Grounding AI Code Generation: How We Eliminated Hallucinated File Paths

If you've ever worked with AI code generation tools, you've probably encountered this frustrating scenario: the AI confidently suggests importing from ./utils/helpers.js or modifying src/components/UserProfile.tsx — files that don't actually exist in your codebase. This "hallucination" problem becomes critical when building automated development workflows.

Today, I want to share how we solved this challenge in our workflow engine by implementing two powerful template variables: {{claudemd}} and {{fileTree}}. The solution leverages GitHub's Git Trees API to provide real-time codebase context to our LLM prompts.

The Problem: AI Tools Living in Fantasy Land

When Large Language Models generate code suggestions, they're working from patterns learned during training. Without concrete knowledge of your specific project structure, they'll confidently reference files and directories that seem reasonable but don't actually exist.

For our automated workflow engine, this was a showstopper. Users would run extension builders or security analysis workflows, only to receive implementation suggestions that referenced non-existent paths. The AI might suggest:

typescript

// Hallucinated suggestion
import { validateUser } from '../utils/auth-helpers';
import UserCard from './components/UserCard';

When the actual project structure looked nothing like this.

The Solution: Real-Time Codebase Context

We implemented two template variables that inject real codebase information directly into our LLM prompts:

1. `{{fileTree}}` - Complete Project Structure

The {{fileTree}} variable provides a comprehensive view of the project's file structure. Here's how we built it:

typescript

// Fetch complete directory tree from GitHub Git Trees API
async function fetchRepoTree(owner: string, repo: string, branch: string = 'main') {
  const response = await octokit.rest.git.getTree({
    owner,
    repo,
    tree_sha: branch,
    recursive: true
  });

  // Filter out noise (node_modules, .git, dist, etc.)
  return response.data.tree
    .filter(item => !shouldIgnorePath(item.path))
    .slice(0, 500) // Cap at 500 files to manage token budget
    .map(item => `${item.type === 'tree' ? '📁' : '📄'} ${item.path}`)
    .join('\n');
}

The key insight here is using GitHub's Git Trees API with recursive: true. This gives us the entire project structure in a single API call, which we then filter and format as a clean markdown code block.

2. `{{claudemd}}` - Project Documentation Context

The {{claudemd}} variable injects project-specific documentation, trying CLAUDE.md first (our convention for AI-friendly project descriptions), then falling back to README.md:

typescript

async function loadClaudeMdContent(owner: string, repo: string, branch: string) {
  const candidates = ['CLAUDE.md', 'README.md'];
  
  for (const filename of candidates) {
    try {
      const response = await octokit.rest.repos.getContent({
        owner,
        repo,
        path: filename,
        ref: branch
      });
      
      if ('content' in response.data) {
        return Buffer.from(response.data.content, 'base64').toString('utf-8');
      }
    } catch (error) {
      // Try next candidate
      continue;
    }
  }
  
  return 'No project documentation found.';
}

Integration: Parallel Loading and Template Resolution

Both data sources load in parallel during workflow startup to minimize latency:

typescript

// Load all context in parallel
const [claudeMdContent, fileTreeContent, ...otherData] = await Promise.all([
  loadClaudeMdContent(owner, repo, branch),
  loadFileTreeContent(owner, repo, branch),
  // ... other existing loaders
]);

// Make available to all prompt templates
const chainContext = {
  claudeMdContent,
  fileTreeContent,
  // ... other context
};

The template resolution is straightforward:

typescript

function resolvePrompt(template: string, context: ChainContext): string {
  return template
    .replace(/\{\{claudemd\}\}/g, context.claudeMdContent)
    .replace(/\{\{fileTree\}\}/g, context.fileTreeContent)
    // ... other template variables
}

Prompt Engineering: Explicit Instructions

The technical implementation is only half the battle. We also updated our system prompts to explicitly instruct the AI to use the provided context:

markdown

## File Structure Reference
{{fileTree}}

## Project Documentation  
{{claudemd}}

**CRITICAL**: You MUST reference real file paths from the provided file tree above. 
Never invent or guess file paths. If you need to create new files, clearly 
indicate they are new and choose paths that align with the existing structure.

Results: From Hallucination to Reality

The impact was immediate. Instead of generic suggestions like:

typescript

// Before: Hallucinated paths
import { config } from '../config/app-config';
import Header from './components/Header';

Our AI now generates contextually accurate code:

typescript

// After: Real paths from the actual codebase
import { config } from '../src/lib/config';  
import Header from '../src/components/ui/header';

Lessons Learned

Token Budget Management

With large repositories, the file tree can consume significant token budget. We cap at 500 files and filter aggressively (excluding node_modules, .git, dist, etc.). For most projects, this provides sufficient context while staying within reasonable limits.

Branch Handling

Always implement fallback logic for branch names. We try main first, then fall back to master if the first attempt fails:

typescript

try {
  return await fetchRepoTree(owner, repo, 'main');
} catch (error) {
  return await fetchRepoTree(owner, repo, 'master');
}

API Rate Limits

GitHub's Git Trees API is efficient, but consider caching the results for frequently accessed repositories to avoid hitting rate limits during development.

What's Next

This foundation opens up several interesting possibilities:

Selective Context Loading: For very large repositories, we could implement smart filtering based on the workflow type (only load frontend files for UI-focused workflows)
Diff-Based Context: For modification workflows, we could include recent git diffs to understand what's been changing
Dependency Graph Context: Parse package.json and import statements to provide even richer context about project dependencies

The Bigger Picture

This solution represents a broader principle in AI tooling: context is king. The most sophisticated language model will produce mediocre results without proper grounding in the specific problem domain. By investing in robust context injection, we transform generic AI suggestions into truly useful, actionable code generation.

The {{claudemd}} and {{fileTree}} template variables are now core components of our workflow engine, ensuring that every AI-generated suggestion is grounded in the reality of the actual codebase rather than the hallucinated fantasy of what a codebase might look like.

Want to implement something similar in your AI tooling? The key is finding the right balance between comprehensive context and token efficiency. Start with the file tree approach — it's surprisingly effective at eliminating the most common hallucination patterns.

Grounding AI Code Generation: How We Eliminated Hallucinated File Paths

The Problem: AI Tools Living in Fantasy Land

The Solution: Real-Time Codebase Context

1. {{fileTree}} - Complete Project Structure

2. {{claudemd}} - Project Documentation Context

Integration: Parallel Loading and Template Resolution

Prompt Engineering: Explicit Instructions

Results: From Hallucination to Reality

Lessons Learned

Token Budget Management

Branch Handling

API Rate Limits

What's Next

The Bigger Picture

1. `{{fileTree}}` - Complete Project Structure

2. `{{claudemd}}` - Project Documentation Context