Grounding AI Code Generation: How We Eliminated Hallucinated File Paths
Building reliable AI-powered development tools requires solving the hallucination problem. Here's how we used GitHub's Git Trees API and template variables to ground LLM prompts in real codebase context.
Grounding AI Code Generation: How We Eliminated Hallucinated File Paths
If you've ever worked with AI code generation tools, you've probably encountered this frustrating scenario: the AI confidently suggests importing from ./utils/helpers.js or modifying src/components/UserProfile.tsx — files that don't actually exist in your codebase. This "hallucination" problem becomes critical when building automated development workflows.
Today, I want to share how we solved this challenge in our workflow engine by implementing two powerful template variables: {{claudemd}} and {{fileTree}}. The solution leverages GitHub's Git Trees API to provide real-time codebase context to our LLM prompts.
The Problem: AI Tools Living in Fantasy Land
When Large Language Models generate code suggestions, they're working from patterns learned during training. Without concrete knowledge of your specific project structure, they'll confidently reference files and directories that seem reasonable but don't actually exist.
For our automated workflow engine, this was a showstopper. Users would run extension builders or security analysis workflows, only to receive implementation suggestions that referenced non-existent paths. The AI might suggest:
// Hallucinated suggestion
import { validateUser } from '../utils/auth-helpers';
import UserCard from './components/UserCard';
When the actual project structure looked nothing like this.
The Solution: Real-Time Codebase Context
We implemented two template variables that inject real codebase information directly into our LLM prompts:
1. {{fileTree}} - Complete Project Structure
The {{fileTree}} variable provides a comprehensive view of the project's file structure. Here's how we built it:
// Fetch complete directory tree from GitHub Git Trees API
async function fetchRepoTree(owner: string, repo: string, branch: string = 'main') {
const response = await octokit.rest.git.getTree({
owner,
repo,
tree_sha: branch,
recursive: true
});
// Filter out noise (node_modules, .git, dist, etc.)
return response.data.tree
.filter(item => !shouldIgnorePath(item.path))
.slice(0, 500) // Cap at 500 files to manage token budget
.map(item => `${item.type === 'tree' ? '📁' : '📄'} ${item.path}`)
.join('\n');
}
The key insight here is using GitHub's Git Trees API with recursive: true. This gives us the entire project structure in a single API call, which we then filter and format as a clean markdown code block.
2. {{claudemd}} - Project Documentation Context
The {{claudemd}} variable injects project-specific documentation, trying CLAUDE.md first (our convention for AI-friendly project descriptions), then falling back to README.md:
async function loadClaudeMdContent(owner: string, repo: string, branch: string) {
const candidates = ['CLAUDE.md', 'README.md'];
for (const filename of candidates) {
try {
const response = await octokit.rest.repos.getContent({
owner,
repo,
path: filename,
ref: branch
});
if ('content' in response.data) {
return Buffer.from(response.data.content, 'base64').toString('utf-8');
}
} catch (error) {
// Try next candidate
continue;
}
}
return 'No project documentation found.';
}
Integration: Parallel Loading and Template Resolution
Both data sources load in parallel during workflow startup to minimize latency:
// Load all context in parallel
const [claudeMdContent, fileTreeContent, ...otherData] = await Promise.all([
loadClaudeMdContent(owner, repo, branch),
loadFileTreeContent(owner, repo, branch),
// ... other existing loaders
]);
// Make available to all prompt templates
const chainContext = {
claudeMdContent,
fileTreeContent,
// ... other context
};
The template resolution is straightforward:
function resolvePrompt(template: string, context: ChainContext): string {
return template
.replace(/\{\{claudemd\}\}/g, context.claudeMdContent)
.replace(/\{\{fileTree\}\}/g, context.fileTreeContent)
// ... other template variables
}
Prompt Engineering: Explicit Instructions
The technical implementation is only half the battle. We also updated our system prompts to explicitly instruct the AI to use the provided context:
## File Structure Reference
{{fileTree}}
## Project Documentation
{{claudemd}}
**CRITICAL**: You MUST reference real file paths from the provided file tree above.
Never invent or guess file paths. If you need to create new files, clearly
indicate they are new and choose paths that align with the existing structure.
Results: From Hallucination to Reality
The impact was immediate. Instead of generic suggestions like:
// Before: Hallucinated paths
import { config } from '../config/app-config';
import Header from './components/Header';
Our AI now generates contextually accurate code:
// After: Real paths from the actual codebase
import { config } from '../src/lib/config';
import Header from '../src/components/ui/header';
Lessons Learned
Token Budget Management
With large repositories, the file tree can consume significant token budget. We cap at 500 files and filter aggressively (excluding node_modules, .git, dist, etc.). For most projects, this provides sufficient context while staying within reasonable limits.
Branch Handling
Always implement fallback logic for branch names. We try main first, then fall back to master if the first attempt fails:
try {
return await fetchRepoTree(owner, repo, 'main');
} catch (error) {
return await fetchRepoTree(owner, repo, 'master');
}
API Rate Limits
GitHub's Git Trees API is efficient, but consider caching the results for frequently accessed repositories to avoid hitting rate limits during development.
What's Next
This foundation opens up several interesting possibilities:
-
Selective Context Loading: For very large repositories, we could implement smart filtering based on the workflow type (only load frontend files for UI-focused workflows)
-
Diff-Based Context: For modification workflows, we could include recent git diffs to understand what's been changing
-
Dependency Graph Context: Parse
package.jsonand import statements to provide even richer context about project dependencies
The Bigger Picture
This solution represents a broader principle in AI tooling: context is king. The most sophisticated language model will produce mediocre results without proper grounding in the specific problem domain. By investing in robust context injection, we transform generic AI suggestions into truly useful, actionable code generation.
The {{claudemd}} and {{fileTree}} template variables are now core components of our workflow engine, ensuring that every AI-generated suggestion is grounded in the reality of the actual codebase rather than the hallucinated fantasy of what a codebase might look like.
Want to implement something similar in your AI tooling? The key is finding the right balance between comprehensive context and token efficiency. Start with the file tree approach — it's surprisingly effective at eliminating the most common hallucination patterns.