Building an AI-Powered Code Analysis Extension: From Schema to Dashboard in 7 Phases

Last week, I embarked on building a comprehensive AI-powered code analysis extension from scratch. The goal was ambitious: create a system that could scan GitHub repositories, detect code patterns using LLMs, generate documentation automatically, and present everything through a real-time dashboard. Here's the story of how it came together.

The Architecture: 7 Phases to Success

Phase 1: The Foundation - Database Schema

Every great feature starts with solid data modeling. I added five new Prisma models to handle the complexity:

Repository - GitHub repo metadata and sync status
RepositoryFile - Individual file tracking with language detection
CodeAnalysisRun - Analysis sessions with status and metrics
CodePattern - Detected patterns with confidence scores
GeneratedDoc - AI-generated documentation with quality ratings

sql

model Repository {
  id          String   @id @default(cuid())
  tenantId    String
  name        String
  fullName    String
  githubId    Int
  // ... relations and timestamps
}

The key insight here was designing for incremental analysis - the system needed to track what had been analyzed and when, enabling efficient re-runs on changed files only.

Phase 2: The Scanner - Intelligent File Discovery

The scanner became the heart of the system, implemented as an AsyncGenerator that yields ScanEvents. This design choice proved crucial for handling large repositories without memory issues:

typescript

async function* scanRepository(repoId: string): AsyncGenerator<ScanEvent> {
  const tree = await fetchRepoTree(repo.fullName);
  
  for (const file of tree) {
    yield { type: 'file_discovered', file };
    
    if (shouldAnalyzeFile(file)) {
      const content = await fetchFileContent(repo.fullName, file.path);
      yield { type: 'file_indexed', file, content };
    }
  }
}

The streaming approach meant users could see progress in real-time, rather than staring at a loading spinner for minutes.

Phase 3: The AI Brain - Pattern Detection

This is where the magic happens. The pattern detector processes files in batches (around 50k characters each) and sends them to LLMs for semantic analysis. The system detects 8 different pattern types:

Architecture patterns (MVC, microservices, etc.)
Naming conventions
Error handling strategies
Testing approaches
Dependency management
Security practices
Performance optimizations
Code style patterns

typescript

const patterns = await analyzeCodeBatch(batch, {
  model: provider.model,
  prompt: `Analyze this code for patterns. Return JSON with:
  {
    "patterns": [
      {
        "type": "architecture|naming|error-handling|...",
        "name": "pattern name",
        "description": "what this pattern does",
        "confidence": 0.85,
        "files": ["file1.ts", "file2.ts"]
      }
    ]
  }`
});

The challenge was getting consistent, parseable responses from different LLM providers. The solution involved careful prompt engineering and robust JSON parsing with fallbacks.

Phase 4: The Documentation Generator

Perhaps the most ambitious part - automatically generating five types of documentation:

README - Project overview and setup instructions
API Documentation - Endpoint and function references
Architecture Guide - System design and patterns
Onboarding Guide - New developer quickstart
Changelog - Recent changes and updates

Each generated document gets a quality score based on length, structure, and content heuristics:

typescript

function scoreDocQuality(content: string, type: DocType): number {
  let score = 0;
  
  // Length scoring (sweet spot varies by type)
  const targetLength = getTargetLength(type);
  score += Math.min(content.length / targetLength, 1) * 0.3;
  
  // Structure scoring (headers, lists, code blocks)
  score += analyzeStructure(content) * 0.4;
  
  // Content scoring (keywords, completeness)
  score += analyzeContent(content, type) * 0.3;
  
  return Math.min(score, 1);
}

Phase 5: The Plumbing - APIs and Real-time Updates

The backend needed three main components:

tRPC Router - Type-safe API endpoints for CRUD operations
Analysis Runner - Orchestrates the scan→patterns→docs pipeline
Server-Sent Events - Real-time progress updates to the UI

The SSE implementation was particularly satisfying:

typescript

// Server-side streaming
export async function GET(request: Request, { params }: { params: { id: string } }) {
  const stream = new ReadableStream({
    start(controller) {
      runAnalysis(params.id, (event) => {
        controller.enqueue(`data: ${JSON.stringify(event)}\n\n`);
      });
    }
  });
  
  return new Response(stream, {
    headers: { 'Content-Type': 'text/event-stream' }
  });
}

Phase 6: The Interface - Dashboard UI

The frontend consists of three main pages:

List View - All repositories with analysis status
Add Repository - GitHub picker with manual entry fallback
Detail View - Four tabs showing Overview, Patterns, Documentation, and Run History

The detail page was the most complex, featuring:

Real-time log streaming during analysis
Interactive pattern visualization
Generated documentation with quality indicators
Historical run comparison

Phase 7: The Safety Net - Comprehensive Testing

56 unit tests across the core services ensured reliability:

typescript

describe('Pattern Detector', () => {
  it('should detect architecture patterns in React components', async () => {
    const files = [mockReactComponent, mockHookFile];
    const patterns = await detectPatterns(files, mockLLMProvider);
    
    expect(patterns).toHaveLength(2);
    expect(patterns[0].type).toBe('architecture');
    expect(patterns[0].confidence).toBeGreaterThanOrEqual(0.7);
  });
});

Lessons Learned: The Challenges That Made It Better

The TypeScript Configuration Gotcha

Challenge: Used [...new Set([...arr1, ...arr2])] for deduplication, but TypeScript threw TS2802: Type 'Set<string>' can only be iterated through when using the '--downlevelIteration' flag.

Solution: Switched to Array.from(new Set([...arr1, ...arr2])). This revealed an important project convention - the codebase doesn't enable downlevelIteration, so Set spreading isn't available.

Lesson: Always check your TypeScript configuration constraints before using newer syntax features.

The Import Path Maze

Challenge: SSE route imports failed with ../../../../middleware (one level too deep).

Solution: Corrected to ../../../middleware by studying existing workflow patterns.

Lesson: When working in unfamiliar directory structures, find similar existing files and copy their import patterns exactly.

The Boundary Value Bug

Challenge: Test assertions using toBeGreaterThan(0.7) failed when scores landed exactly on 0.7.

Solution: Changed to toBeGreaterThanOrEqual() for boundary values.

Lesson: Boundary conditions in tests are more common than you think. Always consider the exact equality case.

The Results

After seven phases and 71 passing tests, the system delivers:

✅ Automated repository scanning with progress tracking
✅ AI-powered pattern detection across 8 categories
✅ Intelligent documentation generation with quality scoring
✅ Real-time dashboard with streaming updates
✅ Comprehensive test coverage for reliability

What's Next?

The foundation is solid, but there's room for enhancement:

Custom Pattern Rules - Allow users to define their own pattern detection criteria
Diff Analysis - Focus analysis on changed files for faster iterations
Team Insights - Aggregate patterns across multiple repositories
Integration Webhooks - Trigger analysis on GitHub push events

Building this extension taught me that complex AI-powered features are really about orchestrating simple, well-tested components. The AI does the heavy lifting for pattern recognition and documentation, but the real engineering challenge is in the plumbing - making everything work together reliably at scale.

The streaming architecture proved especially valuable. Users see immediate feedback, the system handles large repositories gracefully, and debugging is much easier when you can watch the process unfold in real-time.

Sometimes the best way to understand your codebase is to teach an AI to analyze it for you. 🤖

Want to see the code? The complete implementation is available in commit 3aad552 with full test coverage and documentation.