nyxcore-systems
4 min read

Building Resilient Code Analysis: Real-time Progress Tracking and Graceful Error Handling

How we transformed brittle batch processing into a resilient system with real-time progress tracking, graceful error recovery, and a polished user experience.

error-handlingreal-time-uicode-analysistrpcnextjsdeveloper-experience

Building Resilient Code Analysis: Real-time Progress Tracking and Graceful Error Handling

Last week, I tackled one of those development sessions that perfectly illustrates why building robust developer tools is both challenging and rewarding. The goal was straightforward: fix our code analysis error handling and add a sidebar widget to show running processes. The journey? Well, that's where it gets interesting.

The Problem: When Long-Running Processes Go Silent

Our code analysis system was suffering from two classic issues that plague many developer tools:

  1. Silent Failures: When batch processing failed, users got generic "Pattern detection failed" messages with no insight into what actually went wrong
  2. Black Box Processing: Long-running analysis jobs would disappear into the void, leaving users wondering if anything was actually happening

These aren't just technical problems—they're user experience killers. Nothing erodes trust in a tool faster than mysterious failures and invisible progress.

Building Graceful Error Recovery

The first challenge was transforming our brittle all-or-nothing approach into something more resilient. Previously, if any part of the analysis pipeline failed, the entire process would crash:

typescript
// Before: One failure kills everything
if (batchFailed) {
  throw new Error("Pattern detection failed");
}

The solution was to distinguish between fatal and recoverable errors. We introduced new event types (batch_error and doc_error) and restructured the flow:

typescript
// After: Continue processing, track partial failures
try {
  const patterns = await detectPatterns(batch);
  // ... success handling
} catch (error) {
  await this.updateProgress({
    type: 'batch_error',
    error: error.message,
    batchIndex
  });
  batchErrors++;
  // Continue with next batch
}

This change transformed our system from fragile to resilient. Now, if 2 out of 10 batches fail, users still get results from the 8 that succeeded, plus clear visibility into what went wrong.

Real-Time Process Visibility

The second piece was building a unified view of all active operations. Our system runs several types of long-duration processes:

  • Code analysis runs
  • Repository syncing
  • Workflow executions
  • Data consolidations

Each lived in its own silo, making it impossible for users to understand system state. The solution was a unified ActiveProcess abstraction:

typescript
export interface ActiveProcess {
  id: string;
  type: 'analysis' | 'workflow' | 'consolidation' | 'sync';
  title: string;
  status: 'running' | 'completed' | 'failed';
  progress?: { current: number; total: number };
  href: string;
  startedAt: Date;
}

The tRPC endpoint polls four different tables in parallel and normalizes them into this common interface:

typescript
activeProcesses: publicProcedure.query(async ({ ctx }) => {
  const [workflows, analysisRuns, consolidations, syncingRepos] = 
    await Promise.all([
      // Parallel queries to different process tables
    ]);
  
  return [...workflows, ...analysisRuns, ...consolidations, ...syncingRepos]
    .sort((a, b) => b.startedAt.getTime() - a.startedAt.getTime());
}),

The UI: Making Progress Tangible

The sidebar widget brings all this data together in a clean, informative interface:

  • Color-coded status indicators: Green for success, blue for running, red for errors
  • Real-time progress bars: Show completion percentage where available
  • Deep linking: Click any process to jump to its detailed view
  • Auto-refresh: Updates every 5 seconds without user intervention
tsx
<div className="flex items-center gap-3">
  <StatusIcon type={process.type} status={process.status} />
  <div className="flex-1 min-w-0">
    <Link href={process.href} className="text-sm font-medium">
      {process.title}
    </Link>
    {process.progress && (
      <ProgressBar 
        current={process.progress.current} 
        total={process.progress.total} 
      />
    )}
  </div>
</div>

Lessons Learned: The Authentication Trap

Not everything went smoothly. I initially tried to create a simple CLI script to trigger analysis runs by calling our SSE endpoint directly:

bash
curl -X POST http://localhost:3000/api/analysis/run

This failed spectacularly. The endpoint requires NextAuth session cookies from the browser—there's no straightforward way to authenticate CLI requests against a session-based auth system.

The workaround was creating a standalone script that imports the analysis function directly:

typescript
// scripts/run-analysis.ts
import { runAnalysis } from '../src/server/services/code-analysis/analysis-runner';

async function main() {
  await runAnalysis(repositoryId, userId);
}

main().catch(console.error);

Lesson: When designing APIs for developer tools, consider both browser and programmatic access patterns from the start. Session-based auth is great for web UIs but creates friction for automation.

The Results

After pushing commit 8594153, our test run on a sample repository found 58 code patterns and generated 3 documentation files—all while providing clear progress feedback and graceful error handling.

The transformation is significant:

  • Before: "Something failed, good luck figuring out what"
  • After: "Batch 3 failed due to rate limiting, but batches 1, 2, and 4 completed successfully"

What's Next

This foundation opens up several interesting possibilities:

  1. Smart retry logic: Now that we track partial failures, we can implement intelligent retry strategies
  2. Resource optimization: SSE abort handling to stop wasting LLM tokens when clients disconnect
  3. Performance tuning: Adjusting polling intervals based on system load

Building resilient developer tools isn't just about handling the happy path—it's about making the unhappy paths visible, understandable, and recoverable. Sometimes the most valuable features are the ones that help when things go wrong.


Have you built similar real-time progress systems? I'd love to hear about your approaches to handling long-running processes and error recovery strategies.