Beyond 'Failed': Building Robust Analysis and Real-time Visibility
We tackled opaque long-running processes and cryptic error messages, delivering a more robust code analysis engine and a real-time 'Active Processes' sidebar widget for unparalleled visibility.
Every developer knows the frustration: you kick off a long-running process, and then... silence. Or worse, a generic "something went wrong" message that leaves you guessing. In the world of complex code analysis, where large language models (LLMs) chew through codebases and generate documentation, this opacity isn't just annoying—it's a critical blocker for productivity and trust.
This past session, we set out to banish that frustration. Our mission: make our code analysis engine more robust and transparent, and give users real-time insight into exactly what's happening under the hood. The result is a significant leap forward in both reliability and developer experience.
The Quest for Transparency: Robust Error Handling
Our existing code analysis sometimes fell short when faced with unexpected issues. A pattern detection failure or a hiccup in documentation generation would often result in a vague "Pattern detection failed" message. This was a dead end for debugging and incredibly frustrating for users.
Storing the Actual Problems
The first step was to capture and persist granular error messages. Instead of generic strings, our analysis-runner.ts now diligently stores the actual error details in the database. This simple change transforms a mystery into a solvable problem, providing developers with the context they need to understand and address issues.
Resilience Through Non-Fatal Errors
A critical architectural decision was to make individual batch LLM errors and doc generation errors non-fatal. Previously, if one LLM batch failed, the entire analysis run would halt. This was overly punitive.
Now, in pattern-detector.ts and doc-generator.ts, we've introduced batch_error and doc_error event types. If a specific batch of LLM processing or a single documentation chunk fails, the analysis continues. We track these batchErrors and docErrors as counters within the analysis stats. This means:
- Partial Success is Still Success: Users get some results even if there are minor issues.
- Better Resource Utilization: We don't waste LLM tokens by prematurely stopping an entire run.
- Clearer Picture: The stats now accurately reflect the scale of any issues, rather than just a binary "failed/succeeded."
This shift makes our analysis engine far more resilient and user-friendly, providing valuable partial insights rather than total abandonment.
Bringing Operations to Life: The Active Processes Widget
Imagine you've just started a massive code analysis, or initiated a repository sync. How do you know if it's running? How far along is it? Our previous system offered little real-time feedback.
Enter the "Active Processes" sidebar widget – a game-changer for visibility.
A Unified View of Dynamic Operations
We built a new activeProcesses tRPC query in src/server/trpc/routers/dashboard.ts. This query is a marvel of parallel data aggregation, polling four distinct tables: workflows, analysis runs, consolidations, and syncing repos. It then unifies their states into a single ActiveProcess[] array.
This unified approach means that whether you're running a complex multi-step workflow or just syncing a new repository, its status is immediately visible.
The Real-time UI Experience
The frontend component, src/components/layout/active-processes.tsx, brings this data to life:
- Color-coded icons: Instant visual cues for process types.
- Progress bars: Visual indication of completion.
- Status labels: Clear, human-readable descriptions of the current state.
- Deep links: Clickable entries that take you directly to the relevant detail page for that process.
Integrated seamlessly into src/components/layout/sidebar.tsx and refreshed every 5 seconds, this widget provides an unparalleled real-time dashboard right where you need it. We even fixed a small bug to ensure workflow progress labels clamp at max steps (no more "Step 4/3"!).
Now, users can see at a glance what's running, how far along it is, and navigate to its details with a single click. No more guessing, no more refreshing pages.
Challenges & Lessons Learned
Development isn't always a smooth road. Here are a couple of critical lessons from this session:
1. The CLI Authentication Conundrum
Challenge: I initially tried to trigger an analysis run directly from the command line using curl against our SSE endpoint. This seemed like a quick way to test the backend logic without the browser.
Problem: The SSE endpoint required a NextAuth session cookie for authentication, which curl doesn't inherently provide.
Lesson: When interacting with authenticated API endpoints, especially those backed by session-based authentication, directly calling them from the CLI without proper credential handling (e.g., fetching and providing session cookies) often fails.
Solution: We created a dedicated scripts/run-analysis.ts. This standalone script directly imports and invokes the runAnalysis() function, bypassing the HTTP layer entirely for internal testing and development purposes. It's a pragmatic workaround that highlights the difference between external API interaction and internal function calls.
// scripts/run-analysis.ts (simplified)
import { runAnalysis } from '../src/server/services/code-analysis/analysis-runner';
async function main() {
// ... setup context for runAnalysis ...
await runAnalysis({ /* ... test params ... */ });
console.log('Analysis triggered successfully!');
}
main().catch(console.error);
2. Prisma's @updatedAt and Raw SQL
Challenge: While experimenting with database interactions, I tried to create an analysis run record using a raw SQL INSERT statement.
Problem: The INSERT failed with a null value in column "updatedAt". Our Prisma schema uses @updatedAt on this field, which Prisma automatically manages. Raw SQL bypasses Prisma's ORM logic.
Lesson: When using an ORM like Prisma, it's generally best practice to use the ORM's client for all CRUD operations. Direct raw SQL bypasses the ORM's magical directives like @createdAt and @updatedAt, which rely on Prisma to inject the current timestamp.
Solution: For raw SQL, you'd explicitly need to add now() for the updatedAt field. However, the better long-term solution is to use the Prisma client within any scripts or services that interact with these models, ensuring all ORM logic is respected.
(Self-note: The consolidation model's projectIds being a String[] @db.Uuid (not a relation) means our dashboard query must select on c.name directly for display, as there's no project object to join against. A good example of how data modeling choices impact query design.)
Looking Ahead
With these core features complete and pushed to main, the immediate next steps involve verification and optimization:
- UI Verification: Confirm the sidebar widget correctly renders all active processes in the browser.
- End-to-End Testing: Ensure analysis runs initiated from the UI complete successfully.
- SSE Abort Handling: Consider adding logic to stop analysis when a client disconnects from the SSE stream, preventing wasted LLM tokens.
- Polling Optimization: Adjust
refetchInterval(from 5s to 10-15s) and addstaleTimeto reduce polling load. - Unit Tests: Implement unit tests for the
analysis-runner.tsorchestration logic, especially aroundbatch_erroranddoc_errorhandling. - RLS Verification: Double-check that the Row-Level Security policy for the
code_patternstable (which relies on repository FK fortenantId) is working as expected.
This session marked a significant step forward in making our code analysis platform more robust, transparent, and a joy to use. By tackling error handling head-on and providing real-time operational visibility, we're building a system that developers can trust and rely on.