The Hidden Costs of Background Processes: A UX Improvement Session That Became a Bug Hunt
What started as a simple navigation enhancement turned into discovering critical bugs in our background job handling. Here's how orphaned processes and stale data taught us valuable lessons about system resilience.
The Hidden Costs of Background Processes: A UX Improvement Session That Became a Bug Hunt
Sometimes the best debugging sessions are the ones you never planned. What began as a straightforward UX improvement to our workflow detail page quickly evolved into a deep dive through our application's background job handling, uncovering several critical bugs that had been silently wreaking havoc.
The Original Mission: Better Navigation UX
Our workflow detail page needed some love. Users were struggling to navigate through complex multi-step workflows, especially when trying to:
- Quickly jump between different steps
- Get an overview of the entire pipeline status
- Expand or collapse all steps at once
The solution seemed straightforward: add a sticky navigation bar with step indicators and bulk controls.
Building the Navigation Bar
The implementation involved several key components:
// Added smooth scroll navigation
const handleJumpToStep = (stepId: string) => {
document.getElementById(`step-${stepId}`)?.scrollIntoView({
behavior: 'smooth'
});
};
// Bulk expand/collapse controls
const handleExpandAll = () => { /* expand logic */ };
const handleCollapseAll = () => { /* collapse logic */ };
The navigation bar itself became a horizontal scrollable list of step pills, each with status-colored indicators:
<div className="sticky top-0 z-10 bg-nyx-surface">
<div className="flex justify-between items-center p-4">
{/* Left: Bulk controls */}
<div className="flex gap-2">
<Button onClick={handleExpandAll}>Expand All</Button>
<Button onClick={handleCollapseAll}>Collapse All</Button>
</div>
{/* Right: Step indicators */}
<div className="flex gap-2 overflow-x-auto">
{steps.map(step => (
<div
key={step.id}
onClick={() => handleJumpToStep(step.id)}
className="flex items-center gap-1 cursor-pointer"
>
<StatusDot status={step.status} />
<span>{truncate(step.label, 12)}</span>
</div>
))}
</div>
</div>
</div>
When UX Improvements Reveal Deeper Issues
While testing the new navigation, we discovered that the "Resume from here" functionality wasn't working correctly for multi-step workflows. This led us down a rabbit hole that revealed three distinct but related problems.
Problem #1: The Stale Data Trap
Our retry mutation had a subtle but critical bug. When users retried a step that had multiple generated alternatives, the system would:
- Reset the step status to "pending"
- But leave the old
alternativesandselectedIndexintact - The execution engine would see existing alternatives and think "job's already done!"
- Skip execution entirely and mark the workflow as completed
// The fix: Always clear alternatives and selection on retry
await prisma.workflowStep.update({
where: { id: stepId },
data: {
status: 'pending',
output: null,
alternatives: Prisma.JsonNull, // Critical: clear old alternatives
selectedIndex: null, // Critical: clear old selection
error: null,
startedAt: null,
completedAt: null,
}
});
This bug was particularly insidious because it appeared to work—the UI would show the step as "completed" almost instantly, which users might interpret as exceptionally fast processing rather than a skipped execution.
Problem #2: The Orphaned Process Problem
While investigating the retry issue, we discovered that our Server-Sent Events (SSE) endpoints had a dangerous gap in error handling. When background processes crashed or clients disconnected unexpectedly, the database records would remain in "active" states indefinitely.
// Before: Silent failures left processes orphaned
export async function GET(request: Request, { params }: { params: { id: string } }) {
try {
// ... SSE streaming logic
} catch (error) {
console.error('SSE Error:', error);
return new Response('Error', { status: 500 });
// Database still thinks the process is running!
}
}
// After: Proper cleanup on failure
export async function GET(request: Request, { params }: { params: { id: string } }) {
try {
// ... SSE streaming logic
} catch (error) {
// Mark the process as failed in the database
await prisma.workflow.update({
where: { id: params.id },
data: { status: 'failed' }
});
console.error('SSE Error:', error);
return new Response('Error', { status: 500 });
}
}
Problem #3: The Token Limit Ceiling
Our investigation also revealed that several workflow steps were hitting token limits exactly at 8,192 tokens, causing outputs to be truncated mid-sentence. The fix was simple—bump the limits to 16,384—but it highlighted the importance of monitoring resource constraints.
Lessons Learned: Building Resilient Systems
This debugging session taught us several valuable lessons:
1. State Cleanup is Critical
When implementing retry or reset functionality, always audit all related state fields. It's not enough to reset the obvious ones—stale auxiliary data can cause unexpected behavior.
2. Error Boundaries Need Database Awareness
Background processes that update database state must handle failures gracefully. A try-catch block that logs an error but doesn't update the database creates orphaned processes that can confuse users and waste resources.
3. Auto-Cleanup Mechanisms Are Essential
We implemented automatic cleanup for processes stuck in active states:
// Clean up orphaned processes on startup
const tenMinutesAgo = new Date(Date.now() - 10 * 60 * 1000);
await prisma.analysisRun.updateMany({
where: {
status: { in: ['analyzing', 'pending'] },
startedAt: { lt: tenMinutesAgo }
},
data: { status: 'failed' }
});
4. Resource Limits Need Monitoring
Hard limits like token counts should be monitored and adjusted based on actual usage patterns, not just theoretical requirements.
The Bigger Picture
What started as a UX improvement session became a masterclass in system resilience. The new navigation bar works beautifully, but the real value came from discovering and fixing these hidden failure modes.
This experience reinforced an important principle: surface-level improvements often reveal deeper architectural issues. When you're enhancing user-facing features, pay attention to the edge cases and error conditions you encounter—they're often symptoms of broader systemic problems.
The next time you're implementing what seems like a simple feature addition, consider it an opportunity to audit the underlying systems. You might be surprised by what you find lurking beneath the surface.
Have you encountered similar hidden bugs while working on seemingly unrelated features? I'd love to hear about your debugging adventures in the comments below.