Building Resilient Background Jobs: Fixing Orphaned Processes and Improving UX

Late evening development sessions often lead to the most interesting discoveries. What started as a simple UI enhancement request turned into a fascinating journey through orphaned background processes, database state inconsistencies, and the art of building resilient systems.

The Mission: Better Workflow Navigation

The initial goal was straightforward - improve the workflow detail page with better navigation controls. Users needed:

Expand/Collapse All buttons for managing large workflows
Jump navigation with sticky step indicators
Visual status indicators for quick workflow scanning

Implementation Highlights

The solution involved adding a sticky navigation bar that sits between the settings panel and pipeline visualization:

tsx

// Added to workflow detail page
const handleExpandAll = () => {
  // Expand all workflow steps
};

const handleCollapseAll = () => {
  // Collapse all workflow steps  
};

const handleJumpToStep = (stepId: string) => {
  document.getElementById(`step-${stepId}`)?.scrollIntoView({ 
    behavior: 'smooth' 
  });
};

The navigation bar features:

Left side: Expand All / Collapse All buttons with intuitive chevron icons
Right side: Horizontally scrollable step pills with status-colored dots and truncated labels
Styling: Sticky positioning with proper z-index layering

Each workflow step now includes DOM anchors (id="step-${step.id}") enabling smooth scrolling navigation - a simple but effective UX improvement.

The Plot Twist: Orphaned Background Jobs

While testing the navigation improvements, I discovered something more concerning - orphaned background processes. Workflows and code analysis runs were getting stuck in "running" states even after their processes had crashed or been interrupted.

The Root Cause

The issue lay in our Server-Sent Events (SSE) endpoints. When background jobs encountered errors, they were only sending error events to connected clients but not updating the database state:

typescript

// Before: Only sends SSE event
catch (error) {
  res.write(`data: ${JSON.stringify({ 
    type: 'error', 
    error: error.message 
  })}\n\n`);
  res.end();
  // Database still shows "running" status! 💥
}

This created a critical failure mode: if a client disconnected or a process crashed, the database would forever show the job as "running," preventing users from retrying or starting new jobs.

The Solution: Defensive Database Updates

The fix required updating both workflow and code analysis SSE endpoints to always update the database state on errors:

typescript

// After: Update database AND send SSE event
catch (error) {
  // Update database state
  await prisma.workflow.update({
    where: { id: workflowId },
    data: { status: 'failed' }
  });
  
  // Then send SSE event
  res.write(`data: ${JSON.stringify({ 
    type: 'error', 
    error: error.message 
  })}\n\n`);
  res.end();
}

Additionally, I implemented orphan cleanup logic in the job start mutations:

typescript

// Auto-clean stuck processes before starting new ones
const tenMinutesAgo = new Date(Date.now() - 10 * 60 * 1000);

await prisma.codeAnalysisRun.updateMany({
  where: {
    repositoryId,
    status: { in: ['analyzing', 'pending'] },
    createdAt: { lt: tenMinutesAgo }
  },
  data: { status: 'failed' }
});

Lessons Learned: Database Debugging Adventures

One of the most educational parts of this session was debugging database state directly. Here are the key lessons:

Prisma CLI vs Raw SQL

The Challenge: Prisma's db execute command uses model names, not table names, which can be confusing when you need to write raw SQL.

The Workaround: Direct psql connection proved more reliable:

bash

PGPASSWORD=nyxcore_dev psql -U nyxcore -h localhost -d nyxcore

Column Name Gotcha

The Trap: Prisma uses camelCase column names that need double-quoting in raw SQL:

sql

-- This fails
SELECT * FROM workflows WHERE workflowId = 'abc123';

-- This works  
SELECT * FROM workflows WHERE "workflowId" = 'abc123';

Token Limit Discovery

While investigating stuck workflows, I discovered that two steps had hit exactly the 8,192 token completion limit. The solution was bumping their maxTokens from 8,192 to 16,384 directly in the database - a reminder to monitor token usage in production systems.

The Bigger Picture: Building Resilient Systems

This session highlighted several important principles for building robust background job systems:

Always update persistent state - Don't rely solely on real-time events
Implement orphan cleanup - Systems should self-heal from inconsistent states
Monitor resource limits - Token limits, memory usage, and timeouts matter
Design for failure - Assume processes will crash and plan accordingly

What's Next

The immediate next steps involve:

Testing the Resume functionality on the previously stuck workflow
Retrying the token-limited steps with higher limits
Considering default token limit increases for long-output operations
Applying similar orphan cleanup patterns to other background job types

Sometimes the best development sessions are the ones that start with a simple UI request and end with a more resilient system architecture. The navigation improvements were nice, but fixing those orphaned processes? That's the kind of work that prevents 3 AM production incidents.

Have you encountered similar issues with background job management? I'd love to hear about your approaches to building resilient async systems. Feel free to reach out with your own war stories and solutions.