From Orphaned Processes to Polished Pixels: A Night of Workflow Wizardry

Late nights in development often lead to some of the most satisfying breakthroughs. This past session was one of those, a focused sprint to not only elevate user experience but also to squash some critical backend gremlins that had been causing silent headaches. We wrapped up with a comprehensive set of improvements, from a snazzier workflow detail page to ensuring our background jobs never get lost in the ether again.

Let's unpack what went down.

Elevating the Workflow Detail Page UX

The workflow detail page is a central hub for our users, and we wanted to make it as intuitive and powerful as possible. The goal was to improve navigation, provide better control over step visibility, and offer quick jump points.

The Sticky Navigation Bar: Your New Workflow Co-Pilot

One of the biggest additions is a brand-new sticky navigation bar that sits elegantly between the workflow settings panel and the pipeline steps. Built with sticky top-0 z-10 bg-nyx-surface, it ensures crucial controls are always at your fingertips.

This bar features:

Left Side: Expand All and Collapse All buttons (powered by ChevronsDownUp and ChevronsUpDown from lucide-react). These functions (handleExpandAll(), handleCollapseAll()) give users granular control over how much detail they want to see.
Right Side: A horizontally scrollable row of "step pills." Each pill represents a step in the workflow, adorned with a status-colored dot and a truncated label (max 12 characters) for quick identification. Clicking a pill triggers handleJumpToStep(stepId), smoothly scrolling the user to the corresponding step, which now has a id={step-${step.id}} DOM anchor.

Smarter Step Expansion and Resumption

We also fine-tuned the logic for step visibility:

Auto-Expansion for Active Steps: The isStepExpanded() logic was updated to automatically expand any pending step that is currently active, not just those awaiting review or paused. This keeps the user's focus on what's happening now.
"Resume from here" for Terminal Workflows: For workflows that have completed, failed, or paused, we've added a handy "Resume from here" button. This isn't just a simple retry; it intelligently chains a retry mutation with a resume mutation, effectively resetting the workflow from that specific step and immediately restarting execution. This empowers users to quickly recover and iterate on their work.

Taming the Backend Beasts: Critical Bug Fixes

While the UX enhancements are visually appealing, the backend fixes were absolutely critical for the reliability and integrity of our system.

The Elusive Retry Mutation Bug

This was a head-scratcher that manifested in a very subtle, yet critical way.

The Problem: When retrying a multi-generate step (steps that produce multiple alternative outputs, like for AI suggestions), the system would seem to retry but then immediately mark the workflow as completed.
The Root Cause: Our retry mutation, specifically in src/server/trpc/routers/workflows.ts, wasn't correctly clearing the alternatives and selectedIndex fields for these multi-generate steps. The engine would then see stale alternatives with an existing selection, assume the step had already been successfully executed, and skip processing it, leading to a premature workflow completion.
The Fix: We explicitly added alternatives: Prisma.JsonNull and selectedIndex: null to the reset data within the retry mutation. This ensures that when a step is retried, it truly starts fresh, prompting the engine to re-execute and generate new outputs.

typescript

// Simplified example of the fix in the retry mutation
// ...
if (isMultiGenerateStep) {
  updateData.alternatives = Prisma.JsonNull;
  updateData.selectedIndex = null;
}
// ...

This highlights a crucial lesson: when dealing with stateful operations like retries, always consider all relevant state variables that might influence subsequent logic.

Preventing Orphaned SSE Processes

We rely heavily on Server-Sent Events (SSE) for real-time updates on long-running processes like code analysis and workflows. However, we discovered a vulnerability that could leave these processes in a perpetual "active" state if the SSE connection crashed or disconnected unexpectedly.

The Problem: Both our code-analysis and workflows SSE endpoints (src/app/api/v1/events/code-analysis/[id]/route.ts and src/app/api/v1/events/workflows/[id]/route.ts) lacked robust error handling in their catch blocks. If an error occurred or the client disconnected, the database entry for that run/workflow would remain in an active or analyzing state indefinitely, even though no process was actually running.
The Pattern & The Fix: This was a classic "aha!" moment when we realized both endpoints suffered from the same oversight. The fix was identical: import prisma (if not already present) and ensure the catch block explicitly updates the corresponding database record to failed. This provides immediate feedback and prevents ghost processes.

typescript

// Example fix in an SSE route handler catch block
// ...
try {
  // SSE streaming logic
} catch (error) {
  console.error(`SSE stream error for [id]:`, error);
  // Ensure the run/workflow is marked as failed in the DB
  await prisma.workflowRun.update({
    where: { id: workflowId },
    data: { status: 'failed', endedAt: new Date() },
  });
  // ... clean up resources
}
// ...

Proactive Cleanup: To further enhance robustness, we also added a proactive cleanup mechanism to our code-analysis router (src/server/trpc/routers/code-analysis.ts). The runs.start mutation now automatically cleans up any code-analysis runs that have been stuck in an active status for more than 10 minutes, providing a fail-safe against any future orphaned states.

Practical Optimizations: `maxTokens` Adjustment

Sometimes, the simplest fixes have a big impact. We identified two specific steps ("Extend & Improve" and "Improve") in a particular workflow (f89f7f72) that were consistently hitting their 8192 maxTokens limit, resulting in truncated AI outputs. A quick database bump to 16384 maxTokens for these steps resolved the issue, allowing for richer, more complete generations. This is a good reminder to always review default limits for generative AI tasks.

Lessons Learned & What's Next

This session underscored a few critical development principles:

Thorough State Management for Mutating Operations: Especially in retry or reset logic, ensure all relevant state variables are correctly handled. Overlooking one piece of the puzzle can lead to subtle yet critical bugs.
Robust Error Handling in Long-Running Processes: For any background job or streaming process (like SSE), a catch block isn't just for logging; it's a crucial place to update the system's state, preventing orphaned processes and providing accurate feedback to users.
Proactive System Hygiene: Implementing automated cleanup for stuck processes is invaluable. It reduces manual intervention and improves overall system reliability.

Looking ahead, we've identified a few immediate next steps:

Review and potentially bump default maxTokens in src/lib/constants.ts for other templates that might benefit from longer outputs.
Implement similar orphan-cleanup logic directly into the workflow start mutation, mirroring the code-analysis improvement.
Address a pre-existing type error in the discussions page (src/app/(dashboard)/dashboard/discussions/[id]/page.tsx:139) related to a Badge variant.

It was a productive session, leaving the system more robust and user-friendly. Onwards to the next challenge!