From Orphaned Processes to Polished Pixels: A Night of Workflow Wizardry
A deep dive into a recent development sprint, tackling critical backend bugs, enhancing user experience on workflow detail pages, and sharing key lessons learned.
Late nights in development often lead to some of the most satisfying breakthroughs. This past session was one of those, a focused sprint to not only elevate user experience but also to squash some critical backend gremlins that had been causing silent headaches. We wrapped up with a comprehensive set of improvements, from a snazzier workflow detail page to ensuring our background jobs never get lost in the ether again.
Let's unpack what went down.
Elevating the Workflow Detail Page UX
The workflow detail page is a central hub for our users, and we wanted to make it as intuitive and powerful as possible. The goal was to improve navigation, provide better control over step visibility, and offer quick jump points.
The Sticky Navigation Bar: Your New Workflow Co-Pilot
One of the biggest additions is a brand-new sticky navigation bar that sits elegantly between the workflow settings panel and the pipeline steps. Built with sticky top-0 z-10 bg-nyx-surface, it ensures crucial controls are always at your fingertips.
This bar features:
- Left Side:
Expand AllandCollapse Allbuttons (powered byChevronsDownUpandChevronsUpDownfromlucide-react). These functions (handleExpandAll(),handleCollapseAll()) give users granular control over how much detail they want to see. - Right Side: A horizontally scrollable row of "step pills." Each pill represents a step in the workflow, adorned with a status-colored dot and a truncated label (max 12 characters) for quick identification. Clicking a pill triggers
handleJumpToStep(stepId), smoothly scrolling the user to the corresponding step, which now has aid={step-${step.id}}DOM anchor.
Smarter Step Expansion and Resumption
We also fine-tuned the logic for step visibility:
- Auto-Expansion for Active Steps: The
isStepExpanded()logic was updated to automatically expand any pending step that is currently active, not just those awaiting review or paused. This keeps the user's focus on what's happening now. - "Resume from here" for Terminal Workflows: For workflows that have completed, failed, or paused, we've added a handy "Resume from here" button. This isn't just a simple retry; it intelligently chains a retry mutation with a resume mutation, effectively resetting the workflow from that specific step and immediately restarting execution. This empowers users to quickly recover and iterate on their work.
Taming the Backend Beasts: Critical Bug Fixes
While the UX enhancements are visually appealing, the backend fixes were absolutely critical for the reliability and integrity of our system.
The Elusive Retry Mutation Bug
This was a head-scratcher that manifested in a very subtle, yet critical way.
- The Problem: When retrying a multi-generate step (steps that produce multiple alternative outputs, like for AI suggestions), the system would seem to retry but then immediately mark the workflow as
completed. - The Root Cause: Our retry mutation, specifically in
src/server/trpc/routers/workflows.ts, wasn't correctly clearing thealternativesandselectedIndexfields for these multi-generate steps. The engine would then see stale alternatives with an existing selection, assume the step had already been successfully executed, and skip processing it, leading to a premature workflow completion. - The Fix: We explicitly added
alternatives: Prisma.JsonNullandselectedIndex: nullto the reset data within the retry mutation. This ensures that when a step is retried, it truly starts fresh, prompting the engine to re-execute and generate new outputs.
// Simplified example of the fix in the retry mutation
// ...
if (isMultiGenerateStep) {
updateData.alternatives = Prisma.JsonNull;
updateData.selectedIndex = null;
}
// ...
This highlights a crucial lesson: when dealing with stateful operations like retries, always consider all relevant state variables that might influence subsequent logic.
Preventing Orphaned SSE Processes
We rely heavily on Server-Sent Events (SSE) for real-time updates on long-running processes like code analysis and workflows. However, we discovered a vulnerability that could leave these processes in a perpetual "active" state if the SSE connection crashed or disconnected unexpectedly.
- The Problem: Both our
code-analysisandworkflowsSSE endpoints (src/app/api/v1/events/code-analysis/[id]/route.tsandsrc/app/api/v1/events/workflows/[id]/route.ts) lacked robust error handling in theircatchblocks. If an error occurred or the client disconnected, the database entry for that run/workflow would remain in anactiveoranalyzingstate indefinitely, even though no process was actually running. - The Pattern & The Fix: This was a classic "aha!" moment when we realized both endpoints suffered from the same oversight. The fix was identical: import
prisma(if not already present) and ensure thecatchblock explicitly updates the corresponding database record tofailed. This provides immediate feedback and prevents ghost processes.
// Example fix in an SSE route handler catch block
// ...
try {
// SSE streaming logic
} catch (error) {
console.error(`SSE stream error for [id]:`, error);
// Ensure the run/workflow is marked as failed in the DB
await prisma.workflowRun.update({
where: { id: workflowId },
data: { status: 'failed', endedAt: new Date() },
});
// ... clean up resources
}
// ...
- Proactive Cleanup: To further enhance robustness, we also added a proactive cleanup mechanism to our
code-analysisrouter (src/server/trpc/routers/code-analysis.ts). Theruns.startmutation now automatically cleans up anycode-analysisruns that have been stuck in anactivestatus for more than 10 minutes, providing a fail-safe against any future orphaned states.
Practical Optimizations: maxTokens Adjustment
Sometimes, the simplest fixes have a big impact. We identified two specific steps ("Extend & Improve" and "Improve") in a particular workflow (f89f7f72) that were consistently hitting their 8192 maxTokens limit, resulting in truncated AI outputs. A quick database bump to 16384 maxTokens for these steps resolved the issue, allowing for richer, more complete generations. This is a good reminder to always review default limits for generative AI tasks.
Lessons Learned & What's Next
This session underscored a few critical development principles:
- Thorough State Management for Mutating Operations: Especially in retry or reset logic, ensure all relevant state variables are correctly handled. Overlooking one piece of the puzzle can lead to subtle yet critical bugs.
- Robust Error Handling in Long-Running Processes: For any background job or streaming process (like SSE), a
catchblock isn't just for logging; it's a crucial place to update the system's state, preventing orphaned processes and providing accurate feedback to users. - Proactive System Hygiene: Implementing automated cleanup for stuck processes is invaluable. It reduces manual intervention and improves overall system reliability.
Looking ahead, we've identified a few immediate next steps:
- Review and potentially bump default
maxTokensinsrc/lib/constants.tsfor other templates that might benefit from longer outputs. - Implement similar orphan-cleanup logic directly into the workflow
startmutation, mirroring thecode-analysisimprovement. - Address a pre-existing type error in the discussions page (
src/app/(dashboard)/dashboard/discussions/[id]/page.tsx:139) related to aBadgevariant.
It was a productive session, leaving the system more robust and user-friendly. Onwards to the next challenge!