From Orphaned Runs to Seamless Journeys: Our Latest Workflow & Reliability Upgrades

Late last night, as the city slept, our dev session was buzzing. The mission? To iron out some critical kinks and roll out some delightful user experience enhancements. We tackled everything from improving workflow navigation to ensuring our background jobs never get lost in the ether again. Let's unpack the journey.

Elevating the Workflow Experience: Navigate with Ease

Our workflow detail page is where the magic happens – a visual representation of complex processes. As these workflows grow, navigating them can become cumbersome. Our primary goal for this session was to make that journey intuitive and efficient.

We introduced a suite of new navigation features:

Expand All / Collapse All: Two simple, yet powerful buttons (powered by ChevronsDownUp and ChevronsUpDown from lucide-react) now allow users to instantly view the entire workflow or collapse it for a high-level overview. This is a game-changer for managing intricate pipelines.
Jump to Step: Ever scroll endlessly to find a specific step? Not anymore! We've added unique DOM anchors (id={step-${step.id}}) to each workflow step. Now, a new sticky navigation bar provides "step pills" – small, status-colored indicators of each step. Clicking these pills instantly scrolls you to the relevant step.
Sticky Navigation Bar: This is the glue holding it all together. Positioned elegantly between the settings panel and the pipeline visualization, this bar remains sticky top-0 as you scroll. It houses the Expand All / Collapse All buttons on the left and a horizontally scrollable list of status-colored step pills on the right, complete with truncated labels for clarity. This ensures critical controls and navigation are always at your fingertips.

These additions transform the workflow detail page from a static display into a dynamic, interactive canvas, significantly improving user productivity.

Taming the Backend Beasts: Fortifying Reliability and State Management

While improving the frontend, we also dedicated significant effort to bolstering our backend's reliability, particularly around long-running background jobs and state consistency.

When LLMs Hit Their Limit: The `maxTokens` Fix

We identified a subtle but critical issue: certain steps in our workflows, specifically "Extend & Improve" and "Improve", were consistently hitting the 8192 completion token ceiling for Large Language Model (LLM) outputs. This meant incomplete or truncated responses, hindering the user's progress.

Our immediate fix involved directly bumping the maxTokens for these specific steps in the database from 8192 to 16384. This allows the LLM more room to generate comprehensive outputs. For the user, this means they'll need to "Retry" these steps to regenerate the content with the higher limit. This also served as a valuable reminder: we need to review and potentially bump default maxTokens in our src/lib/constants.ts for templates prone to generating longer outputs.

Rescuing Orphaned Background Jobs: A Pattern of Robustness

One of the more challenging aspects of distributed systems is ensuring background jobs complete gracefully, even if the primary process crashes or a client disconnects. We uncovered a recurring vulnerability that led to "orphaned" jobs – tasks stuck in an analyzing or pending state, despite no active process.

The Problem: We found a code-analysis run (e.g., de75b23a) stuck in analyzing status without a startedAt timestamp, indicating a process crash. Similarly, a workflow (f89f7f72) showed completed status but its last step was pending, an inconsistent state. The common thread? Our Server-Sent Events (SSE) route handlers' catch blocks were only sending an SSE event, not updating the database state on error. If the client disconnected or the server crashed, the DB state remained unresolved.
The Fix: We implemented a crucial pattern change in both src/app/api/v1/events/code-analysis/[id]/route.ts and src/app/api/v1/events/workflows/[id]/route.ts. Now, within the catch block, we explicitly update the database to mark the run or workflow as failed when an error occurs. This ensures that even if the client disconnects or the server crashes, the system's state accurately reflects the outcome.
Proactive Cleanup: To further enhance resilience, we added an "orphan cleanup" mechanism to our src/server/trpc/routers/code-analysis.ts. Before creating a new code analysis run, the runs.start mutation now automatically cleans up any runs that have been stuck in an active status for more than 10 minutes. This prevents the accumulation of stale, orphaned jobs. We're considering implementing similar logic for workflow start mutations.

Challenges & Lessons Learned

Every development session comes with its unique set of hurdles. These "pain points" often transform into valuable lessons.

Database Debugging: Prisma vs. Raw SQL: Our initial attempt to inspect the database using prisma db execute hit a snag because Prisma CLI queries use model names, not raw table names. This necessitated dropping down to psql directly (PGPASSWORD=nyxcore_dev psql -U nyxcore -h localhost -d nyxcore). The lesson here is that while ORMs are fantastic, sometimes you need to get your hands dirty with raw SQL.
The camelCase Column Name Gotcha: When working with raw SQL on a Prisma-managed database, we encountered a common pitfall: Prisma's camelCase column names (e.g., "workflowId", "stepType") require double-quoting in raw SQL queries. Forgetting this leads to syntax errors. A small detail, but a crucial one for seamless direct DB interaction.
Robust Error Handling in Asynchronous Systems: The most significant lesson was the discovery of the identical bug in both SSE route handlers. It highlighted a critical pattern: in asynchronous, event-driven systems, merely emitting an event on error is insufficient. The system's persistent state (the database) must also be updated to reflect failures, especially when client-server connections are ephemeral. This ensures system integrity and prevents orphaned processes.

Looking Ahead

With these enhancements and fixes rolled out, our system is more robust and user-friendly. Here's what's immediately next:

Users should click Resume on workflow f89f7f72 to complete its final step.
Users should Retry steps "Extend & Improve" and "Improve" to benefit from the increased maxTokens limit.
We'll be reviewing default maxTokens in src/lib/constants.ts to prevent future truncation issues.
Implementing similar orphan-cleanup logic for workflow start mutations is on the roadmap.
And finally, we'll squash that pre-existing Badge variant type error in our discussions page – a minor but important cleanup!

This session was a testament to our commitment to continuous improvement, balancing elegant user experiences with rock-solid backend reliability. We're excited for you to experience the smoother, more dependable platform!