Taming Our Workflows: Sticky Nav, Smart Steps, and Slaying Orphaned Jobs
Join us on a journey through a recent dev session where we tackled both user experience enhancements for our workflow detail pages and critical backend stability issues, particularly around orphaned background jobs and SSE endpoints. We'll share our wins, our woes, and the hard-won lessons.
Late-night coding sessions often feel like a race against time, but they're also fertile ground for deep dives and impactful fixes. Recently, I found myself wrestling with two distinct but equally important challenges in our application: enhancing the user experience on our workflow detail pages and stabilizing our backend against the insidious problem of orphaned background jobs. This post chronicles that session, sharing the technical decisions, the "aha!" moments, and the hard-won lessons.
We're building a system where complex workflows are central, involving multiple steps, background processes, and real-time updates via Server-Sent Events (SSE). Ensuring both a smooth user journey and robust backend operations is paramount.
Elevating the Workflow Detail Experience: Navigating Complexity
Our workflow detail page (src/app/(dashboard)/dashboard/workflows/[id]/page.tsx) is where users spend significant time monitoring and interacting with their multi-step processes. As workflows grow in complexity, simply scrolling through dozens of steps becomes cumbersome. We needed better navigation.
The solution came in three parts:
-
Global Expand/Collapse: Users often want to see either a high-level overview or dive into the nitty-gritty of every single step. We implemented "Expand All" and "Collapse All" functionality using
lucide-react'sChevronsDownUpandChevronsUpDownicons. This was straightforward: a state variable to control the expanded status of all steps, toggled by simple click handlers. -
Jump Marks for Instant Access: To allow direct navigation to specific steps, we added unique
idattributes to each step's outer wrapperdiv:id={step-${step.id}}. This provides stable DOM anchors, making it easy to link directly to a step or programmatically scroll to it. -
The Sticky Navigator: Always in View This was the biggest UX win for the page. We introduced a sticky navigation bar positioned between the workflow settings panel and the main pipeline visualization. This bar remains visible as the user scrolls, offering constant access to key controls:
- Left Side: Our newly added "Expand All" and "Collapse All" buttons.
- Right Side: A horizontally scrollable list of "step pills." Each pill represents a step, featuring a status-colored dot (e.g., green for completed, yellow for pending) and a truncated label. Clicking a pill uses our
handleJumpToStep()function to smoothly scroll the user to that specific step.
Styling this sticky bar was crucial for integration:
css.sticky-navigator { @apply sticky top-0 z-10 bg-nyx-surface border border-nyx-border; }This ensures it stays visible at the top, layers correctly (
z-10), and blends seamlessly with our existing UI theme.
The Battle Against Orphaned Jobs and Inconsistent States
While the UI improvements bring immediate user delight, the backend fixes address deeper stability and data integrity issues. Nothing frustrates a user more than a process stuck in limbo, or data that just doesn't make sense.
The maxTokens Gauntlet
One user reported a workflow (f89f7f72) where two critical steps, "Extend & Improve" and "Improve," consistently failed to generate complete output. Digging in, we found they were hitting the 8192 completion token ceiling exactly (8192/8192). This wasn't a bug in logic, but a hard limit being reached.
The Fix: A direct database update was needed. We bumped maxTokens for these specific steps from 8192 to 16384. The user will need to manually Retry these steps to trigger regeneration with the higher limit.
Lesson Learned: While direct DB fixes are quick, they highlight a potential need to review default maxTokens in our src/lib/constants.ts for templates that frequently produce long outputs. Hardcoding limits can lead to unexpected truncations.
Catching the Orphans: Code Analysis Runs
A more critical issue surfaced with our code analysis runs. Run de75b23a was stuck in an analyzing status without a startedAt timestamp – a tell-tale sign of a crashed process. The client likely disconnected, or the server process died, leaving the database record in an inconsistent state.
The Root Cause: Our SSE route handler for code analysis (src/app/api/v1/events/code-analysis/[id]/route.ts) had a critical omission in its catch block. It was sending an SSE event about the error, but not updating the database. If the client was already disconnected, that event was useless, and the DB state remained analyzing.
The Fix: We modified the catch block to explicitly update the run's status to failed in the database, along with an error message.
// src/app/api/v1/events/code-analysis/[id]/route.ts (Conceptual)
try {
// ... existing SSE logic for streaming events ...
} catch (error: any) { // Using 'any' for simplicity, but type-checking is crucial
console.error(`Error in code analysis SSE for run ${runId}:`, error);
// CRITICAL: Update database status on error to prevent orphaned jobs
await prisma.codeAnalysisRun.update({
where: { id: runId },
data: { status: 'failed', errorMessage: error.message || 'An unknown error occurred' },
});
// ... (Optional) send SSE error event if client is still connected ...
}
Proactive Cleanup: To prevent future build-up of such orphans, we added a cleanup mechanism to our runs.start mutation in src/server/trpc/routers/code-analysis.ts. Before creating a new run, it now checks for and auto-cleans any runs stuck in an active status for more than 10 minutes. This provides a safety net against transient issues and ensures a clean slate for new operations.
Ensuring Workflow State Integrity
The pattern repeated itself with our workflows. Workflow f89f7f72 was marked completed, but its last step, "Implementation Prompts," was still pending. This inconsistent state meant the "Resume" button wouldn't appear, blocking the user from continuing.
The Root Cause & Fix: Identical to the code analysis issue, the catch block in src/app/api/v1/events/workflows/[id]/route.ts was missing the crucial database update. We applied the same fix: explicitly marking the workflow as failed in the database on error, allowing for clearer error states and easier recovery (or manual intervention).
After the fix, we manually set workflow f89f7f72 to paused, making the "Resume" button available again for the user.
Lessons Learned from the Trenches
These sessions are as much about fixing bugs as they are about solidifying development best practices. The "pain points" often become the most valuable lessons.
-
Befriend Your Database Directly: While ORMs like Prisma are fantastic, there are times when you just need to talk to your database in its native tongue. I initially struggled with
prisma db executebecause it expects model names, not raw table names. Switching topsqldirectly (PGPASSWORD=nyxcore_dev psql -U nyxcore -h localhost -d nyxcore) was faster and more robust for specific data manipulations. Don't be afraid to drop down to raw SQL when necessary, it's a powerful tool in your debugging arsenal. -
Mind Your Case, Quote Your Columns: A subtle but important detail when using raw SQL with Prisma-generated schemas: Prisma typically uses
camelCasefor column names in your models, but the underlying database tables often convert these tosnake_caseor require double-quoting if they maintaincamelCase(like in PostgreSQL). For example,"workflowId"or"stepType"are necessary in raw SQL queries if your database schema preserves the camelCase. Always double-check your column naming conventions! -
Robust Error Handling is Non-Negotiable for Async Operations (Especially with SSE): This was the recurring theme and the most critical takeaway. For any long-running, asynchronous process that updates database state and communicates via SSE:
- Always ensure your
catchblocks update the database with afailedstatus and an error message. - Relying solely on SSE events for error communication is insufficient because the client might disconnect, or the server might crash mid-way.
- The database is the single source of truth; it must reflect the actual state, even in failure scenarios. This prevents orphaned jobs and inconsistent UI states, which are notoriously difficult for users to recover from.
- Always ensure your
What's Next?
With these fixes in place, our application is more stable and user-friendly. However, the journey continues:
- Users can now Resume workflow
f89f7f72and Retry themaxTokens-constrained steps with confidence. - We'll consider bumping default
maxTokensinsrc/lib/constants.tsfor templates prone to long outputs, making this a proactive rather than reactive fix. - The proactive orphan-cleanup logic added to
code-analysis.tsis a pattern we should extend to other long-running processes, like workflowstartmutations, to further harden our system. - And of course, there's always a pre-existing type error or two waiting for attention, like the
Badgevariant mismatch on our discussions page!
It was a productive session, reinforcing the importance of both user-facing polish and rock-solid backend reliability. Happy coding!
{
"thingsDone": [
"Implemented Expand/Collapse All for workflow steps on detail page",
"Added DOM anchors for 'Jump to Step' functionality within workflows",
"Created sticky navigation bar with status-colored step pills and truncated labels",
"Fixed workflow maxTokens truncation by manually bumping DB limits for affected steps",
"Resolved orphaned code analysis runs by adding DB status updates in SSE catch blocks",
"Implemented proactive orphan cleanup for code analysis runs stuck >10 minutes",
"Fixed inconsistent workflow status by adding DB status updates in SSE catch blocks",
"Manually corrected specific workflow and code analysis run states in DB for user recovery"
],
"pains": [
"Struggled with Prisma CLI for raw SQL queries, resorted to psql directly for efficiency",
"Discovered Prisma camelCase column names require double-quoting in raw SQL for PostgreSQL",
"Identified a recurring bug pattern across multiple SSE handlers where catch blocks failed to update DB state, leading to orphaned jobs"
],
"successes": [
"Significant UI/UX improvement for navigating complex workflow detail pages",
"Enhanced backend stability by preventing and cleaning up orphaned background jobs",
"Improved data consistency across asynchronous operations and real-time updates",
"Established a critical pattern for robust error handling in SSE routes with database persistence",
"Introduced proactive cleanup mechanisms for long-running processes"
],
"techStack": [
"Next.js",
"React",
"Prisma",
"PostgreSQL",
"Server-Sent Events (SSE)",
"TypeScript",
"Tailwind CSS",
"Lucide React"
]
}