Late-Night Code Sprints: Taming the Workflow Beast with UX Polish and Critical Bug Fixes

It was late, the kind of late where the only sounds are the hum of your machine and the quiet tapping of keys. Another development session was wrapping up, and as is tradition, I captured the "Letter to Myself" – a raw, unfiltered log of triumphs, pains, and the immediate next steps. Today, I'm pulling back the curtain on that internal memo to share some insights from a session dedicated to both delighting users and wrestling with some gnarly backend ghosts.

Our goal was clear: elevate the user experience on our workflow detail page and, crucially, stamp out a couple of critical bugs that were causing silent failures in the background.

Elevating the Workflow Experience: Navigation Reimagined

Navigating long, multi-step workflows can be a headache. Scrolling endlessly to find a specific step, or to get a bird's-eye view, is a frustrating experience. Our primary focus for the frontend was to inject some much-needed fluidity into the workflow detail page.

We tackled this head-on by introducing a sticky navigation bar positioned strategically between the settings panel and the workflow pipeline itself. This bar, styled with sticky top-0 z-10 bg-nyx-surface, now serves as a dynamic control center:

Global Expand/Collapse: Two prominent buttons, powered by ChevronsDownUp and ChevronsUpDown from lucide-react, allow users to instantly expand or collapse all workflow steps. This is a game-changer for quickly scanning a workflow or diving deep into details.
Jump Marks for Steps: We added id={step-${step.id}} DOM anchors to each step wrapper. The sticky nav features a horizontally scrollable row of "step pills." Each pill displays a truncated label (max 12 chars) and a status-colored dot, letting users quickly identify and jump to any step in the workflow. No more endless scrolling!
Intelligent Auto-Expansion: We refined our isStepExpanded() logic. Previously, only paused or review steps would auto-expand if they were the currentStep. Now, any pending step that's currently active will automatically expand, providing immediate context to the user about where the workflow is progressing.
"Resume from here" for Terminal Workflows: For workflows that have completed, failed, or are paused, but aren't awaiting review, we added a "Resume from here" button. This smart little helper chains a retry mutation with a resume mutation, effectively resetting the workflow from that step and immediately restarting it. It's a huge quality-of-life improvement for iterative development.

Battling the Ghosts in the Machine: Critical Bug Hunts

While UX polish is rewarding, some bugs demand immediate attention. This session was also about confronting two critical backend issues that were silently undermining the reliability of our system.

The Sneaky Retry Bug: When "Done" Isn't Done

This was a particularly insidious bug that surfaced in multi-generate steps. When a user would attempt to "retry" a step that had previously generated multiple alternatives, the engine would see stale alternatives and selectedIndex data. Instead of re-executing, it would simply use the old, selected output, mark the step as complete, and the workflow would prematurely finish. The user would see a "completed" workflow that, in reality, had skipped a crucial execution.

The Root Cause: Our retry mutation was not comprehensively clearing the state for multi-generate steps. It was resetting some fields, but alternatives and selectedIndex were left untouched.

The Fix: We updated the retry mutation in src/server/trpc/routers/workflows.ts to explicitly set alternatives: Prisma.JsonNull and selectedIndex: null for both single-step and subsequent-steps resets.

typescript

// Simplified representation of the retry mutation update
// src/server/trpc/routers/workflows.ts
// ...
await prisma.workflow.update({
  where: { id: workflowId },
  data: {
    // ... other reset fields
    steps: {
      updateMany: {
        where: {
          workflowId: workflowId,
          stepIndex: { gte: stepToRetryIndex },
        },
        data: {
          status: 'pending',
          output: null,
          alternatives: Prisma.JsonNull, // Crucial fix!
          selectedIndex: null,          // Crucial fix!
          // ... other fields to reset
        },
      },
    },
  },
});
// ...

Lesson Learned: When dealing with state-dependent operations like retries, always perform a thorough audit of all relevant state fields that might influence subsequent logic. A partial reset is often worse than no reset at all, as it creates misleading states.

Taming Orphaned Background Jobs: SSE Endpoint Resilience

We noticed a recurring pattern: background jobs (both for workflow execution and code analysis) getting stuck in an active status, even when their associated SSE connection had long since died or the server had crashed. These were "orphaned" processes, consuming resources and providing no meaningful updates.

The Root Cause: Our SSE route handlers (src/app/api/v1/events/code-analysis/[id]/route.ts and src/app/api/v1/events/workflows/[id]/route.ts) lacked robust error handling. While they might log an error, the catch blocks weren't updating the database to reflect a failed status for the respective run or workflow. This meant a crashed SSE connection left the DB believing the process was still analyzing or active.

The Fix: We systematically added prisma imports and await prisma.run.update(...) or await prisma.workflow.update(...) calls within the catch blocks of both SSE route handlers. Now, if an error occurs or the connection unexpectedly terminates, the database state is immediately updated to failed, preventing orphaned jobs.

typescript

// Simplified example for an SSE route handler catch block
// src/app/api/v1/events/workflows/[id]/route.ts
// ...
try {
  // ... SSE stream logic
} catch (error) {
  console.error(`SSE stream for workflow ${workflowId} failed:`, error);
  // CRITICAL: Update DB to mark workflow as failed
  await prisma.workflow.update({
    where: { id: workflowId },
    data: { status: 'failed', errorMessage: error.message },
  });
  // ... further error handling/cleanup
}

Lesson Learned: For any long-running or streamed process, especially those involving external connections, robust error handling in catch blocks is non-negotiable. Always ensure that the system's internal state (e.g., database records) accurately reflects the real-world status of the operation, even in failure scenarios. Furthermore, consider adding proactive "janitor" processes (like the runs.start auto-cleanup for code analysis) that identify and clean up stuck entities.

The Silent Token Ceiling: When LLMs Get Truncated

A subtle issue we caught was specific to a couple of "Extend & Improve" and "Improve" steps in a particular workflow (f89f7f72). Their output was consistently hitting the 8192 maxTokens limit exactly, often truncating mid-sentence. This meant valuable LLM output was being silently cut off.

The Fix: A simple but effective database-only bump: maxTokens for these specific steps were increased from 8192 to 16384.

Lesson Learned: Don't assume default LLM token limits will always suffice, especially for steps designed to generate extensive or detailed output. Monitor for signs of truncation and be prepared to adjust maxTokens based on observed behavior. In future, we might even consider dynamic token limits or smarter templates.

Looking Ahead: The Never-Ending Journey

Even after a productive session, the backlog always has more to offer. Our immediate next steps include:

Revisiting Default maxTokens: We'll evaluate if other templates using 8192 maxTokens for potentially long-output steps need a similar bump.
Workflow Orphan Cleanup: Extending the proactive orphan-cleanup logic (like the one we have for code-analysis runs) to the workflow start mutation, ensuring workflows don't get stuck in limbo.
Minor Type Fix: Squashing a pre-existing Badge variant type error in our discussions page – a small detail, but important for type safety.

This session was a microcosm of development: a blend of improving the user-facing experience and diving deep into the backend to ensure reliability. Shipping these changes feels good, knowing our workflows are now more navigable, resilient, and accurate.

json

{"thingsDone":["Workflow Detail Page Navigation (sticky bar, expand/collapse, jump marks, auto-expand, resume button)","Retry Mutation Fix (cleared alternatives/selectedIndex)","Orphaned SSE Process Fixes (DB update on catch blocks)","Workflow maxTokens Increase (specific steps)","Code Analysis Orphan Fix (specific run)"],"pains":["Retry mutation didn't clear alternatives/selectedIndex, leading to skipped execution","SSE catch blocks didn't update DB, leaving orphaned processes","LLM output truncated due to maxTokens ceiling"],"successes":["Improved workflow navigation significantly","Fixed critical bug causing false positives in workflow completion","Ensured higher reliability for background jobs and SSE streams","Addressed LLM output truncation"],"techStack":["TypeScript","Next.js","tRPC","Prisma","PostgreSQL","SSE (Server-Sent Events)","Lucide-React","LLMs (Large Language Models)"]}