nyxcore-systems
9 min read

Taming the Workflow Beast: A Day of Clone Fixes, Fan-Out Wins, and Migration Maneuvers

Ever stared down a complex workflow engine, armed with a bug list and a migration plan? Join us as we recount a recent dev session – from fixing tricky cloning logic and boosting real-time feedback, to navigating tenant migrations and wrestling with stubborn Docker caches.

Workflow EngineTypeScriptPrismaNext.jsDockerAI/LLMSystem DesignDebuggingData Migration

Sometimes, a development session feels like a rapid-fire series of small victories, punctuated by head-scratching moments and the sweet relief of a major breakthrough. This past week, our team embarked on just such a journey, tackling critical workflow engine improvements, a significant data migration, and a handful of persistent operational quirks. The goal was ambitious: get our BRbase project fully operational in a new tenant (clarait), with a robust cloning mechanism, real-time fan-out progress, and a successfully executed, complex group workflow.

Spoiler alert: We got there. But the path was paved with classic developer pitfalls and some clever workarounds. Let's dive in.

The Mission: A More Robust Workflow Experience

Our workflow engine is at the heart of our system, orchestrating complex sequences of AI-driven tasks. We had a few key areas demanding attention:

  1. Smarter Workflow Cloning: When duplicating a workflow, users were getting a messy copy, including auto-generated implementation-prompt and consistency-check steps that should only appear during execution. Plus, the order of steps often got scrambled.
  2. Real-time Fan-Out Progress: For workflows that fan out to process multiple items concurrently, users lacked visibility into the progress of individual items. Imagine a progress bar that only updates at 0% and 100% – not ideal for long-running tasks.
  3. Project Migration: Our BRbase project, a critical internal initiative, needed to be migrated from the nyx tenant to clarait, consolidating data and ensuring correct tenant isolation.
  4. Operational Stability: Ironing out deployment kinks, LLM provider issues, and obscure workflow resume bugs.

The Solutions: Under the Hood

1. The Clean Clone: Re-indexing and Filtering

The workflows.duplicate mutation in our src/server/trpc/routers/workflows.ts was the culprit. When a user wanted to clone a workflow, they'd get an exact replica, including steps that our engine adds dynamically during execution (like implementation-prompt for each fan-out item or a final consistency-check). This made cloned workflows cumbersome to edit.

The fix involved two main parts:

  • Filtering: We now filter out implementation-prompt and consistency-check step types during the cloning process.
  • Re-indexing: After filtering, we re-index the order property of the remaining steps, ensuring a clean, sequential order for the new workflow.

Conceptually, it looks something like this:

typescript
// src/server/trpc/routers/workflows.ts (simplified)
async duplicateWorkflow({ ctx, input }) {
  // ... fetch original workflow and steps

  const newSteps = originalWorkflow.steps
    .filter(step => !['implementation-prompt', 'consistency-check'].includes(step.type)) // Filter out auto-generated
    .sort((a, b) => a.order - b.order) // Ensure original order is maintained before re-indexing
    .map((step, idx) => ({
      ...step,
      id: generateNewId(), // Assign new IDs for cloned steps
      workflowId: newWorkflow.id, // Link to the new workflow
      order: idx, // Re-index for a clean slate
      // ... other properties copied
    }));

  await ctx.prisma.workflowStep.createMany({ data: newSteps });
  // ...
}

This simple change significantly improved the usability of the cloning feature.

2. Live Fan-Out Progress: Incremental Updates

Long-running fan-out operations, where a single step processes multiple items (e.g., generating 13 distinct implementation prompts), desperately needed better feedback. We tackled this in src/server/services/workflow-engine.ts:

  • Incremental subOutputs Persistence: Instead of waiting for all fan-out items to complete before saving their results, we now persist subOutputs (the results of each individual fan-out item) after each item is processed using prisma.workflowStep.update. This means if a workflow pauses or fails mid-fan-out, we don't lose the progress made.
  • Real-time Event Yielding: After each item is processed and saved, the engine yields a fan_out_progress event. This event carries crucial information: fanOutIndex (current item), fanOutTotal (total items), and fanOutHeading (a description of the current item being processed).

The UI (src/app/(dashboard)/dashboard/workflows/[id]/page.tsx) was updated to consume these events, displaying a dynamic progress bar for these previously opaque steps. We also fixed an oversight where implementation-prompt steps weren't showing a progress bar at all, by explicitly adding step.name === "implementation-prompt" to the UI condition.

typescript
// src/server/services/workflow-engine.ts (simplified conceptual snippet)
async executeFanOutStep(stepId: string) {
  // ... fetch step, prepare items
  for (let i = 0; i < totalItems; i++) {
    const currentItem = items[i];
    const itemResult = await processItemWithLLM(currentItem);

    // Update subOutputs incrementally
    // (This is a simplification, actual logic involves appending to an array field)
    await prisma.workflowStep.update({
      where: { id: stepId },
      data: {
        subOutputs: {
          push: { index: i, output: itemResult } // Example: push new result
        }
      },
    });

    // Yield real-time progress event
    yield {
      type: 'fan_out_progress',
      payload: {
        fanOutIndex: i + 1, // 1-based index
        fanOutTotal: totalItems,
        fanOutHeading: currentItem.description,
      },
    };
  }
  // ... mark step as complete
}

3. BRbase Tenant Migration: A Data Juggling Act

Moving the BRbase project from our nyx tenant (b983cca6) to clarait (b5b898be) was more than just updating a tenantId field. It involved:

  • Table Updates: Modifying tenantId across numerous related tables: projects, workflows (15 of them!), repositories, project_notes (8), project_syncs (5), consolidations (3), and workflow_insights (a whopping 163 records).
  • Repository Conflict Resolution: clarait already had an empty BRbase repository (f356f796), which conflicted with the (tenantId, owner, repo) unique constraint when trying to move the real BRbase repo (from nyx, 7e746227, containing 234 patterns). The workaround was to delete the empty clarait repo first, then update the tenantId of the nyx repo.
  • Project De-duplication: A user had also created an empty BRbase project (da6fa199) in clarait, which also needed to be removed.

This meticulous data manipulation ensured that the BRbase project, with all its associated data and critical code patterns, was correctly established in its new home.

The Big Win: A Successful Workflow Run

With all the fixes and migrations in place, the moment of truth arrived: running a complex workflow (051fe560) in the clarait tenant. This workflow was designed to process 13 distinct items, perform a synthesis step, run a consistency check, and then generate 13 fan-out implementation-prompt steps, amounting to over 115,000 characters of LLM output.

Crucially, we also set appropriate personas (NyxCore, Athena, Cael, Harmonia, Nemesis, Aristaeus, Ipcha Mistabra, Morgan, Aletheia) for each step, guiding the AI's behavior. The workflow ran to completion, delivering all expected outputs and passing its consistency checks. We also verified that the BRbase code patterns generated during this run adhered to our internal standards (e.g., ~/ imports, Clerk auth, Jest tests, feature-based structure, tRPC patterns).

It's immensely satisfying to see such a complex, multi-stage, AI-driven process execute flawlessly after a focused development effort.

Lessons from the Trenches: The "Pain Log"

Not everything went smoothly, of course. Here are some classic pitfalls and crucial lessons learned:

  1. The "Forgot to Push" Debacle:

    • The Pain: Tried to deploy after committing locally, but forgot to git push. The deploy pipeline reported "Already up to date," but the new code wasn't live.
    • The Lesson: Always, always git push before deploying. A quick ssh ... git log --oneline -3 on the deployment target can verify the commit history. This is a classic, but one that still bites.
  2. Docker Cache Stubbornness:

    • The Pain: Used docker build --no-cache, but the build still seemed to use old code.
    • The Lesson: --no-cache prevents layer caching during the build, but doesn't prune the builder cache. For a guaranteed fresh build, especially when local changes aren't picking up, run docker builder prune -af before your build --no-cache command.
  3. The Elusive Workflow Resume Bug:

    • The Pain: Attempted to resume a workflow where a synthesis step (order 18) was pending, but auto-generated steps (order 19+) had already completed. The engine marked synthesis as completed with a NULL output, skipping the LLM call entirely.
    • The Lesson: This points to a deeper interaction bug between our step type logic and the resume mechanism. The engine's for-loop should execute pending steps regardless of subsequent completed steps.
    • Workaround: Manually reset the synthesis step to pending, switch the provider (e.g., to google), set the workflow to paused, and then resume from the UI. This is a critical bug for future investigation.
  4. Anthropic API Credit Crunch:

    • The Pain: All Anthropic API calls (for consistency checks, synthesis, digests) started failing with "credit balance is too low."
    • The Lesson: Robust fallback chains are essential for critical LLM-powered features. Our consistency check already had a fallback (Anthropic → Google → OpenAI), which saved us there. For individual steps, we had to manually set the provider to google/gemini-2.5-pro. This highlights the importance of monitoring API credits and implementing comprehensive provider fallbacks.
  5. Tenant Migration Unique Constraint:

    • The Pain: When moving the BRbase repository from nyx to clarait, we hit a unique constraint on (tenantId, owner, repo).
    • The Lesson: Data migrations are rarely straightforward. Always anticipate unique constraints and pre-existing, potentially empty, duplicate records in the target environment. The workaround (deleting the empty clarait repo first) was effective.

What's Next?

Our work is never truly done. Immediate next steps include:

  1. Fixing the Synthesis Resume Bug: This is a high-priority engine bug that needs a proper root cause analysis and fix.
  2. Expanding Provider Fallbacks: Add robust fallback chains to all review steps, not just consistency checks.
  3. Codebase Housekeeping: Resolve outstanding merge conflicts and prepare for future feature development.
  4. API Credit Management: Top up Anthropic credits or update default providers to ensure uninterrupted service.

This session was a microcosm of daily development: a blend of meticulous coding, strategic data management, frantic debugging, and the ultimate satisfaction of seeing a complex system perform as intended. It reinforces the idea that building robust, intelligent systems is an iterative process, constantly refined by lessons learned in the trenches.


json
{
  "thingsDone": [
    "Fixed workflow cloning mutation to filter auto-generated steps and re-index order.",
    "Implemented incremental fan-out progress persistence and real-time event yielding.",
    "Updated UI to display fan-out progress for auto-generated steps.",
    "Successfully migrated BRbase project and its data from nyx to clarait tenant.",
    "Resolved repository and project unique constraint conflicts during migration.",
    "Set appropriate personas for all workflow steps.",
    "Successfully ran a complex 13-item workflow with synthesis, consistency check, and fan-out implementation prompts.",
    "Verified BRbase pattern compliance for generated code."
  ],
  "pains": [
    "Forgot to git push before deploying, leading to stale code.",
    "Docker build --no-cache failed to use new code due to unpruned builder cache.",
    "Workflow engine bug: synthesis step skipped and marked complete with NULL output when auto-generated steps follow it.",
    "Anthropic API calls failed due to depleted credit balance.",
    "Tenant migration failed due to unique constraint on repository (tenantId, owner, repo)."
  ],
  "successes": [
    "Achieved desired workflow cloning behavior.",
    "Provided real-time feedback for long-running fan-out operations.",
    "Successfully moved a critical project across tenants without data loss.",
    "Executed a complex, multi-stage AI workflow end-to-end.",
    "Implemented effective workarounds for immediate operational issues."
  ],
  "techStack": [
    "TypeScript",
    "Prisma",
    "Next.js",
    "tRPC",
    "Docker",
    "PostgreSQL",
    "LLMs (Anthropic, Google, OpenAI)",
    "Git",
    "SSH"
  ]
}