Taming the Workflow Beast: A Day of Clone Fixes, Fan-Out Wins, and Migration Maneuvers
Ever stared down a complex workflow engine, armed with a bug list and a migration plan? Join us as we recount a recent dev session – from fixing tricky cloning logic and boosting real-time feedback, to navigating tenant migrations and wrestling with stubborn Docker caches.
Sometimes, a development session feels like a rapid-fire series of small victories, punctuated by head-scratching moments and the sweet relief of a major breakthrough. This past week, our team embarked on just such a journey, tackling critical workflow engine improvements, a significant data migration, and a handful of persistent operational quirks. The goal was ambitious: get our BRbase project fully operational in a new tenant (clarait), with a robust cloning mechanism, real-time fan-out progress, and a successfully executed, complex group workflow.
Spoiler alert: We got there. But the path was paved with classic developer pitfalls and some clever workarounds. Let's dive in.
The Mission: A More Robust Workflow Experience
Our workflow engine is at the heart of our system, orchestrating complex sequences of AI-driven tasks. We had a few key areas demanding attention:
- Smarter Workflow Cloning: When duplicating a workflow, users were getting a messy copy, including auto-generated
implementation-promptandconsistency-checksteps that should only appear during execution. Plus, theorderof steps often got scrambled. - Real-time Fan-Out Progress: For workflows that fan out to process multiple items concurrently, users lacked visibility into the progress of individual items. Imagine a progress bar that only updates at 0% and 100% – not ideal for long-running tasks.
- Project Migration: Our
BRbaseproject, a critical internal initiative, needed to be migrated from thenyxtenant toclarait, consolidating data and ensuring correct tenant isolation. - Operational Stability: Ironing out deployment kinks, LLM provider issues, and obscure workflow resume bugs.
The Solutions: Under the Hood
1. The Clean Clone: Re-indexing and Filtering
The workflows.duplicate mutation in our src/server/trpc/routers/workflows.ts was the culprit. When a user wanted to clone a workflow, they'd get an exact replica, including steps that our engine adds dynamically during execution (like implementation-prompt for each fan-out item or a final consistency-check). This made cloned workflows cumbersome to edit.
The fix involved two main parts:
- Filtering: We now filter out
implementation-promptandconsistency-checkstep types during the cloning process. - Re-indexing: After filtering, we re-index the
orderproperty of the remaining steps, ensuring a clean, sequential order for the new workflow.
Conceptually, it looks something like this:
// src/server/trpc/routers/workflows.ts (simplified)
async duplicateWorkflow({ ctx, input }) {
// ... fetch original workflow and steps
const newSteps = originalWorkflow.steps
.filter(step => !['implementation-prompt', 'consistency-check'].includes(step.type)) // Filter out auto-generated
.sort((a, b) => a.order - b.order) // Ensure original order is maintained before re-indexing
.map((step, idx) => ({
...step,
id: generateNewId(), // Assign new IDs for cloned steps
workflowId: newWorkflow.id, // Link to the new workflow
order: idx, // Re-index for a clean slate
// ... other properties copied
}));
await ctx.prisma.workflowStep.createMany({ data: newSteps });
// ...
}
This simple change significantly improved the usability of the cloning feature.
2. Live Fan-Out Progress: Incremental Updates
Long-running fan-out operations, where a single step processes multiple items (e.g., generating 13 distinct implementation prompts), desperately needed better feedback. We tackled this in src/server/services/workflow-engine.ts:
- Incremental
subOutputsPersistence: Instead of waiting for all fan-out items to complete before saving their results, we now persistsubOutputs(the results of each individual fan-out item) after each item is processed usingprisma.workflowStep.update. This means if a workflow pauses or fails mid-fan-out, we don't lose the progress made. - Real-time Event Yielding: After each item is processed and saved, the engine yields a
fan_out_progressevent. This event carries crucial information:fanOutIndex(current item),fanOutTotal(total items), andfanOutHeading(a description of the current item being processed).
The UI (src/app/(dashboard)/dashboard/workflows/[id]/page.tsx) was updated to consume these events, displaying a dynamic progress bar for these previously opaque steps. We also fixed an oversight where implementation-prompt steps weren't showing a progress bar at all, by explicitly adding step.name === "implementation-prompt" to the UI condition.
// src/server/services/workflow-engine.ts (simplified conceptual snippet)
async executeFanOutStep(stepId: string) {
// ... fetch step, prepare items
for (let i = 0; i < totalItems; i++) {
const currentItem = items[i];
const itemResult = await processItemWithLLM(currentItem);
// Update subOutputs incrementally
// (This is a simplification, actual logic involves appending to an array field)
await prisma.workflowStep.update({
where: { id: stepId },
data: {
subOutputs: {
push: { index: i, output: itemResult } // Example: push new result
}
},
});
// Yield real-time progress event
yield {
type: 'fan_out_progress',
payload: {
fanOutIndex: i + 1, // 1-based index
fanOutTotal: totalItems,
fanOutHeading: currentItem.description,
},
};
}
// ... mark step as complete
}
3. BRbase Tenant Migration: A Data Juggling Act
Moving the BRbase project from our nyx tenant (b983cca6) to clarait (b5b898be) was more than just updating a tenantId field. It involved:
- Table Updates: Modifying
tenantIdacross numerous related tables:projects,workflows(15 of them!),repositories,project_notes(8),project_syncs(5),consolidations(3), andworkflow_insights(a whopping 163 records). - Repository Conflict Resolution:
claraitalready had an emptyBRbaserepository (f356f796), which conflicted with the(tenantId, owner, repo)unique constraint when trying to move the realBRbaserepo (fromnyx,7e746227, containing 234 patterns). The workaround was to delete the emptyclaraitrepo first, then update thetenantIdof thenyxrepo. - Project De-duplication: A user had also created an empty
BRbaseproject (da6fa199) inclarait, which also needed to be removed.
This meticulous data manipulation ensured that the BRbase project, with all its associated data and critical code patterns, was correctly established in its new home.
The Big Win: A Successful Workflow Run
With all the fixes and migrations in place, the moment of truth arrived: running a complex workflow (051fe560) in the clarait tenant. This workflow was designed to process 13 distinct items, perform a synthesis step, run a consistency check, and then generate 13 fan-out implementation-prompt steps, amounting to over 115,000 characters of LLM output.
Crucially, we also set appropriate personas (NyxCore, Athena, Cael, Harmonia, Nemesis, Aristaeus, Ipcha Mistabra, Morgan, Aletheia) for each step, guiding the AI's behavior. The workflow ran to completion, delivering all expected outputs and passing its consistency checks. We also verified that the BRbase code patterns generated during this run adhered to our internal standards (e.g., ~/ imports, Clerk auth, Jest tests, feature-based structure, tRPC patterns).
It's immensely satisfying to see such a complex, multi-stage, AI-driven process execute flawlessly after a focused development effort.
Lessons from the Trenches: The "Pain Log"
Not everything went smoothly, of course. Here are some classic pitfalls and crucial lessons learned:
-
The "Forgot to Push" Debacle:
- The Pain: Tried to deploy after committing locally, but forgot to
git push. The deploy pipeline reported "Already up to date," but the new code wasn't live. - The Lesson: Always, always
git pushbefore deploying. A quickssh ... git log --oneline -3on the deployment target can verify the commit history. This is a classic, but one that still bites.
- The Pain: Tried to deploy after committing locally, but forgot to
-
Docker Cache Stubbornness:
- The Pain: Used
docker build --no-cache, but the build still seemed to use old code. - The Lesson:
--no-cacheprevents layer caching during the build, but doesn't prune the builder cache. For a guaranteed fresh build, especially when local changes aren't picking up, rundocker builder prune -afbefore yourbuild --no-cachecommand.
- The Pain: Used
-
The Elusive Workflow Resume Bug:
- The Pain: Attempted to resume a workflow where a synthesis step (order 18) was pending, but auto-generated steps (order 19+) had already completed. The engine marked synthesis as
completedwith aNULLoutput, skipping the LLM call entirely. - The Lesson: This points to a deeper interaction bug between our step type logic and the resume mechanism. The engine's for-loop should execute pending steps regardless of subsequent completed steps.
- Workaround: Manually reset the synthesis step to
pending, switch the provider (e.g., togoogle), set the workflow topaused, and then resume from the UI. This is a critical bug for future investigation.
- The Pain: Attempted to resume a workflow where a synthesis step (order 18) was pending, but auto-generated steps (order 19+) had already completed. The engine marked synthesis as
-
Anthropic API Credit Crunch:
- The Pain: All Anthropic API calls (for consistency checks, synthesis, digests) started failing with "credit balance is too low."
- The Lesson: Robust fallback chains are essential for critical LLM-powered features. Our consistency check already had a fallback (Anthropic → Google → OpenAI), which saved us there. For individual steps, we had to manually set the provider to
google/gemini-2.5-pro. This highlights the importance of monitoring API credits and implementing comprehensive provider fallbacks.
-
Tenant Migration Unique Constraint:
- The Pain: When moving the
BRbaserepository fromnyxtoclarait, we hit a unique constraint on(tenantId, owner, repo). - The Lesson: Data migrations are rarely straightforward. Always anticipate unique constraints and pre-existing, potentially empty, duplicate records in the target environment. The workaround (deleting the empty
claraitrepo first) was effective.
- The Pain: When moving the
What's Next?
Our work is never truly done. Immediate next steps include:
- Fixing the Synthesis Resume Bug: This is a high-priority engine bug that needs a proper root cause analysis and fix.
- Expanding Provider Fallbacks: Add robust fallback chains to all review steps, not just consistency checks.
- Codebase Housekeeping: Resolve outstanding merge conflicts and prepare for future feature development.
- API Credit Management: Top up Anthropic credits or update default providers to ensure uninterrupted service.
This session was a microcosm of daily development: a blend of meticulous coding, strategic data management, frantic debugging, and the ultimate satisfaction of seeing a complex system perform as intended. It reinforces the idea that building robust, intelligent systems is an iterative process, constantly refined by lessons learned in the trenches.
{
"thingsDone": [
"Fixed workflow cloning mutation to filter auto-generated steps and re-index order.",
"Implemented incremental fan-out progress persistence and real-time event yielding.",
"Updated UI to display fan-out progress for auto-generated steps.",
"Successfully migrated BRbase project and its data from nyx to clarait tenant.",
"Resolved repository and project unique constraint conflicts during migration.",
"Set appropriate personas for all workflow steps.",
"Successfully ran a complex 13-item workflow with synthesis, consistency check, and fan-out implementation prompts.",
"Verified BRbase pattern compliance for generated code."
],
"pains": [
"Forgot to git push before deploying, leading to stale code.",
"Docker build --no-cache failed to use new code due to unpruned builder cache.",
"Workflow engine bug: synthesis step skipped and marked complete with NULL output when auto-generated steps follow it.",
"Anthropic API calls failed due to depleted credit balance.",
"Tenant migration failed due to unique constraint on repository (tenantId, owner, repo)."
],
"successes": [
"Achieved desired workflow cloning behavior.",
"Provided real-time feedback for long-running fan-out operations.",
"Successfully moved a critical project across tenants without data loss.",
"Executed a complex, multi-stage AI workflow end-to-end.",
"Implemented effective workarounds for immediate operational issues."
],
"techStack": [
"TypeScript",
"Prisma",
"Next.js",
"tRPC",
"Docker",
"PostgreSQL",
"LLMs (Anthropic, Google, OpenAI)",
"Git",
"SSH"
]
}