Untangling the AI Workflow: From Monoliths to Precision-Engineered Prompts
We faced a critical challenge: our AI workflow engine was generating monolithic, off-target implementation prompts for complex tasks. This post dives deep into how we diagnosed, designed, and are implementing a fan-out solution to achieve precise, per-action-point code generation.
Late last Friday, as the city lights blurred outside my window, I penned a "Letter to Myself." Not a philosophical reflection, but a detailed session handoff – a snapshot of my mind after a deep dive into the guts of our AI workflow engine. The goal? To fix a glaring issue: our powerful LLM-driven system, designed to break down complex projects, was spitting out a single, often irrelevant, implementation prompt for multi-faceted tasks.
This isn't just about a bug; it's about pushing the boundaries of what AI can do in a development workflow. When you ask an AI to tackle ten distinct action points, you don't want a single, generic response. You want ten targeted, actionable prompts.
The Foundation We Built On
Before we dive into the challenges, let's set the stage. Our system is designed to take a high-level goal, break it down into granular action points, analyze dependencies, synthesize a plan, and then generate an "implementation prompt" – essentially, a detailed request for the LLM to generate code or a detailed plan for a specific task.
We'd just deployed a significant refactor, migrating our core pages to a ProviderModelPicker (think dynamic LLM selection). This paved the way for more flexible model usage. Crucially, we verified that a complex workflow, involving 13 steps and 10 different LLM calls, executed flawlessly. This confirmed the engine's robustness for processing information.
But then came the deep consistency analysis. We meticulously traced input action points through group analysis, synthesis, and finally, to the implementation prompt. This is where the cracks appeared.
The Critical Challenges: Lessons Learned from the Trenches
While the system was robust, its output for complex, multi-item tasks was far from ideal. Here's what we uncovered, and how we plan to tackle it:
Challenge 1: The Monolithic Prompt Problem
The Symptom: Imagine you have 10 distinct tasks, like "Implement NLI Metric," "Refactor score.py," and "Add sanitize.py." Our workflow engine would generate one massive implementation prompt. The LLM, given this overwhelming input, would arbitrarily pick one of the 10 items (in our test case, "NLI Metric") and focus solely on that, completely ignoring the other nine.
The Root Cause: Our workflow-engine.ts was designed to append a single implementation prompt step when workflow.generatePrompt was true. This worked perfectly for single-item workflows, but completely fell apart for groups. It was a classic case of a solution scaled for simple cases failing under complexity.
The Solution: Fan-Out to Precision: The good news? We already had a "fan-out" mechanism built into our system, used for splitting large outputs into smaller, manageable sections (section-splitter.ts). The plan is to leverage this. Instead of one monolithic prompt, we'll configure the implementation prompt step to "fan out" from the synthesis output (which already understands individual action points). This will generate N separate implementation prompts, one for each action point section.
Challenge 2: Speaking the Wrong Language
The Symptom: Our action points clearly referenced Python files (ipcha/score.py, ipcha/sanitize.py). Yet, the generated implementation prompt often produced Go code, mentioning internal/audit/ and cmd/ckb/ (patterns from our internal CodeMCP project).
The Root Cause: Our project.wisdom (a collection of project-specific context) contained strong signals for Go patterns related to CodeMCP. The LLM, despite seeing Python file references in the action points, prioritized the stronger, more explicit code examples and patterns injected from project.wisdom. It was a context collision, with the injected "wisdom" overriding the implicit context of the action points.
The Solution: Explicit Context Injection: We need to teach the prompt generator to deduce the target language. The fix involves modifying implementation-prompt-generator.ts to scan action point descriptions for file references (e.g., .py, .go, .ts) and then explicitly inject a "target-context" instruction into the system prompt. This gives the LLM a clear, unambiguous signal for the desired output language.
Challenge 3: Merged Actions, Separate Plans
The Symptom: Our Group Analysis intelligently identified that action points #4 and #7 should be merged. Excellent! However, subsequent steps still generated separate per-item plans for #4 and #7. While the final synthesis step correctly merged them, this led to redundant token usage and potential for conflicting early-stage plans.
The Root Cause: The group-prompt-builder.ts was designed to generate one step per action point, regardless of the group analysis output.
The Solution: Fan-Out from Synthesis: This issue is elegantly resolved by our fan-out strategy from Challenge 1. By fanning out the implementation prompts from the synthesis output (which already correctly merges #4 and #7), we inherently eliminate the redundant separate plans. The synthesis is the authoritative source for the final, merged action points.
Minor Challenges & Pragmatic Decisions
Not every issue warrants a full redesign. We identified two minor points:
- Dependency Ordering Ignored: Group Analysis correctly identified dependencies (e.g., #8 before #1), but steps executed sequentially (1→10). We decided to accept this. Per-item plans are independent documents, and the synthesis step ultimately dictates the correct ordering for the overall plan. The impact on implementation quality is low.
- Overestimated Resource Estimates: The synthesis estimated 22-26 engineer-weeks for what looked like a solo researcher's TODO list. This is largely cosmetic and doesn't affect the quality of the generated code prompts. We decided to accept this for now, noting it could be addressed later via persona tuning or more sophisticated estimation models.
The Fix: Immediate Next Steps
With a clear understanding, the path forward is precise:
- Refactor Workflow Engine: Modify
src/server/services/workflow-engine.ts(lines 2482-2607) to conditionally create a fan-out implementation prompt step for group workflows. This step will point itsfanOutConfigto the Synthesis step, splitting its output by action point sections. - Enhance Prompt Generation: Update
src/server/services/implementation-prompt-generator.tsto detect target languages (e.g.,.py,.go,.ts) from action point descriptions and inject an explicit target-context into the system prompt. - Validate: Create and run a new group workflow to thoroughly test that the fan-out mechanism produces
Ndistinct, correctly targeted implementation prompts. - Cleanup: Remove the now-unused
discussions.availableProviderstRPC procedure and address a minor TypeScript diagnostic (_ExpandedPreviewdeclared but never read).
This journey from monolithic prompts to precision-engineered, fanned-out instructions is a testament to the iterative nature of building robust AI systems. It's not just about getting an answer from an LLM, but getting the right answer, in the right format, for the right context.
By tackling these challenges, we're making our AI workflow engine smarter, more accurate, and ultimately, a more powerful tool for developers.