Taming LLM Workflows: From Haikus to Executive Summaries in Our AI Engine
A deep dive into a recent development session, tackling tricky LLM output formats, workflow digest bugs, and enhancing cost visibility in our AI-powered workflow engine.
Building sophisticated AI-powered applications is a journey filled with fascinating challenges. Our "Ipcha Mistabra" workflow, designed to orchestrate complex adversarial analyses and synthesis, is a prime example. It's a powerful tool, but like any cutting-edge system, it occasionally throws us a curveball.
Recently, we dedicated a late-afternoon session to ironing out some critical kinks in our workflow engine. The goal was clear: enhance reliability, improve AI output quality, and boost transparency. I'm excited to share the breakthroughs we made, from wrestling with overly concise LLM digests to refining our prompt engineering for better, more predictable results.
Mission Accomplished: What We Shipped
Our session focused on several key areas, all of which are now live in production. The latest Ipcha Mistabra workflow (2758e8c9-3416-446a-ac03-a1889f226e09) is now running smoother and smarter, thanks to these targeted improvements.
1. The Case of the Missing Data: Taming the Fan-out Digest Bug
One of the most perplexing issues we faced involved our workflow's fan-out steps. Imagine you have a workflow where a single input branches out into 12 parallel adversarial analyses. You'd expect the downstream "Synthesis" step to receive all 12 detailed analyses. Instead, it was getting a brief, almost Haiku-like summary. This "digest" feature, while useful for single-step outputs, was lossy-compressing our critical data for fan-out steps, effectively starving the synthesis process.
The Fix: We pinpointed the culprit in src/server/services/workflow-engine.ts (lines 403-413). The solution was to modify our {{steps.Label.content}} accessor. It now intelligently skips digest auto-preference for steps that involve subOutputs (our fan-out steps). This ensures that the full, uncompressed content of all 12 analyses is passed downstream, while the .digest accessor remains available for explicit use when a summary is desired.
2. Prompt Perfection: Guiding LLMs to Better Outcomes
Our "Ipcha Mistabra" workflow relies heavily on LLMs for critical tasks like arbitration and synthesizing results. We discovered that even subtle phrasing in our prompts could lead to drastically different, and often undesirable, outputs.
-
Arbitration Prompt Rewrite (
src/server/trpc/routers/workflows.tsline 660):- Old Prompt: "Judge the following adversarial analysis process"
- Problem: The LLM interpreted this literally, evaluating the methodology of our adversarial analysis rather than the subject of the analysis (our OFFPAD AS product). It also had a habit of returning raw JSON, mimicking a dual-provider judge format, which wasn't human-readable.
- New Prompt: "Judge the SUBJECT of the adversarial analyses." We added explicit instructions: "Write a human-readable markdown summary. Do NOT output JSON or code blocks." This ensures the LLM focuses on the product and presents its judgment in a digestible format.
-
Results Prompt Rewrite (
src/server/trpc/routers/workflows.tsline 673):- Old Prompt: A structured classification prompt (e.g., "Classify as pain_point or strength").
- Problem: Our Gemini-2.5-pro model consistently outputted a structured JSON array, making it difficult for users to quickly grasp the executive summary.
- New Prompt: We shifted to an executive summary format with clear sections: Strengths, Critical Risks, Rejected Claims, Overall Assessment. Crucially, we added the explicit directive: "Write a human-readable executive summary... Do NOT output JSON or code blocks."
3. Streamlining Workflow Creation: Disabling generatePrompt
When creating a new Ipcha workflow, we noticed an unwanted "Implementation Prompt step" was being appended. This was due to createIpcha inheriting a default generatePrompt: true from the schema.
The Fix: We explicitly set generatePrompt: false in src/server/trpc/routers/workflows.ts (line 615) for createIpcha. This small change means our workflows now start clean, without unnecessary steps.
4. Enhanced Visibility: NerdStats and Cost Tracking
Understanding the performance and cost of our LLM workflows is paramount. We made two significant improvements here:
- NerdStats on Workflow Pages (
src/app/(dashboard)/dashboard/workflows/[id]/page.tsx): We integrated ourNerdStatscomponent directly into the workflow detail pages. This now provides a comprehensive breakdown of costs and token usage per phase and per LLM provider, aggregated from individual step data and fan-outsubOutputs. - Per-Provider Table in
summary.mdExport (src/server/services/workflow-bundle.ts): For offline analysis and reporting, our workflow bundle export now includes a "Per Provider" section in thesummary.md, offering a clear aggregation of costs from all fan-outsubOutputs.
5. Accurate Cost Coverage: Accounting for All LLMs
You can't optimize what you can't measure. We noticed that gemini-2.5-pro costs were showing up as $0.000000, which was misleading.
The Fix: We updated src/server/services/llm/types.ts to include the correct cost rates for gemini-2.5-pro (10 per 1M tokens) and also added all Ollama models, which are, delightfully, free to run locally! This ensures our cost calculations are now accurate across the board.
Lessons Learned: Our Debugging Journey
Debugging LLM-powered systems often feels like teaching a highly intelligent, but incredibly literal, student. Our "Pain Log" from the session offers valuable insights:
- The Digest Dilemma: We initially tried using
{{steps.Adversarial Analysis.content}}expecting the full output. The system's auto-digestion feature, designed for brevity, became a blocker for workflows requiring comprehensive data.- Lesson: Be mindful of implicit data transformations. When dealing with fan-out patterns, ensure your system explicitly handles the aggregation of full outputs, or provides a mechanism to bypass summarization.
- Prompt Precision is Paramount: Crafting prompts requires anticipating how an LLM might misinterpret vague or implied instructions.
- Lesson 1 (Arbitration): "Judge the process" versus "Judge the subject" is a critical distinction. LLMs are highly sensitive to the exact phrasing of instructions. Always be explicit about the domain of judgment.
- Lesson 2 (Output Format): Simply asking for a "summary" or "classification" might still yield structured data (like JSON) if the LLM's training data heavily features such formats for similar tasks. Explicitly stating "Write a human-readable executive summary. Do NOT output JSON or code blocks" is often necessary to get the desired markdown output.
What's Next?
With these critical fixes deployed, our focus shifts to validation and further enhancements:
- Verify Workflow Output: Confirm that workflow
2758e8c9's Results step produces clean markdown as intended. - Cost Calculation Check: Ensure Gemini costs are correctly reflected in NerdStats for new workflows.
- UI Toggle for
generatePrompt: Consider adding a UI option to togglegeneratePromptfor Ipcha workflows, providing more flexibility. - Stripe Integration: Finally add the pending Stripe environment variables to production and restart the container (a carry-over from a previous session).
- Retroactive Cost Recalculation: Investigate the feasibility of recalculating costs for old, completed workflows.
This session was a testament to the iterative nature of development, especially when working with complex AI systems. Each challenge overcome makes our workflow engine more robust, more transparent, and ultimately, more valuable. Onwards!