Taming the AI Workflow Beast: An Evening of Debugging, Discovery, and Design

Another evening, another deep dive into the labyrinthine world of AI workflows. This wasn't just about squashing bugs; it was about refining the delicate dance between code, prompts, and the unpredictable nature of large language models. The goal for this session was ambitious: fix workflow fan-out issues, sharpen our Ipcha Mistabra (adversarial analysis) prompts, introduce detailed cost visibility, improve persona evaluations, switch our default Ollama model, and kick off planning for a major overhaul of our persona system.

By the time the last commit was pushed and deployed, we'd made significant strides, putting out several fires and setting the stage for future innovation. Let's unpack the journey.

The Evening's Harvest: What Got Done

It was a classic mix of bug fixes and quality-of-life improvements that make a real difference in the day-to-day.

Workflow Engine: Precision Prompting & Data Flow

The core of our system relies on complex workflows, and ensuring data flows correctly through multi-step processes is paramount.

Fan-out Digest Fix: A subtle but critical bug was found in src/server/services/workflow-engine.ts. Our fan-out steps, which generate multiple parallel analyses (subOutputs), were getting an unwanted digest auto-preference. If a downstream step referenced {{steps.Label.content}}, it would receive a Haiku-compressed summary instead of the full, detailed output of 12 individual analyses. This was like asking for a detailed report and getting a haiku instead! We've now ensured that steps with subOutputs skip this digest, preserving the richness of the aggregated data.
Sharpening Ipcha Mistabra Prompts: Our Ipcha Mistabra (adversarial analysis) workflow is designed to challenge assumptions. We found our arbitration prompt was guiding the LLM to evaluate its own process rather than the subject matter. The fix in src/server/trpc/routers/workflows.ts involved a crucial prompt rewrite: "Judge the SUBJECT... NOT the analysis process" and explicit instructions for "no JSON, no code blocks."
Human-Readable Results: Similarly, the final results prompt for Ipcha Mistabra was sometimes yielding structured JSON, which isn't ideal for immediate human consumption. We've refined it to demand an executive summary format (Strengths, Critical Risks, Rejected Claims, Overall Assessment) instead of structured JSON, ensuring immediate readability.
Preventing Unwanted Prompt Generation: A small but annoying default in createIpcha was causing an Implementation Prompt step to be appended unnecessarily. Setting generatePrompt: false by default now streamlines the process.

NerdStats & Cost Transparency

Understanding the operational costs of LLM usage is no longer a luxury; it's a necessity.

Introducing NerdStats: For our fellow developers, we've added a NerdStats component to the workflow detail page (src/app/(dashboard)/dashboard/workflows/[id]/page.tsx). This provides a granular breakdown of costs per phase and per provider, aggregated from step data and fan-out subOutputs.
Comprehensive Cost Rates: We've updated src/server/services/llm/types.ts to include accurate cost rates for gemini-2.5-pro ( $1.25/$ 10 per 1M tokens) and, importantly, declared all Ollama models as "free." This means no more misleading $0.000000 for Gemini Pro. A "Per Provider" section was also added to our summary.md bundle export.

Persona Evaluations: Data Integrity

Reliable evaluation of our AI personas is critical for their development.

Handling Empty Responses: When an LLM provider's safety filter blocks a response, it previously went to our judge, defaulting to a 50/50/50 score, skewing evaluation data. Now, src/server/services/persona-evaluator.ts auto-scores empty responses as 0/0/0 with an empty_response violation, and the UI (src/app/(dashboard)/dashboard/personas/[id]/evaluations/page.tsx) clearly flags these with a "Red Empty response — provider likely blocked by safety filter" indicator.

Ollama: A Pragmatic Model Switch

Running self-hosted LLMs comes with its own set of operational challenges.

Memory Management: Our default Ollama model, qwen2.5:7b, was consistently OOM-killed on our production server (7.5GB RAM server with a 5GB container limit). The 7B model alone required ~4.5GB weights + ~448MB KV cache, pushing it over the edge. We've pragmatically switched the default to qwen2.5:3b (~2GB) in src/server/services/llm/adapters/ollama.ts and src/lib/constants.ts, ensuring stability.

The Commits

All these changes are live on main:

22f855c — fix: skip digest auto-preference for fan-out steps
05c0992 — fix: improve Ipcha Mistabra arbitration and results prompts
9280162 — fix: Results step must output human-readable markdown, not JSON
5a4e268 — feat: add Stats for Nerds to workflow detail page
e3dd795 — feat: add per-provider table to summary.md + complete cost rates
4597943 — fix: handle empty responses in persona evaluations
028974b — fix: switch Ollama default model from qwen2.5:7b to qwen2.5:3b

Lessons Learned from the Trenches (The "Pain Log")

Every session has its share of head-scratching moments. Here’s what we wrestled with and what we learned:

The Case of the Overzealous Summarizer:
- Tried: Referencing {{steps.Adversarial Analysis.content}} in a downstream Synthesis step, expecting the full combined output of 12 analyses.
- Failed: Received a Haiku-compressed digest instead. Our system was trying to be "helpful" by summarizing, but in this context, it was losing critical detail from the fan-out.
- Lesson: When dealing with aggregated outputs from parallel LLM calls, explicit control over content resolution is vital. Auto-summarization is a powerful feature, but it needs to be switchable or context-aware to prevent data loss in structured workflows.
The LLM That Judged the Judge:
- Tried: An arbitration prompt like "Judge the following adversarial analysis process."
- Failed: The LLM, instead of evaluating the product of the adversarial analysis, started critiquing the methodology itself, and even returned raw JSON instead of a decision.
- Lesson: Prompt engineering for arbitration or evaluation steps demands extreme precision. Be hyper-explicit about the subject of the evaluation and what the LLM should ignore. Also, always reinforce output format requirements, even if it feels redundant.
When JSON Becomes an Obstacle:
- Tried: A results prompt with structured classification requirements.
- Failed: Gemini-2.5-pro dutifully outputted a JSON array, but for the end-user, a human-readable executive summary was far more valuable.
- Lesson: Balance the desire for structured LLM output with the ultimate consumption format. Sometimes, "human-readable markdown" is the superior instruction, even if it means less programmatic parsing on our end.
The Memory Monster (Ollama OOM):
- Tried: Running Ollama qwen2.5:7b on a production server with a 5GB container memory limit.
- Failed: signal: killed – classic Out Of Memory during inference. The 7B model's weights and KV cache pushed it past the limit.
- Lesson: Resource monitoring is non-negotiable for self-hosted LLMs. Be aware of model memory footprints (weights + KV cache) and set realistic container limits. Have smaller, performant fallback models ready for production environments.
The Silent Runner (Ollama Healthcheck):
- Tried: Using curl for the Ollama container healthcheck.
- Failed: curl: not found – the minimal Ollama image doesn't include common utilities. The healthcheck always reported "unhealthy" despite the container functioning perfectly.
- Lesson: Understand your container images. If standard tools aren't present, adapt your healthchecks or bake in necessary utilities. For now, it's a cosmetic issue, but one that needs addressing.

The Road Ahead: Persona System v2

With the immediate fires out, our focus shifts to a critical strategic initiative: Persona System v2. Our current persona evaluation system, while functional, has revealed several weaknesses through deep analysis. This next phase aims to build a more scientific, robust framework for evaluating and evolving our AI personas, along with an ambitious "Rent-a-Persona" API.

We're currently in the brainstorming phase, and a key decision point is the approach:

(A) Breadth: Fix all existing tests for all personas, keeping the same 3 test types.
(B) Depth: Add rigorous new test types, starting with 1-2 personas as a Proof of Concept.
(C) Hybrid (recommended): Fix critical gaps across all personas AND introduce 1 new, impactful test type.

This choice will guide the design specification.

Deep Research Findings (The "Why" for v2)

Our subagent analysis highlighted several key areas for improvement:

Generic Test Prompts: Many personas are tested with the same monolithic migration question, failing to probe persona-specific nuances.
Trivial Jailbreak Attacks: Our current jailbreak tests are often 2023-era, easily bypassed, and not reflective of sophisticated attacks.
Vague Judge Prompt: Lack of persona-specific rubrics and truncated system prompts for the judge lead to inconsistent evaluations.
Wrong Scoring Weights: The current system sometimes over-weights factors like temperature or jailbreak success inappropriately.
Incomplete Markers: Only a subset of our personas (Cael, Lee, Morgan, Sage) have mapped markers; others default to generic settings.
No Refusal Quality Dimension: Jailbreaks are scored similarly to regular tests, without distinguishing the quality of the LLM's refusal.
Provider Not Isolated: Our resolveAnyProvider() setup makes cross-tenant result comparisons difficult.

The v2 overhaul will tackle these head-on, allowing us to accurately measure and iteratively improve our personas.

Immediate Next Steps

Persona System v2: Await user's choice on breadth vs depth, then write the design spec, and invoke our internal "writing-plans" skill for an implementation plan. This will split into two parallel tracks: Persona Evaluation v2 (scientific framework) and Rent-a-Persona API (external token-based access).
Verify Workflow 2758e8c9: Confirm the Results step now outputs clean markdown.
Fix Ollama Healthcheck: Install wget/curl or implement a different healthcheck mechanism.
Stripe Env Vars: Add pending Stripe environment variables to production.
Retroactive Cost Recalculation: Consider recalculating costs for old workflows now that rates are accurate.

This session was a microcosm of modern AI development: a blend of meticulous debugging, strategic planning, and the constant learning curve of working with cutting-edge (and sometimes quirky) LLMs. It's a challenging but incredibly rewarding field, and I'm excited for what Persona System v2 will unlock.

json

{"thingsDone":["Fixed workflow fan-out digest issue for aggregated outputs","Improved Ipcha Mistabra arbitration and results prompts for precision","Ensured human-readable markdown output for workflow results","Added 'NerdStats' component for per-phase/per-provider cost breakdowns","Completed LLM cost rates for Gemini and Ollama models","Improved persona evaluation data integrity for empty responses","Switched default Ollama model to qwen2.5:3b due to OOM issues with 7b","Deployed all changes to production."],"pains":["LLM auto-summarization on fan-out steps leading to data loss","LLM evaluating its own process instead of the subject matter","LLM outputting JSON when human-readable markdown was required","OOM errors with Ollama 7B model in resource-constrained environments","Ollama container healthcheck failure due to missing `curl`."],"successes":["Achieved greater control over LLM output in complex workflows","Improved precision in prompt engineering for critical evaluation steps","Enhanced cost transparency for LLM usage","Improved data quality for AI persona evaluations","Ensured stability of self-hosted LLM services through model selection."],"techStack":["TypeScript","Next.js","tRPC","Ollama","Gemini-2.5-pro","LLM Workflow Engines","Prompt Engineering","Containerization (Docker/Ollama)"]}