nyxcore-systems
6 min read

Leveling Up AI Workflows: From Fan-Out Fixes to Future Personas

We just wrapped a crucial dev session, tackling everything from workflow fan-out bugs and prompt engineering nightmares to optimizing local LLMs and laying the groundwork for a revolutionary persona system.

AILLMWorkflow AutomationPrompt EngineeringOllamaSystem DesignDevelopment Log

The world of AI development moves fast, and often, the most impactful work happens behind the scenes, refining the intricate gears of a complex system. We recently concluded an intense evening session, a deep dive into several critical areas, aiming to enhance the robustness, transparency, and intelligence of our AI-powered workflows. From wrestling with prompt engineering quirks to optimizing local LLM deployments and sketching out the future of our persona system, it was a session packed with problem-solving and strategic planning.

Here's a breakdown of what we accomplished, the challenges we overcame, and where we're headed next.

Taming the AI Workflow Beast

Our workflow engine is designed to orchestrate complex AI tasks, often involving multiple LLM calls in sequence or in parallel. We identified a few key areas where the system wasn't behaving as expected, leading to less-than-optimal outputs.

1. The Case of the Over-Summarized Fan-Out: Imagine a workflow where a single input is fanned out to 12 different analytical LLMs. You expect to collect all 12 detailed analyses in a subsequent step. However, we discovered that our workflow engine was sometimes auto-digesting these subOutputs when referenced like {{steps.Adversarial Analysis.content}}, compressing them into a brief summary. This was particularly noticeable with more concise models like Haiku, effectively losing the granular detail we needed.

  • The Fix: We adjusted src/server/services/workflow-engine.ts to skip digest auto-preference for steps that involve fan-out (subOutputs), ensuring that all rich, detailed analyses are passed downstream intact.

2. Precision in Prompt Engineering: Our Ipcha Mistabra (adversarial analysis and arbitration) process is critical, but we found the LLM sometimes misunderstood its role.

  • Arbitration Prompt: The initial prompt, "Judge the following adversarial analysis process," led the LLM to evaluate the methodology of the analysis rather than the product itself. It even returned raw JSON instead of a clear judgment.

  • Results Prompt: Similarly, our results step, intended to provide an executive summary, was sometimes producing structured JSON arrays instead of human-readable markdown.

  • The Fix: We rewrote both prompts in src/server/trpc/routers/workflows.ts:

    • For arbitration: "Judge the SUBJECT... NOT the analysis process" with explicit "no JSON, no code blocks."
    • For results: "Write human-readable executive summary format (Strengths, Critical Risks, Rejected Claims, Overall Assessment) instead of structured JSON."
  • We also ensured that new Ipcha creations correctly set generatePrompt: false to prevent an unwanted default "Implementation Prompt" step.

Shining a Light on Costs & Performance with NerdStats

Transparency in AI usage, especially concerning costs, is paramount. We've enhanced our dashboard to provide deeper insights.

  • Introducing NerdStats: We added a new component to our workflow detail page (src/app/(dashboard)/dashboard/workflows/[id]/page.tsx) that displays per-phase and per-provider cost breakdowns, aggregated from all step data and fan-out subOutputs.
  • Complete Cost Rates: We updated src/server/services/llm/types.ts to include accurate pricing for gemini-2.5-pro (1.25/1.25/10 per 1M tokens) and correctly reflect Ollama models as "free." This ensures our cost tracking is precise and reliable.
  • A "Per Provider" section was also added to our summary.md bundle export for offline analysis.

Fortifying Persona Evaluations

Our persona evaluation system helps us understand how different LLMs perform under specific persona constraints. A critical bug involved how we handled empty responses.

  • The Problem: When an LLM provider (e.g., due to safety filters) returned an empty response, our system would still send it to the judge, which would then default to an ambiguous 50/50/50 score. This skewed evaluation results and hid the underlying issue.
  • The Fix: In src/server/services/persona-evaluator.ts, empty responses are now auto-scored 0/0/0 with an empty_response violation. The UI (src/app/(dashboard)/dashboard/personas/[id]/evaluations/page.tsx) now clearly indicates "Red Empty response — provider likely blocked by safety filter." This provides immediate, accurate feedback.

Optimizing Local LLMs with Ollama

Running large language models locally or on self-hosted infrastructure comes with its own set of challenges, particularly around resource management.

  • The OOM Killer Strikes: We initially configured Ollama to use qwen2.5:7b as the default model. On our production server with a 7.5GB RAM and a 5GB container limit, this model (which has ~4.5GB weights plus KV cache requirements) frequently ran into Out-Of-Memory (OOM) errors during inference, resulting in signal: killed messages.
  • The Workaround: We've switched the default model from qwen2.5:7b to qwen2.5:3b (~2GB) in src/server/services/llm/adapters/ollama.ts and src/lib/constants.ts. This smaller model fits comfortably within our container limits, ensuring stable operation.
  • Minor Infra Quirk: We still have a cosmetic issue where the Ollama container healthcheck reports "unhealthy" because curl is not found in the base Ollama image. It's not impacting functionality, but it's on our list to address.

Lessons Learned & Debugging Tales

Every development session brings its own set of "aha!" moments and hard-earned insights.

  • The Digest Dilemma: When dealing with complex, multi-output steps, always be explicit about what content you expect. Default behaviors (like auto-digestion for brevity) can silently undermine the quality of downstream processing if not overridden.
  • Prompt Precision is Paramount: LLMs are incredibly literal. If you ask them to "Judge the process," they will. If you don't explicitly forbid JSON, they might give you JSON. Crafting prompts with surgical precision, including negative constraints ("Do NOT output JSON"), is crucial for reliable outputs.
  • Resource Constraints are Real: Running even moderately sized LLMs requires careful resource planning. What works on a development machine might OOM on production with tighter container limits. Always profile memory usage and be prepared to optimize or downsize models.
  • Infra Quirks Persist: Sometimes the smallest things, like a missing curl utility in a container image, can lead to confusing healthcheck failures. It's a reminder that infrastructure details matter.

Looking Ahead: The Future of Personas (v2)

With these immediate fixes deployed, our sights are set on a major strategic initiative: overhauling our persona system. Our deep research has identified several key weaknesses:

  • Generic test prompts: The same monolithic question for every persona.
  • Trivial jailbreak attacks: 2023-era "ignore previous instructions" no longer sufficient.
  • Vague judge prompt: Lacks persona-specific rubrics and has a truncated system prompt.
  • Wrong scoring weights: Over-weighting temperature rules and jailbreak role.
  • Incomplete markers: Only a few personas mapped, others use generic defaults.
  • No refusal quality dimension: Jailbreaks scored the same as regular tests.
  • Provider not isolated: resolveAnyProvider() makes results incomparable.

This overhaul will involve two parallel tracks:

  1. Persona Evaluation v2 (Scientific Framework): Moving towards a more rigorous, scientific approach to evaluating persona adherence and performance. We're currently in the brainstorming phase, deciding between a "breadth" approach (fixing critical gaps across all personas) or a "depth" approach (adding rigorous new test types for a few key personas). Our recommendation leans towards a hybrid approach.
  2. Rent-a-Persona API: Developing an external, token-based API to allow others to leverage our carefully crafted and evaluated personas.

Once we finalize the strategic approach for Persona System v2, we'll dive into writing a detailed design specification and implementation plan.


This session was a testament to the iterative nature of building complex AI systems. Each bug fixed, each prompt refined, and each new feature added brings us closer to a more robust, intelligent, and user-friendly platform. We're excited about the progress and even more so about the ambitious plans for our persona system!