When AI Hallucinates Healthcare: Unmasking a Subtle LLM Workflow Bug and Embracing 'Controlled Dreaming'

It was late. The kind of late where the lines between "bug" and "feature" start to blur, and the only thing clearer than your screen is the looming deadline for a critical fix. My mission: track down and obliterate a particularly insidious bug where our LLM-powered workflows were generating bizarre, hallucinated content – specifically, detailed collaboration scenarios involving healthcare and e-commerce, completely out of context.

This wasn't just a minor glitch; it was a BLOCKER. Our workflow engine, designed for precision and relevant output, was suddenly dreaming of patient data and shopping carts. Here's how we dissected the problem, fixed the systemic root cause, and even found a silver lining in the process.

The Mystery Unfolds: Diagnosing the Hallucination

The first step in any good debugging session is to identify the patient zero. We traced the hallucinated output back to a specific workflow: "nyxCore - Hetzner Deployment." This workflow, intended for generating code prompts related to infrastructure deployment, was outputting generic persona collaboration scenarios.

Digging into its structure, we found the smoking gun in Step 2, "Generate Code Prompts." Its prompt template referenced {{steps.Extract Features.content}}. The problem? There was no step with the label "Extract Features" anywhere in that workflow.

The "Aha!" Moment: The workflow engine, upon encountering this broken reference, was silently resolving {{steps.Extract Features.content}} to an empty string. An empty prompt, combined with the LLM's inherent drive to be helpful and the general specializations of the personas involved (e.g., Engineering, Security, DSGVO), led it to hallucinate generic, plausible collaboration scenarios. Healthcare and e-commerce, it turns out, are common enough domains for persona-driven problem-solving that they became the default "creative fill."

This wasn't an LLM going rogue; it was an LLM doing precisely what it's trained to do: make sense of limited input, even if that input is effectively null.

The Surgical Strike: Fixing the Immediate Bug

Our first priority was to prevent this from happening again. This meant introducing validation at the point of workflow creation.

In src/server/trpc/routers/workflows.ts, within the create mutation, we added a pre-validation step:

typescript

// src/server/trpc/routers/workflows.ts (simplified)

// ... inside the 'create' mutation input handler ...

for (const step of input.steps) {
  const missingRefs = detectMissingStepRefs(step.prompt, input.steps.map(s => s.label));
  if (missingRefs.length > 0) {
    throw new TRPCError({
      code: 'BAD_REQUEST',
      message: `Workflow creation failed: Step '${step.label}' references non-existent steps: ${missingRefs.join(', ')}.`,
      // ... potentially add more details for the client ...
    });
  }
}

// ... proceed with prisma.workflow.create() if validation passes ...

This ensures that any new workflow attempting to reference a non-existent step will immediately throw a TRPCError.BAD_REQUEST, guiding the user to correct their input before the workflow is even saved.

Building Resilience: Fixing the Systemic Root Cause

Preventing new bugs is good, but what about existing workflows that might already contain these broken references? We needed a more robust solution in the core workflow engine to handle these gracefully at runtime.

The changes landed in src/server/services/workflow-engine.ts:

detectMissingStepRefs() Helper: We extracted the logic for identifying broken references into a reusable helper function.
Smarter resolvePrompt(): The core resolvePrompt() function, responsible for injecting dynamic content into step prompts, was enhanced:
- If detectMissingStepRefs() found broken references, instead of silently returning an empty string, it would now issue a warning and, crucially, fall back to auto-context injection. This means it would inject the outputs of all previous completed steps into the current prompt. This provides the LLM with some relevant context, rather than nothing, making it less likely to hallucinate wildly.
- The warning message itself was improved: [WARNING: Step "X" not found. Available steps: Y, Z, ...] rather than a silent "".
- These warnings are now yielded as progress events before both human review steps and LLM execution steps, so users can be aware of potential context issues.

Here's a conceptual look at the refined resolvePrompt logic:

typescript

// src/server/services/workflow-engine.ts (conceptual)

function resolvePrompt(promptTemplate: string, currentStep: WorkflowStep, allSteps: WorkflowStep[], previousStepOutputs: Record<string, string>): string {
  const missingRefs = detectMissingStepRefs(promptTemplate, allSteps.map(s => s.label));

  if (missingRefs.length > 0) {
    // Emit a warning event (e.g., via a callback or event bus)
    emitProgressEvent({
      type: 'warning',
      message: `Prompt for step '${currentStep.label}' references non-existent steps: ${missingRefs.join(', ')}. Falling back to auto-context.`,
      details: { missingRefs, availableSteps: allSteps.map(s => s.label) }
    });

    // Fallback: Inject all previous step outputs as context
    const autoContext = Object.entries(previousStepOutputs)
      .map(([label, content]) => `--- Previous Step: ${label} ---\n${content}\n`)
      .join('\n');

    return `${autoContext}\n\n${promptTemplate}`; // Prepend context, still allow template
  }

  // Original logic: Safely resolve all existing references
  let resolved = promptTemplate;
  // ... (logic to replace {{steps.X.content}} with actual content) ...

  return resolved;
}

This dual approach means we catch errors early (at creation) and handle existing issues gracefully (at runtime), making the system much more robust.

Lessons from the Trenches: Developer Gotchas

Even with a clear path, the journey to a fix is rarely smooth. Here are a few "pain log" entries that turned into valuable lessons:

Prisma Inline Queries are Tricky:
- The Problem: Trying to run complex Prisma queries directly in the shell using npx tsx -e '...'
- The Failure: Shell escaping issues, especially with !== and template literals, led to esbuild parse errors. It's a syntactic minefield.
- The Workaround: For anything non-trivial, create a temporary .ts script file (e.g., scripts/temp-query.ts) within the project root, run it with npx tsx scripts/temp-query.ts, and then delete it. Module resolution works, and you avoid shell quoting hell.
- Lesson: Save yourself the headache. Complex scripts belong in files, not inline.
Data Model Clarity is Key (Prisma templateId):
- The Problem: Attempting prisma.workflow.findFirst({ select: { templateId: true } }).
- The Failure: templateId doesn't exist directly on the Workflow model.
- The Workaround: Realized that workflow template information lives in the WorkflowTemplate table, and a Workflow instance only has a foreign key to it if it was created from a template, not that it is a template.
- Lesson: Always double-check your data model and relationships. A Workflow is an instance, a WorkflowTemplate is its blueprint.
Field Naming Matters (Prisma compareOutputs vs. alternatives):
- The Problem: Querying for select: { compareOutputs: true } on workflowStep.
- The Failure: compareOutputs doesn't exist. The field is actually alternatives (a JSON type), which holds comparison data, and checkpoint which holds review data.
- The Workaround: Corrected the field name to alternatives.
- Lesson: Be precise with field names. Even similar-sounding concepts can map to different fields in your schema. Review your schema.prisma often.

From Bug to Feature: Embracing Controlled Hallucination

This entire debugging journey led to an unexpected revelation: the "bug" behavior of an empty prompt leading to creative, if irrelevant, output isn't entirely useless. What if we could control this?

The idea for an /init-dream feature was born. Imagine intentionally giving our personas a vague or even empty prompt, coupled with some domain hints, and letting them "dream up" creative collaboration scenarios. This could be an incredible tool for:

Ideation and Brainstorming: Quickly generating diverse ideas for a new project or problem space.
Exploring Edge Cases: Discovering unexpected interactions between personas.
Creative Problem Solving: Pushing the boundaries of what our personas can generate.

We're now designing this feature. Key considerations include:

Parameters: What personas to involve, optional domain hints, creativity temperature, desired output format.
Implementation: Could it be a new step type (dream), a specialized workflow template, or an entirely separate interface?

The key insight is that what was once a critical flaw – the LLM trying to be helpful with no input – can be harnessed when done intentionally and with guardrails.

What's Next? Solidifying the Foundation

With the core issue resolved and a new feature on the horizon, our immediate next steps involve enhancing the system further:

Design /init-dream Feature: Flesh out the concept, parameters, and decide on the implementation strategy.
Extend Validation: Add step cross-reference validation to the updateStep mutation as well, not just create. This ensures that even modified workflows remain valid.
UI Indicators: Implement a visual warning in the workflow builder UI when a step contains broken references. This would provide immediate feedback to users.
Re-run Existing Workflows: Inform users that the existing workflow 5d30703a-... (and any others with similar issues) will now benefit from the auto-context fallback, and they may want to re-run it for better results.

This journey from a critical hallucination bug to a potential "controlled dreaming" feature is a testament to the unexpected paths of software development. Sometimes, understanding why something breaks can lead you to build something even better.

json

{
  "thingsDone": [
    "Identified and fixed the root cause of LLM hallucination: broken step references in workflow prompts.",
    "Implemented pre-validation in workflow creation (tRPC `create` mutation) to prevent new broken references.",
    "Enhanced workflow engine's `resolvePrompt` to gracefully handle existing broken references by falling back to auto-context injection and emitting warnings.",
    "Refactored `PersonaPicker` and integrated it into multiple components.",
    "Restyled `ReportsTab` with themed designs and enhanced cards."
  ],
  "pains": [
    "Shell escaping issues with inline `npx tsx -e` commands for Prisma queries.",
    "Misunderstanding Prisma data model relationships (workflow vs. workflow template).",
    "Incorrect field naming in Prisma queries (`compareOutputs` vs. `alternatives`)."
  ],
  "successes": [
    "Resolved a critical BLOCKER bug.",
    "Improved system robustness by adding both compile-time (validation) and runtime (graceful fallback) error handling.",
    "Discovered and designed a potential new feature (`/init-dream`) from the observed 'bug' behavior.",
    "Enhanced developer experience for debugging Prisma queries (lesson learned about temp scripts)."
  ],
  "techStack": [
    "TypeScript",
    "Node.js",
    "Prisma",
    "tRPC",
    "Next.js",
    "LLMs (Large Language Models)",
    "Workflow Engine"
  ]
}