Unlocking Self-Repairing LLM Workflows: Our Journey with RAG Injection

The promise of large language models (LLMs) is immense, but their inherent tendency towards "hallucination" or simply missing crucial context can be a significant hurdle in complex, multi-step workflows. How do you build systems that are not just powerful, but also robust and, ideally, self-correcting?

This was the central question driving our latest engineering sprint. Our goal: to prove that injecting relevant, factual information via Retrieval Augmented Generation (RAG) directly into our group workflow prompt builders could create a truly self-repairing system. We're thrilled to report: it works.

The Experiment: A/B/C Testing for Self-Repair

Our hypothesis was simple: give the LLM the right knowledge at the right time, and it will not only perform better but also identify and fix its own shortcomings. To test this, we devised an A/B/C experiment using our BRbase workflow, a complex, multi-step process for generating specific outputs based on persona and provider configurations (in this case, Google Gemini 2.5 Pro, augmented by various specialized models like NyxCore, Athena, etc.).

Here's how we set up our runs:

Run A (27dae5fc): The Baseline (Broken RAG)
- This run represented our "before" state. Due to an existing bug, the RAG content (axiomContent) was never actually loaded, meaning the LLM received 0 chunks of external knowledge.
Run B (230085a1): RAG Loaded, But Not Wired
- In this intermediate step, we ensured axiomContent was correctly loaded into our chainCtx (the context object passed through our workflow engine). However, a critical bug meant this content wasn't actually being passed down to the individual prompt builders that generate the LLM's instructions. A subtle but crucial miss!
Run C (2a3562e8): Fully Injected RAG
- This was the moment of truth. After identifying and fixing the bug from Run B, we ensured that chainCtx.axiomContent was properly wired through all three critical call sites within our src/server/services/workflow-engine.ts (around lines 2597, 2730, and 2829). Specifically, it was now correctly passed to buildGroupItemPromptInput(), buildConsistencyCheckInput(), and buildImplementationPromptInput() in src/server/services/implementation-prompt-generator.ts. This meant the LLM's prompt builders finally received the 257 relevant BRbase chunks we'd retrieved.

The Breakthrough: Self-Correction in Action

The results from Run C were nothing short of exciting.

When we ran our consistency checks, Run C, with full RAG injection, actually identified more issues than Runs A or B (3 critical + 5 warnings, compared to 2 critical in the earlier runs). This might sound counterintuitive – more issues found means better? Yes! It indicates that with the RAG context, the system had a more robust understanding of what "correct" looked like, leading to improved error detection.

But here's the kicker: the subsequent "implementation prompts" within the workflow, now empowered by the RAG content, self-resolved all identified issues. This is the essence of a self-repairing system. The LLM, given the right context, was able to detect its own potential missteps and course-correct autonomously.

The fix, along with a detailed report, has been committed (de12a86 and 9a4a60f respectively) and is now live in production. You can find the full report at docs/reports/2026-03-18-axiom-injection-ab-test.md.

Lessons Learned: Navigating the Production Rapids

No engineering journey is without its challenges. Here are a couple of key lessons we picked up along the way:

The SSH Heredoc Dance for `psql`

When making direct database updates on production (necessary for aligning our workflow configs for the A/B/C test), we ran into a classic quoting conundrum. Trying to use SSH heredocs for psql queries on a remote machine proved trickier than expected, especially with escaped double quotes inside single-quoted SSH commands.

After several attempts, the reliable pattern emerged: piping a local heredoc to the remote SSH command's stdin.

bash

ssh -T root@your.prod.server 'docker exec -i nyxcore-postgres-1 psql -U nyxcore -d nyxcore' << 'EOF'
-- Your multi-line psql query goes here
UPDATE workflow_steps SET ... WHERE ...;
SELECT * FROM workflow_steps LIMIT 1;
EOF

Self-correction: Also, a quick reminder that Prisma's model names don't always map directly to production table names. We found our step_templates model was actually workflow_steps in the database – a small detail that can cause big headaches!

Anthropic API Credits: The Unsung Hero of Fallbacks

During Run C, we hit an unexpected snag: our Anthropic API credit balance was too low. This impacted our Haiku-based side features, specifically the per-step consistency checks and step digests. While the main workflow runs on Google, these auxiliary checks needed Anthropic.

Thankfully, our system is designed with fallbacks. The consistency check gracefully degraded, attempting Anthropic, then Google, then OpenAI until it found a working provider. This non-fatal issue highlighted the importance of robust credit monitoring and multi-provider strategies for critical features. Time to top up those credits!

What's Next on the Horizon?

This successful experiment has opened up several exciting avenues for further development:

Top up Anthropic API credits: A practical, immediate next step to restore full functionality to our Haiku-based features.
Expand RAG to earlier workflow steps: Currently, axiomContent primarily benefits the implementation prompts. We're considering adding {{axiom}} to the Group Analysis step template (step 0) to give the LLM a stronger foundation from the very beginning.
Integrate per-step scoring: While we have post-workflow audit scoring (workflowInput.selfAudit), wiring per-step ipcha scoring directly into the workflow engine could provide invaluable real-time feedback and allow for dynamic adjustments.
Feedback loops for low-scoring prompts: The consistency check identified action points scoring 4-6/10. We can leverage these scores to automatically trigger re-generation loops for prompts that don't meet our quality thresholds.
Real-world validation: The ultimate test is to run the actual BRbase feature implementation from the prompts generated in Run C. This will validate the real-world quality and effectiveness of our RAG-enhanced, self-repairing system.

This session was a significant leap forward in building more intelligent, resilient, and autonomous AI-driven workflows. By strategically injecting context, we're not just making LLMs smarter; we're teaching them to be more aware and capable of repairing their own outputs, paving the way for truly robust AI systems.