Orchestrating AI Brilliance: Gearing Up for Expert Teams & LLM A/B Testing

Hey everyone,

It's been a busy sprint, and as we wrap up one set of features, the roadmap ahead is looking incredibly exciting. Today, I want to pull back the curtain on our latest development session, where we laid the groundwork for what I believe will be a significant leap forward for our AI-powered workflows: introducing dynamic expert teams and multi-provider LLM comparisons.

The Foundation: A Stable Launchpad

Before we jump into the shiny new stuff, it's worth acknowledging the recent wins that make these ambitious features possible. We just pushed commit 8594153 to main, which includes some crucial stability and visibility improvements:

Robust Code Analysis Error Handling: One of the biggest pain points previously was how our code analysis handled errors. Now, batch and document-level errors are non-fatal. Instead of crashing the whole process, we gracefully store the actual error messages in the database. This means more resilient analysis runs and better diagnostics when things don't go perfectly.
Real-time Process Visibility: We've shipped a new sidebar widget for "Active Processes." This little gem, powered by src/components/layout/active-processes.tsx and a new dashboard.activeProcesses tRPC query, gives users a live overview of what the system is currently working on. No more wondering if that long-running analysis is still chugging along!
Green CI/CD: All 71 tests are passing, and our typecheck is squeaky clean. This stability is non-negotiable as we build more complex systems.
Operational Code Analysis: The core code analysis feature is humming along, consistently finding patterns and generating documentation on our test repositories.

These "done" items aren't just features; they're the solid ground upon which we can build more advanced capabilities without constantly looking over our shoulders.

The Next Frontier: Orchestrating AI Intelligence

With a stable base, our focus is now shifting to two major workflow enhancements designed to make our AI agents more intelligent, adaptable, and ultimately, more effective:

1. Dynamic Expert Teams: Tailoring AI to the Task

The Problem: While our current workflow system is powerful, the prompts often lean towards a "one-size-fits-all" approach. Different tasks require different expertise. A security review needs a "security expert" persona, while performance optimization needs a "performance engineer."

The Vision: Imagine being able to select a specific "expert team" or persona for any step in a workflow. This allows us to craft highly specialized prompts and direct the AI's focus precisely where it's needed.

Our Approach: We'll be diving deep into src/server/services/workflow-engine.ts to understand how workflow steps, teams, and personas are currently structured. I'm particularly interested in the existing "expert team prompts" from commit 79d2445 to see how those teams are defined.

The core design challenge here is two-fold:

Backend: How do we dynamically resolve the correct team/persona at runtime for a given workflow step? This likely involves extending the workflow step definition to include a teamId or similar.
Frontend: How do we expose this selection to the user? A dropdown in the step editor, allowing users to pick from available expert teams, seems like the most intuitive approach.

This enhancement promises to unlock a new level of nuance and effectiveness in our AI-driven processes.

2. Multi-Provider A/B Comparison for Code Prompts

The Problem: The world of Large Language Models (LLMs) is rapidly evolving, with new providers and models emerging constantly. Different models excel at different tasks, and even the same model can produce varying outputs. Relying on a single prompt to a single provider for critical steps (like generating final code) can be limiting. How do you know you're getting the best output?

The Vision: For critical steps, especially the final "Generate Code Prompt" step, we want to query multiple LLM providers (or even multiple models from the same provider) in parallel, then present the results side-by-side for human review and selection.

Our Approach: We already have a precedent for managing alternatives with generateCount, selectedIndex, and alternatives on our WorkflowStep objects. This provides a good starting point for storing multiple outputs.

The new design will focus on:

Backend Parallel Execution: The workflow engine will need to fan out requests to N providers concurrently for a given prompt, collecting all their responses.
Frontend A/B Comparison UI: This is where the magic happens for the user. We'll need a dedicated view that renders the outputs from each provider side-by-side, allowing for easy comparison.
Intuitive Selection & Handoff: Once the user picks the "winning" output, we need a clear UI to confirm their selection and seamlessly pass that chosen output to the next stage, typically a coding instance.

This feature will empower users to leverage the strengths of various LLMs, mitigate the risks of single-point-of-failure models, and ultimately ensure higher quality, human-curated outputs.

Lessons from the "Pain Log": The Power of Preparation

Our "Pain Log" for this session was surprisingly empty: "No major issues encountered." This isn't a sign of complacency; it's a testament to the hard work put into the previous sprint.

Actionable Takeaway: A smooth development session, free of critical blockers, often means the preceding work was robust, well-tested, and thoughtfully designed. The effort we put into fixing error handling and enhancing visibility has paid dividends, allowing us to focus entirely on planning the next big thing rather than fighting fires. It underscores the importance of a stable foundation and addressing technical debt proactively.

The Road Ahead: Design, Explore, Implement

Here’s our immediate roadmap, broken down into logical phases:

Deep Dive & Exploration:
- Explore existing workflow system: src/server/services/workflow-engine.ts, workflow steps, teams/personas.
- Understand current "expert team prompts" from commit 79d2445 – how teams are structured.
- Understand existing multi-output alternatives system (generateCount, selectedIndex, alternatives on WorkflowStep).
Design Phase:
- Team Selection: Design how to make teams selectable per workflow step (likely a dropdown in the step editor UI).
- A/B Comparison: Design the "Generate Code Prompt" final step to query N providers and render results side-by-side.
- Selection UI: Design the UI for picking the winning output and passing it to the coding instance.
Implementation:
- Backend: Implement team resolution logic in the workflow engine and the multi-provider parallel execution.
- Frontend: Implement the team selector in the step editor, the A/B comparison view, and the selection/handoff UI.

It's a full-stack effort that promises to push the boundaries of what our AI workflows can achieve. I'm incredibly excited about the potential these features hold for making our system more intelligent, flexible, and powerful.

Stay tuned for updates as we bring these ideas to life!

json

{"thingsDone":[
    "Fixed code analysis error handling (batch/doc errors non-fatal, messages stored)",
    "Created sidebar active processes widget (src/components/layout/active-processes.tsx + dashboard.activeProcesses tRPC query)",
    "Ensured all 71 tests pass and typecheck is clean",
    "Confirmed code analysis feature is fully operational (58 patterns found, 3 docs generated on test repo)"
],"pains":[
    "No major issues encountered in this specific handoff session, highlighting the success of previous stability work."
],"successes":[
    "Achieved a stable and clean codebase, ready for new feature development.",
    "Successfully planned two major workflow enhancements: expert teams and multi-provider A/B testing.",
    "Established a clear, actionable roadmap for design and implementation."
],"techStack":[
    "TypeScript",
    "tRPC",
    "React (for frontend components like active-processes.tsx)",
    "Node.js (for backend services like workflow-engine.ts)",
    "Git",
    "PostgreSQL (or similar DB for storing error messages, workflow state)",
    "LLM APIs (for multi-provider comparison)"
]}