Leveling Up AI Workflows: Introducing Personas & Multi-Provider A/B Testing
We've just pushed a major update, bringing powerful new capabilities to our AI workflows: dynamic, persona-driven expert teams and robust multi-provider A/B comparison for critical steps. Dive into how we built it and the lessons we learned along the way.
Building sophisticated AI applications often feels like navigating a constantly evolving landscape. As we push the boundaries of what our systems can achieve, two critical needs consistently emerge: precision in output and reliability across models. Our latest development sprint tackled these head-on, delivering features that empower users with unprecedented control and insight: workflow personas and multi-provider A/B comparison.
This post dives into the recent session where we implemented these features, sharing the technical details, the challenges we overcame, and the immediate impact on our platform.
Bringing Personalities to AI: Persona-Driven Expert Teams
Imagine orchestrating a team of highly specialized AI experts for every task. That's the power of workflow personas. Instead of generic prompts, you can now define specific roles—like a "Creative Strategist" or a "Technical Auditor"—each with its own system prompt, guiding the AI to adopt a particular style, tone, and focus.
The Why
In complex AI pipelines, the initial "expert team assembly" step is crucial. By injecting predefined personas, we ensure that the AI's foundational understanding of its role is consistent and tailored to the task at hand. This leads to more precise, relevant, and consistently high-quality outputs.
How We Built It
- Schema Evolution: We introduced
personaIdsto ourWorkflowschema, allowing users to select one or more personas for a workflow. - Dynamic Prompt Injection: The core logic lives in
workflow-engine.ts, whereloadPersonaSystemPrompts()dynamically fetches and injects the selected persona's system prompts into the AI's context. - User Experience:
- A new persona picker UI was added to
workflows/new/page.tsx, making it easy to assign personas when creating or updating a workflow. - Persona badges now appear on the workflow details page (
dashboard/workflows/[id]/page.tsx), providing an at-a-glance view of the assigned experts.
- A new persona picker UI was added to
- Team Assembly Enhancement: Crucially, all three of our initial "Step 0: Assemble the Expert Team" prompts (
deepPrompt,extensionPrompt,secPrompts) were updated to defer to these injected personas. If personas are present, they take precedence, ensuring the expert team is assembled with their specific guidance. We also updated the default system prompts for these to align with the new persona-aware structure.
The Quest for Quality: Multi-Provider A/B Comparison
The world of Large Language Models (LLMs) is dynamic, with new models and providers emerging constantly. Each offers unique strengths and weaknesses. How do you choose the best one for a critical workflow step? Our new multi-provider A/B comparison feature provides the answer.
The Why
To ensure optimal performance and resilience, it's vital to compare how different LLMs handle the same input. This A/B testing capability allows users to run a single workflow step against multiple providers (e.g., Anthropic, OpenAI, Google, Ollama) simultaneously, then visually compare their outputs side-by-side to select the best fit.
How We Built It
- Step Configuration: We added
compareProviderstoWorkflowStepconfiguration. This array specifies which providers should be invoked for a particular step. - Forking Logic: Within the workflow engine, if
compareProvidersis present, the system "forks" the execution for that step, sending the same input to each specified provider. - User Interface for Comparison:
- A new compare providers toggle was integrated into
SortableStepCard, allowing users to activate this feature for any step. - The results are presented in an "alternatives block" with provider and model badges, clearly indicating which model generated which output.
- "N providers" badges on step headers quickly show when a step is configured for multi-provider comparison.
- A new compare providers toggle was integrated into
- Preventing Truncation: A key learning from early testing was that some outputs were being truncated due to token limits. We promptly bumped
maxTokensfrom 8192 to 16384 for our core generation models (deepExtend,deepWisdom,deepImprove) insrc/lib/constants.tsto accommodate potentially longer, more detailed responses, especially when comparing multiple outputs.
Under the Hood: Supporting Infrastructure
These features required robust backend and data model adjustments:
- tRPC Router for Personas: A dedicated tRPC router (
src/server/trpc/routers/personas.ts) was created to handlelistandgetoperations for personas, registered insrc/server/trpc/router.ts. - Workflow Router Updates: Our
src/server/trpc/routers/workflows.tswas updated to handlepersonaIdsduring create/update/duplicate operations, andcompareProvidersinstepConfigSchemaandsteps.updatelogic. We also bumped theselectAlternativemaximum to 3, preparing for more sophisticated comparison options. - Database Sync: All schema changes were pushed via
db:push, ensuring our database was aligned with the new data models.
Lessons Learned & Overcoming Hurdles
No significant feature rollout is without its challenges. Here are a few key lessons from this sprint:
1. Type Safety vs. Flexibility with Zod Enums
- Challenge: When trying to define
compareProvidersasstring[]in our frontendStepConfig, TypeScript correctly flagged it as incompatible with our Zod enum union type ("anthropic"|"openai"|...). Zod's enum expects specific literal values, not just any string. - Workaround/Solution: We explicitly defined the type as a literal union of allowed providers:
("anthropic"|"openai"|"google"|"ollama")[]. This maintained strict type safety while allowing the necessary flexibility. - Insight: Sticking to strict typing, especially with schema validation libraries like Zod, is paramount. Sometimes, the solution is to explicitly declare the union of allowed literals rather than a generic string array.
2. Surgical Edits in Template Literals
- Challenge: Our in-house "Edit" tool, designed for targeted string modifications, struggled to match and replace content that spanned across template literal boundaries (e.g., trying to edit content that included
${variable}). - Workaround/Solution: We learned to use smaller, unique substrings within the template literal for our targeted edits. This allowed the tool to find precise matches without getting confused by the dynamic parts of the string.
- Insight: When building tools for string manipulation, consider the nuances of template literals and dynamic content. Designing for atomic, context-aware edits can save significant headaches.
3. Dev Server Shenanigans
- Challenge: Running multiple development servers simultaneously often led to port conflicts and, occasionally, stale styles or unexpected behavior.
- Workaround/Solution: Our go-to solution became a robust clean-up and restart:
killallprocesses on port 3000, clear the.nextcache, and then perform a clean restart. This ensured a fresh, unconflicted environment. - Insight: A reliable development environment setup is crucial. We're now formalizing this into a
scripts/dev-start.shscript to streamline the process, ensuring consistency for the entire team.
What's Next?
With these foundational features landed, we're excited to see them in action. Our immediate next steps include:
- Finalizing and sharing the
scripts/dev-start.shscript. - Thoroughly testing workflows with linked personas to verify expert team injection and output quality.
- Validating multi-provider A/B comparison to ensure side-by-side results are accurate and informative.
- Re-running our deep build pipelines with the bumped token limits to confirm no more truncation issues.
These updates represent a significant leap forward in making our AI workflows more intelligent, controllable, and reliable. We're eager to continue iterating and building even more powerful tools!