Leveling Up AI Workflows: Introducing Personas & Multi-Provider A/B Testing

Building sophisticated AI applications often feels like navigating a constantly evolving landscape. As we push the boundaries of what our systems can achieve, two critical needs consistently emerge: precision in output and reliability across models. Our latest development sprint tackled these head-on, delivering features that empower users with unprecedented control and insight: workflow personas and multi-provider A/B comparison.

This post dives into the recent session where we implemented these features, sharing the technical details, the challenges we overcame, and the immediate impact on our platform.

Bringing Personalities to AI: Persona-Driven Expert Teams

Imagine orchestrating a team of highly specialized AI experts for every task. That's the power of workflow personas. Instead of generic prompts, you can now define specific roles—like a "Creative Strategist" or a "Technical Auditor"—each with its own system prompt, guiding the AI to adopt a particular style, tone, and focus.

The Why

In complex AI pipelines, the initial "expert team assembly" step is crucial. By injecting predefined personas, we ensure that the AI's foundational understanding of its role is consistent and tailored to the task at hand. This leads to more precise, relevant, and consistently high-quality outputs.

How We Built It

Schema Evolution: We introduced personaIds to our Workflow schema, allowing users to select one or more personas for a workflow.
Dynamic Prompt Injection: The core logic lives in workflow-engine.ts, where loadPersonaSystemPrompts() dynamically fetches and injects the selected persona's system prompts into the AI's context.
User Experience:
- A new persona picker UI was added to workflows/new/page.tsx, making it easy to assign personas when creating or updating a workflow.
- Persona badges now appear on the workflow details page (dashboard/workflows/[id]/page.tsx), providing an at-a-glance view of the assigned experts.
Team Assembly Enhancement: Crucially, all three of our initial "Step 0: Assemble the Expert Team" prompts (deepPrompt, extensionPrompt, secPrompts) were updated to defer to these injected personas. If personas are present, they take precedence, ensuring the expert team is assembled with their specific guidance. We also updated the default system prompts for these to align with the new persona-aware structure.

The Quest for Quality: Multi-Provider A/B Comparison

The world of Large Language Models (LLMs) is dynamic, with new models and providers emerging constantly. Each offers unique strengths and weaknesses. How do you choose the best one for a critical workflow step? Our new multi-provider A/B comparison feature provides the answer.

The Why

To ensure optimal performance and resilience, it's vital to compare how different LLMs handle the same input. This A/B testing capability allows users to run a single workflow step against multiple providers (e.g., Anthropic, OpenAI, Google, Ollama) simultaneously, then visually compare their outputs side-by-side to select the best fit.

How We Built It

Step Configuration: We added compareProviders to WorkflowStep configuration. This array specifies which providers should be invoked for a particular step.
Forking Logic: Within the workflow engine, if compareProviders is present, the system "forks" the execution for that step, sending the same input to each specified provider.
User Interface for Comparison:
- A new compare providers toggle was integrated into SortableStepCard, allowing users to activate this feature for any step.
- The results are presented in an "alternatives block" with provider and model badges, clearly indicating which model generated which output.
- "N providers" badges on step headers quickly show when a step is configured for multi-provider comparison.
Preventing Truncation: A key learning from early testing was that some outputs were being truncated due to token limits. We promptly bumped maxTokens from 8192 to 16384 for our core generation models (deepExtend, deepWisdom, deepImprove) in src/lib/constants.ts to accommodate potentially longer, more detailed responses, especially when comparing multiple outputs.

Under the Hood: Supporting Infrastructure

These features required robust backend and data model adjustments:

tRPC Router for Personas: A dedicated tRPC router (src/server/trpc/routers/personas.ts) was created to handle list and get operations for personas, registered in src/server/trpc/router.ts.
Workflow Router Updates: Our src/server/trpc/routers/workflows.ts was updated to handle personaIds during create/update/duplicate operations, and compareProviders in stepConfigSchema and steps.update logic. We also bumped the selectAlternative maximum to 3, preparing for more sophisticated comparison options.
Database Sync: All schema changes were pushed via db:push, ensuring our database was aligned with the new data models.

Lessons Learned & Overcoming Hurdles

No significant feature rollout is without its challenges. Here are a few key lessons from this sprint:

1. Type Safety vs. Flexibility with Zod Enums

Challenge: When trying to define compareProviders as string[] in our frontend StepConfig, TypeScript correctly flagged it as incompatible with our Zod enum union type ("anthropic"|"openai"|...). Zod's enum expects specific literal values, not just any string.
Workaround/Solution: We explicitly defined the type as a literal union of allowed providers: ("anthropic"|"openai"|"google"|"ollama")[]. This maintained strict type safety while allowing the necessary flexibility.
Insight: Sticking to strict typing, especially with schema validation libraries like Zod, is paramount. Sometimes, the solution is to explicitly declare the union of allowed literals rather than a generic string array.

2. Surgical Edits in Template Literals

Challenge: Our in-house "Edit" tool, designed for targeted string modifications, struggled to match and replace content that spanned across template literal boundaries (e.g., trying to edit content that included ${variable}).
Workaround/Solution: We learned to use smaller, unique substrings within the template literal for our targeted edits. This allowed the tool to find precise matches without getting confused by the dynamic parts of the string.
Insight: When building tools for string manipulation, consider the nuances of template literals and dynamic content. Designing for atomic, context-aware edits can save significant headaches.

3. Dev Server Shenanigans

Challenge: Running multiple development servers simultaneously often led to port conflicts and, occasionally, stale styles or unexpected behavior.
Workaround/Solution: Our go-to solution became a robust clean-up and restart: killall processes on port 3000, clear the .next cache, and then perform a clean restart. This ensured a fresh, unconflicted environment.
Insight: A reliable development environment setup is crucial. We're now formalizing this into a scripts/dev-start.sh script to streamline the process, ensuring consistency for the entire team.

What's Next?

With these foundational features landed, we're excited to see them in action. Our immediate next steps include:

Finalizing and sharing the scripts/dev-start.sh script.
Thoroughly testing workflows with linked personas to verify expert team injection and output quality.
Validating multi-provider A/B comparison to ensure side-by-side results are accurate and informative.
Re-running our deep build pipelines with the bumped token limits to confirm no more truncation issues.

These updates represent a significant leap forward in making our AI workflows more intelligent, controllable, and reliable. We're eager to continue iterating and building even more powerful tools!