Leveling Up Our AI Workflows: Personas, A/B Tests, and Token Bumps

Just finished a pretty intense development sprint, and it's always good to pause, reflect, and document the journey. This session was all about making our AI workflow engine smarter, more robust, and more developer-friendly. We tackled everything from giving our AI agents distinct 'personalities' to enabling side-by-side model comparisons, and even wrestling with some token limit dragons.

Here's a breakdown of what landed, what we learned, and where we're headed.

Empowering AI Agents with Personas

One of the big goals was to allow users to define personas for their AI workflows. Imagine having an "Expert Coder" persona, a "Creative Writer" persona, or a "Security Auditor" persona that can be injected into the AI's system prompt. This allows our AI agents to adapt their tone, knowledge, and focus dynamically based on the task at hand.

How We Built It:

Core Data Model: Added personaIds: string[] to our Workflow schema. This allows a workflow to reference multiple pre-defined personas.
Prompt Injection Logic: Implemented loadPersonaSystemPrompts() in workflow-engine.ts. This function now fetches the system prompts associated with the selected personas and injects them into the initial "Assemble the Expert Team" steps (deepPrompt, extensionPrompt, secPrompts). This ensures the AI's initial setup is persona-aware.
UI Integration: Added a persona picker to workflows/new/page.tsx, allowing users to select and assign personas when creating or editing a workflow. Persona badges also now appear on dashboard/workflows/[id]/page.tsx for quick visual context.
API Layer: Created a new tRPC router src/server/trpc/routers/personas.ts with list and get endpoints to manage our persona definitions. This was then registered in src/server/trpc/router.ts. Our workflows.ts router was also updated to handle personaIds in create/update/duplicate operations.

The result? Our AI agents can now truly embody different roles, making our workflows far more flexible and powerful.

The Quest for AI Model Certainty: A/B Comparison

In the world of LLMs, choosing the right provider and model for a specific task is often a guessing game. We wanted to make that process data-driven. Enter multi-provider A/B comparison.

The Implementation:

Step-Level Configuration: Introduced compareProviders: string[] (though with a specific type workaround, more on that in "Lessons Learned") on WorkflowStep. This flag tells our engine to run a particular step against multiple specified providers.
Forking Logic: When compareProviders is active, the workflow-engine now forks the execution of that step, sending the same prompt to each selected provider.
UI for Comparison: We updated the SortableStepCard to include a "Compare Providers" toggle. When activated, the alternatives block in dashboard/workflows/[id]/page.tsx now displays the outputs from different providers side-by-side, complete with provider+model badges. This makes evaluating and selecting the best output incredibly straightforward. We also bumped the selectAlternative max to 3, giving us more room for comparison.

This feature is a game-changer for evaluating model performance and ensuring the robustness of our AI pipelines across different LLM ecosystems.

Battling Truncation and Boosting Limits

During testing, we noticed some of our more complex AI steps were consistently truncating their outputs. The culprit? Hardcoded maxTokens limits.

The Fix:

Our deepExtend, deepWisdom, and deepImprove models in src/lib/constants.ts were capped at 8192 tokens. For involved tasks, especially when injecting verbose personas, this just wasn't enough. We bumped these limits to 16384.

typescript

// src/lib/constants.ts (simplified)
export const models = {
  deepExtend: {
    // ... other properties
    maxTokens: 16384, // Previously 8192
  },
  deepWisdom: {
    // ... other properties
    maxTokens: 16384, // Previously 8192
  },
  deepImprove: {
    // ... other properties
    maxTokens: 16384, // Previously 8192
  },
  // ... other models
};

This simple but critical change ensures our AI agents can now produce comprehensive, untruncated responses, especially when dealing with richer context provided by personas.

Lessons from the Trenches: Our "Pain Log"

No dev session is complete without hitting a few snags. Here's what we learned along the way:

1. Type Safety vs. Flexibility with Zod Enums

The Problem: We initially tried to define compareProviders as string[] in our frontend StepConfig type. However, our backend Zod schema for WorkflowStep used a literal union like ("anthropic"|"openai"|"google"|"ollama")[] to ensure only valid providers could be specified. TypeScript rightly threw an error, complaining that string[] wasn't assignable to the more specific Zod enum union type.

The Workaround: The solution was to explicitly use the literal union type across the frontend StepConfig, LocalStep, and StepTemplate definitions. This ensures type consistency end-to-end.

typescript

// Before (failed attempt):
// compareProviders: string[];

// After (successful workaround):
type ProviderName = "anthropic" | "openai" | "google" | "ollama";
compareProviders: ProviderName[]; // Now type-safe with Zod schema

Lesson Learned: When dealing with Zod schemas and specific enum-like unions, ensure your frontend types mirror that specificity. Generic string[] might seem convenient but will cause type conflicts with stricter Zod validations.

2. The Fickle Nature of Template Literals and Edit Tools

The Problem: We were trying to use an "Edit" tool to modify large, backtick-delimited template literal strings (``) in our code. The tool struggled to match strings that spanned across the internal structure of these literals (e.g., trying to find a substring that was broken by an interpolated variable).
The Workaround: Instead of trying to match long, complex strings, we switched to using smaller, unique substrings within the template literal. This allowed the edit tool to reliably find and replace the targeted content.
Lesson Learned: When automating code edits, especially with template literals, identify unique, short, and non-interpolated anchor points for your search and replace operations. Don't rely on matching large, structurally complex strings.

3. Dev Server Hygiene is Crucial

The Problem: Running two dev servers simultaneously led to port conflicts (naturally), but also some unexpected stale styles and inconsistent behavior that wasn't immediately obvious.
The Workaround: A full teardown and clean restart became our go-to. This involved:
1. Killing all processes on port 3000 (kill-port 3000 or similar).
2. Clearing the Next.js cache (rm -rf .next).
3. A clean restart.
Lesson Learned: When things get weird in local development, especially with frontend frameworks, a clean slate is often the fastest path to resolution. Don't fight stale caches or lingering processes; just nuke 'em and restart. We're now creating a scripts/dev-start.sh to automate this cleanup and restart process.

Active State & Next Steps

All the changes mentioned above have been committed to main (aa3799 and 6623c29). Our dev server is running clean, and the database schema has been pushed.

Our immediate next steps involve:

Finalizing and testing the scripts/dev-start.sh script for consistent developer onboarding.
Pushing these commits to origin.
Thoroughly testing the new persona injection to verify expert team behavior.
Validating multi-provider A/B comparison with different models.
Re-running our deep build pipelines to confirm no more truncation issues with the bumped token limits.
(Minor housekeeping) Fixing a pre-existing Badge variant type error in discussions/[id]/page.tsx.

It was a productive session, pushing our AI workflow capabilities significantly forward. Onwards!

json

{
  "thingsDone": [
    "Implemented workflow persona injection (personaIds on Workflow, loadPersonaSystemPrompts, persona picker UI)",
    "Implemented multi-provider A/B comparison (compareProviders on WorkflowStep, provider fork, compare providers toggle UI)",
    "Created tRPC router for personas (list, get)",
    "Updated tRPC workflows router for personaIds and compareProviders",
    "Updated dashboard UI for persona badges, provider+model badges, 'N providers' badge",
    "Bumped maxTokens from 8192 to 16384 for deepExtend, deepWisdom, deepImprove models to fix truncation",
    "Made Step 0 expert team prompts persona-aware",
    "Updated system prompts for deepPrompt, extensionPrompt, secPrompts"
  ],
  "pains": [
    "TS error: string[] not assignable to Zod enum union type for compareProviders",
    "Edit tool failed to match strings crossing template literal boundaries",
    "Port conflict and stale styles when running two dev servers simultaneously"
  ],
  "successes": [
    "Successfully implemented complex features like persona injection and A/B comparison",
    "Resolved critical token truncation issue",
    "Improved developer experience by identifying and documenting common dev environment pitfalls",
    "Established a clear path for automating dev server startup"
  ],
  "techStack": [
    "Next.js",
    "TypeScript",
    "tRPC",
    "Prisma",
    "Zod",
    "LLMs (Anthropic, OpenAI, Google, Ollama)",
    "React"
  ]
}