Building Resilient LLM Workflows: Granular Control at Every Step

The world of Large Language Models is incredible, but it's not without its quirks. Billing issues, API rate limits, or unexpected model behavior from a specific provider can bring an entire workflow to a grinding halt. When a user's mission-critical automation is stuck because Anthropic decided to have a bad day, "just wait it out" isn't an acceptable answer.

That's why we recently embarked on a mission: to give our users granular control over their LLM workflows, right down to each individual step. No more global settings forcing an entire workflow to use a failing provider. No more rigid personas dictating every interaction. It was time for per-step provider/model selection and persona overrides.

After a focused development session, I'm thrilled to report the feature is fully implemented and ready for deployment. Let's break down how we got there, the technical decisions, and a particularly gnarly UI challenge we navigated.

The Goal: Dynamic Control, Enhanced Resilience

Our primary objective was clear: empower users to react to real-time LLM provider issues or specific step requirements by:

Switching Providers/Models: If Anthropic-Claude-3-Opus is acting up, a user should be able to quickly switch a problematic step to OpenAI-GPT-4-Turbo or Google-Gemini-1.5-Pro directly from the workflow execution page.
Overriding Personas: Some steps might require a different "voice" or "instruction set" than the overall workflow persona. We needed to allow a per-step persona to take precedence.

This isn't just a convenience; it's a critical resilience feature that puts control squarely in the user's hands, minimizing downtime and maximizing flexibility.

Under the Hood: The Implementation Journey

Building this required touching multiple layers of our stack – from the database schema to the tRPC API, our core workflow engine, and finally, the React frontend.

1. Database Schema & Prisma Magic

The first step was to extend our WorkflowStep model to accommodate the new personaId. We opted for a nullable UUID, allowing steps to optionally link to a specific persona.

prisma

// prisma/schema.prisma

model WorkflowStep {
  id               String      @id @default(uuid())
  workflowId       String
  workflow         Workflow    @relation(fields: [workflowId], references: [id], onDelete: Cascade)
  // ... other fields ...

  personaId        String?     @map("persona_id") @db.Uuid // New: Optional per-step persona
  persona          Persona?    @relation(fields: [personaId], references: [id])

  @@map("workflow_steps")
}

model Persona {
  id               String      @id @default(uuid())
  // ... other fields ...
  workflowSteps    WorkflowStep[] // New: Reverse relation for convenience
  @@map("personas")
}

After modifying the schema, a quick npm run db:push && npm run db:generate brought our database and Prisma client into sync. The persona_id column now gracefully exists on the workflow_steps table.

2. API Layer: Updating Steps via tRPC

Our backend uses tRPC for type-safe API interactions. We extended the input schema for steps.update to accept the new personaId.

typescript

// src/server/trpc/routers/workflows.ts

// ... inside steps.update input schema ...
personaId: z.string().uuid().nullable().optional(),
// ...

A neat little "Prisma gotcha" emerged when handling step duplication. To correctly duplicate a step with a persona, you can't just copy the personaId. Instead, Prisma's connect syntax is the way to go:

typescript

// ... inside duplicate mutation logic ...
const newStep = await ctx.prisma.workflowStep.create({
  data: {
    // ... other fields ...
    persona: step.personaId ? { connect: { id: step.personaId } } : undefined,
  },
});

This ensures the new step correctly references the existing persona without trying to create a new one.

3. Workflow Engine: Persona Overrides

The core logic lives in src/server/services/workflow-engine.ts. The executeStep() function was modified to prioritize the per-step persona. Previously, it would always inject the workflow-level persona. Now, it checks for a personaId on the current step first. If present, that persona is loaded from the database and used, effectively overriding any workflow-level persona configured.

This simple change unlocks powerful dynamic behavior, allowing specific steps to adopt a completely different tone or instruction set as needed.

4. Frontend: The User Interface

The most visible changes, and where we hit our biggest snag, were on the frontend.

Provider Picker Integration: We imported our existing ProviderPicker component into src/app/(dashboard)/dashboard/workflows/[id]/page.tsx.
Conditional Rendering: The ProviderPicker is only shown when the workflow is pending or paused, allowing modifications. For running or completed workflows, we display a read-only text representation of the selected provider/model, maintaining a consistent user experience.
Per-Step Persona Dropdown: A new <select> dropdown was added within the expanded step body, right after the prompt editor. This allows users to pick an available persona for that specific step. We also adjusted the persona query to always load all available personas, removing a previous enabled: settingsOpen gate that was no longer suitable for this dynamic interaction.

Crucially, npm run typecheck passed clean across the entire stack, giving us confidence in the new types and interfaces.

Lessons Learned: The Nested Button Saga

Every developer knows that moment when a seemingly simple UI integration turns into a head-scratcher. For us, it was trying to place the ProviderPicker inside an existing header <button> element.

The Problem: Our workflow step headers already had a <button> that toggled the step's expansion/collapse. The ProviderPicker component, internally, also renders its own <button> for interaction. Nesting a button inside another button (<button><button>...</button></button>) is invalid HTML.
The Symptoms: Beyond being semantically incorrect, this caused erratic click propagation. Clicking the inner ProviderPicker button would often trigger the outer collapse button, leading to a frustrating user experience where the step would unexpectedly close.
The Fix: The workaround involved a structural change to the step header. We split the header into a <div> wrapper. The original toggle <button> remained on the left, and the ProviderPicker was placed in a separate <div> on the right. To prevent the ProviderPicker's clicks from propagating up and triggering the outer step collapse, we added onClick={(e) => e.stopPropagation()} to the picker's wrapper div.

tsx

// Simplified example of the fix
<div className="flex justify-between items-center">
  {/* Left: Step toggle button */}
  <button onClick={toggleStepExpansion}>
    {step.name}
  </button>

  {/* Right: ProviderPicker, protected from propagation */}
  <div onClick={(e) => e.stopPropagation()}>
    <ProviderPicker
      selectedProvider={step.provider}
      onSelect={handleProviderChange}
      filterAvailable={true} // Only show relevant options
    />
  </div>
</div>

Takeaway: Always be mindful of HTML semantics, especially when dealing with interactive components. Nested interactive elements are a common pitfall that can lead to subtle, frustrating bugs. Event propagation is your friend (and sometimes your enemy!), and stopPropagation() is a powerful tool when used judiciously.

What's Next?

The feature is implemented, type-checked, and ready for deployment. My immediate next steps involve:

Commit and Push: Getting these changes into our main branch.
Manual Verification:
- Opening a workflow with previously failed Anthropic steps.
- Clicking the provider badge to switch to OpenAI/Google.
- Re-running the step to confirm it uses the new provider.
- Verifying per-step persona selection persists and correctly influences LLM output.
- Ensuring workflow-level personas still function when no per-step override is set.
- Confirming completed/running workflows show read-only provider text.

This new level of control is a huge win for our users, offering unprecedented flexibility and resilience in their LLM-powered workflows. It's exciting to see how these small, but impactful, changes empower users to build more robust and adaptable systems.

json

{"thingsDone":[
  "Added personaId (optional UUID) + persona Prisma relation to WorkflowStep model",
  "Added workflowSteps reverse relation on Persona model",
  "Ran npm run db:push && npm run db:generate",
  "Added personaId: z.string().uuid().nullable().optional() to steps.update input in src/server/trpc/routers/workflows.ts",
  "Added personaId carry-through in duplicate mutation using persona: { connect: { id } }",
  "Modified executeStep() in src/server/services/workflow-engine.ts for per-step persona override",
  "Imported ProviderPicker in src/app/(dashboard)/dashboard/workflows/[id]/page.tsx",
  "Restructured step header to split toggle button from provider area",
  "Implemented conditional ProviderPicker display (pending/paused vs. running/completed)",
  "Added per-step persona <select> dropdown in expanded step body",
  "Changed personas query to always load (removed enabled: settingsOpen gate)",
  "npm run typecheck passes clean"
],"pains":[
  "Tried putting ProviderPicker inside existing header <button>",
  "Failed due to nested interactive elements (invalid HTML, click propagation issues)",
  "Workaround: Split header into <div> wrapper with toggle <button> on left and ProviderPicker on right, using onClick={(e) => e.stopPropagation()} on picker's wrapper div"
],"successes":[
  "Feature fully implemented across 4 files",
  "Typecheck passes",
  "Achieved granular control over LLM providers and personas per workflow step",
  "Improved workflow resilience and user flexibility"
],"techStack":[
  "Prisma",
  "tRPC",
  "React",
  "Next.js",
  "TypeScript",
  "PostgreSQL",
  "LLM Integration"
]}