Late-Night Liftoff: Revamping Our AI Persona Evaluation Engine

It was 01:15 AM. The hum of the server rack was a familiar lullaby, and my screen glowed with a mix of TypeScript and Nginx configuration. The mission: rescue our AI persona evaluation system from a tangled web of broken scoring, confusing UI, and stubborn infrastructure failures. This wasn't just a bug fix; it was a full-scale overhaul, a deep dive into the guts of an LLM-powered application.

Our persona evaluation system is critical. It's how we measure the effectiveness and adherence of our AI personas to their defined roles. But lately, it had been limping along:

Broken Scoring: The Role Adherence metric, especially, was stuck at a disheartening zero for almost every evaluation.
Missing Progress: Long-running evaluations offered no real-time feedback, leaving users in the dark.
Flat Results: A long, undifferentiated list of evaluation results made it impossible to grasp trends or specific runs.
Truncated Responses: LLM responses were cut short, hindering comprehensive analysis.
Nginx Timeouts: Critical evaluation runs were failing silently, leading to cryptic errors.
Ugly UI: Model selection dropdowns were basic, clunky, and lacked crucial context.

The goal was clear: get this system back on track, providing accurate, insightful, and user-friendly feedback. Six commits later, and with production humming along, here's how we tackled it.

The Overhaul: From Broken to Brilliant

1. Real-time Progress for Long-Running Evaluations

One of the most frustrating aspects for users was the lack of feedback during a lengthy evaluation. You'd click "start," and then... nothing, until a success or error message eventually popped up.

To fix this, we integrated evaluations into our existing EphemeralProcess system. This allows us to display real-time progress in a persistent sidebar:

tsx

// src/lib/ephemeral-processes.tsx
// Extending the type union to include our new process
type EphemeralProcessType = "discussion" | "evaluation";

We wired addEphemeral() at the start of an evaluation mutation and removeEphemeral() on success or error. Now, users see a ShieldCheck icon with an emerald-colored progress bar, giving them confidence that the system is busy at work.

2. Rewriting Role Adherence Scoring: A Shift in Philosophy

This was the big one. The Role Adherence score was consistently 0, rendering the entire evaluation useless. The culprit? Our scoreRoleAdherenceDeterministic() function, which relied on a naive word-level anti-pattern matching system. Words like "security," "without," or "review" (often found in generic anti-pattern descriptions) were being matched in every LLM response, leading to massive penalties. It was a classic case of trying to solve a semantic problem with a blunt lexical tool.

We completely rewrote this logic:

Jailbreak-Specific Detection: For jailbreak evaluations, we now look for explicit refusal signals, confirmation of persona identity, and the absence of clear breach indicators. This is far more targeted.
Non-Jailbreak Scenario: For general role adherence, we removed anti-pattern matching entirely. Instead, we now focus on identifying distinctive words (>6 characters) related to the desired persona behaviors.
Hybrid Scoring: Crucially, we adjusted our hybridScore() function. Role adherence is now a blend of 40% deterministic checks (for unambiguous signals) and 60% LLM judge analysis (for nuanced semantic understanding). This acknowledges the LLM's superior ability to understand context and intent.

We also fixed a minor React error where evaluation markers were trying to render raw objects instead of their description property, cleaning up the UI.

3. Grouping Evaluations for Clarity

A flat list of evaluation results quickly becomes overwhelming. To make sense of multiple runs, especially when iterating on persona prompts, we introduced grouping:

tsx

// evaluations/page.tsx
// A simplified look at how we group runs
const groupEvaluations = (evals: EvaluationResult[]) => {
  const groups: EvalRunGroup[] = [];
  let currentGroup: EvalRunGroup | null = null;

  for (const evalResult of evals) {
    // Logic to cluster items within a 3-minute window
    // ...
    if (!currentGroup || timeDifference > 3 * 60 * 1000) {
      currentGroup = {
        id: crypto.randomUUID(),
        timestamp: evalResult.timestamp,
        evaluations: [evalResult],
        // ... other group metadata
      };
      groups.push(currentGroup);
    } else {
      currentGroup.evaluations.push(evalResult);
    }
  }
  return groups;
};

Evaluations are now clustered into EvalRunGroupCard components if they occur within a 3-minute window. These collapsible headers display key information like the run's timestamp, tier, test count, and average score, making it far easier to compare different iterations.

The evaluationTrend chart was also updated to reflect this, now grouping by run rather than just by day, with more precise YYYY-MM-DD HH:mm labels on the x-axis.

4. Bumping Max Tokens for Fuller Responses

This was a straightforward fix that yielded immediate results. Our LLM calls for jailbreak evaluations were capped at 1024 tokens, and others at 2048. This frequently led to truncated responses, making it difficult for both the LLM judge and human reviewers to get the full picture. We simply raised all maxTokens limits to 4096. Problem solved.

5. Taming Nginx Timeouts: The Silent Killer

This was a classic infrastructure headache. Our full evaluation runs involve 22+ sequential LLM calls, which can take a significant amount of time. The default Nginx proxy_read_timeout and proxy_send_timeout of 120 seconds were simply not enough.

The result? 502 Bad Gateway errors from Nginx, which then returned an HTML error page. Our tRPC client, expecting JSON, would then throw a cryptic "Unexpected token '<'" error. Extremely frustrating to debug!

The solution was to add a dedicated location block in our nginx.conf specifically for our tRPC API endpoint:

nginx

# nginx/nginx.conf
location /api/trpc/ {
    proxy_pass http://app:3000;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_read_timeout 600s; # Increased for long-running LLM evaluations
    proxy_send_timeout 600s;
}

# General API location (still uses default 120s timeout)
location /api/ {
    proxy_pass http://app:3000;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}

Crucial detail: After modifying nginx.conf, we had to remember to restart Nginx separately (docker compose restart nginx). A simple docker compose up -d often won't pick up Nginx config changes if the container itself isn't rebuilt or restarted.

6. Introducing the `ProviderModelPicker` Component

The native <select> elements we were using for choosing LLM providers and models were functional but ugly and lacked context. This was a prime opportunity for a reusable UI component.

We developed the ProviderModelPicker:

It groups models by provider.
Displays cost tier dots (free/low/medium/high).
Shows API key status.
Includes model descriptions and "(default)" tags.
Uses check marks for selection.

This component leverages DropdownMenuLabel and DropdownMenuGroup primitives for a clean, accessible UI. It fetches trpc.discussions.availableProviders to show live provider status. We immediately replaced four native selects with two instances of this new component on the evaluations page (one for the Test model, one for the Judge model). This component is now a "memory artifact" – our go-to for model selection across the application.

Lessons from the Trenches: The "Pain Log" Transformed

Not everything went smoothly. Here are some of the critical lessons we learned along the way:

1. The Perils of Naive Keyword Matching for LLM Evaluation

The Trap: We initially tried to use simple word-level anti-pattern matching for scoreRoleAdherenceDeterministic(). The idea was to penalize responses containing certain keywords associated with undesirable behaviors.
The Fail: It catastrophically failed, scoring 0 for every single test. Why? Words like "security," "without," or "review" (which were part of our anti-pattern descriptions) are common in any LLM response. Our system was incorrectly applying a massive penalty for every instance.
The Fix: We stripped out the broad anti-pattern matching for general role adherence, shifting that complex semantic task to the LLM judge. For jailbreak scenarios, we implemented more targeted refusal signal detection.
Lesson Learned: Semantic evaluation is hard. Don't rely on simplistic keyword matching for nuanced LLM behavior. Leverage the LLM itself for judging, and use deterministic checks only for highly specific, unambiguous signals (e.g., explicit refusal phrases for jailbreaks). Also, be aware that fundamental changes to scoring logic mean old evaluation records might be permanently flawed and can't be retroactively fixed.

2. Nginx Timeouts: The Silent Killer of Long-Running API Calls

The Trap: Our LLM evaluation runs involve many sequential API calls, which, even with fast LLMs, can add up. We assumed the default Nginx proxy timeouts (120s) would be sufficient.
The Fail: Long evaluation sequences silently failed with 502 Bad Gateway. The tRPC client, expecting JSON, received an HTML error page from Nginx, leading to a cryptic "Unexpected token '<'" error. This made debugging incredibly frustrating, as the client error didn't point to an upstream issue.
The Fix: We added a specific Nginx location block for /api/trpc/ and bumped proxy_read_timeout and proxy_send_timeout to 600s.
Lesson Learned: Don't assume default infrastructure timeouts will suffice for all application workloads, especially with LLM interactions which can be unpredictable. When debugging cryptic client-side parsing errors, always consider if an upstream proxy is returning an unexpected HTML error page. And never forget to restart your proxy server when its configuration changes!

3. TypeScript `Set` Conversion Quirks

The Trap: A seemingly innocuous attempt to get unique items from an array: [...new Set(items)].
The Fail: TypeScript threw TS2802 – "downlevelIteration not enabled in tsconfig." While a minor issue, it's a common gotcha for developers working with older JS targets or specific tsconfig setups.
The Fix: Switched to the more universally compatible Array.from(new Set(items)).
Lesson Learned: Be mindful of your tsconfig.json settings, especially target and downlevelIteration, when using modern JavaScript features. Array.from() is a robust alternative when spread on iterables isn't transpiled as expected.

Current State & Next Steps

All fixes are now deployed to production, with eval runs actively generating meaningful data. maxTokens is uniformly 4096 for all evaluation LLM calls, and our Nginx tRPC endpoint has a generous 600s timeout. The trend chart now uses UTC times for consistency. We acknowledge that old evaluation records still contain the broken Role: 0 scores, but all new runs will reflect the corrected scoring.

Our immediate next steps include:

Verification: Double-check production eval results for grouped runs and the new picker UI.
Scoring Improvement: Confirm that new evaluations show Role Adherence scores > 0.
Data Management: Consider adding "purge old evals" or a "re-score" option for historical data.
UI Consistency: Replace native selects in other parts of the application (e.g., discussions/new, workflows/new) with the ProviderModelPicker.
Future Features: Track the progress of our B: Rent-a-Persona API and add RLS policies for persona_profiles.
I18n for Jailbreak: Our jailbreak refusal detection currently uses hardcoded English signals; we'll need to consider internationalization for multilingual personas.

This late-night session was a marathon, not a sprint, but the satisfaction of seeing a critical system go from broken to robust is immense. It's a testament to the iterative nature of development, where every "pain log" entry transforms into a valuable lesson learned.

json

{
  "thingsDone": [
    "Fixed broken LLM scoring (Role Adherence)",
    "Implemented real-time sidebar progress for evaluations",
    "Grouped evaluation results by run for better readability",
    "Increased LLM maxTokens to prevent truncated responses",
    "Resolved Nginx 502 Bad Gateway timeouts for long-running tRPC calls",
    "Developed and integrated a reusable ProviderModelPicker UI component"
  ],
  "pains": [
    "Naive keyword matching for LLM evaluation causing false positives",
    "Nginx default timeouts causing 502s for long API calls",
    "Cryptic 'Unexpected token <' errors due to Nginx returning HTML error pages",
    "TypeScript TS2802 error with spread syntax on Set conversion"
  ],
  "successes": [
    "Accurate and meaningful persona evaluation scores",
    "Improved user experience with real-time feedback and clear result grouping",
    "Stable and performant infrastructure for LLM interactions",
    "Reusable and enhanced UI components for model selection"
  ],
  "techStack": [
    "TypeScript",
    "React",
    "Next.js",
    "tRPC",
    "Nginx",
    "LLMs",
    "Docker"
  ]
}