Cracking the Code: Overhauling Our Persona Evaluation System for Deeper Insights

The world of Large Language Models (LLMs) is dynamic, and ensuring our custom personas behave as expected is paramount. Our persona evaluation system is the backbone of this assurance, but recently, it was showing its age – and its flaws. We faced a laundry list of issues: utterly broken scoring, a frustrating lack of progress indicators for long-running tasks, a flat and unhelpful results list, truncated LLM responses, mysterious Nginx timeouts, and clunky model selection interfaces.

After an intense, focused session, I'm thrilled to report that all these issues have been addressed and deployed to production. This post details the journey, the solutions, and the crucial lessons learned along the way.

Our Journey to Better Evals: A Deep Dive into the Fixes

This sprint involved a series of interconnected improvements, each contributing to a more robust, reliable, and user-friendly evaluation experience.

1. Keeping an Eye on Long-Running Tasks: Sidebar Progress for Evaluations

Commit: 3ace4bd

LLM evaluations aren't instantaneous. For a user, kicking off an evaluation and seeing no immediate feedback can be anxiety-inducing. We needed a clear, real-time indicator that something was happening.

We extended our EphemeralProcess system to include "evaluation" as a type. This involved updating src/lib/ephemeral-processes.tsx and adding a new, distinct UI for evaluations in src/components/layout/active-processes.tsx – complete with a ShieldCheck icon and an emerald-colored progress bar. Finally, evaluations/page.tsx was wired to addEphemeral() when a mutation starts and removeEphemeral() upon success or error.

Impact: Users now have peace of mind, seeing their evaluation tasks actively processing in the sidebar, improving transparency and user experience.

2. From Flawed Penalties to Intelligent Judgement: Role Adherence Scoring Rewrite

Commit: 44f233a

This was perhaps the most critical fix. Our previous scoreRoleAdherenceDeterministic() function was fundamentally broken. It used word-level anti-pattern matching that inadvertently penalized nearly every response, scoring 0 for everything. Imagine your persona trying to be helpful, only to be docked points because common words like "security" or "review" appeared in its response, triggering a -75 penalty!

We completely rewrote the scoring logic:

Jailbreak-specific evaluations: Now check for refusal signals, explicit persona identity, and the absence of breach signals. This is far more nuanced than simple word matching.
Non-jailbreak evaluations: We removed anti-pattern matching entirely. Instead, we now focus on the longest distinctive words (over 6 characters) per behavior, providing a more meaningful signal of adherence.
Hybrid Scoring: The hybridScore() now blends deterministic checks (40%) with the power of an LLM judge (60%). This combination leverages the best of both worlds: the speed and consistency of deterministic checks for clear-cut cases, and the semantic understanding of an LLM for nuanced role adherence.

We also fixed a minor React error (#31) on evaluations/page.tsx where markers were attempting to render raw objects instead of their descriptions.

Impact: Our persona evaluations are now accurate and truly reflect role adherence, providing meaningful insights into persona performance. Old evaluation records still show the broken 0 scores, but all new runs will display corrected, insightful data.

3. Bringing Order to Chaos: Evaluations Grouped by Run

Commit: 5773854

A flat list of evaluation results quickly becomes unwieldy. To make sense of multiple tests conducted in a short span, we needed a way to group them logically.

evaluations/page.tsx now intelligently groups evaluation items that occur within a 3-minute window into EvalRunGroupCard components. These cards provide a collapsible header displaying the run's time, tier, test count, and average score. Complementing this, src/server/trpc/routers/personas.ts was updated to group evaluationTrend data per run rather than per day, with the chart's x-axis labels now showing YYYY-MM-DD HH:mm for precise context.

Impact: The evaluation results page is transformed into an organized, digestible overview, making it much easier to track performance trends across specific test runs.

4. Letting LLMs Speak Their Mind: Increased Max Tokens

Commit: c775701

Truncated LLM responses are frustrating. They can hide critical information, making it impossible to properly assess a persona's behavior or a jailbreak attempt's success.

We've increased the maxTokens limit in persona-evaluator.ts for all LLM calls from previous limits (1024 for jailbreak, 2048 for others) to a uniform 4096.

Impact: Fuller, more complete LLM responses, ensuring we get the full picture during evaluations.

5. Banishing the Dreaded 502: Nginx Timeout Fix

Commit: fb642e2

Running a full evaluation involves 22+ sequential LLM calls, a process that can easily exceed default server timeouts. This led to frustrating 502 Bad Gateway errors, with Nginx prematurely closing connections and sending back an HTML error page that our tRPC client would then try to parse, resulting in an "Unexpected token '<'" error.

The fix involved a targeted update to nginx/nginx.conf. We added a dedicated location block for /api/trpc/ and explicitly set proxy_read_timeout 600s and proxy_send_timeout 600s. This ensures that long-running tRPC calls have ample time to complete without being prematurely cut off.

Important Note: Remember that Nginx configuration changes require a separate restart (docker compose restart nginx) from the application itself. A common pitfall!

Impact: Stable, uninterrupted evaluation runs, even for the most complex persona assessments.

6. A Polished Selection Experience: Introducing the ProviderModelPicker Component

Commit: 00ad268

Our model selection interfaces were, frankly, ugly. Using native <select> elements for provider and model selection was clunky and lacked essential information.

We developed a brand-new, reusable ProviderModelPicker component (src/components/shared/provider-model-picker.tsx). This component:

Groups models by provider.
Visually indicates cost tiers (free/low/medium/high) with dots.
Shows API key status.
Displays model descriptions and "(default)" tags.
Uses check marks for selected items.

To support this, we added DropdownMenuLabel and DropdownMenuGroup primitives to src/components/ui/dropdown-menu.tsx. We then replaced the 4 native <select> elements on evaluations/page.tsx with 2 instances of ProviderModelPicker (one for the Test model, one for the Judge model), dynamically fetching trpc.discussions.availableProviders for live status.

Impact: A significantly improved, user-friendly experience for selecting LLM providers and models, setting a new standard for future UI interactions. This component is now a memory artifact for best practice: always use ProviderModelPicker for model selection.

Lessons Learned: Overcoming the Challenges

This session wasn't without its hurdles. Each challenge, however, provided a valuable lesson that will inform future development.

1. The Peril of Naive String Matching for Semantic Tasks

Challenge: Our initial approach to scoreRoleAdherenceDeterministic() relied on simple word-level anti-pattern matching. This proved disastrous, leading to every evaluation scoring 0 due to common, innocuous words being flagged as "anti-patterns."

Lesson Learned: Simple string matching is insufficient for nuanced semantic tasks like role adherence. It's a blunt instrument where a scalpel is needed. For complex semantic understanding, leveraging the power of an LLM judge is essential. A hybrid approach, combining targeted deterministic checks with LLM judgment, offers the most robust and accurate solution.

2. Nginx Configuration: The Silent Killer of Long Requests

Challenge: Full evaluation runs, involving many sequential LLM calls, consistently hit the default 120-second Nginx timeout, resulting in cryptic 502 Bad Gateway errors and client-side "Unexpected token '<'" messages (as our tRPC client tried to parse an Nginx HTML error page).

Lesson Learned: For long-running API calls, especially those involving external services like LLMs, default server timeouts are often too restrictive. Explicitly configuring higher timeouts for specific API endpoints (e.g., /api/trpc/) in nginx.conf is crucial for stability. And a perennial reminder: Nginx configuration changes require a separate, explicit restart of the Nginx service!

3. TypeScript's `downlevelIteration` and Set Conversion

Challenge: A minor but annoying TypeScript error (TS2802) occurred when trying to use the spread syntax [...new Set(items)] to convert a Set to an array.

Lesson Learned: Be mindful of your tsconfig.json settings, particularly downlevelIteration. While the spread syntax is elegant, Array.from(new Set(items)) is a more universally compatible and explicit way to achieve the same result, especially when downlevelIteration isn't enabled.

Looking Ahead: The Journey Continues

These changes represent a significant leap forward in the robustness, accuracy, and user experience of our persona evaluation system. All fixes are live on production, and evaluation runs are actively flowing, providing much-needed insights into persona performance.

While this session wrapped up a major set of improvements, the journey continues. Immediate next steps include:

Verifying on production that grouped runs and the new picker UI are functioning as expected.
Confirming that new evaluations show correctly improved Role Adherence scores (i.e., > 0).
Considering options to purge or re-score old, broken historical evaluation data.
Replacing native selects in other pages (like discussions/new, workflows/new) with the new ProviderModelPicker.
Further developing the Rent-a-Persona API.
Adding RLS policy for the persona_profiles table for enhanced security.
Expanding jailbreak refusal detection beyond hardcoded English signals to support multilingual personas.

Stay tuned for more updates as we continue to refine and enhance our LLM tooling!