nyxcore-systems
6 min read

Precision & Clarity: Shipping Major Upgrades to Our AI Persona Evaluation Engine

We've just rolled out a significant update to our AI persona evaluation platform, tackling scoring accuracy, user experience, and developer insights. Dive into the challenges, solutions, and lessons learned from improving how we measure LLM role adherence and track performance.

AILLMEvaluationDevelopmentTypeScriptNext.jsEngineering

Building robust AI systems isn't just about crafting powerful models; it's equally about having the right tools to evaluate them. When we're dealing with LLMs that need to embody specific personas—whether it's a helpful assistant, a strict moderator, or a mischievous character—measuring their adherence to that role is paramount.

Recently, our team shipped a series of critical updates to our internal AI persona evaluation platform. This wasn't just about tweaking a few lines of code; it was a deep dive into the heart of how we score, visualize, and interact with our LLM evaluations. The goal? More accurate scoring, a smoother developer experience, and clearer insights into model performance trends.

Let's break down what we accomplished and the journey we took to get there.

Elevating the Evaluation Experience

Our recent work focused on several key areas to make our evaluation platform more powerful and user-friendly:

1. Real-time Feedback for Long-Running Processes

Running comprehensive LLM evaluations can take time. Developers need to know when an evaluation has started, if it's still running, and when it completes. To address this, we introduced a new sidebar component that displays active, ephemeral processes.

Now, whenever an evaluation kicks off, a clear entry appears in the sidebar. This leverages an EphemeralProcess pattern, adding "evaluation" to its type union in src/lib/ephemeral-processes.tsx. On the frontend (src/components/layout/active-processes.tsx and evaluations/page.tsx), we've wired up useEphemeralProcesses to automatically add an entry when a mutation starts and remove it upon success or error. It's a small but mighty UX win, ensuring you're never left wondering about the status of your latest test run.

2. A Smarter Approach to Role Adherence Scoring

This was perhaps the most critical and complex piece of the puzzle. Accurately scoring an LLM's adherence to a persona is nuanced. Our previous deterministic scoring method for "Role Adherence" had some significant flaws, particularly when it came to anti-pattern matching.

We completely rewrote scoreRoleAdherenceDeterministic() in src/server/services/persona-evaluator.ts. Here's the new strategy:

  • Jailbreak-Specific Detection: For tests specifically designed to provoke a "jailbreak," we now employ a targeted approach. This checks for explicit refusal signals, verifies the persona's identity, and actively looks for the absence of breach signals. This is far more precise than general anti-pattern matching.
  • Non-Jailbreak Context: For standard persona adherence, we moved away from simple anti-pattern word matching. Instead, we now look for the longest distinctive words (over 6 characters) from the persona's definition, ensuring the LLM is using character-specific vocabulary.
  • Hybrid Scoring: The most significant change is the introduction of hybridScore(). Role adherence now benefits from a powerful combination: 40% of its score comes from our refined deterministic checks, and a full 60% is delegated to an LLM judge, which excels at semantic understanding. This hybrid model leverages the strengths of both rule-based precision and AI-powered nuance.

We also squashed a React error (#31) on the evaluations/page.tsx where marker objects were being rendered directly. The fix now correctly displays marker.description or the string itself.

3. Grouping Evaluations for Better Trends

Previously, our evaluation trend charts were grouped by day. While useful for high-level tracking, it obscured improvements made within a single day, especially during rapid iteration cycles.

We've introduced a new grouping mechanism. In evaluations/page.tsx, groupEvaluations() now clusters individual evaluations that occur within a 3-minute window, treating them as a single "run." These are presented in new EvalRunGroupCard components, providing a collapsible header with the run's time, tier, test count, and average score.

Crucially, the evaluationTrend endpoint in src/server/trpc/routers/personas.ts has been updated to reflect this per-run grouping. Chart labels now show "YYYY-MM-DD HH:mm," allowing developers to visualize the impact of changes made throughout the day.

4. Expanding LLM Context Windows

Sometimes, the simplest changes have a profound impact. We've raised the maxTokens for all LLM calls within our persona-evaluator.ts to 4096. Previously, this was 1024 for jailbreak targets and 2048 for others. A larger context window means the LLM has more information to process, leading to more comprehensive and accurate evaluations, especially for longer responses or complex persona definitions.

Lessons Learned the Hard Way

No significant feature rollout comes without its share of head-scratching moments and false starts. Here are a few key lessons we picked up along the way:

The Pitfalls of Naive Anti-Pattern Matching

  • Initial Approach: We initially tried to improve scoreRoleAdherenceDeterministic() by incorporating word-level anti-pattern matching. The idea was to penalize responses containing words from descriptions of behaviors we wanted to avoid.
  • The Failure: This approach spectacularly failed. Every test scored 0 for Role Adherence. Why? Common words found in the descriptions of anti-patterns (e.g., "behave," "respond," "user") were present in almost every LLM response. With 5 anti-patterns, each carrying a -15 penalty, a score of -75 quickly zeroed out the adherence score.
  • The Workaround & Lesson: We learned that deterministic anti-pattern matching needs to be extremely precise and context-aware, which is inherently difficult for semantic tasks. We removed word-level anti-pattern matching entirely for general role adherence, shifting that responsibility to the LLM judge (which excels at semantic understanding). For jailbreaks, we adopted a much more specific, signal-based detection. Lesson: For nuanced semantic evaluation, trust the LLM judge. For deterministic rules, ensure they are incredibly narrow and specific.

React Rendering Objects Directly

  • The Problem: While refining the UI, we encountered a classic React error: Objects are not valid as a React child. This happened because we were trying to render {m.marker} directly, where m.marker was an object of type MarkerDefinition.
  • The Fix: The solution was straightforward: typeof m.marker === "string" ? m.marker : m.marker.description. This ensures we always render a string representation, either the marker itself if it's a simple string, or its description property if it's a more complex object. Lesson: Always ensure what you're rendering in React is a primitive type or a component.

TypeScript Set Iteration and downlevelIteration

  • The Problem: We tried to use the concise spread syntax with a Set for deduplication: [...new Set(items)]. This resulted in a TS2802 error from TypeScript, indicating that downlevelIteration was not enabled.
  • The Workaround: The immediate fix was to use Array.from(new Set(items)). Lesson: Be mindful of your TypeScript compiler targets and configuration. While modern JS features are great, sometimes explicit polyfills or alternative syntax are required for wider compatibility or specific build setups.

What's Next for Us?

With these significant improvements now live in production, we're already looking ahead:

  • Verify on Production: Our immediate next steps involve thorough verification of the new features in the production environment—confirming grouped results, accurate trend charts with time labels, and truncated-free responses with the increased token limits.
  • Cleaning Up History: We're considering adding a "purge old evals" button to allow users to clear out pre-fix evaluation records that contain the now-broken Role: 0 scores.
  • Internationalization: The jailbreak refusal detection currently uses hardcoded English signals. Expanding this to support multiple languages will be crucial for broader application.
  • Future Features: Beyond evaluation, we're also tracking initiatives like our Rent-a-Persona API and enhancing database security with RLS policies.

This latest update marks a substantial leap forward in our ability to accurately evaluate and iterate on AI personas. By focusing on precision, developer experience, and clear insights, we're better equipped than ever to build the next generation of intelligent agents.