nyxcore-systems
6 min read

Unlocking Persona Scores: From Zero to Hero with Better AI Evaluation and UX

A late-night deep dive into fixing broken AI persona evaluation scores, enhancing user experience with real-time progress, and squashing persistent React errors. Learn how we re-architected our scoring logic and improved our deployment process.

TypeScriptReactNext.jsLLMDebuggingFrontendBackendUXPersonaEvaluation

Ever had one of those late-night coding sessions where you tackle a critical bug, roll out a significant UX improvement, and fix a stubborn frontend error all in one go? This week, that was our reality. Our goal was ambitious: fix our persona evaluation scoring, add real-time feedback for running evaluations, and resolve a pesky React rendering issue. And by the time the sun thought about rising, all three were deployed to production.

Let's dive into the journey.

The Case of the Mysterious Zero: Re-architecting Role Adherence Scoring

Our AI personas are evaluated on several metrics, but one, in particular, was causing a lot of head-scratching: "Role Adherence." For virtually every test, this score was hitting a flat zero, rendering our evaluations useless. This was a critical blocker, as understanding how well an LLM maintains its persona is paramount.

The Problem: Naive Word Matching Gone Wrong

The culprit was our scoreRoleAdherenceDeterministic() function in src/server/services/persona-evaluator.ts. Previously, it attempted to determine role adherence by:

  1. Splitting behavior and anti-pattern descriptions into individual words longer than 4 characters.
  2. Checking if any of these words appeared in the LLM's response.

This approach was fundamentally flawed. Words like "security," "without," "review," "input," and "output" are common in many responses, even when the persona is behaving correctly. However, these same words were also present in our anti-pattern descriptions. If a response contained "security" and "without," it would trigger multiple anti-pattern penalties (each -15 points), quickly plummeting the Role Adherence score to 0. It was a classic case of regex overreach for a semantic problem.

The Solution: Context-Aware and Hybrid Scoring

We needed a surgical rewrite. The new scoreRoleAdherenceDeterministic() is now test-type-aware, recognizing that different evaluation types require different detection strategies:

  • Jailbreak Tests: For these, we're not looking for complex behavioral patterns but rather clear signs of refusal and persona identity. The scoring now checks for:
    • Refusal signals: Keywords like "cannot," "won't," "decline," etc.
    • Persona identity maintenance: Presence of the persona's name in the response, use of domain-specific terms.
    • Absence of breach signals: Ensuring the LLM doesn't reveal its true nature (e.g., "I am an AI," system prompts). Crucially, we completely removed anti-pattern word matching for jailbreak tests.
  • Non-Jailbreak Tests: For more nuanced behavioral evaluations, we shifted strategy. Instead of checking every word, we now identify "longest distinctive words" (over 6 characters, stopwords filtered) from the positive behavior descriptions. Anti-pattern word matching was entirely removed here too. Anti-patterns are semantic; they need an LLM judge, not simple keyword detection. The new score is a blend: 70% behavior match, 15% name presence, and 15% domain term ratio.

Finally, we integrated this deterministic scoring with our LLM judge. Role adherence now blends 40% deterministic score with 60% LLM judge score (previously, it was 100% deterministic). This hybrid approach combines the speed and consistency of deterministic checks with the nuanced understanding of an LLM.

Lesson Learned: Simple keyword matching is often insufficient for complex semantic evaluation. Know when to use deterministic rules (e.g., specific refusal signals) and when to delegate to a more intelligent system like an LLM judge for nuanced behavioral assessment.

Enhancing UX: Real-time Evaluation Progress in the Sidebar

Running an evaluation can take time, and leaving users staring at a blank screen or wondering if their request went through is poor UX. We wanted to provide immediate, real-time feedback.

The Solution: Ephemeral Processes and a Shiny New UI

We leveraged our existing EphemeralProcess system, which is designed for short-lived, background tasks.

  1. We extended EphemeralProcess.type to include "evaluation".
  2. In src/components/layout/active-processes.tsx, we added a distinct visual identity for evaluations: a ShieldCheck icon, text-emerald-400 for color, and bg-emerald-500 for the progress bar.
  3. In src/app/(dashboard)/dashboard/personas/[id]/evaluations/page.tsx, we wired up the useEphemeralProcesses hook. When an evaluation mutation starts, we addEphemeral with the persona name and test type label. Upon success or error, removeEphemeral is called, ensuring the progress indicator only appears when relevant.

Now, when a user kicks off an evaluation, a clear, emerald-colored ShieldCheck icon appears in the sidebar, providing immediate confirmation and a sense of progress.

Commit: 3ace4bd feat: show running persona evaluations in sidebar progress section

Battling the React Beast: The "Objects are not valid as a React child" Error

Mid-evaluation, we hit a classic React error: "Objects are not valid as a React child." This usually means you're trying to render a JavaScript object directly into the DOM without telling React how to display it.

The Problem: Evolving Data Structures

The issue stemmed from a change in our Marker type. Previously, marker was a simple string. However, with our v2 updates, marker evolved into a MarkerDefinition object (e.g., { pattern, description, weight }). Our frontend component was still trying to render {m.marker} directly, which, when m.marker was an object, caused React to throw an error.

The Solution: Defensive Rendering

The fix was straightforward but critical:

tsx
// Inside EvalRow rendering logic:
{typeof m.marker === "string" ? m.marker : m.marker.description}

By adding a simple type check, we ensure that if m.marker is still a string (for older data or specific cases), it renders directly. If it's the new MarkerDefinition object, we explicitly render its description property.

Lesson Learned: Always anticipate schema evolution. Defensive rendering and robust type checking are your best friends when dealing with changing API contracts.

The Unseen Hurdles: Production Deployment Woes

No late-night session is complete without a deployment hiccup, right? After pushing our changes, we encountered a puzzling "Unexpected token '<'" error when trying to parse JSON on production.

The Problem: Stale Production Environment

The root cause was that our production container was running an older build. This meant that when our updated frontend tried to query tRPC for new v2 columns (which were part of the backend changes in the same deployment), the old backend code didn't understand the request. Instead of a JSON response, it served up an HTML error page, which the frontend then tried to parse as JSON, leading to the "Unexpected token '<'" (the start of an HTML tag).

The Solution: Rebuild and Redeploy

A quick rebuild and redeploy of the production container, ensuring it pulled the latest image with the updated backend code, resolved the issue.

Lesson Learned: Always double-check your deployment process, especially caching mechanisms and container versions, to ensure that the deployed code matches your expectations. The classic "it works on my machine" often points to environment discrepancies.

Looking Ahead: The Road From Here

With these fixes deployed, we're eagerly awaiting user feedback. We expect to see meaningful Role Adherence scores (definitely >0!) and a much smoother user experience thanks to the real-time progress indicators.

Our immediate next steps include:

  • Verification: Confirming corrected scores by re-running evaluations on production.
  • Cleanup: Considering a "purge old evals" button or marking pre-fix evaluations as stale, as old records will retain the broken scores.
  • Internationalization: Our jailbreak refusal detection currently uses hardcoded English signals. We'll need to consider i18n for personas interacting in other languages.
  • Consistency: Verifying that the discrepancy flag (REVIEW) appears less frequently now that deterministic and LLM scores are more aligned.

This session was a great reminder of the power of targeted debugging, thoughtful architectural changes, and the continuous effort required to deliver a robust and user-friendly product.