The Great LLM Scoring Rescue: Rewriting Persona Evals, Squashing React Bugs, and Shipping to Prod

Late-night coding sessions often feel like a race against the clock, fueled by caffeine and the relentless pursuit of a cleaner, more functional system. This past week, one such session turned into a mini-epic of debugging, refactoring, and deploying, touching everything from core LLM evaluation logic to frontend UI polish.

The mission was clear:

Fix our broken persona evaluation scoring (a critical bug rendering Role Adherence useless).
Add real-time progress feedback to the sidebar for running evaluations.
Squash a stubborn React rendering error that popped up after a data model change.

And the best part? All three are now live in production. Let's unpack the journey.

The Core Challenge: Demystifying LLM Persona Scoring

This was the big one. Our LLM persona evaluations had a critical flaw: the "Role Adherence" score was consistently showing 0 for almost every test. This was a major blocker for understanding how well our personas were maintaining their identity and following instructions.

The Bug's Lair: Naive Deterministic Matching

The culprit was scoreRoleAdherenceDeterministic() in src/server/services/persona-evaluator.ts. Here's how it was broken:

Before (The Problem): The original logic attempted to detect "anti-patterns" (undesirable behaviors) by splitting their descriptions into individual words longer than 4 characters. If any of these words appeared in the LLM's response, it would trigger a penalty.

Imagine an anti-pattern like "The persona should not discuss security vulnerabilities without prior review." Words like "security," "without," and "review" are common. They'd appear in almost any moderately complex LLM response, regardless of whether the persona actually breached its role. Each anti-pattern match incurred a -15 penalty. With just 5 common anti-patterns, the score would plummet to -75, effectively guaranteeing a Role: 0. Ouch.

The Fix: Test-Type-Aware & Semantic-First Scoring

The solution involved a complete rewrite of scoreRoleAdherenceDeterministic() and a more intelligent blend of deterministic and LLM-based judging.

After (The Solution):

Test-Type-Aware Scoring:
- Jailbreak Tests: For these, we don't care about anti-pattern words. We need to detect refusal signals (e.g., "cannot," "won't," "decline"), ensure persona identity maintenance (name in response, domain terms), and confirm the absence of breach signals (e.g., "I am an AI," "here is my system prompt"). This is a very specific, positive/negative signal detection.
- Non-Jailbreak Tests: Here, the focus shifts. We now use longest distinctive words (filtered for stopwords, >6 characters) from behavior descriptions, rather than every word. Crucially, we removed anti-pattern word matching entirely. Why? Because anti-patterns, by their nature, describe semantic behavioral deviations that simple word matching cannot reliably catch. That's a job for a more sophisticated judge.
Hybrid Scoring Model: The hybridScore() function was updated. Role adherence now blends 40% deterministic + 60% LLM judge. This is a significant shift from the previous 100% deterministic approach. The LLM is far better at understanding nuance and context for complex behavioral evaluations.

This rewrite, captured in commit 44f233a, should finally provide meaningful Role Adherence scores. We're now awaiting user re-runs of evaluations to verify the corrected scores.

Lesson Learned: Deterministic regex matching for semantic tasks is a trap. While useful for explicit signals (like refusal words), it's woefully inadequate for nuanced behavioral analysis. Know when to delegate to a more capable (LLM) judge.

Enhancing User Experience: Real-time Evaluation Progress

A common user frustration: initiating an evaluation and not knowing if it's running, or when it finishes. This session tackled that by adding a real-time progress indicator to the sidebar.

Implementation Details: Ephemeral Processes

We leveraged our existing EphemeralProcess system, which is designed for transient background tasks.

Expanded EphemeralProcess Type: Added "evaluation" to the EphemeralProcess.type union in src/lib/ephemeral-processes.tsx.
UI Integration: Updated src/components/layout/active-processes.tsx to include an evaluation: ShieldCheck icon, text-emerald-400 color, and bg-emerald-500 bar color – giving evaluations a distinct, reassuring green hue.
Frontend Wiring: In src/app/(dashboard)/dashboard/personas/[id]/evaluations/page.tsx, we wired up the useEphemeralProcesses hook. An ephemeral entry (including persona name and test type) is addEphemeral on mutation start and removeEphemeral on success or error.

This small but impactful feature, pushed in commit 3ace4bd, provides immediate visual feedback, significantly improving the user experience. Now, when you kick off an eval, you'll see a friendly green shield icon letting you know it's hard at work.

Squashing a Pesky React Error: Objects as Children

Even during a major refactor, smaller, but equally annoying, bugs can pop up. After updating our evaluation marker data structure, a classic React error reared its head.

The Problem: React Error #31 - Objects are not valid as a React child

Our MarkerDefinition object evolved. Previously, marker was a simple string. In the v2 format, it became an object: { pattern, description, weight }. When attempting to render {m.marker} directly in our EvalRow component, React threw an error because you can't render a plain JavaScript object as a child directly.

The Fix: Defensive Rendering

The solution was straightforward but essential:

typescript

// Before (failed with v2 marker object):
// <div>{m.marker}</div>

// After (handles both string and object formats):
<div>
  {typeof m.marker === "string" ? m.marker : m.marker.description}
</div>

This simple type check ensures that if m.marker is still a string (for older data or specific cases), it renders directly. If it's the new MarkerDefinition object, we access its description property for display. This fix was part of the same 44f233a commit as the scoring rewrite.

Lesson Learned: When evolving data structures, especially those rendered in the UI, always anticipate the need for defensive rendering. Type checks or dedicated display functions can prevent common React errors and ensure forward compatibility.

The "Oh, Right!" Moment: Production Deployment Gotchas

Just when you think you're done, production throws a curveball. After deploying the fixes, some tRPC calls were returning HTML instead of JSON.

The Problem: "Unexpected token '<'" - tRPC returning HTML

This is a classic. When a frontend expects JSON but gets HTML (usually an error page), it means something went wrong server-side. In this case, our production environment was running an older build that didn't know about the new v2 columns in our tRPC select statements. The server was trying to query for non-existent columns, failing, and returning an HTML error page, which the client then tried to parse as JSON.

The Fix: Rebuild and Redeploy (the classic)

The workaround was simple: a full rebuild and redeploy of the production container. This ensured the latest server-side code, with the updated tRPC select statements, was running. Sometimes, the most complex issues have the simplest (though often frustratingly overlooked) solutions.

Lesson Learned: Always verify your deployment. Check logs, confirm the correct build is running, and don't underestimate the power of stale caches or old container images to sabotage your day.

Looking Ahead: The Road Still Traveled

While this session wrapped up some critical fixes, the journey continues. Here's what's immediately on the radar:

Verify Corrected Scores: The most crucial next step is for users to re-run evaluations and confirm that Role Adherence scores are no longer stuck at 0.
Old Evaluation Cleanup: We need to consider how to handle pre-fix evaluation records, which still show the broken scores. A "purge old evals" button or a clear "stale" indicator might be necessary.
i18n for Jailbreak Detection: Our current jailbreak refusal detection relies on hardcoded English signals. This will need internationalization if personas respond in other languages.
Discrepancy Flag Monitoring: Now that deterministic and LLM scores are closer, we should verify if the REVIEW discrepancy flag appears less frequently.
Sidebar Progress Testing: A quick sanity check to ensure the emerald ShieldCheck appears and disappears as expected when running an evaluation.

Conclusion

This late-night session was a microcosm of full-stack development: tackling a tricky backend logic bug, enhancing frontend UX, squashing a React rendering error, and navigating deployment quirks. It's these kinds of sessions, where multiple threads converge into a coherent set of fixes, that feel the most rewarding.

Onwards to more stable systems and happier users!

json

{
  "thingsDone": [
    "Rewrote persona evaluation role adherence scoring logic",
    "Implemented real-time sidebar progress for running evaluations",
    "Fixed React error #31 related to rendering object as child",
    "Resolved production deployment issue causing tRPC to return HTML"
  ],
  "pains": [
    "Persona Role Adherence scoring consistently zero due to flawed deterministic word matching",
    "React 'Objects are not valid as a React child' error after data model change",
    "tRPC returning HTML on production due to old code/deployment cache issues"
  ],
  "successes": [
    "Corrected LLM persona evaluation scores (awaiting user verification)",
    "Improved user experience with visual feedback for long-running processes",
    "Eliminated a critical frontend rendering bug",
    "Ensured stable production deployment of new features"
  ],
  "techStack": [
    "React",
    "Next.js",
    "TypeScript",
    "tRPC",
    "LLM",
    "AI Evaluation",
    "TailwindCSS",
    "Node.js",
    "PostgreSQL"
  ]
}