Putting Our AI Personas to the Test: Building a Robust LLM Evaluation System

In the rapidly evolving world of large language models, ensuring the consistent, safe, and high-quality performance of our AI personas is paramount. It's not enough to simply prompt an LLM; we need to rigorously test, evaluate, and benchmark them to prevent degradation, guard against misuse, and optimize their behavior. That's precisely the challenge we tackled head-on, culminating in a brand-new persona evaluation and benchmarking system that's now live and kicking.

Our goal was ambitious: create a system capable of running various tests on our LLM personas, scoring their responses using an LLM-as-Judge paradigm, and visualizing the trends over time in an intuitive dashboard. We wanted to answer critical questions like: How stable is a persona's output across different temperature settings? Can it be easily jailbroken? Does its performance degrade with very long contexts?

The Core Challenge: Robust LLM Persona Evaluation

To truly understand our personas, we needed a multi-faceted approach. Our new system implements several key evaluation types:

Temperature A/B Testing: This allows us to compare how a persona responds at different temperature settings, helping us fine-tune the balance between creativity and consistency.
Jailbreak Testing: A critical security measure. We developed a suite of four common jailbreak attacks to probe for vulnerabilities, ensuring our personas remain aligned with their intended purpose and safety guidelines.
Degradation Testing: LLMs can sometimes struggle with extremely long contexts. Our degradation test pushes the limits with inputs exceeding 3000 words, verifying that our personas maintain coherence and accuracy even under heavy load.

The scoring mechanism is where things get really interesting. We implemented an LLM-as-Judge system, where a separate, more robust LLM evaluates the persona's responses against predefined criteria, assigning scores for adherence, safety, and quality. This provides a nuanced, automated way to assess performance beyond simple keyword matching.

Finally, all these evaluations needed to be digestible. We envisioned a trend dashboard, powered by Recharts, to visualize performance over time, highlight anomalies, and track improvements.

Bringing It To Life: An Implementation Deep Dive

Building this system involved touching various parts of our stack, from database schema to UI.

The Foundation: Data Models and Security

We started by laying the groundwork in our database. A new PersonaEvaluation model was added to prisma/schema.prisma, establishing relationships with existing Persona and Tenant models. Crucially, we implemented Row-Level Security (RLS) in prisma/rls.sql to ensure strict tenant isolation, a must-have for multi-tenant applications. This means each tenant can only see their own persona evaluations, maintaining data privacy and integrity.

The Brain: Our Evaluation Engine

The heart of the system lives in src/server/services/persona-evaluator.ts. This service orchestrates all the tests:

runTemperatureTest(): Manages A/B comparisons.
runJailbreakTest(): Executes the four attack vectors.
runDegradationTest(): Handles the extensive context challenges.
judgeResponse(): The LLM-as-Judge function, taking a prompt and response, and returning a structured evaluation from a separate LLM.

The API: Connecting Front-end to Back-end

To expose these capabilities, we added three new tRPC procedures to src/server/trpc/routers/personas.ts:

runEvaluation: Kicks off a new evaluation for a given persona.
evaluationHistory: Provides a cursor-paginated list of past evaluation results.
evaluationTrend: Aggregates evaluation scores over a 90-day period, limited to 500 data points for efficient charting, perfect for our dashboard.

The Face: A Dynamic Dashboard

The user interface, src/app/(dashboard)/dashboard/personas/[id]/evaluations/page.tsx, is where all this data comes to life. We built a comprehensive dashboard featuring:

A Recharts AreaChart to visualize performance trends over time.
Expandable result rows, allowing users to drill down into specific evaluation runs.
Detailed views of each test, including specific violations, markers, the original prompt, and the persona's response.
A Radix DropdownMenu for user actions, making the interface clean and interactive.

We also integrated a prominent "Evaluations" button on each persona's main page (src/app/(dashboard)/dashboard/personas/[id]/page.tsx), making the new functionality easily accessible.

After rigorous code reviews and applying crucial fixes—like ensuring tenant isolation across layers, correct Prisma type casting, robust cursor null-safety, and bounded queries—we ran our first live jailbreak test on our "Cael" persona. The results, showing 4 distinct jailbreak attempts and their scores, immediately populated the database and were visible in the dashboard. Success!

Navigating the Bumps: Lessons Learned Along the Way

No development journey is without its challenges. We hit a couple of interesting snags that provided valuable lessons:

Challenge 1: Script Execution Contexts

Initially, I tried running a test script from /tmp/run-jailbreak-test.ts. This seemed like a quick way to test, but tsx (our TypeScript execution environment) couldn't resolve relative module imports like '../src/server/services/persona-evaluator' from outside the project root.

Lesson Learned: When writing helper scripts that interact with your project's codebase, it's best practice to place them within the project's directory structure (e.g., scripts/). This ensures consistent module resolution paths and avoids unexpected Cannot find module errors.

Challenge 2: The Elusive tRPC Route & Next.js HMR

After adding the new tRPC procedures, I fired up the dev server and navigated to the dashboard. To my dismay, it displayed "No evaluation data yet," even though the code was correct. The tRPC queries were returning empty.

Root Cause: While Next.js's Hot Module Replacement (HMR) is fantastic, it doesn't always reliably pick up new tRPC router definitions. If the dev server was started before the new routes were added, HMR might not re-initialize the tRPC router with the latest procedures.

Lesson Learned: Whenever you introduce entirely new tRPC procedures or make significant structural changes to your tRPC router, perform a full dev server restart. Our ./scripts/dev-start.sh script, which kills the port, clears the .next cache, regenerates Prisma, and restarts the server, proved to be the reliable solution. Data appeared immediately after this full restart. Save yourself some debugging time!

What's Next? Pushing the Boundaries

With the core system feature-complete and tested, our immediate next steps involve some cleanup and expansion:

Final Polish: Committing the last debug log removals.
Database Hardening: Applying the RLS policy to our running database with psql -f prisma/rls.sql.
Persona Improvement: Critically, we'll use Cael's jailbreak test results (scores of 29 and 10 on certain attacks indicate vulnerabilities) to harden its system prompt and make it more robust.
Full Benchmarking: Running temperature and degradation tests on Cael to fully populate its trend chart, and then extending these benchmarks to our other personas like Lee, Morgan, and Sage, validating their persona-specific markers.
Future Enhancements: Looking ahead, we're considering using Server-Sent Events (SSE) for our runAllTests function. With 14 LLM calls, these tests can take 70-210 seconds, risking HTTP timeouts. SSE would allow us to stream real-time progress updates to the UI, greatly enhancing the user experience.

This journey has been incredibly rewarding, providing us with a powerful tool to ensure our AI personas are not just intelligent, but also reliable, safe, and performant. We're excited to leverage this system to build even better AI experiences!