Benchmarking AI Personas: Building a Robust Evaluation System from Scratch
Discover how we built a comprehensive system to rigorously evaluate LLM personas, from A/B testing temperatures to detecting jailbreaks, all powered by an LLM-as-Judge and visualized in a dynamic dashboard.
Every developer working with large language models knows the challenge: how do you ensure your carefully crafted AI personas maintain their integrity, performance, and security over time? How do you measure the impact of a prompt tweak or a model update? Manual testing is tedious, unscalable, and prone to human error.
This was the exact problem we set out to solve: building a robust, automated persona evaluation and benchmarking system. And I'm thrilled to share that we've just reached a major milestone: our new persona evaluation system is feature-complete and has successfully run its first live tests!
The Quest for Reliable Persona Evaluation
Our goal was ambitious: create a system that could rigorously test LLM personas across multiple dimensions and provide actionable insights. This meant developing capabilities for:
- A/B Temperature Testing: Understanding how different
temperaturesettings influence a persona's output – finding that sweet spot between creativity and coherence. - Jailbreak Detection: A critical security and integrity measure. Can our personas be tricked into violating their guidelines or roles? We needed a way to proactively identify and mitigate these risks.
- Degradation Testing: Ensuring performance doesn't suffer with complex, long-context inputs (e.g., 3000+ words). LLMs can sometimes lose their "memory" or coherence under heavy load.
- LLM-as-Judge Scoring: Automating the evaluation process by having another LLM score the persona's responses against predefined criteria. This is a game-changer for scalability and consistency, moving beyond subjective human review for initial passes.
- Trend Dashboard: Visualizing evaluation results over time with Recharts to quickly spot performance shifts and regressions at a glance.
Under the Hood: The Technical Journey
Bringing this vision to life involved touching several parts of our stack. Here's a glimpse into the key components we built and integrated:
-
Data Model with Prisma: We introduced a new
PersonaEvaluationmodel inprisma/schema.prisma, establishing clear relations to our existingPersonaandTenantmodels. This ensures all evaluation data is structured and linked correctly for easy querying and analysis. -
Robust Security with RLS: For multi-tenant applications, data isolation is paramount. We implemented Row-Level Security (RLS) in
prisma/rls.sql, adding atenant_isolationpolicy forpersona_evaluationsto ensure each tenant only sees their own sensitive evaluation data. -
The Evaluation Engine (
src/server/services/persona-evaluator.ts): This is the brain of the operation. It houses our core test runners:runTemperatureTest(): For systematically A/B testing temperature variations.runJailbreakTest(): With multiple attack vectors (we started with 4 common jailbreak attempts).runDegradationTest(): Pushing context limits with 3000+ word inputs to simulate real-world complex interactions.- Crucially,
judgeResponse()leverages an LLM-as-Judge to automatically score the persona's output against predefined rules and identify violations, providing objective feedback at scale.
-
API Endpoints with tRPC (
src/server/trpc/routers/personas.ts): We exposed three new, strongly-typed endpoints for seamless frontend interaction:runEvaluation: To trigger an evaluation for a specific persona.evaluationHistory: A cursor-paginated endpoint to fetch past evaluation results efficiently, handling large datasets without performance bottlenecks.evaluationTrend: An aggregated endpoint providing 90 days of data, perfectly formatted for our charting component.
-
The Dashboard Experience (
src/app/(dashboard)/dashboard/personas/[id]/evaluations/page.tsx): The user interface brings all the data to life.- A dynamic Recharts
AreaChartdisplays performance trends over time, allowing for quick visual assessment of persona health. - Expandable result rows enable users to drill down into specific evaluations, revealing violations, markers, the original prompt, and the persona's exact response for detailed investigation.
- A
Radix DropdownMenuprovides intuitive actions and filters for navigating the data.
- A dynamic Recharts
-
Seamless Integration: A new "Evaluations" button was added to the main persona detail page, making it easy for users to access the new dashboard directly from their persona management workflow.
Throughout the development, we also applied several crucial code review fixes, ensuring RLS policies were correctly applied, tenant isolation was robust in the service layer, proper type casting (Prisma.InputJsonValue) was used, cursor null-safety was handled, and trend queries were bounded (90 days + take 500) for optimal performance.
Lessons Learned: Navigating the Development Hurdles
No development journey is without its bumps. We encountered a couple of interesting challenges that offered valuable lessons:
-
Module Resolution in Ad-Hoc Scripts:
- The Problem: We initially tried running a temporary test script from
/tmp/run-jailbreak-test.tsto quickly validate our evaluation engine. This failed with aCannot find module '../src/server/services/persona-evaluator'error. - The Cause:
tsx(our TypeScript execution environment) couldn't resolve relative paths from an arbitrary/tmpdirectory back into our project'ssrcfolder. The execution context simply didn't know where to look. - The Solution: Moving the script into a dedicated
scripts/directory within the project root immediately resolved the issue, astsxcould then correctly infer the project structure. - Lesson: Always be mindful of your execution context and how module resolvers interpret relative paths, especially when running scripts outside the typical application flow. Project-relative paths are generally safer than attempting to infer paths from temporary locations.
- The Problem: We initially tried running a temporary test script from
-
tRPC Routes and Next.js HMR:
- The Problem: After adding our new tRPC procedures (
runEvaluation,evaluationHistory,evaluationTrend), the dashboard initially showed "No evaluation data yet," even though the page rendered without errors. The tRPC queries were simply returning empty. - The Root Cause: Next.js's Hot Module Replacement (HMR) for server-side code (like tRPC router definitions) isn't always reliable when new procedures are added. It often refreshes existing modules but doesn't necessarily re-evaluate the entire router setup.
- The Solution: A full dev server restart via
./scripts/dev-start.sh(which kills the port, clears.nextcache, regenerates Prisma, and restarts) instantly brought the data to life. - Lesson: When adding new tRPC router procedures, always perform a full dev server restart. Relying solely on HMR can lead to frustrating "missing data" bugs that aren't immediately obvious, wasting valuable debugging time. This is a crucial reminder for anyone working with Next.js and tRPC!
- The Problem: After adding our new tRPC procedures (
First Light: Live Test Success!
After resolving these issues and cleaning up temporary debug logs, we ran our first live jailbreak test on our "Cael" persona. The results were instantly visible in the dashboard UI! Two of the four attack vectors scored 29 and 10 respectively, clearly indicating areas where Cael's system prompt needs immediate hardening. This immediate feedback loop, from test execution to actionable insights, is exactly what we aimed for.
What's Next on the Benchmarking Horizon?
With the core system in place, our immediate next steps are:
- Commit the final cleanup (debug log removal) and push the code to our main branch.
- Apply the RLS policies to our running database to ensure production-ready security.
- Harden Cael's system prompt against the identified jailbreak attacks.
- Run the temperature and degradation tests on Cael to fully populate its trend chart and get a complete performance picture.
- Expand benchmarking to our other personas (Lee, Morgan, Sage) to validate persona-specific markers and ensure consistent quality across the board.
- Investigate SSE streaming for our
runAllTestsendpoint, as 14 LLM calls can lead to long wait times (70-210 seconds) and potential HTTP timeouts, impacting user experience.
This new evaluation system is a massive leap forward in ensuring the quality, security, and consistent performance of our LLM personas. We're incredibly excited to leverage this tool to build even more reliable and trustworthy AI experiences, confident that our personas will always perform as intended.