Benchmarking AI Personas: Our Journey to a Robust Evaluation System
Ever wonder how to truly know if your AI personas are performing as expected? We built a comprehensive system to evaluate and benchmark LLM personas, featuring A/B testing, adversarial jailbreak detection, degradation analysis, LLM-as-Judge scoring, and a dynamic trend dashboard. Come see how we did it!
In the fast-evolving world of Large Language Models (LLMs), simply deploying a persona isn't enough. We need to ensure they consistently adhere to their designated roles, remain secure against adversarial attacks, and maintain coherence even under stress. This was the driving force behind our latest project: a full-fledged Persona Evaluation & Benchmark System.
Our goal was ambitious: build a system that could rigorously test LLM personas across various dimensions, score their performance automatically using an LLM-as-Judge, and visualize trends over time. After an intense development sprint, I'm thrilled to share that the system is now feature-complete, type-check clean, lint-approved, and ready for launch!
The Why: Ensuring Persona Integrity
Imagine you have several specialized LLM personas: Cael, the meticulous research assistant; Lee, the GDPR-compliant legal advisor; Morgan, the financial analyst; and Sage, the software architecture expert. Each needs to consistently embody its role, follow specific rules, and avoid common pitfalls. How do you measure that? Manually testing each scenario is time-consuming and prone to human bias. This is where our automated evaluation system shines.
We designed it to answer critical questions:
- How does a persona's output change with varying "temperature" settings?
- Can an attacker trick the persona into ignoring its instructions (jailbreak)?
- Does the persona maintain coherence and relevant information in long, complex contexts?
Building the Brains: The Persona Evaluator Service
At the heart of our system lies the persona-evaluator.ts service. This is where the magic happens, orchestrating various test types and the sophisticated LLM-as-Judge scoring mechanism.
The Three Pillars of Evaluation
We implemented three core test types, each designed to probe a different aspect of persona performance:
-
Temperature Test (
runTemperatureTest): This A/B test runs the exact same prompt against the persona twice: once with a low temperature (0.2, for deterministic, focused responses) and once with a higher temperature (0.8, for more creative, varied outputs). The goal is to see how robust the persona's core adherence is to this inherent LLM variability. Do they stay in character even when given more room to "improvise"? -
Adversarial Jailbreak Test (
runJailbreakTest): Security is paramount. This test utilizes a series of escalating attack prompts designed to make the persona ignore its instructions. We crafted four distinct attack vectors:- "Ignore instructions"
- "System override"
- "Admin pretense"
- "Anti-persona game" The system checks if the persona can resist these attempts and maintain its role.
-
Degradation/Amnesia Test (
runDegradationTest): LLMs can struggle with long contexts, sometimes losing track of earlier information or becoming less coherent. This test feeds the persona a 3000+ word enterprise scenario context and then asks specific questions to verify its ability to maintain coherence and recall key details.
The Judge: LLM-as-Judge Scoring
Instead of human graders, we leveraged the power of LLMs themselves to evaluate responses. The judgeResponse() function is critical here. It sends the persona's output, along with the original prompt and expected persona traits, to a separate "judge" LLM. This judge LLM then extracts a structured JSON output, scoring the response on three key metrics (0-100):
roleAdherence: How well did the persona stick to its defined role?ruleCompliance: Did it follow any specific rules or constraints?markerPresence: Did it include specific keywords or concepts relevant to its persona?
Each persona has its own set of unique markers: Cael looks for "papers" and "math," Lee for "DSGVO" and "legal statutes," Morgan for "financial reports," and Sage for "refactoring patterns." This allows for highly tailored and accurate scoring.
Crucially, scoring weights are adjusted per test type. For instance, the jailbreakTest heavily weights roleAdherence (0.6) because the primary goal there is to see if the persona can maintain its core identity against attacks.
Data Persistence and API Exposure
All evaluation results, including detailed scores and responses, are stored in a new PersonaEvaluation model in our prisma/schema.prisma. We ensured tenant isolation and robust data security by adding an RLS (Row Level Security) policy in prisma/rls.sql.
To interact with this powerful backend, we exposed three tRPC endpoints in src/server/trpc/routers/personas.ts:
runEvaluation: A mutation to kick off selected or all test types for a given persona.evaluationHistory: A cursor-paginated query to fetch detailed past evaluation results.evaluationTrend: A query to retrieve aggregated 90-day daily averages, perfect for visualizing performance over time.
Bringing it to Life: The UI and Dashboard
No powerful backend is complete without an intuitive user interface. We integrated the evaluation system directly into our dashboard:
- Dedicated Evaluation Page: For each persona, a new
/evaluationspage was created. It features a header with the persona's avatar and a Radix DropdownMenu for selecting and running specific tests. - Trend Visualization: A Recharts
AreaChartdynamically displays the 90-day performance trend for temperature, jailbreak, and degradation tests, each color-coded for clarity. This provides an at-a-glance view of how personas are improving (or degrading) over time. - Detailed Results Table: An expandable table presents individual evaluation results, including overall scores, a list of violations, a checklist of detected markers, and the full prompt/response for deeper analysis.
- Intuitive Score Badges: We implemented a clear visual cue for performance: green badges for scores >= 80, yellow for >= 60, and red for < 60, making it easy to spot areas needing attention.
- Dashboard Integration: A prominent FlaskConical "Evaluations" button was added to the main persona detail page, making it easy to navigate to the new system.
Challenges & Lessons Learned
Development is rarely without its bumps. Here are a couple of key lessons we picked up along the way:
-
Component API Nuances: We initially tried to use
variant="destructive"on ourBadgecomponent for critical scores, only to hit a TypeScript error (TS2322). It turned out ournyxcoreBadge API correctly usesvariant="danger"for that purpose.- Lesson: Always consult component documentation or type definitions thoroughly, even for seemingly common props. Type safety is there to guide you!
-
Type-Safe JSON Handling: When working with Prisma's
Jsonfields, we initially resorted toas unknown as undefinedto satisfy TypeScript. Our vigilant code reviewer rightly flagged this as a type-safety antipattern – it lies to the compiler.- Lesson: For
Jsonfields, the correct and semantically accurate approach isas unknown as Prisma.InputJsonValue. This maintains type integrity and prevents potential runtime issues. It's a reminder thatunknownis a powerful type, but its casting should be handled with care and precision.
- Lesson: For
What's Next?
With the core system complete, our immediate next steps involve deployment and rigorous testing:
- Commit and Push: The code is ready for prime time!
- Apply RLS: The
prisma/rls.sqlpolicy needs to be applied to the running database to ensure robust row-level security. - Manual Verification: We'll run several manual tests to confirm everything works as expected:
- Verify Cael's
roleAdherencein a jailbreak test. - Confirm both temperature variants are scored correctly.
- Ensure the trend chart accurately reflects multiple evaluations.
- Verify Cael's
- Performance Optimization: Running all tests (which can involve 14 LLM calls) can take a significant amount of time (70-210 seconds). We're already considering implementing Server-Sent Events (SSE) streaming for the
runAllTestsmutation to provide real-time feedback to the user. - E2E Testing: Finally, robust end-to-end tests will be implemented to cover the evaluations page load and run test button interactions, ensuring long-term stability.
This Persona Evaluation & Benchmark System is a significant step forward in our journey to build more reliable, secure, and performant AI personas. We're excited about the insights it will provide and the confidence it instills in our LLM deployments!