Battle-Testing Our LLM Personas: Building a Comprehensive Evaluation & Benchmarking System
We just shipped a critical system for evaluating our LLM personas, covering everything from temperature consistency to jailbreak resilience, all powered by an LLM-as-Judge and visualized in a live dashboard. Here's a deep dive into its architecture and the invaluable lessons learned along the way.
Ensuring the quality, consistency, and safety of Large Language Model (LLM) personas isn't just a nice-to-have; it's a fundamental requirement for any serious AI-driven application. How do you know if your "Customer Support Bot" persona is truly helpful and not susceptible to prompt injection? How do you track its performance over time as models evolve or prompts are tweaked?
These were the questions driving our latest development push. We needed a robust, automated system to evaluate our LLM personas across multiple dimensions and provide actionable insights. I'm excited to share that we've just brought such a system to life, feature-complete and already yielding results from its first live tests.
The Mission: A Holistic Persona Evaluation System
Our goal was clear: build a comprehensive persona evaluation and benchmarking system. This wasn't just about simple prompt-response checks; it had to be sophisticated enough to:
- A/B Test Temperature: Understand how different temperature settings affect response variability and quality.
- Detect Jailbreaks: Proactively identify vulnerabilities to prompt injection and other adversarial attacks.
- Measure Degradation: Assess performance under stress, particularly with large contexts (e.g., 3000+ words).
- Automate Scoring: Leverage an LLM-as-Judge to provide objective, scalable evaluations.
- Visualize Trends: Offer a clear, interactive dashboard to track persona performance over time.
After an intense development sprint, we've hit "feature complete." Our first live jailbreak test on the "Cael" persona successfully ran, and its results are already illuminating our dashboard. We're now ready for final cleanup and deployment.
Under the Hood: The Architecture of Our Evaluation Engine
Let's dive into the technical details of what we built during this session.
1. Data Model & Security with Prisma and RLS
The foundation of any data-driven system is its schema. We introduced a new PersonaEvaluation model in prisma/schema.prisma:
model PersonaEvaluation {
id String @id @default(uuid())
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
tenantId String
personaId String
testType String // e.g., "jailbreak", "temperature", "degradation"
testName String // e.g., "attack_1", "high_temp_consistency"
prompt String
response String
judgePrompt String? // Prompt used for the LLM-as-Judge
judgeResponse String? // Raw response from the LLM-as-Judge
score Int? // Numerical score from the judge
violations Json? // JSON array of specific violations detected
markers Json? // JSON array of other relevant markers
model String?
temperature Float?
maxTokens Int?
persona Persona @relation(fields: [personaId], references: [id])
tenant Tenant @relation(fields: [tenantId], references: [id])
@@index([tenantId])
@@index([personaId])
}
This model captures all the necessary details for each evaluation run. Crucially, as a multi-tenant application, data isolation is paramount. We immediately added a Row-Level Security (RLS) policy in prisma/rls.sql to ensure tenant_isolation for persona_evaluations, preventing data leaks between tenants.
2. The Brains: Our PersonaEvaluator Service
The core logic resides in src/server/services/persona-evaluator.ts. This service is the orchestrator, housing the functions for running each test and, most importantly, our "LLM-as-Judge" implementation.
runTemperatureTest(): Designed to prompt a persona multiple times with varying temperatures and analyze the consistency and quality of responses.runJailbreakTest(): This is where things get interesting. We implemented four distinct attack vectors to probe a persona's defenses against prompt injection and other manipulative prompts. Each attack generates a score indicating the persona's resilience.runDegradationTest(): Tests the persona's ability to maintain performance and coherence with extremely large contexts. We're pushing 3000+ word contexts here, which can often trip up even advanced models.judgeResponse(): This is the heart of our automated scoring. We craft a specific prompt for a separate, powerful LLM (our "judge") to evaluate the persona's response based on predefined criteria (e.g., adherence to persona, factual accuracy, safety violations). The judge returns a score and detailedviolationsormarkers.
3. API Endpoints & Frontend Dashboard
To expose this functionality and visualize the results, we extended our tRPC router in src/server/trpc/routers/personas.ts with three new endpoints:
runEvaluation: Triggers a specific evaluation test for a given persona.evaluationHistory: Provides a cursor-paginated list of past evaluation results, allowing users to scroll through historical data.evaluationTrend: Aggregates evaluation scores over time, currently configured for a 90-day window, to power our trend charts.
On the frontend, src/app/(dashboard)/dashboard/personas/[id]/evaluations/page.tsx became our dedicated evaluation dashboard. Here, we used Recharts to display an AreaChart showing performance trends. Individual evaluation results are presented in expandable rows, revealing the original prompt, the persona's response, the judge's prompt, its raw response, the score, and any detected violations or markers. A Radix DropdownMenu provides options for filtering or initiating new tests.
We also added a prominent "Evaluations" button to the main persona detail page (src/app/(dashboard)/dashboard/personas/[id]/page.tsx) using FlaskConical for easy access.
4. Hardening & Refinements
A significant part of the "done" list involved applying code review fixes and best practices:
- RLS Policy & Tenant Isolation: Double-checking that RLS was correctly applied and that our service layer explicitly filtered data by
tenantIdto prevent any cross-tenant data access. - Type Safety: Correctly casting JSON inputs with
Prisma.InputJsonValue. - Robust Queries: Ensuring cursor null-safety for pagination and bounding the
evaluationTrendquery to a sensible 90-day window with atake 500limit to prevent excessive data loading. - UI Polish: Integrating
Radix DropdownMenufor a consistent and accessible user experience. - Error Logging: Enhancing judge error logging for better debugging.
Finally, we performed the first live jailbreak test on our "Cael" persona. The 4 results were successfully persisted to the database and are now proudly displayed in the dashboard, confirming the system's end-to-end functionality. Before committing, I removed some temporary debug logs from the evaluationTrend query.
Lessons from the Trenches: Overcoming Development Hurdles
No significant feature ships without its share of unexpected challenges. These "pain points" are often the most valuable learning opportunities.
Lesson 1: tsx and Relative Path Woes
- The Problem: I initially tried to run a quick test script from
/tmp/run-jailbreak-test.ts. The script needed to import modules from our project'ssrcdirectory (e.g.,../src/server/services/persona-evaluator). When executed withtsx, it consistently failed withCannot find module '../src/server/services/persona-evaluator'. - The Root Cause:
tsx(or Node.js in general) resolves relative paths based on the current working directory or the script's location. When running from/tmp, the relative path../src/server/services/persona-evaluatorpoints to an entirely different, non-existent location relative to/tmp. It doesn't inherently understand the project root context. - The Workaround & Takeaway: The simplest fix was to move the script into a dedicated
scripts/directory within the project root (e.g.,scripts/run-jailbreak-test.ts). This ensures that all relative imports correctly resolve within the project's structure.- Actionable Takeaway: For any utility or test scripts that need to import project modules, always place them within the project's directory structure. If external execution is required, ensure your
tsconfig.jsonpathsare correctly configured andtsxis invoked with the appropriateTS_NODE_PROJECTorNODE_PATHenvironment variables, or simply use absolute paths relative to the project root.
- Actionable Takeaway: For any utility or test scripts that need to import project modules, always place them within the project's directory structure. If external execution is required, ensure your
Lesson 2: The Elusive tRPC Route & Next.js HMR
- The Problem: After adding the new tRPC router procedures (
runEvaluation,evaluationHistory,evaluationTrend) and committing them, I fired up the dev server and navigated to the dashboard. The page rendered, but the tRPC queries returned empty results, displaying "No evaluation data yet." The data was in the database, and the queries looked correct. - The Root Cause: Next.js's Hot Module Replacement (HMR) is fantastic, but it doesn't reliably pick up new route additions to the tRPC router. While changes to existing procedures often get hot-reloaded, entirely new procedures might be missed by the underlying build system or server-side routing logic until a full restart. The server simply wasn't aware of the new endpoints.
- The Workaround & Takeaway: A full restart of the development server immediately resolved the issue. This involved killing the process on port 3000, clearing the
.nextcache, regenerating Prisma client, and restarting via./scripts/dev-start.sh. The data appeared instantly.- Actionable Takeaway: Always perform a full dev server restart after adding new tRPC router procedures. HMR is not sufficient for new routes. Document this explicitly for the team to save future headaches.
What's Next? Pushing Forward
The immediate next steps are clear:
- Commit & Push: Get the clean code (debug logs removed) into our main branch.
- Apply RLS: Run
psql -f prisma/rls.sqlto ensure the Row-Level Security policy is active on our running database. - Harden Cael: The initial jailbreak tests on Cael showed scores of 29 and 10 for
attack_1andattack_2respectively, indicating significant vulnerabilities. We need to iterate on Cael's system prompt to make it more resilient. - Full Benchmarks: Run the temperature and degradation tests on Cael to fully populate its trend chart and get a holistic view of its performance.
- Expand Coverage: Benchmark our other key personas (Lee, Morgan, Sage) to validate persona-specific markers and ensure consistent quality across the board.
- Optimize Long-Running Tests: A full
runAllTestscan involve 14 LLM calls, potentially taking 70-210 seconds. This risks HTTP timeouts. We'll need to consider SSE (Server-Sent Events) streaming for real-time progress updates and to prevent timeouts for these longer evaluation runs.
This new evaluation system is a massive leap forward in ensuring the quality, safety, and reliability of our LLM personas. It provides the data-driven insights we need to continuously improve and iterate, moving beyond subjective impressions to objective, measurable performance.
{
"thingsDone": [
"Implemented PersonaEvaluation data model with Prisma",
"Configured Row-Level Security (RLS) for tenant isolation on evaluations",
"Developed comprehensive persona evaluator service (runTemperatureTest, runJailbreakTest with 4 attacks, runDegradationTest with 3000+ word context)",
"Integrated LLM-as-Judge for automated scoring and violation detection (judgeResponse)",
"Created tRPC endpoints for running evaluations, fetching history (cursor-paginated), and aggregated trends (90-day)",
"Built a full dashboard UI with Recharts AreaChart, expandable result rows, and Radix DropdownMenu",
"Added 'Evaluations' button to persona detail pages",
"Applied critical code review fixes (RLS policy, service-layer tenant isolation, type casting, null-safety, bounded queries, error logging)",
"Successfully ran live jailbreak test on 'Cael' persona, persisting results to DB and dashboard",
"Cleaned up temporary debug logs from trend query"
],
"pains": [
"tsx failing to resolve relative module paths when script run from /tmp",
"Next.js HMR not picking up new tRPC router procedures, leading to empty query results"
],
"successes": [
"Achieved feature complete status for persona evaluation system",
"Successfully executed and visualized first live jailbreak test",
"Established robust data model and secure multi-tenant access for evaluations",
"Implemented automated LLM-as-Judge scoring mechanism",
"Created an intuitive and informative dashboard for tracking persona performance"
],
"techStack": [
"TypeScript",
"Next.js",
"tRPC",
"Prisma",
"PostgreSQL",
"Recharts",
"Radix UI",
"LLMs (for persona and judge)",
"tsx"
]
}