Battle-Testing Our LLM Personas: Building a Comprehensive Evaluation & Benchmarking System

Ensuring the quality, consistency, and safety of Large Language Model (LLM) personas isn't just a nice-to-have; it's a fundamental requirement for any serious AI-driven application. How do you know if your "Customer Support Bot" persona is truly helpful and not susceptible to prompt injection? How do you track its performance over time as models evolve or prompts are tweaked?

These were the questions driving our latest development push. We needed a robust, automated system to evaluate our LLM personas across multiple dimensions and provide actionable insights. I'm excited to share that we've just brought such a system to life, feature-complete and already yielding results from its first live tests.

The Mission: A Holistic Persona Evaluation System

Our goal was clear: build a comprehensive persona evaluation and benchmarking system. This wasn't just about simple prompt-response checks; it had to be sophisticated enough to:

A/B Test Temperature: Understand how different temperature settings affect response variability and quality.
Detect Jailbreaks: Proactively identify vulnerabilities to prompt injection and other adversarial attacks.
Measure Degradation: Assess performance under stress, particularly with large contexts (e.g., 3000+ words).
Automate Scoring: Leverage an LLM-as-Judge to provide objective, scalable evaluations.
Visualize Trends: Offer a clear, interactive dashboard to track persona performance over time.

After an intense development sprint, we've hit "feature complete." Our first live jailbreak test on the "Cael" persona successfully ran, and its results are already illuminating our dashboard. We're now ready for final cleanup and deployment.

Under the Hood: The Architecture of Our Evaluation Engine

Let's dive into the technical details of what we built during this session.

1. Data Model & Security with Prisma and RLS

The foundation of any data-driven system is its schema. We introduced a new PersonaEvaluation model in prisma/schema.prisma:

prisma

model PersonaEvaluation {
  id               String    @id @default(uuid())
  createdAt        DateTime  @default(now())
  updatedAt        DateTime  @updatedAt
  tenantId         String
  personaId        String
  testType         String // e.g., "jailbreak", "temperature", "degradation"
  testName         String // e.g., "attack_1", "high_temp_consistency"
  prompt           String
  response         String
  judgePrompt      String? // Prompt used for the LLM-as-Judge
  judgeResponse    String? // Raw response from the LLM-as-Judge
  score            Int?    // Numerical score from the judge
  violations       Json?   // JSON array of specific violations detected
  markers          Json?   // JSON array of other relevant markers
  model            String?
  temperature      Float?
  maxTokens        Int?

  persona Persona @relation(fields: [personaId], references: [id])
  tenant  Tenant  @relation(fields: [tenantId], references: [id])

  @@index([tenantId])
  @@index([personaId])
}

This model captures all the necessary details for each evaluation run. Crucially, as a multi-tenant application, data isolation is paramount. We immediately added a Row-Level Security (RLS) policy in prisma/rls.sql to ensure tenant_isolation for persona_evaluations, preventing data leaks between tenants.

2. The Brains: Our `PersonaEvaluator` Service

The core logic resides in src/server/services/persona-evaluator.ts. This service is the orchestrator, housing the functions for running each test and, most importantly, our "LLM-as-Judge" implementation.

runTemperatureTest(): Designed to prompt a persona multiple times with varying temperatures and analyze the consistency and quality of responses.
runJailbreakTest(): This is where things get interesting. We implemented four distinct attack vectors to probe a persona's defenses against prompt injection and other manipulative prompts. Each attack generates a score indicating the persona's resilience.
runDegradationTest(): Tests the persona's ability to maintain performance and coherence with extremely large contexts. We're pushing 3000+ word contexts here, which can often trip up even advanced models.
judgeResponse(): This is the heart of our automated scoring. We craft a specific prompt for a separate, powerful LLM (our "judge") to evaluate the persona's response based on predefined criteria (e.g., adherence to persona, factual accuracy, safety violations). The judge returns a score and detailed violations or markers.

3. API Endpoints & Frontend Dashboard

To expose this functionality and visualize the results, we extended our tRPC router in src/server/trpc/routers/personas.ts with three new endpoints:

runEvaluation: Triggers a specific evaluation test for a given persona.
evaluationHistory: Provides a cursor-paginated list of past evaluation results, allowing users to scroll through historical data.
evaluationTrend: Aggregates evaluation scores over time, currently configured for a 90-day window, to power our trend charts.

On the frontend, src/app/(dashboard)/dashboard/personas/[id]/evaluations/page.tsx became our dedicated evaluation dashboard. Here, we used Recharts to display an AreaChart showing performance trends. Individual evaluation results are presented in expandable rows, revealing the original prompt, the persona's response, the judge's prompt, its raw response, the score, and any detected violations or markers. A Radix DropdownMenu provides options for filtering or initiating new tests.

We also added a prominent "Evaluations" button to the main persona detail page (src/app/(dashboard)/dashboard/personas/[id]/page.tsx) using FlaskConical for easy access.

4. Hardening & Refinements

A significant part of the "done" list involved applying code review fixes and best practices:

RLS Policy & Tenant Isolation: Double-checking that RLS was correctly applied and that our service layer explicitly filtered data by tenantId to prevent any cross-tenant data access.
Type Safety: Correctly casting JSON inputs with Prisma.InputJsonValue.
Robust Queries: Ensuring cursor null-safety for pagination and bounding the evaluationTrend query to a sensible 90-day window with a take 500 limit to prevent excessive data loading.
UI Polish: Integrating Radix DropdownMenu for a consistent and accessible user experience.
Error Logging: Enhancing judge error logging for better debugging.

Finally, we performed the first live jailbreak test on our "Cael" persona. The 4 results were successfully persisted to the database and are now proudly displayed in the dashboard, confirming the system's end-to-end functionality. Before committing, I removed some temporary debug logs from the evaluationTrend query.

Lessons from the Trenches: Overcoming Development Hurdles

No significant feature ships without its share of unexpected challenges. These "pain points" are often the most valuable learning opportunities.

Lesson 1: `tsx` and Relative Path Woes

The Problem: I initially tried to run a quick test script from /tmp/run-jailbreak-test.ts. The script needed to import modules from our project's src directory (e.g., ../src/server/services/persona-evaluator). When executed with tsx, it consistently failed with Cannot find module '../src/server/services/persona-evaluator'.
The Root Cause: tsx (or Node.js in general) resolves relative paths based on the current working directory or the script's location. When running from /tmp, the relative path ../src/server/services/persona-evaluator points to an entirely different, non-existent location relative to /tmp. It doesn't inherently understand the project root context.
The Workaround & Takeaway: The simplest fix was to move the script into a dedicated scripts/ directory within the project root (e.g., scripts/run-jailbreak-test.ts). This ensures that all relative imports correctly resolve within the project's structure.
- Actionable Takeaway: For any utility or test scripts that need to import project modules, always place them within the project's directory structure. If external execution is required, ensure your tsconfig.json paths are correctly configured and tsx is invoked with the appropriate TS_NODE_PROJECT or NODE_PATH environment variables, or simply use absolute paths relative to the project root.

Lesson 2: The Elusive tRPC Route & Next.js HMR

The Problem: After adding the new tRPC router procedures (runEvaluation, evaluationHistory, evaluationTrend) and committing them, I fired up the dev server and navigated to the dashboard. The page rendered, but the tRPC queries returned empty results, displaying "No evaluation data yet." The data was in the database, and the queries looked correct.
The Root Cause: Next.js's Hot Module Replacement (HMR) is fantastic, but it doesn't reliably pick up new route additions to the tRPC router. While changes to existing procedures often get hot-reloaded, entirely new procedures might be missed by the underlying build system or server-side routing logic until a full restart. The server simply wasn't aware of the new endpoints.
The Workaround & Takeaway: A full restart of the development server immediately resolved the issue. This involved killing the process on port 3000, clearing the .next cache, regenerating Prisma client, and restarting via ./scripts/dev-start.sh. The data appeared instantly.
- Actionable Takeaway: Always perform a full dev server restart after adding new tRPC router procedures. HMR is not sufficient for new routes. Document this explicitly for the team to save future headaches.

What's Next? Pushing Forward

The immediate next steps are clear:

Commit & Push: Get the clean code (debug logs removed) into our main branch.
Apply RLS: Run psql -f prisma/rls.sql to ensure the Row-Level Security policy is active on our running database.
Harden Cael: The initial jailbreak tests on Cael showed scores of 29 and 10 for attack_1 and attack_2 respectively, indicating significant vulnerabilities. We need to iterate on Cael's system prompt to make it more resilient.
Full Benchmarks: Run the temperature and degradation tests on Cael to fully populate its trend chart and get a holistic view of its performance.
Expand Coverage: Benchmark our other key personas (Lee, Morgan, Sage) to validate persona-specific markers and ensure consistent quality across the board.
Optimize Long-Running Tests: A full runAllTests can involve 14 LLM calls, potentially taking 70-210 seconds. This risks HTTP timeouts. We'll need to consider SSE (Server-Sent Events) streaming for real-time progress updates and to prevent timeouts for these longer evaluation runs.

This new evaluation system is a massive leap forward in ensuring the quality, safety, and reliability of our LLM personas. It provides the data-driven insights we need to continuously improve and iterate, moving beyond subjective impressions to objective, measurable performance.

json

{
  "thingsDone": [
    "Implemented PersonaEvaluation data model with Prisma",
    "Configured Row-Level Security (RLS) for tenant isolation on evaluations",
    "Developed comprehensive persona evaluator service (runTemperatureTest, runJailbreakTest with 4 attacks, runDegradationTest with 3000+ word context)",
    "Integrated LLM-as-Judge for automated scoring and violation detection (judgeResponse)",
    "Created tRPC endpoints for running evaluations, fetching history (cursor-paginated), and aggregated trends (90-day)",
    "Built a full dashboard UI with Recharts AreaChart, expandable result rows, and Radix DropdownMenu",
    "Added 'Evaluations' button to persona detail pages",
    "Applied critical code review fixes (RLS policy, service-layer tenant isolation, type casting, null-safety, bounded queries, error logging)",
    "Successfully ran live jailbreak test on 'Cael' persona, persisting results to DB and dashboard",
    "Cleaned up temporary debug logs from trend query"
  ],
  "pains": [
    "tsx failing to resolve relative module paths when script run from /tmp",
    "Next.js HMR not picking up new tRPC router procedures, leading to empty query results"
  ],
  "successes": [
    "Achieved feature complete status for persona evaluation system",
    "Successfully executed and visualized first live jailbreak test",
    "Established robust data model and secure multi-tenant access for evaluations",
    "Implemented automated LLM-as-Judge scoring mechanism",
    "Created an intuitive and informative dashboard for tracking persona performance"
  ],
  "techStack": [
    "TypeScript",
    "Next.js",
    "tRPC",
    "Prisma",
    "PostgreSQL",
    "Recharts",
    "Radix UI",
    "LLMs (for persona and judge)",
    "tsx"
  ]
}