Level Up Your AI: Unveiling Persona Evaluation v2 with Hybrid Scoring and Deterministic Profiles
We just shipped Persona Evaluation v2, a significant upgrade to how we assess AI behavior. Dive into our journey of building a profile-driven, hybrid scoring system that brings determinism, nuance, and transparency to LLM evaluations.
Late last night, after a marathon coding session, we pushed a major update that fundamentally changes how we evaluate our AI personas. I'm thrilled to announce the successful deployment of Persona Evaluation v2! This wasn't just a patch; it was a complete overhaul, moving from a more generic scoring system to a sophisticated, profile-driven hybrid approach that brings unprecedented determinism and insight into LLM behavior.
Our goal was ambitious: implement a robust, 11-task plan covering everything from new type definitions to a full production deployment. And I'm proud to say, we hit every single one. The feat/persona-eval-v2 branch is now merged to main, deployed, and running live.
Let's dive into the core changes and the journey it took to get there.
The Quest for Deeper Understanding: Why Persona Evaluation v2?
Our previous evaluation system, while functional, had limitations. It was somewhat generic, making it harder to precisely measure adherence to complex persona traits or identify subtle deviations. We needed a system that:
- Understood Nuance: Each persona is unique. Their evaluation shouldn't be a one-size-fits-all.
- Increased Determinism: Relying solely on LLM judges for scoring introduced variability. We needed more objective, reproducible metrics.
- Provided Actionable Insights: Beyond a simple score, we wanted to know why a persona succeeded or failed, and how it could be improved.
Persona Evaluation v2 addresses these needs head-on, introducing profile-driven hybrid scoring.
Building the Brains: A Deep Dive into the Implementation
Our journey unfolded across four key development chunks, each building on the last.
1. Laying the Foundation: Types, Schema & Persona Profiles
The first step was to define the new world. We introduced a suite of new types in src/server/services/persona-evaluation-types.ts to capture the complexity of persona profiles, attack vectors, marker definitions, and detailed evaluation outputs.
The prisma/schema.prisma file saw significant modifications, adding crucial columns to our PersonaEvaluation model (like refusalQuality, coherenceUnderLoad, judgeReasoning, compositeScore, evalTier, discrepancyFlag). Crucially, we introduced a brand-new PersonaProfile model to store persona-specific data.
Perhaps the most exciting part of this chunk was the introduction of 12 built-in persona profiles (Cael, NyxCore, Athena, Nemesis, Harmonia, Clotho, Hermes, Tyche, Themis, Prometheus, Aletheia, Ipcha Mistabra) in src/server/services/persona-profiles.ts. These profiles are the heart of v2, providing a rich, structured definition for each AI, including their core traits, specific test prompts, and tailored attack vectors. We even included a GENERIC_PROFILE fallback for custom personas without a defined profile.
2. The Core Engine: Deterministic Scoring & Evaluator Rewrite
With the foundation set, we completely rewrote our evaluation engine in src/server/services/persona-evaluator.ts. This is where the magic of hybrid scoring happens.
Gone are the days of purely generic scoring. Our new evaluator now features:
scoreMarkersDeterministic(): A regex-based, tamper-proof system for scoring specific keywords or patterns, ensuring objective measurement.scoreRoleAdherenceDeterministic(): This function uses behavior matching and anti-pattern penalties to objectively assess how well an LLM adheres to its defined role.buildJudgePrompt(): Instead of a generic system prompt, our LLM judge now receives a persona-specific prompt, allowing for more nuanced and context-aware evaluations.hybridScore(): This is the cornerstone. It combines our deterministic primary scores with an LLM's advisory input. Critically, it includes a discrepancy detection mechanism to flag when the deterministic score and LLM's assessment diverge significantly, prompting further investigation.- Profile-Driven Tests: Our
runTemperatureTest,runJailbreakTest, andrunDegradationTestnow leverage thetestPromptsdefined within each persona's profile, includingGENERIC_JAILBREAKSand 5 tailored attack vectors per persona for full-tier evaluations. - Tiered Evaluations: We introduced
runQuickEval()(temperature + generic jailbreaks) andrunFullEval()(all three types + tailored attacks) to provide flexible evaluation depths.
This rewrite transforms our evaluator into a powerful, precise, and highly configurable tool.
3. Bringing it to Life: Profile Management & UI
An intelligent system is only as good as its interface. We made significant updates to our tRPC procedures and UI:
src/server/trpc/routers/personas.ts: New procedures likegenerateProfileDraft(which uses an LLM to derive a profile from a system prompt!),approveProfile, andgetProfileempower users to manage persona definitions. We also updatedrunEvaluationto support "quick" and "full" tiers, and enhancedevaluationHistory/evaluationTrendto display our v2 fields.src/app/(dashboard)/dashboard/personas/[id]/evaluations/page.tsx: The evaluation display now showcases thecompositeScore,attackVector,evalTier, anddiscrepancyFlagwith clear badges. Bonus dimensions likerefusalQualityandcoherenceUnderLoadoffer deeper insights, and a dedicated section forjudgeReasoningprovides transparency. Users can now select "Quick Eval" or "Full Eval" directly from the UI.src/app/(dashboard)/dashboard/personas/[id]/profile/page.tsx: A brand new profile management page! Here, users can view status badges (BUILT-IN/DRAFT/APPROVED/NO PROFILE), inspect profile JSON, generate new drafts, approve them, or regenerate existing ones. This provides a powerful workflow for custom persona tuning.
4. The Final Push: Production Deployment
After rigorous testing, the moment of truth arrived. We successfully applied the schema migration directly to our production database (7 ALTER TABLE statements and 1 CREATE TABLE statement) and deployed the rebuilt application. Persona Evaluation v2 is now live!
Navigating the Treacherous Waters: Lessons Learned
No major release is without its challenges. Here are a few critical lessons we learned along the way:
Lesson 1: Database Migrations in Production Environments
Our standard db-migrate-safe.sh --dry-run script failed on production due to environment variable issues and npx not being available directly on the host. This highlighted a critical gap in our production migration strategy.
- Insight: Always have a fallback for direct SQL application, especially for targeted schema changes.
docker execinto the database container forpsqlis a robust way to apply preciseALTER TABLEandCREATE TABLEcommands after carefully reviewing theprisma migrate diffoutput (generated from within the app container to ensure correct Prisma version). - Action: We need to refine our safe migration script for production or standardize on a direct SQL approach for specific changes.
Lesson 2: Prisma Client Regeneration
Initially, after adding new columns and models to schema.prisma, our TypeScript environment was flooded with errors because the Prisma client didn't know about these changes.
- Insight: The Prisma client is generated based on your
schema.prismafile. If your local database isn't perfectly in sync or accessible,npx prisma generatecan still run successfully without a database connection, updating the client types based purely on the schema definition. - Action: Always run
npx prisma generateafter anyschema.prismachanges to keep your TypeScript types in sync.
Lesson 3: Verifying Services in a Containerized Environment
A quick curl localhost:3000 from the production host failed to verify the app's deployment, returning 000. This was a reminder that our application runs behind NGINX within a Docker network.
- Insight: Direct host-to-app communication isn't always possible or desirable in containerized setups. To verify internal service health, you often need to execute commands within the Docker network or the relevant container.
- Action:
docker exec nyxcore-nginx-1 wget -q -O /dev/null -S http://app:3000/was the correct approach to check if the app container was serving requests successfully from the perspective of NGINX.
What's Next?
With Persona Evaluation v2 live, our immediate next steps involve thorough verification on production:
- Confirming v2 scoring, judge reasoning, and markers for
Quick Evalon a built-in persona. - Checking the read-only profile display for built-in personas.
- Running a
Full Evalto test tailored attack vectors. - Implementing RLS policies for the
persona_profilestable for an extra layer of security. - Testing the full custom persona flow: creation, draft generation, review, approval, and evaluation.
Beyond these, we're already looking ahead to new features, including a "Rent-a-Persona" API.
Conclusion
Persona Evaluation v2 is a monumental step forward in our ability to understand, control, and improve our AI personas. By combining deterministic scoring with LLM insights and a flexible profile management system, we've built a robust foundation for the future of AI development. This was a challenging but incredibly rewarding effort, and I'm excited to see the impact it has on the quality and reliability of our AI.