nyxcore-systems
7 min read

Level Up Your AI: Unveiling Persona Evaluation v2 with Hybrid Scoring and Deterministic Profiles

We just shipped Persona Evaluation v2, a significant upgrade to how we assess AI behavior. Dive into our journey of building a profile-driven, hybrid scoring system that brings determinism, nuance, and transparency to LLM evaluations.

AILLMEvaluationTypeScriptPrismaDevOpsProductionHybridScoringPersona

Late last night, after a marathon coding session, we pushed a major update that fundamentally changes how we evaluate our AI personas. I'm thrilled to announce the successful deployment of Persona Evaluation v2! This wasn't just a patch; it was a complete overhaul, moving from a more generic scoring system to a sophisticated, profile-driven hybrid approach that brings unprecedented determinism and insight into LLM behavior.

Our goal was ambitious: implement a robust, 11-task plan covering everything from new type definitions to a full production deployment. And I'm proud to say, we hit every single one. The feat/persona-eval-v2 branch is now merged to main, deployed, and running live.

Let's dive into the core changes and the journey it took to get there.

The Quest for Deeper Understanding: Why Persona Evaluation v2?

Our previous evaluation system, while functional, had limitations. It was somewhat generic, making it harder to precisely measure adherence to complex persona traits or identify subtle deviations. We needed a system that:

  1. Understood Nuance: Each persona is unique. Their evaluation shouldn't be a one-size-fits-all.
  2. Increased Determinism: Relying solely on LLM judges for scoring introduced variability. We needed more objective, reproducible metrics.
  3. Provided Actionable Insights: Beyond a simple score, we wanted to know why a persona succeeded or failed, and how it could be improved.

Persona Evaluation v2 addresses these needs head-on, introducing profile-driven hybrid scoring.

Building the Brains: A Deep Dive into the Implementation

Our journey unfolded across four key development chunks, each building on the last.

1. Laying the Foundation: Types, Schema & Persona Profiles

The first step was to define the new world. We introduced a suite of new types in src/server/services/persona-evaluation-types.ts to capture the complexity of persona profiles, attack vectors, marker definitions, and detailed evaluation outputs.

The prisma/schema.prisma file saw significant modifications, adding crucial columns to our PersonaEvaluation model (like refusalQuality, coherenceUnderLoad, judgeReasoning, compositeScore, evalTier, discrepancyFlag). Crucially, we introduced a brand-new PersonaProfile model to store persona-specific data.

Perhaps the most exciting part of this chunk was the introduction of 12 built-in persona profiles (Cael, NyxCore, Athena, Nemesis, Harmonia, Clotho, Hermes, Tyche, Themis, Prometheus, Aletheia, Ipcha Mistabra) in src/server/services/persona-profiles.ts. These profiles are the heart of v2, providing a rich, structured definition for each AI, including their core traits, specific test prompts, and tailored attack vectors. We even included a GENERIC_PROFILE fallback for custom personas without a defined profile.

2. The Core Engine: Deterministic Scoring & Evaluator Rewrite

With the foundation set, we completely rewrote our evaluation engine in src/server/services/persona-evaluator.ts. This is where the magic of hybrid scoring happens.

Gone are the days of purely generic scoring. Our new evaluator now features:

  • scoreMarkersDeterministic(): A regex-based, tamper-proof system for scoring specific keywords or patterns, ensuring objective measurement.
  • scoreRoleAdherenceDeterministic(): This function uses behavior matching and anti-pattern penalties to objectively assess how well an LLM adheres to its defined role.
  • buildJudgePrompt(): Instead of a generic system prompt, our LLM judge now receives a persona-specific prompt, allowing for more nuanced and context-aware evaluations.
  • hybridScore(): This is the cornerstone. It combines our deterministic primary scores with an LLM's advisory input. Critically, it includes a discrepancy detection mechanism to flag when the deterministic score and LLM's assessment diverge significantly, prompting further investigation.
  • Profile-Driven Tests: Our runTemperatureTest, runJailbreakTest, and runDegradationTest now leverage the testPrompts defined within each persona's profile, including GENERIC_JAILBREAKS and 5 tailored attack vectors per persona for full-tier evaluations.
  • Tiered Evaluations: We introduced runQuickEval() (temperature + generic jailbreaks) and runFullEval() (all three types + tailored attacks) to provide flexible evaluation depths.

This rewrite transforms our evaluator into a powerful, precise, and highly configurable tool.

3. Bringing it to Life: Profile Management & UI

An intelligent system is only as good as its interface. We made significant updates to our tRPC procedures and UI:

  • src/server/trpc/routers/personas.ts: New procedures like generateProfileDraft (which uses an LLM to derive a profile from a system prompt!), approveProfile, and getProfile empower users to manage persona definitions. We also updated runEvaluation to support "quick" and "full" tiers, and enhanced evaluationHistory/evaluationTrend to display our v2 fields.
  • src/app/(dashboard)/dashboard/personas/[id]/evaluations/page.tsx: The evaluation display now showcases the compositeScore, attackVector, evalTier, and discrepancyFlag with clear badges. Bonus dimensions like refusalQuality and coherenceUnderLoad offer deeper insights, and a dedicated section for judgeReasoning provides transparency. Users can now select "Quick Eval" or "Full Eval" directly from the UI.
  • src/app/(dashboard)/dashboard/personas/[id]/profile/page.tsx: A brand new profile management page! Here, users can view status badges (BUILT-IN/DRAFT/APPROVED/NO PROFILE), inspect profile JSON, generate new drafts, approve them, or regenerate existing ones. This provides a powerful workflow for custom persona tuning.

4. The Final Push: Production Deployment

After rigorous testing, the moment of truth arrived. We successfully applied the schema migration directly to our production database (7 ALTER TABLE statements and 1 CREATE TABLE statement) and deployed the rebuilt application. Persona Evaluation v2 is now live!

Navigating the Treacherous Waters: Lessons Learned

No major release is without its challenges. Here are a few critical lessons we learned along the way:

Lesson 1: Database Migrations in Production Environments

Our standard db-migrate-safe.sh --dry-run script failed on production due to environment variable issues and npx not being available directly on the host. This highlighted a critical gap in our production migration strategy.

  • Insight: Always have a fallback for direct SQL application, especially for targeted schema changes. docker exec into the database container for psql is a robust way to apply precise ALTER TABLE and CREATE TABLE commands after carefully reviewing the prisma migrate diff output (generated from within the app container to ensure correct Prisma version).
  • Action: We need to refine our safe migration script for production or standardize on a direct SQL approach for specific changes.

Lesson 2: Prisma Client Regeneration

Initially, after adding new columns and models to schema.prisma, our TypeScript environment was flooded with errors because the Prisma client didn't know about these changes.

  • Insight: The Prisma client is generated based on your schema.prisma file. If your local database isn't perfectly in sync or accessible, npx prisma generate can still run successfully without a database connection, updating the client types based purely on the schema definition.
  • Action: Always run npx prisma generate after any schema.prisma changes to keep your TypeScript types in sync.

Lesson 3: Verifying Services in a Containerized Environment

A quick curl localhost:3000 from the production host failed to verify the app's deployment, returning 000. This was a reminder that our application runs behind NGINX within a Docker network.

  • Insight: Direct host-to-app communication isn't always possible or desirable in containerized setups. To verify internal service health, you often need to execute commands within the Docker network or the relevant container.
  • Action: docker exec nyxcore-nginx-1 wget -q -O /dev/null -S http://app:3000/ was the correct approach to check if the app container was serving requests successfully from the perspective of NGINX.

What's Next?

With Persona Evaluation v2 live, our immediate next steps involve thorough verification on production:

  • Confirming v2 scoring, judge reasoning, and markers for Quick Eval on a built-in persona.
  • Checking the read-only profile display for built-in personas.
  • Running a Full Eval to test tailored attack vectors.
  • Implementing RLS policies for the persona_profiles table for an extra layer of security.
  • Testing the full custom persona flow: creation, draft generation, review, approval, and evaluation.

Beyond these, we're already looking ahead to new features, including a "Rent-a-Persona" API.

Conclusion

Persona Evaluation v2 is a monumental step forward in our ability to understand, control, and improve our AI personas. By combining deterministic scoring with LLM insights and a flexible profile management system, we've built a robust foundation for the future of AI development. This was a challenging but incredibly rewarding effort, and I'm excited to see the impact it has on the quality and reliability of our AI.