Hardening Our AI with Adversarial Personas: Lessons from Designing Persona Evaluation v2

Late last night, after a marathon session, we hit a major milestone: the design specification for Persona Evaluation v2 is complete, reviewed, hardened, and ready for implementation planning. This isn't just another feature; it's a fundamental upgrade to how we understand and improve our AI's behavior, especially under pressure.

Our goal with v2 was ambitious: move beyond basic evaluations to a system that can truly stress-test our AI against a diverse array of personas, including highly adversarial ones. This session was all about locking down that design, and as often happens in development, it came with its share of triumphs and unexpected hurdles.

The Vision: A Smarter, Hardened Persona Evaluation

The core of Persona Evaluation v2 is a sophisticated approach to assessing how our AI interacts with different 'personas.' Think of it as putting our AI through a series of role-playing scenarios, some friendly, some not-so-friendly, to ensure it behaves as expected.

Here are the six key design sections we hammered out:

PersonaProfile Interface: We're defining 12 built-in personas (like NyxCore, Athena, Nemesis, Ipcha Mistabra itself) and a robust auto-derivation pipeline for custom ones. The system will learn to create new persona profiles, but with a crucial human-in-the-loop approval step.
Judge Rubric Overhaul: Moving beyond simple pass/fail, we're implementing a hybrid scoring system. Deterministic checks (regex, keyword matching) will handle the clear-cut cases, while an LLM-powered judge provides nuanced, advisory scoring, especially for complex interactions.
Adversarial Persona Exploitation: This is where things get interesting. We're building a Proof-of-Concept (PoC) test type with 5 attack categories, designed to actively try and break our AI's persona adherence. It's about finding weaknesses before they become problems in the wild.
Scoring Weight Rebalancing: Different test types have different stakes. We're introducing test-type-specific weight profiles to ensure our evaluations accurately reflect the importance of each interaction.
Auto-derivation Pipeline: Custom persona profiles will be automatically derived from user interactions, then persisted in the database with a 'draft' or 'approved' status, always requiring human oversight.
Schema Changes: To support all this, we're introducing 5 new nullable columns and a dedicated PersonaProfile Prisma model.

This design wasn't just pulled out of thin air. It went through rigorous internal review by our code-reviewer subagent, which surfaced 4 critical issues we promptly addressed. But the real hardening came next.

Enter Ipcha Mistabra: Our Adversarial Guardian

To truly validate the design, we unleashed Ipcha Mistabra. For those unfamiliar, Ipcha Mistabra (meaning "the opposite is true" in Aramaic) is our in-house adversarial analysis agent. It's designed to take a design spec, break it down, and identify potential vulnerabilities or unintended consequences from an attacker's perspective. It's like having a team of red-teamers scrutinize your blueprints before you even lay the first brick.

Running Ipcha against our Persona Evaluation v2 spec was invaluable. It uncovered:

2 CRITICAL risks
3 HIGH risks

These findings forced us to rethink certain assumptions and build in safeguards at the design stage. For example, we refined our human-in-the-loop approval process for custom personas and strengthened the deterministic checks in our hybrid scoring. This proactive hardening is crucial for building robust AI systems.

The Unforeseen Hurdle: Battling LLM Token Limits

However, our Ipcha Mistabra run wasn't without its own drama. We hit a wall with LLM token limits – a common headache when working with large language models.

The Problem: When Ipcha Mistabra's "Adversarial Analysis (fan-out)" step kicked off, each parallel provider analysis hit the default 4096 token ceiling. The output was truncated, leading to incomplete and unreliable adversarial findings. It was like asking a detective to solve a case but only letting them read half the witness statements.

The Fix & The Lesson: We quickly identified the bottleneck in src/server/trpc/routers/workflows.ts and adjusted the maxTokens for Ipcha's critical steps:

Adversarial Analysis (fan-out): 4096 → 16384
Synthesis, Arbitration, Results: 4096 → 8192

After bumping these limits, a re-run yielded full, comprehensive adversarial insights.

Takeaway: When integrating LLMs into complex workflows, especially those involving detailed analysis or synthesis of large inputs/outputs, always be mindful of context window limits. Default settings often aren't sufficient for deep dives, and truncated outputs can silently compromise the integrity of your results. Test your LLM chains with realistic data volumes early!

Beyond the Design: Other Lessons from the Trenches

The session also threw up a few other curveballs, leading to some useful "lessons learned" for any developer:

The Vanishing File Act: At one point, our carefully crafted design spec file (docs/superpowers/specs/2026-03-12-persona-evaluation-v2-design.md) mysteriously disappeared from the working tree, even though Git knew it existed.
- Lesson: When files go missing but Git insists they're there, git checkout HEAD -- <path/to/file> is your best friend. It restores the file from the latest commit without affecting other changes. A good reminder of Git's power (and occasional quirks).
Remote Database Access: Trying to run Prisma query scripts locally failed because our dev environment uses a production-only DB setup.
- Lesson: Don't forget your environment context! If your local setup lacks a DB, remember your remote access tools. sshing to production and querying via docker exec nyxcore-postgres-1 psql saved the day. Document these access patterns for new team members.
Escaping Characters in npx tsx: Attempting to run a quick npx tsx -e "..." command with escaped characters (specifically !) in the shell led to esbuild syntax errors.
- Lesson: For anything beyond the simplest one-liners, especially when dealing with shell escaping and JavaScript syntax, it's safer and clearer to write your script to a temporary file (/tmp/script.ts) and then execute it via npx tsx /tmp/script.ts. It avoids a frustrating battle with shell parsers.

Looking Ahead: Implementation & Key Decisions

With the design hardened and committed, the immediate next step is to invoke our superpowers:writing-plans skill to generate a detailed implementation plan.

Key decisions from the spec that will guide implementation:

Hybrid scoring will combine deterministic checks (regex, keyword) with an LLM judge for robust evaluation.
Human-in-the-loop approval is mandatory for all custom persona profiles before active use.
DB-persisted profiles will use a new PersonaProfile Prisma model with draft/approved statuses.
Tiered evaluation will offer "Quick" (temp + generic jailbreaks) and "Full" (all types + tailored) options.
We'll be authoring 12 built-in persona profiles for our core entities like NyxCore, Athena, and Ipcha Mistabra.
A new UI page at /personas/[id]/profile will facilitate draft review and approval.

This session was a microcosm of modern AI development: pushing the boundaries of what our systems can do, leveraging advanced tools for validation, and debugging the unexpected challenges that inevitably arise. We're excited to move into the implementation phase and bring Persona Evaluation v2 to life!

json

{"thingsDone":[
  "Completed Persona Evaluation v2 design spec, covering 6 key sections.",
  "Conducted adversarial analysis on the spec using Ipcha Mistabra, identifying and hardening 5 critical/high risks.",
  "Fixed Ipcha Mistabra LLM token limits (4096 -> 16384/8192) to prevent truncated output.",
  "Resolved issues found by `code-reviewer` subagent.",
  "All design and fix commits deployed to main."
],"pains":[
  "Ipcha Mistabra's 'Adversarial Analysis' step hit 4096 token limit, truncating output.",
  "Design spec file disappeared from working tree (existed in Git, not on disk).",
  "Failed to run Prisma query scripts locally due to production-only DB.",
  "Failed to run `npx tsx -e \"...\"` with escaped characters due to esbuild syntax errors."
],"successes":[
  "Successfully increased Ipcha Mistabra's maxTokens for critical steps.",
  "Restored missing file using `git checkout HEAD -- <path>`.",
  "Accessed production DB via SSH and `docker exec psql`.",
  "Used temporary file for `npx tsx` scripts with complex escaping."
],"techStack":[
  "LLMs (for Judge rubric, Ipcha Mistabra, Persona derivation)",
  "Prisma (for schema changes, PersonaProfile model)",
  "Git (for version control, file recovery)",
  "TypeScript (for `tsx` scripts, application logic)",
  "trpc (for API router, e.g., `workflows.ts`)",
  "Docker (for database access)",
  "Adversarial Analysis (Ipcha Mistabra, internal tool)",
  "Design Specifications (Markdown docs)"
]}