Inside Our Dev Session: From AI Flakiness to Bulletproof Workflows & Hard-Earned Lessons

Building intelligent systems that understand and interact with code is a thrilling, yet often challenging, endeavor. Every development session feels like a mini-adventure, filled with problem-solving, breakthroughs, and the occasional head-desk moment. Today, I want to share a recent session that perfectly encapsulates this journey – a deep dive into hardening our AI review process, ensuring workflow quality, and introducing a crucial consistency check into our LLM-powered workflow engine.

Our primary goals for this session were clear:

Fortify AI Review: Make our AI code review system more robust and reliable.
Analyze Workflow Quality: Understand and improve how our core BRbase workflows handle complex codebases.
Restructure for Consistency: Integrate a consistency check directly into our group workflow engine to catch issues before they manifest.

Let's unpack the journey.

Taming the AI Review Beast: Robustness & Clarity

Our AI review system is a cornerstone, providing automated suggestions and feedback. But like any complex system, it needed refinement. We tackled a few key areas:

1. The Fallback Dance: Never Miss a Beat

We've all been there: one LLM provider goes down, or hits a rate limit. To combat this, we implemented a robust provider fallback mechanism in src/server/trpc/routers/reviews.ts. Now, if Google fails, we gracefully switch to Anthropic, and if that falters, OpenAI steps in. This ensures our AI review service remains highly available, even when external dependencies throw a curveball.

typescript

// Simplified illustration of the fallback logic
async function getReviewSuggestions(diff: string): Promise<ReviewSuggestions> {
  try {
    return await googleProvider.parseReviewSuggestions(diff);
  } catch (error) {
    console.warn("Google provider failed, falling back to Anthropic...", error);
    try {
      return await anthropicProvider.parseReviewSuggestions(diff);
    } catch (anthropicError) {
      console.warn("Anthropic provider failed, falling back to OpenAI...", anthropicError);
      return await openaiProvider.parseReviewSuggestions(diff);
    }
  }
}

2. Diff Size Matters: Guarding Against Overload

Large diffs can overwhelm LLMs, leading to truncated responses, increased costs, or outright failures. We introduced a diff size cap of 60,000 characters. Diffs exceeding this limit are handled gracefully, preventing system overload and ensuring predictable behavior.

3. Error Propagation: Clearer User Feedback

Previously, some errors from our parseReviewSuggestions function might have been swallowed or presented ambiguously. We refined the error propagation, ensuring that users receive clear, actionable messages, differentiating between "no issues found" and an actual processing error. This was also reflected in the UI update in src/app/(dashboard)/dashboard/projects/[id]/reviews/[prNumber]/page.tsx.

Elevating Workflow Quality: Context is King

Our workflow engine generates complex implementation plans and code based on project context. Ensuring this context is accurate is paramount.

1. Unmasking Contextual Drifting

We meticulously analyzed an older workflow (a781148f) and discovered a critical issue: nyxCore specific patterns were being injected into a BRbase workflow. This led to incorrect code suggestions, duplicate sub-outputs, and file conflicts – a classic case of context pollution.

2. Validating the New Generation

In contrast, our new cce03fe4 workflow was a breath of fresh air. It correctly linked to the BRbase project, utilized 234 specific BRbase code patterns, and referenced the right documentation (clarait/BRbase). This validation confirmed our efforts in improving context management were paying off.

3. More Tokens, More Power

For the cce03fe4 workflow, we bumped the maxTokens for all 18 steps from 4096 to 8192. This seemingly small tweak, done via direct SQL on production, provides the LLMs with more breathing room to generate comprehensive and detailed responses, especially for complex tasks.

The Game Changer: Automated Consistency Checks

This was arguably the most significant architectural improvement of the session. We built an auto-generated Consistency Check step directly into our workflow engine (src/server/services/workflow-engine.ts and src/server/services/implementation-prompt-generator.ts).

Why a Consistency Check?

Imagine an LLM generating multiple implementation plans (fan-out) for a single feature. Without coordination, these plans might:

Try to create the same file.
Implement the same piece of logic redundantly.
Introduce conflicting dependencies.
Violate established coding patterns.

Our new check acts as an internal guardian angel, preventing these issues before the LLMs even generate the final prompts.

How It Works: Pre-Fan-Out Validation

The consistency check runs before the fan-out phase. It analyzes individual plans (not the generated prompts yet) for:

File Path Conflicts: Are multiple plans trying to write to the same file?
Duplicate Implementations: Is the same functionality being implemented in different places unnecessarily?
Pattern Inconsistencies: Do the plans adhere to the project's established code patterns (e.g., ~/ imports, Clerk auth, MySQL, Jest, Chakra UI for BRbase)?
Dependency Violations: Are there any implicit conflicts or missing dependencies between proposed changes?

The results of this check are then injected into each fan-out prompt via a consistencyCheck parameter on buildGroupItemPromptInput. This allows the LLM to be aware of potential issues and self-correct during its generation phase, leading to far more cohesive and robust outputs. We're primarily using Claude Sonnet for this review, with fallbacks to Gemini and GPT.

typescript

// Simplified pseudo-code for the consistency check integration
interface ConsistencyCheckResult {
  hasConflicts: boolean;
  conflictDetails: string[]; // e.g., "File 'src/utils/new-helper.ts' is created by multiple plans."
  patternViolations: string[];
}

function runConsistencyChecks(individualPlans: Plan[]): ConsistencyCheckResult {
  // Logic to analyze file paths, patterns, duplicates across plans
  // ... using Claude Sonnet for sophisticated analysis
  return { hasConflicts: true, conflictDetails: ["..."], patternViolations: ["..."] };
}

function buildGroupItemPromptInput(plan: Plan, consistencyResult: ConsistencyCheckResult) {
  return {
    ...plan.promptContext,
    consistencyCheck: consistencyResult, // Injected here!
    // ... other prompt parameters
  };
}

We also confirmed that our OpenAI o3 integration works as expected, with isReasoningModel() regex correctly mapping to max_completion_tokens and omitting temperature for more deterministic outputs.

Lessons from the Trenches: The "Pain Log" Transformed

Not everything was smooth sailing. We hit a few snags that served as excellent, albeit painful, learning opportunities.

Lesson 1: NEVER `prisma db push` on Production

This is a critical takeaway. I attempted to use prisma db push on production for a schema change and was interrupted by a user. Thankfully, I stopped, but not before remembering the explicit warning in our internal CLAUDE.md documentation: prisma db push can drop pgvector columns, leading to data loss in our vector embeddings.

Actionable Takeaway: Always use direct SQL or carefully crafted migration scripts for production schema changes. prisma db push is for development environments only. For production, prisma migrate deploy after prisma migrate dev is the intended flow, or raw SQL for specific, small changes.

Lesson 2: Prisma CamelCase vs. DB Snake_Case

I spent some time debugging a raw SQL query that used "isPersonal" as a column name. It failed. The database, of course, uses snake_case (is_personal), while Prisma's generated client uses camelCase.

Actionable Takeaway: When writing raw SQL queries, always refer to the actual database column names (snake_case). Prisma's client abstracts this, but raw SQL doesn't.

Lesson 3: Beware of Reserved Keywords

Another SQL hiccup involved querying the position column on workflow_steps. It turns out the column is actually named order (confirmed via Prisma schema). ORDER is a reserved keyword in SQL, so it needs to be quoted ("order").

Actionable Takeaway: Always confirm column names against your actual schema, and be mindful of SQL reserved keywords, quoting them when necessary.

What's Next? The Journey Continues

As I write this, the new cce03fe4 workflow is about to be run by a user. Our immediate next steps involve:

Monitoring: Closely observe the user's execution of workflow cce03fe4 to ensure correct BRbase pattern injection.
Validation: After completion, verify that our new consistency check effectively caught any issues and that implementation prompts correctly utilize BRbase patterns (e.g., ~/ imports, Clerk auth, MySQL, Jest, Chakra UI).
Codebase Sync: Resolve pending merge conflicts for PR #2276 on BRbase to bring ionos-migration up to date with main.

This session was a testament to the iterative nature of software development. We fixed immediate issues, introduced significant architectural improvements, and learned valuable lessons that will make our systems even more robust in the future. Onwards!

json

{"thingsDone":["AI review fallback mechanism","diff size capping","improved AI review error propagation","workflow context validation","LLM maxTokens increase","automated workflow consistency check","OpenAI o3 integration confirmation"],"pains":["attempted prisma db push on production","Prisma camelCase vs DB snake_case column name mismatch","SQL reserved keyword issue ('order' vs 'position')"],"successes":["robust and reliable AI review service","accurate and high-quality workflow generation","preventative consistency checks in workflow engine","clearer user feedback for AI reviews","learning from production mistakes and documenting solutions"],"techStack":["TypeScript","Next.js","tRPC","Prisma","PostgreSQL","Google AI","Anthropic Claude","OpenAI GPT","Docker"]}