Fortifying Our LLM Backbone: A Tale of Fallbacks, Data Isolation, and Production Learnings

Every development session brings its unique set of puzzles, triumphs, and sometimes, a few head-scratching moments. This past week, we embarked on a mission to harden our platform's LLM (Large Language Model) integration and ensure impeccable data isolation for our multi-tenant architecture. The goal was clear: build a more resilient system and fix a specific data visibility issue for one of our tenants, ckb-nyx.

After a focused push, I'm thrilled to report that all changes are live in production, significantly enhancing our system's robustness. Let's break down the journey, the challenges we faced, and the solutions we implemented.

Building a Resilient LLM System: The Fallback Mechanism

In a world increasingly reliant on external APIs, especially for critical services like LLMs, anticipating failure is not pessimism—it's good engineering. Our primary LLM provider is robust, but what happens if a tenant hasn't configured a key, or if the primary service experiences an outage? Service disruption, that's what.

Our mission was to introduce a graceful fallback mechanism.

The Initial Approach (and its limitations)

We started by wiring up a fallback provider directly within our workflow engine's retry loop. The idea was simple: if the primary provider failed after its retries, the engine would then attempt to use the tenant's configured fallback provider before finally giving up. This was a step in the right direction, providing resilience for API call failures within the workflow engine itself.

(See commit 931e1fa)

The "Aha!" Moment: Global Resilience

While effective for the workflow engine, this approach felt piecemeal. What about all the other services that interact with LLMs directly? Duplicating fallback logic across 20+ services wasn't scalable or maintainable. This led to a crucial realization: the fallback logic needed to be baked into the very core of how we resolve an LLM provider.

Enter resolveProvider() in src/server/services/llm/resolve-provider.ts. This function is the single source of truth for getting an LLM provider instance. By modifying it, we could achieve global fallback without touching a single import in our application.

Here's how we enhanced it:

typescript

// Simplified conceptual example
export function resolveProvider(tenantConfig: TenantConfig): LLMProvider {
  const primaryProvider = tenantConfig.llmDefaults.provider;
  const primaryKey = tenantConfig.llmKeys[primaryProvider];

  if (primaryKey) {
    return createProviderInstance(primaryProvider, primaryKey);
  }

  // If primary key is missing, check for a fallback
  const fallbackProvider = tenantConfig.llmDefaults.fallbackProvider;
  if (fallbackProvider) {
    const fallbackKey = tenantConfig.llmKeys[fallbackProvider];
    if (fallbackKey) {
      console.warn(`Primary key for ${primaryProvider} missing. Falling back to ${fallbackProvider}.`);
      return createProviderInstance(fallbackProvider, fallbackKey);
    }
  }

  // If no primary or fallback key, throw an error (or return a default/null based on strictness)
  throw new Error("No LLM provider key configured.");
}

Now, resolveProvider() intelligently checks for the tenant's fallbackProvider if the primary provider has no key configured. This means:

Automatic Fallback: All 23+ services that call resolveProvider() automatically gain fallback capabilities.
Zero Code Changes: No import statements needed to be updated across the codebase, drastically simplifying the rollout.
Clean API: resolveProviderWithFallback() was kept as an alias for backward compatibility, while an internal resolveProviderStrict() can return null instead of throwing for specific use cases.

This architectural change (commit b6a7e34) ensures that our LLM integration is not just resilient to API call failures, but also to configuration gaps.

Ensuring Data Isolation: The `userId` Saga

Our platform serves multiple tenants, and maintaining strict data isolation is paramount. We encountered a specific issue where the ckb-nyx tenant was seeing data that wasn't scoped to their projects, hinting at a multi-tenancy bug.

The Problem: Over-scoping with `userId`

Upon investigation, the root cause was clear: several tRPC query procedures in src/server/trpc/routers/projects.ts were incorrectly filtering data by userId when they should have been scoped by tenantId or the project context itself. For shared resources within a tenant, userId is too narrow a filter.

For example, a docs.list procedure intended to show all documents for a project (accessible by any user within that project/tenant) was inadvertently filtering by the current user's ID, preventing other users from seeing shared content, or in some cases, showing un-scoped data.

The Fix: Strategic `userId` Removal

The solution involved carefully auditing and removing userId from 10 specific query procedures: healthCheck, stats, notes.list, docs.list, docs.get, blogPosts.list, blogPosts.get, blogPosts.unblogged (two instances), and overview.

(See commit fbf1d16)

By removing the userId filter in these contexts, we ensure that data shared at the project or tenant level is correctly visible to all authorized users within that scope, resolving the ckb-nyx tenant's visibility issues. It's a critical reminder that in multi-tenant systems, granularity of access control is key.

Lessons Learned from the Trenches

No deployment is without its hiccups. These moments, often frustrating in the moment, are invaluable learning opportunities.

1. The ESLint Trap & Better Design

The Pain: During the initial fallback wiring, I switched to resolveProviderWithFallback in workflow-engine.ts. This left the original resolveProvider import unused, triggering an ESLint error on deployment.

The Lesson: ESLint is your friend! It caught a potential dead code path. More importantly, this minor setback pushed me to rethink the approach. Instead of adapting imports, I realized baking the fallback into resolveProvider itself was a far cleaner, more robust, and globally impactful solution. Sometimes, a small error can lead to a significant architectural improvement.

2. The Classic `git push` Oversight

The Pain: Attempting to deploy the userId fix to production, I ran git pull on the server only to find no changes. My local commit hadn't been pushed to the remote repository.

The Lesson: A timeless reminder for every developer: Always git push origin main before attempting to deploy! Establish a clear, consistent deployment workflow. Forgetting this step costs time and can introduce confusion. Simple hygiene prevents complex headaches.

What's Next?

With these critical updates deployed, we're moving forward with a few immediate items:

User Configuration: Our ckb-nyx tenant (and others) now needs to set their preferred fallback provider in Admin > LLM Defaults.
Verification: Confirm that the project detail pages load correctly for the ckb-nyx tenant.
Data Processing: Re-run the docs pipeline (4ef23a06-ae50-446e-bc1b-eb66bfa2985f) to process the 13 pending items that were previously affected by the userId issue.
Audit: Continue auditing remaining routers for userId in queries, ensuring proper tenantId or project-level scoping everywhere it's needed (e.g., wardrobe is intentionally user-scoped, so it's an exception).
Security: Set AUDIT_CRON_SECRET on production for enhanced security.

Conclusion

This session was a fantastic example of iterative development, problem-solving, and continuous improvement. We've significantly enhanced the resilience of our LLM integrations with an intelligent, global fallback mechanism, and solidified our commitment to data isolation within our multi-tenant architecture. Every challenge overcome makes our platform stronger, more reliable, and ultimately, better for our users.