Beyond the Bug Fix: Centralizing LLM Fallbacks and Sharpening Multi-Tenant Data Access

As developers, our days are often a blend of building new features, squashing bugs, and, if we're lucky, making foundational architectural improvements. Recently, I dove deep into a session that perfectly encapsulated this mix, tackling two critical areas: ensuring correct multi-tenant data visibility and building a more resilient LLM provider system.

This post isn't just a recount of what was done; it's a look at the thought process, the missteps, and the "aha!" moments that led to a cleaner, more robust system.

The Mission: Clarity and Resilience

My primary goals for this session were twofold:

Fix Multi-Tenant Data Visibility: Specifically for our ckb-nyx tenant, shared project data wasn't showing up correctly. This pointed to an issue where certain queries were inadvertently scoping data by userId instead of tenantId.
Wire Up a Fallback LLM Provider System: In a world increasingly reliant on external APIs, especially LLMs, resilience is paramount. We needed a robust way to switch to a fallback provider if the primary one wasn't available or configured.

By the end of the session, both objectives were not only met but deployed to production. Let's break down how.

Unraveling the Multi-Tenant Data Mystery

The ckb-nyx tenant's issue was a classic multi-tenant pitfall: assuming userId was always the correct identifier for scoping data. While some data is user-specific (like personal settings or private notes), many entities, like project details, documents, or blog posts, belong to the tenant and should be visible to all users within that tenant.

Our investigation quickly led to src/server/trpc/routers/projects.ts. This file contained several tRPC procedures that, in some cases, were filtering based on userId where tenantId (or no userId filter at all, if the data was truly tenant-wide) was appropriate.

The fix involved auditing and removing userId from the input of 10 specific query procedures: healthCheck, stats, notes.list, docs.list, docs.get, blogPosts.list, blogPosts.get, blogPosts.unblogged (two instances), and overview.

For example, a query that might have looked like this:

typescript

// Before: Potentially scoping tenant-wide data by userId
export const projectsRouter = t.router({
  docs: t.router({
    list: t.procedure
      .input(z.object({ userId: z.string(), projectId: z.string() }))
      .query(async ({ input }) => {
        // Query database filtering by input.userId
        // ...
      }),
  }),
});

Was refactored to remove the userId constraint, ensuring tenant-wide visibility:

typescript

// After: Correctly scoping by tenantId (implicitly via context) or just projectId
export const projectsRouter = t.router({
  docs: t.router({
    list: t.procedure
      .input(z.object({ projectId: z.string() })) // userId removed
      .query(async ({ ctx, input }) => {
        // Query database for docs within input.projectId, accessible to ctx.tenantId
        // ...
      }),
  }),
});

This seemingly small change has a significant impact on data integrity and user experience within multi-tenant environments.

Architecting LLM Resilience: The Fallback Journey

The second major task was to build a robust fallback system for our LLM providers. We operate with multiple providers (e.g., OpenAI, Anthropic, Google, Kimi), and tenants can configure their primary choice. But what happens if the primary provider's API key is missing, invalid, or exhausted?

Initial Approach: Workflow Engine Retry Loop

My first thought was to integrate the fallback logic directly into our workflow engine's retry loop. The idea was simple: if a primary provider call failed after its retries, the engine would then attempt the tenant's configured fallbackProvider.

This worked, but it felt... limited. It only covered API call failures within the workflow engine. What about other services that directly called resolveProvider()? They wouldn't get the benefit of a fallback.

The Eureka Moment: Baking Fallback into `resolveProvider()`

The real breakthrough came when I realized the fallback logic belonged at the source of provider resolution. Instead of scattering retry-and-fallback logic, what if resolveProvider() itself handled the fallback when the primary provider had no key?

This led to a critical refactor in src/server/services/llm/resolve-provider.ts. The resolveProvider() function, which is used across 23+ services to get the correct LLM provider for a given tenant, was enhanced:

typescript

// Conceptual implementation of the new resolveProvider
import { getTenantLLMConfig } from './tenant-config-service'; // Assumed service

interface LLMProvider {
  name: string;
  hasApiKey: boolean;
  // ... other provider details
}

/**
 * Resolves the appropriate LLM provider for a given tenant.
 * Automatically checks fallback if primary is configured but has no key.
 */
export async function resolveProvider(tenantId: string): Promise<LLMProvider | null> {
  const tenantConfig = await getTenantLLMConfig(tenantId);

  // 1. Try primary provider
  if (tenantConfig.primaryLLMProvider && tenantConfig.primaryLLMProvider.hasApiKey) {
    return tenantConfig.primaryLLMProvider;
  }

  // 2. If primary has no key, try fallback provider
  if (tenantConfig.fallbackLLMProvider && tenantConfig.fallbackLLMProvider.hasApiKey) {
    return tenantConfig.fallbackLLMProvider;
  }

  // 3. No suitable provider found
  return null;
}

// Kept for backward compatibility, now just an alias
export const resolveProviderWithFallback = resolveProvider;

// A stricter version that throws if no provider can be resolved
export async function resolveProviderStrict(tenantId: string): Promise<LLMProvider> {
  const provider = await resolveProvider(tenantId);
  if (!provider) {
    throw new Error(`No LLM provider found for tenant ${tenantId}`);
  }
  return provider;
}

This change was powerful:

Global Impact: All services calling resolveProvider() automatically benefit from the fallback mechanism without a single import or code change on their end.
Clean Separation: The concern of "which provider to use" is now fully encapsulated within this single function.
Clarity: resolveProvider() now clearly defines the hierarchy: primary first, then fallback if the primary isn't usable due to a missing key.

For backward compatibility, resolveProviderWithFallback() was kept as an alias. resolveProviderStrict() was also introduced to provide an option for services that must have a provider and would rather throw an error than receive null.

Lessons Learned from the Trenches

No development session is complete without a few bumps in the road. These "pain points" are often the most valuable learning opportunities.

Lesson 1: The "Always Push Before Deploy" Mantra

In my eagerness to deploy the userId fix, I forgot a crucial step: pushing my local commit to the remote repository.

Tried: Deploying changes via SSH to production.
Failed: The production server's git pull command reported "Already up to date" because my local commit wasn't on origin/main.
Takeaway: Muscle memory is great, but sometimes a small checklist helps. Before initiating any remote deployment, a quick git status followed by git push origin main is non-negotiable. It's a simple step that saves time and confusion.

Lesson 2: ESLint as a Guard Rail (and a Design Catalyst)

During the initial fallback wiring attempt (the less ideal workflow engine approach), I switched to resolveProviderWithFallback but forgot to remove the original resolveProvider import.

Tried: Deploying the workflow engine fallback.
Failed: ESLint quickly caught an "unused import" error during the build process.
Takeaway: ESLint isn't just about code style; it's a powerful tool for catching logical errors and inconsistencies. In this case, it highlighted a minor issue that, ironically, made me rethink the entire fallback strategy and led to the much cleaner resolveProvider() refactor. Sometimes, friction points lead to better design.

Current State and Next Steps

Both the multi-tenant data visibility fix and the global LLM fallback system are now live in production.

Here's what's immediately next:

User Configuration: The ckb-nyx tenant administrator needs to set their fallbackProvider in Admin > LLM Defaults (e.g., kimi or google).
Verification: Verify that project detail pages load correctly for the ckb-nyx tenant.
Pipeline Re-run: A specific docs pipeline run (4ef23a06-ae50-446e-bc1b-eb66bfa2985f) has 13 pending items. This needs to be re-run to process them with the new, correct data visibility.
Audit: A quick audit of other routers for potential userId misuse in queries is on the list (though wardrobe is intentionally user-scoped).
Distinguishing Fallback Scenarios: It's important to remember that the new resolveProvider() fallback handles cases where the primary provider has no key. The workflow engine's retry loop still plays a crucial role in handling API call failures (like 402 - credits exhausted) by retrying with the primary, and then, if configured, attempting the fallback provider. These two mechanisms complement each other for full resilience.

This session was a great reminder that even seemingly small issues can lead to significant architectural improvements and valuable lessons. Building robust, scalable systems requires constant vigilance, thoughtful refactoring, and a willingness to learn from every attempt, successful or not.

How do you handle LLM fallbacks or multi-tenant data challenges in your systems? Share your insights in the comments!

json

{"thingsDone":["Removed userId from 10 tRPC queries for tenant-wide data visibility","Implemented global LLM fallback system baked into resolveProvider()","Wired fallback into workflow engine retry loop (initial attempt, then refined into resolveProvider)","Fixed ESLint unused import"],"pains":["Failed deployment due to unpushed local commit","ESLint error from unused import (initial fallback attempt led to better design)"],"successes":["Achieved global LLM fallback without widespread code changes","Resolved critical multi-tenant data visibility issue for ckb-nyx tenant","Streamlined LLM provider resolution logic at a central point","Improved deployment discipline"],"techStack":["Node.js","TypeScript","tRPC","ESLint","Git","SSH","LLM APIs","PostgreSQL (implied for data queries)"]}

The Mission: Clarity and Resilience

Unraveling the Multi-Tenant Data Mystery

Architecting LLM Resilience: The Fallback Journey

Initial Approach: Workflow Engine Retry Loop

The Eureka Moment: Baking Fallback into resolveProvider()

Lessons Learned from the Trenches

Lesson 1: The "Always Push Before Deploy" Mantra

Lesson 2: ESLint as a Guard Rail (and a Design Catalyst)

Current State and Next Steps

The Eureka Moment: Baking Fallback into `resolveProvider()`