nyxcore-systems
7 min read

The Case of the Missing Embeddings: A Production Backfill Saga

We faced a silent data integrity crisis: hundreds of missing vector embeddings in production. This post details our detective work, the custom backfill solution we built, and the invaluable lessons learned while operating in the trenches.

productionbackfilldatabaseembeddingspgvectorprismadockeropenaidebugging

Every developer knows the sinking feeling of discovering missing data in production. For us, it was a subtle but critical issue: hundreds of workflow_insights records, crucial for our AI-powered features, were missing their associated vector embeddings. These aren't just any columns; they're the lifeblood of semantic search and RAG capabilities, powered by pgvector.

This isn't a story of a catastrophic outage, but rather a deep dive into diagnosing a data integrity issue, building a robust (and safe) backfill mechanism, and navigating the often-tricky waters of a live production environment.

The Silent Killer: Missing Embeddings

Our workflow_insights table, which stores processed user workflow data, relies heavily on pgvector to store OpenAI embeddings. These embeddings are essential for quickly finding semantically similar insights. One afternoon, a routine check revealed the truth: 382 rows had NULL values in their embedding column.

The pgvector extension was healthy, the column existed, and new insights were getting embeddings. This suggested an intermittent issue or a past bug during a specific period. Regardless of the root cause (which we'll investigate further), the immediate priority was clear: backfill those missing embeddings.

Crafting the Solution: A Safe Backfill Endpoint

Directly manipulating production data is always a high-stakes operation. We needed a controlled, auditable, and repeatable way to generate and insert these embeddings. The answer? A dedicated admin API endpoint.

We spun up src/app/api/v1/admin/backfill-embeddings/route.ts as a POST endpoint. Here's why this approach worked for us:

  1. Security First: Protected by an x-admin-secret header (falling back to AUTH_SECRET), ensuring only authorized personnel could trigger it.
  2. Idempotence & Safety:
    • dryRun=true query parameter: Crucial for testing the logic without touching the database. It simulates the entire process, including OpenAI calls, but skips the final UPDATE.
    • batchSize query parameter: Allows us to control the load on our OpenAI API and database. We settled on 50 insights per batch.
  3. Multi-tenancy Aware: Our system supports Bring-Your-Own-Key (BYOK) for OpenAI. The endpoint intelligently grouped insights by tenantId to ensure the correct OpenAI API key was used for each embedding generation. This was a non-negotiable requirement.
  4. OpenAI Integration: It fetched the insights, called the OpenAI API in batches to get the embeddings, and then updated the workflow_insights table.

A simplified conceptual view of the endpoint's logic:

typescript
// src/app/api/v1/admin/backfill-embeddings/route.ts (conceptual)

import { NextRequest, NextResponse } from 'next/server';
import { db } from '@/lib/db'; // Your Prisma client instance
import { generateEmbedding } from '@/lib/openai'; // Your OpenAI integration

export async function POST(req: NextRequest) {
  const adminSecret = req.headers.get('x-admin-secret') || process.env.AUTH_SECRET;
  if (adminSecret !== process.env.AUTH_SECRET) { // Simple auth check
    return NextResponse.json({ error: 'Unauthorized' }, { status: 401 });
  }

  const { searchParams } = new URL(req.url);
  const dryRun = searchParams.get('dryRun') === 'true';
  const batchSize = parseInt(searchParams.get('batchSize') || '50', 10);

  try {
    const insightsToBackfill = await db.workflowInsight.findMany({
      where: { embedding: null },
      select: { id: true, content: true, tenantId: true },
      take: batchSize, // Process in chunks
    });

    if (insightsToBackfill.length === 0) {
      return NextResponse.json({ message: 'No missing embeddings found.' });
    }

    const updates = [];
    for (const insight of insightsToBackfill) {
      // Resolve tenant-specific OpenAI key if applicable
      const embedding = await generateEmbedding(insight.content, insight.tenantId);
      updates.push({
        where: { id: insight.id },
        data: { embedding: embedding },
      });
    }

    if (!dryRun) {
      await db.$transaction(
        updates.map(update => db.workflowInsight.update(update))
      );
      return NextResponse.json({
        message: `Successfully backfilled ${updates.length} embeddings.`,
        dryRun: false,
      });
    } else {
      return NextResponse.json({
        message: `Dry run: Would backfill ${updates.length} embeddings.`,
        dryRun: true,
        insightsAffected: insightsToBackfill.map(i => i.id),
      });
    }

  } catch (error) {
    console.error('Backfill error:', error);
    return NextResponse.json({ error: 'Failed to backfill embeddings' }, { status: 500 });
  }
}

Deployment and Execution: The Moment of Truth

With the endpoint built and tested locally (albeit on an empty dev DB – more on that later), it was time to deploy.

  1. Deployment: A quick git push of commit d439543 to main triggered our CI/CD, and the new code landed on our production nyxcore-app-1 container.

  2. Accessing Production:

    • sshpass -p 'q7AL3MgctEW9' ssh root@46.225.232.35 (yes, we know, temporary credentials for emergency access).
    • Navigating into the running app container.
  3. The Backfill Command: We used wget (since curl wasn't available in our Alpine-based image) from inside the app container to call the newly deployed endpoint. This avoided network complexities and ensured the call originated from a trusted source.

    bash
    # Inside the 'nyxcore-app-1' container
    wget --post-data '' --header='x-admin-secret: A92yIzuhG/t+eU+N8nHh2aodIjWB+gFrLO2hgLejRjc=' http://0.0.0.0:3000/api/v1/admin/backfill-embeddings?dryRun=true
    # (Checked dry run output, looked good)
    
    wget --post-data '' --header='x-admin-secret: A92yIzuhG/t+eU+N8nHh2aodIjWB+gFrLO2hgLejRjc=' http://0.0.0.0:3000/api/v1/admin/backfill-embeddings
    # (Repeated until all missing embeddings were processed)
    

The result? A resounding success! All 382 missing embeddings were successfully generated and updated, with 0 errors. Data integrity restored.

The "Pain Log" & Hard-Earned Lessons

No production fix is without its share of bumps. Here are the "gotchas" and lessons we picked up along the way:

  1. Docker Compose V2 Naming:

    • Tried: docker exec nyxcore-postgres
    • Failed: Container not found.
    • Lesson: Docker Compose v2 (and later) appends -1 (or -N) to container names by default. Always verify actual container names with docker ps or docker compose ps. The correct name was nyxcore-postgres-1 (and nyxcore-app-1).
  2. Alpine's Minimalist Nature:

    • Tried: curl from inside the app container.
    • Failed: curl not installed.
    • Lesson: Alpine Linux images are tiny for a reason – they strip out non-essential utilities. If you need curl or similar tools for debugging, consider a custom Docker image or installing it temporarily (apk add curl). For us, wget was already present and sufficient.
    • Bonus Lesson: When calling an endpoint from inside a container, use http://0.0.0.0:PORT or the service name, not localhost or 127.0.0.1, as localhost often refers to the container's own loopback interface, not the service exposed by the container.
  3. Prisma Versioning & Environment Loading:

    • Tried: npx prisma db push locally.
    • Failed: Global npx pulled Prisma 7.x (with breaking changes) instead of our project's 5.x. Also, DATABASE_URL wasn't loaded from .env.
    • Lesson: Always use the project-local Prisma CLI for database operations to ensure version compatibility (npx prisma@5.22.0 ... or better, npm exec prisma ...). Ensure your .env file is correctly loaded for database connection strings.
  4. Local Dev DB State:

    • Our local docker compose up created fresh, empty volumes. This meant our dev database was completely empty, making it challenging to fully test the backfill logic against existing data before deploying to production.
    • Lesson: For critical data operations, consider having a representative local dataset (e.g., anonymized production dump or robust seeding scripts) to thoroughly test logic that relies on existing data states.

Future-Proofing and Reflections

The backfill endpoint remains deployed and can be reused if this issue ever resurfaces. However, this experience highlighted a critical area for improvement: proactive embedding generation during schema migrations.

  • Next Step: We're now considering adding embedding generation logic directly into our safe migration scripts. This would ensure that any schema changes or data imports that might inadvertently create records without embeddings would automatically trigger a backfill as part of the migration process. This shifts from a reactive fix to a proactive data integrity measure.
  • Monitoring: We'll be closely monitoring our nyxcore-app-1 container health, as it was showing "unhealthy" before this deployment and restart.
  • Axiom RAG: The axiom_chunks table is still missing in production. This indicates a future migration will be needed if we fully integrate Axiom RAG features.

This session was a stark reminder that even well-designed systems can encounter data integrity issues. The ability to quickly diagnose, build a safe recovery mechanism, and learn from the operational challenges is what truly defines a robust development process.


json
{
  "thingsDone": [
    "Diagnosed 382 missing vector embeddings on production workflow_insights table",
    "Created a secure, idempotent backfill admin endpoint (`/api/v1/admin/backfill-embeddings`)",
    "Implemented `x-admin-secret` authentication for the endpoint",
    "Added `dryRun=true` and `batchSize` query parameters for safe execution",
    "Ensured BYOK (Bring-Your-Own-Key) compatibility by grouping insights by tenantId for OpenAI calls",
    "Deployed the backfill endpoint to production (commit d439543)",
    "Executed the backfill using `wget` from inside the production app container",
    "Successfully backfilled all 382 missing embeddings with 0 errors",
    "Verified production container names (`nyxcore-app-1`, `nyxcore-postgres-1`)"
  ],
  "pains": [
    "Incorrect Docker Compose v2 container naming (`nyxcore-postgres` vs `nyxcore-postgres-1`)",
    "Missing `curl` in Alpine-based Docker image, requiring `wget` workaround",
    "Calling endpoint from inside container required `http://0.0.0.0:3000` instead of `localhost`",
    "Global `npx` pulling incompatible Prisma version (7.x instead of 5.x)",
    "`.env` not loaded correctly for local Prisma commands",
    "Local development environment having empty Docker volumes/DB, hindering full testing"
  ],
  "successes": [
    "Complete restoration of data integrity for workflow_insights embeddings",
    "Development of a reusable and secure backfill tool",
    "Successful deployment and execution in a live production environment",
    "Gained practical insights into production debugging and Docker quirks"
  ],
  "techStack": [
    "PostgreSQL",
    "pgvector",
    "Prisma (5.x)",
    "Next.js (API Routes)",
    "TypeScript",
    "Docker",
    "Docker Compose (v2)",
    "OpenAI API",
    "Alpine Linux"
  ]
}