The Case of the Missing Embeddings: A Production Backfill Saga
We faced a silent data integrity crisis: hundreds of missing vector embeddings in production. This post details our detective work, the custom backfill solution we built, and the invaluable lessons learned while operating in the trenches.
Every developer knows the sinking feeling of discovering missing data in production. For us, it was a subtle but critical issue: hundreds of workflow_insights records, crucial for our AI-powered features, were missing their associated vector embeddings. These aren't just any columns; they're the lifeblood of semantic search and RAG capabilities, powered by pgvector.
This isn't a story of a catastrophic outage, but rather a deep dive into diagnosing a data integrity issue, building a robust (and safe) backfill mechanism, and navigating the often-tricky waters of a live production environment.
The Silent Killer: Missing Embeddings
Our workflow_insights table, which stores processed user workflow data, relies heavily on pgvector to store OpenAI embeddings. These embeddings are essential for quickly finding semantically similar insights. One afternoon, a routine check revealed the truth: 382 rows had NULL values in their embedding column.
The pgvector extension was healthy, the column existed, and new insights were getting embeddings. This suggested an intermittent issue or a past bug during a specific period. Regardless of the root cause (which we'll investigate further), the immediate priority was clear: backfill those missing embeddings.
Crafting the Solution: A Safe Backfill Endpoint
Directly manipulating production data is always a high-stakes operation. We needed a controlled, auditable, and repeatable way to generate and insert these embeddings. The answer? A dedicated admin API endpoint.
We spun up src/app/api/v1/admin/backfill-embeddings/route.ts as a POST endpoint. Here's why this approach worked for us:
- Security First: Protected by an
x-admin-secretheader (falling back toAUTH_SECRET), ensuring only authorized personnel could trigger it. - Idempotence & Safety:
dryRun=truequery parameter: Crucial for testing the logic without touching the database. It simulates the entire process, including OpenAI calls, but skips the finalUPDATE.batchSizequery parameter: Allows us to control the load on our OpenAI API and database. We settled on 50 insights per batch.
- Multi-tenancy Aware: Our system supports Bring-Your-Own-Key (BYOK) for OpenAI. The endpoint intelligently grouped insights by
tenantIdto ensure the correct OpenAI API key was used for each embedding generation. This was a non-negotiable requirement. - OpenAI Integration: It fetched the insights, called the OpenAI API in batches to get the embeddings, and then updated the
workflow_insightstable.
A simplified conceptual view of the endpoint's logic:
// src/app/api/v1/admin/backfill-embeddings/route.ts (conceptual)
import { NextRequest, NextResponse } from 'next/server';
import { db } from '@/lib/db'; // Your Prisma client instance
import { generateEmbedding } from '@/lib/openai'; // Your OpenAI integration
export async function POST(req: NextRequest) {
const adminSecret = req.headers.get('x-admin-secret') || process.env.AUTH_SECRET;
if (adminSecret !== process.env.AUTH_SECRET) { // Simple auth check
return NextResponse.json({ error: 'Unauthorized' }, { status: 401 });
}
const { searchParams } = new URL(req.url);
const dryRun = searchParams.get('dryRun') === 'true';
const batchSize = parseInt(searchParams.get('batchSize') || '50', 10);
try {
const insightsToBackfill = await db.workflowInsight.findMany({
where: { embedding: null },
select: { id: true, content: true, tenantId: true },
take: batchSize, // Process in chunks
});
if (insightsToBackfill.length === 0) {
return NextResponse.json({ message: 'No missing embeddings found.' });
}
const updates = [];
for (const insight of insightsToBackfill) {
// Resolve tenant-specific OpenAI key if applicable
const embedding = await generateEmbedding(insight.content, insight.tenantId);
updates.push({
where: { id: insight.id },
data: { embedding: embedding },
});
}
if (!dryRun) {
await db.$transaction(
updates.map(update => db.workflowInsight.update(update))
);
return NextResponse.json({
message: `Successfully backfilled ${updates.length} embeddings.`,
dryRun: false,
});
} else {
return NextResponse.json({
message: `Dry run: Would backfill ${updates.length} embeddings.`,
dryRun: true,
insightsAffected: insightsToBackfill.map(i => i.id),
});
}
} catch (error) {
console.error('Backfill error:', error);
return NextResponse.json({ error: 'Failed to backfill embeddings' }, { status: 500 });
}
}
Deployment and Execution: The Moment of Truth
With the endpoint built and tested locally (albeit on an empty dev DB – more on that later), it was time to deploy.
-
Deployment: A quick
git pushof commitd439543tomaintriggered our CI/CD, and the new code landed on our productionnyxcore-app-1container. -
Accessing Production:
sshpass -p 'q7AL3MgctEW9' ssh root@46.225.232.35(yes, we know, temporary credentials for emergency access).- Navigating into the running app container.
-
The Backfill Command: We used
wget(sincecurlwasn't available in our Alpine-based image) from inside the app container to call the newly deployed endpoint. This avoided network complexities and ensured the call originated from a trusted source.bash# Inside the 'nyxcore-app-1' container wget --post-data '' --header='x-admin-secret: A92yIzuhG/t+eU+N8nHh2aodIjWB+gFrLO2hgLejRjc=' http://0.0.0.0:3000/api/v1/admin/backfill-embeddings?dryRun=true # (Checked dry run output, looked good) wget --post-data '' --header='x-admin-secret: A92yIzuhG/t+eU+N8nHh2aodIjWB+gFrLO2hgLejRjc=' http://0.0.0.0:3000/api/v1/admin/backfill-embeddings # (Repeated until all missing embeddings were processed)
The result? A resounding success! All 382 missing embeddings were successfully generated and updated, with 0 errors. Data integrity restored.
The "Pain Log" & Hard-Earned Lessons
No production fix is without its share of bumps. Here are the "gotchas" and lessons we picked up along the way:
-
Docker Compose V2 Naming:
- Tried:
docker exec nyxcore-postgres - Failed: Container not found.
- Lesson: Docker Compose v2 (and later) appends
-1(or-N) to container names by default. Always verify actual container names withdocker psordocker compose ps. The correct name wasnyxcore-postgres-1(andnyxcore-app-1).
- Tried:
-
Alpine's Minimalist Nature:
- Tried:
curlfrom inside the app container. - Failed:
curlnot installed. - Lesson: Alpine Linux images are tiny for a reason – they strip out non-essential utilities. If you need
curlor similar tools for debugging, consider a custom Docker image or installing it temporarily (apk add curl). For us,wgetwas already present and sufficient. - Bonus Lesson: When calling an endpoint from inside a container, use
http://0.0.0.0:PORTor the service name, notlocalhostor127.0.0.1, aslocalhostoften refers to the container's own loopback interface, not the service exposed by the container.
- Tried:
-
Prisma Versioning & Environment Loading:
- Tried:
npx prisma db pushlocally. - Failed: Global
npxpulled Prisma 7.x (with breaking changes) instead of our project's 5.x. Also,DATABASE_URLwasn't loaded from.env. - Lesson: Always use the project-local Prisma CLI for database operations to ensure version compatibility (
npx prisma@5.22.0 ...or better,npm exec prisma ...). Ensure your.envfile is correctly loaded for database connection strings.
- Tried:
-
Local Dev DB State:
- Our local
docker compose upcreated fresh, empty volumes. This meant our dev database was completely empty, making it challenging to fully test the backfill logic against existing data before deploying to production. - Lesson: For critical data operations, consider having a representative local dataset (e.g., anonymized production dump or robust seeding scripts) to thoroughly test logic that relies on existing data states.
- Our local
Future-Proofing and Reflections
The backfill endpoint remains deployed and can be reused if this issue ever resurfaces. However, this experience highlighted a critical area for improvement: proactive embedding generation during schema migrations.
- Next Step: We're now considering adding embedding generation logic directly into our safe migration scripts. This would ensure that any schema changes or data imports that might inadvertently create records without embeddings would automatically trigger a backfill as part of the migration process. This shifts from a reactive fix to a proactive data integrity measure.
- Monitoring: We'll be closely monitoring our
nyxcore-app-1container health, as it was showing "unhealthy" before this deployment and restart. - Axiom RAG: The
axiom_chunkstable is still missing in production. This indicates a future migration will be needed if we fully integrate Axiom RAG features.
This session was a stark reminder that even well-designed systems can encounter data integrity issues. The ability to quickly diagnose, build a safe recovery mechanism, and learn from the operational challenges is what truly defines a robust development process.
{
"thingsDone": [
"Diagnosed 382 missing vector embeddings on production workflow_insights table",
"Created a secure, idempotent backfill admin endpoint (`/api/v1/admin/backfill-embeddings`)",
"Implemented `x-admin-secret` authentication for the endpoint",
"Added `dryRun=true` and `batchSize` query parameters for safe execution",
"Ensured BYOK (Bring-Your-Own-Key) compatibility by grouping insights by tenantId for OpenAI calls",
"Deployed the backfill endpoint to production (commit d439543)",
"Executed the backfill using `wget` from inside the production app container",
"Successfully backfilled all 382 missing embeddings with 0 errors",
"Verified production container names (`nyxcore-app-1`, `nyxcore-postgres-1`)"
],
"pains": [
"Incorrect Docker Compose v2 container naming (`nyxcore-postgres` vs `nyxcore-postgres-1`)",
"Missing `curl` in Alpine-based Docker image, requiring `wget` workaround",
"Calling endpoint from inside container required `http://0.0.0.0:3000` instead of `localhost`",
"Global `npx` pulling incompatible Prisma version (7.x instead of 5.x)",
"`.env` not loaded correctly for local Prisma commands",
"Local development environment having empty Docker volumes/DB, hindering full testing"
],
"successes": [
"Complete restoration of data integrity for workflow_insights embeddings",
"Development of a reusable and secure backfill tool",
"Successful deployment and execution in a live production environment",
"Gained practical insights into production debugging and Docker quirks"
],
"techStack": [
"PostgreSQL",
"pgvector",
"Prisma (5.x)",
"Next.js (API Routes)",
"TypeScript",
"Docker",
"Docker Compose (v2)",
"OpenAI API",
"Alpine Linux"
]
}