The Case of the Missing Embeddings: A Production Backfill Story
A deep dive into diagnosing and fixing missing vector embeddings in a production database, highlighting the challenges and solutions involved in a critical backfill operation.
Every developer knows the unique thrill (and dread) of a production incident. This past week, I found myself deep in the trenches, wrestling with a critical issue: missing vector embeddings on our workflow_insights table. When your AI-powered features suddenly go quiet, you know it's time to put on your detective hat.
The mission was clear: diagnose why 382 crucial rows lacked their vector embeddings and then devise a robust way to backfill them without disrupting our live system. Here's how we tackled it, the curveballs we faced, and the valuable lessons learned along the way.
The Mystery of the Missing Vectors
Our workflow_insights table is the backbone for several AI-driven features, relying heavily on vector embeddings for similarity searches and contextual understanding. Imagine our surprise when these features started behaving erratically. A quick peek into the production database confirmed our fears: 382 rows in workflow_insights had NULL values in their embedding column.
Crucially, the pgvector extension was still active, and the column type was correct. This wasn't a schema migration gone wrong; it was a data integrity issue, likely due to a past bug in our embedding generation pipeline. The immediate goal shifted from "why did this happen?" to "how do we fix it, now?"
Engineering a Production-Ready Backfill
Restoring 382 missing embeddings manually was out of the question. We needed an automated, safe, and auditable solution. My approach was to build a dedicated admin endpoint:
- A New Admin Endpoint: I created a
POSTendpoint atsrc/app/api/v1/admin/backfill-embeddings/route.ts. This ensured it was part of our existing application, benefiting from its environment and database connection. - Secure Access: The endpoint was protected by an
x-admin-secretheader, falling back to ourAUTH_SECRETfor convenience. Security first! - Safe Execution:
- A
dryRun=truequery parameter allowed us to simulate the operation without making any changes, crucial for verifying logic. - A
batchSizeparameter provided control over the number of records processed per database transaction and OpenAI API call, preventing timeouts and resource exhaustion.
- A
- Tenant-Aware Processing: Our system supports Bring Your Own Key (BYOK) for OpenAI. The backfill logic grouped insights by
tenantIdto ensure the correct OpenAI API key was used for each batch. - Efficient API Calls: OpenAI embedding calls were batched (50 per batch) to optimize network requests and stay within rate limits.
After thorough local testing (even with an empty dev DB, more on that later!), the new endpoint was ready for deployment.
Deployment and Victory
The commit d439543 was pushed to main and deployed to production. With bated breath, I executed the backfill command directly from the production container. The process ran flawlessly, with zero errors reported. All 382 workflow_insights rows now proudly sported their vector embeddings. Our AI features were back online, and the crisis was averted.
Lessons from the Trenches: The "Pain Log"
No production fix is ever without its share of minor (or major) headaches. These are the moments that truly forge a developer's experience.
-
Docker Compose v2 Naming:
- The Trap: Trying to
docker exec nyxcore-postgreson production. - The Reality: Docker Compose v2 generates container names like
service-name-1. The correct container name wasnyxcore-postgres-1. - Lesson: Always verify container names, especially after upgrading Docker Compose or in new environments. A quick
docker pssaves a lot of head-scratching.
- The Trap: Trying to
-
Alpine's Lean Footprint:
- The Trap: Attempting to
curlthe newly deployed backfill endpoint from inside our Alpine-based app container. - The Reality: Alpine images are minimal by design;
curlwasn't installed. Also,localhostor127.0.0.1won't work for internal container communication when you're exec'd into one container trying to reach another service within the same compose network. You need to use the service name or0.0.0.0. - The Workaround: Used
wget(which was installed) and targetedhttp://0.0.0.0:3000(our app's internal port). - Lesson: Understand your base image. For inter-container communication, use service names or
0.0.0.0for self-referencing within the network.
- The Trap: Attempting to
-
Prisma Versioning & Local Dev Woes:
- The Trap: Running
npx prisma db pushlocally to set up my dev environment. - The Reality: My global
npxpicked up a much newer Prisma version (7.x) with breaking changes, leading to errors. Additionally, I hadn't loaded my.envvariables, soDATABASE_URLwas missing. - The Workaround: Explicitly specified the correct Prisma version (
npx prisma@5.22.0) and ensured my.envwas properly sourced. - Lesson: Pin your dependencies! Always use
npx prisma@<version>or localnode_modules/.bin/prismato avoid global version conflicts. And never forget your.env!
- The Trap: Running
-
Empty Local Dev DB:
- The Note: A fresh
docker compose uplocally created new, empty Docker volumes. My dev database was completely empty, which meant my local testing was purely for logic validation, not data integrity. - Lesson: For critical data operations, consider having a representative local dataset (or a staging environment) to test against, or at least be aware of the limitations of an empty dev DB.
- The Note: A fresh
Looking Ahead: Fortifying Our Defenses
This incident, while stressful, provided valuable insights and led to crucial improvements:
- Reusable Tooling: The backfill endpoint remains deployed and can be reused if embeddings are ever lost again. It's a powerful "break glass in case of emergency" tool.
- Schema Migration Safety: We're now considering integrating embedding generation directly into our safe schema migration scripts, ensuring that new features or schema changes automatically backfill embeddings where necessary.
- Health Monitoring: The app container showed "unhealthy" before the restart. We'll be monitoring health closely and reviewing our health check configurations.
- Local Dev Seeding: For future development, seeding a local DB with representative data (
npm run db:push && npm run db:generate && npm run db:seed) will make local testing more robust.
Every production incident is a learning opportunity. By sharing these experiences, we not only improve our systems but also contribute to a collective wisdom that makes us all better engineers. Here's to fewer missing embeddings and more robust production environments!