The Case of the Missing Embeddings: A Production Backfill Story

Every developer knows the unique thrill (and dread) of a production incident. This past week, I found myself deep in the trenches, wrestling with a critical issue: missing vector embeddings on our workflow_insights table. When your AI-powered features suddenly go quiet, you know it's time to put on your detective hat.

The mission was clear: diagnose why 382 crucial rows lacked their vector embeddings and then devise a robust way to backfill them without disrupting our live system. Here's how we tackled it, the curveballs we faced, and the valuable lessons learned along the way.

The Mystery of the Missing Vectors

Our workflow_insights table is the backbone for several AI-driven features, relying heavily on vector embeddings for similarity searches and contextual understanding. Imagine our surprise when these features started behaving erratically. A quick peek into the production database confirmed our fears: 382 rows in workflow_insights had NULL values in their embedding column.

Crucially, the pgvector extension was still active, and the column type was correct. This wasn't a schema migration gone wrong; it was a data integrity issue, likely due to a past bug in our embedding generation pipeline. The immediate goal shifted from "why did this happen?" to "how do we fix it, now?"

Engineering a Production-Ready Backfill

Restoring 382 missing embeddings manually was out of the question. We needed an automated, safe, and auditable solution. My approach was to build a dedicated admin endpoint:

A New Admin Endpoint: I created a POST endpoint at src/app/api/v1/admin/backfill-embeddings/route.ts. This ensured it was part of our existing application, benefiting from its environment and database connection.
Secure Access: The endpoint was protected by an x-admin-secret header, falling back to our AUTH_SECRET for convenience. Security first!
Safe Execution:
- A dryRun=true query parameter allowed us to simulate the operation without making any changes, crucial for verifying logic.
- A batchSize parameter provided control over the number of records processed per database transaction and OpenAI API call, preventing timeouts and resource exhaustion.
Tenant-Aware Processing: Our system supports Bring Your Own Key (BYOK) for OpenAI. The backfill logic grouped insights by tenantId to ensure the correct OpenAI API key was used for each batch.
Efficient API Calls: OpenAI embedding calls were batched (50 per batch) to optimize network requests and stay within rate limits.

After thorough local testing (even with an empty dev DB, more on that later!), the new endpoint was ready for deployment.

Deployment and Victory

The commit d439543 was pushed to main and deployed to production. With bated breath, I executed the backfill command directly from the production container. The process ran flawlessly, with zero errors reported. All 382 workflow_insights rows now proudly sported their vector embeddings. Our AI features were back online, and the crisis was averted.

Lessons from the Trenches: The "Pain Log"

No production fix is ever without its share of minor (or major) headaches. These are the moments that truly forge a developer's experience.

Docker Compose v2 Naming:
- The Trap: Trying to docker exec nyxcore-postgres on production.
- The Reality: Docker Compose v2 generates container names like service-name-1. The correct container name was nyxcore-postgres-1.
- Lesson: Always verify container names, especially after upgrading Docker Compose or in new environments. A quick docker ps saves a lot of head-scratching.
Alpine's Lean Footprint:
- The Trap: Attempting to curl the newly deployed backfill endpoint from inside our Alpine-based app container.
- The Reality: Alpine images are minimal by design; curl wasn't installed. Also, localhost or 127.0.0.1 won't work for internal container communication when you're exec'd into one container trying to reach another service within the same compose network. You need to use the service name or 0.0.0.0.
- The Workaround: Used wget (which was installed) and targeted http://0.0.0.0:3000 (our app's internal port).
- Lesson: Understand your base image. For inter-container communication, use service names or 0.0.0.0 for self-referencing within the network.
Prisma Versioning & Local Dev Woes:
- The Trap: Running npx prisma db push locally to set up my dev environment.
- The Reality: My global npx picked up a much newer Prisma version (7.x) with breaking changes, leading to errors. Additionally, I hadn't loaded my .env variables, so DATABASE_URL was missing.
- The Workaround: Explicitly specified the correct Prisma version (npx prisma@5.22.0) and ensured my .env was properly sourced.
- Lesson: Pin your dependencies! Always use npx prisma@<version> or local node_modules/.bin/prisma to avoid global version conflicts. And never forget your .env!
Empty Local Dev DB:
- The Note: A fresh docker compose up locally created new, empty Docker volumes. My dev database was completely empty, which meant my local testing was purely for logic validation, not data integrity.
- Lesson: For critical data operations, consider having a representative local dataset (or a staging environment) to test against, or at least be aware of the limitations of an empty dev DB.

Looking Ahead: Fortifying Our Defenses

This incident, while stressful, provided valuable insights and led to crucial improvements:

Reusable Tooling: The backfill endpoint remains deployed and can be reused if embeddings are ever lost again. It's a powerful "break glass in case of emergency" tool.
Schema Migration Safety: We're now considering integrating embedding generation directly into our safe schema migration scripts, ensuring that new features or schema changes automatically backfill embeddings where necessary.
Health Monitoring: The app container showed "unhealthy" before the restart. We'll be monitoring health closely and reviewing our health check configurations.
Local Dev Seeding: For future development, seeding a local DB with representative data (npm run db:push && npm run db:generate && npm run db:seed) will make local testing more robust.

Every production incident is a learning opportunity. By sharing these experiences, we not only improve our systems but also contribute to a collective wisdom that makes us all better engineers. Here's to fewer missing embeddings and more robust production environments!