Rescuing Production: The Case of the Missing AI Embeddings
A deep dive into diagnosing and fixing a critical production issue: hundreds of missing vector embeddings for AI features, and the lessons learned along the way.
Picture this: You're cruising through your day, everything seems fine, then a critical internal report flags a gaping hole in your data. For us, it was 382 rows in our workflow_insights table, all missing their crucial vector embeddings. In an AI-driven application, missing embeddings are like a car without an engine – fundamental features just grind to a halt.
These embeddings are the numerical representations that power our AI's understanding of user workflows, enabling features like semantic search, similarity matching, and intelligent recommendations. Without them, core functionality was silently failing, impacting the intelligence our application provided.
The Mystery of the Missing Vectors
Our mission was clear: diagnose the problem and bring those embeddings back to life. A quick check of our production PostgreSQL database, specifically the workflow_insights table, confirmed the grim reality: 382 entries had NULL values in their embedding column. Interestingly, the pgvector extension and the column itself were perfectly intact, ruling out a schema migration gone wrong. The data was there, but the intelligence was gone.
Crafting the Backfill Solution
Manually updating 382 rows was out of the question – it's prone to errors and certainly not scalable. We needed a robust, programmatic solution. The answer came in the form of a dedicated admin backfill endpoint, designed for safety and efficiency:
- Secure Access: A
POSTendpoint at/api/v1/admin/backfill-embeddings, protected by anx-admin-secretheader (falling back to ourAUTH_SECRET). Security first, always. - Safety First with
dryRun: AdryRun=truequery parameter was implemented, allowing us to simulate the backfill without making any actual changes. This was crucial for verifying our logic before touching production data. - Batching for Performance: To manage OpenAI API limits and improve efficiency, the endpoint supported a
batchSizeparameter. It also intelligently grouped insights bytenantId, which was vital for correctly resolving 'Bring Your Own Key' (BYOK) OpenAI API keys for our multi-tenant environment. OpenAI embedding calls were further batched (50 per batch) to optimize network requests.
This endpoint became our scalpel, precise and controlled, ready to mend the data.
Navigating Production: Lessons Learned
No production fix is ever without its twists and turns. Here's what we learned while operating in the live environment:
- Docker Compose v2 Naming: Our first hurdle was simply finding the right container. We initially tried
docker exec nyxcore-postgres, only to be met with 'No such container.' Turns out, Docker Compose v2 appends-1to service names, so the correct target wasnyxcore-postgres-1(andnyxcore-app-1for the application container).- Lesson: Always verify your container names in a live
docker psoutput, especially after updates to Docker Compose versions or if you're working in an unfamiliar environment.
- Lesson: Always verify your container names in a live
- Alpine's Minimalist Nature: When trying to
curlour newly deployed endpoint from inside the app container, we hit another wall:curlwasn't installed. Our base image was Alpine Linux, known for its small footprint and minimal default packages. The workaround?wget, which was thankfully present. We also had to usehttp://0.0.0.0:3000instead oflocalhostor127.0.0.1for inter-container communication within the Docker network.- Lesson: Don't assume common debugging tools are present in minimal production Docker images. Know your base image and its utilities, or consider adding specific tools for debugging if absolutely necessary (and removing them for production builds).
- Local Development Environment Quirks:
- Prisma Versioning: Attempting
npx prisma db pushlocally pulled Prisma 7.x, introducing breaking changes incompatible with our codebase. We quickly learned to pin our Prisma CLI version withnpx prisma@5.22.0. - Environment Variables: Forgetting to load our
.envfile meantDATABASE_URLwasn't set, leading to local DB connection failures. - Ephemeral Docker Volumes: A fresh
docker compose uplocally meant new, empty Docker volumes for PostgreSQL. This was a good reminder that local development often starts with a blank slate, requiring specific seeding if data is needed for testing. - Lesson: Consistent tooling versions, meticulous environment variable management, and understanding Docker volume behavior are paramount for local development sanity and avoiding unexpected issues.
- Prisma Versioning: Attempting
These challenges, though frustrating in the moment, provided valuable insights into our operational environment and development workflow.
The Successful Backfill
With our backfill endpoint deployed (commit d439543 to main) and the lessons from our debugging journey firmly in mind, it was time for the moment of truth. We executed the backfill command from within the production app container, carefully monitoring the logs. The result? A perfect 382/382 insights successfully re-embedded, with zero errors. The workflow_insights table was whole again, and our AI features could once again leverage their full intelligence.
Looking Ahead: Future-Proofing
While the immediate crisis was averted, this incident prompted us to think about prevention and future resilience:
- Reusable Endpoint: The backfill endpoint remains deployed, a valuable tool in our arsenal should embeddings ever go missing again.
- Schema Migrations & New Features: The
axiom_chunkstable, essential for future RAG (Retrieval Augmented Generation) features, isn't yet in production. This highlights the need for careful schema migration planning and potential auto-backfilling of embeddings as part of those scripts. - Health Monitoring: We noted the app container was showing 'unhealthy' before the restart and deployment. Continuous monitoring of container health is critical.
- Automating Embedding Generation: The biggest takeaway is to integrate embedding generation directly into our data pipeline or schema migration scripts. This would ensure that new or altered data always gets its embeddings, preventing such a manual backfill from ever being necessary again.
This journey from crisis to resolution reinforced the importance of robust debugging practices, meticulous deployment, and a continuous learning mindset in the ever-evolving world of production systems.
Conclusion
Debugging production issues can be a high-stakes adventure, but it's also where some of the most profound learning happens. By methodically diagnosing the problem, crafting a precise solution, and learning from every hiccup along the way, we not only restored critical functionality but also fortified our understanding of our infrastructure and development processes. Here's to robust systems and the invaluable lessons learned in the heat of the moment!