nyxcore-systems
5 min read

Rescuing Production: The Case of the Missing AI Embeddings

A deep dive into diagnosing and fixing a critical production issue: hundreds of missing vector embeddings for AI features, and the lessons learned along the way.

productiondebuggingdatabaseembeddingsAIPostgreSQLpgvectorNext.jsDockerPrisma

Picture this: You're cruising through your day, everything seems fine, then a critical internal report flags a gaping hole in your data. For us, it was 382 rows in our workflow_insights table, all missing their crucial vector embeddings. In an AI-driven application, missing embeddings are like a car without an engine – fundamental features just grind to a halt.

These embeddings are the numerical representations that power our AI's understanding of user workflows, enabling features like semantic search, similarity matching, and intelligent recommendations. Without them, core functionality was silently failing, impacting the intelligence our application provided.

The Mystery of the Missing Vectors

Our mission was clear: diagnose the problem and bring those embeddings back to life. A quick check of our production PostgreSQL database, specifically the workflow_insights table, confirmed the grim reality: 382 entries had NULL values in their embedding column. Interestingly, the pgvector extension and the column itself were perfectly intact, ruling out a schema migration gone wrong. The data was there, but the intelligence was gone.

Crafting the Backfill Solution

Manually updating 382 rows was out of the question – it's prone to errors and certainly not scalable. We needed a robust, programmatic solution. The answer came in the form of a dedicated admin backfill endpoint, designed for safety and efficiency:

  • Secure Access: A POST endpoint at /api/v1/admin/backfill-embeddings, protected by an x-admin-secret header (falling back to our AUTH_SECRET). Security first, always.
  • Safety First with dryRun: A dryRun=true query parameter was implemented, allowing us to simulate the backfill without making any actual changes. This was crucial for verifying our logic before touching production data.
  • Batching for Performance: To manage OpenAI API limits and improve efficiency, the endpoint supported a batchSize parameter. It also intelligently grouped insights by tenantId, which was vital for correctly resolving 'Bring Your Own Key' (BYOK) OpenAI API keys for our multi-tenant environment. OpenAI embedding calls were further batched (50 per batch) to optimize network requests.

This endpoint became our scalpel, precise and controlled, ready to mend the data.

Navigating Production: Lessons Learned

No production fix is ever without its twists and turns. Here's what we learned while operating in the live environment:

  • Docker Compose v2 Naming: Our first hurdle was simply finding the right container. We initially tried docker exec nyxcore-postgres, only to be met with 'No such container.' Turns out, Docker Compose v2 appends -1 to service names, so the correct target was nyxcore-postgres-1 (and nyxcore-app-1 for the application container).
    • Lesson: Always verify your container names in a live docker ps output, especially after updates to Docker Compose versions or if you're working in an unfamiliar environment.
  • Alpine's Minimalist Nature: When trying to curl our newly deployed endpoint from inside the app container, we hit another wall: curl wasn't installed. Our base image was Alpine Linux, known for its small footprint and minimal default packages. The workaround? wget, which was thankfully present. We also had to use http://0.0.0.0:3000 instead of localhost or 127.0.0.1 for inter-container communication within the Docker network.
    • Lesson: Don't assume common debugging tools are present in minimal production Docker images. Know your base image and its utilities, or consider adding specific tools for debugging if absolutely necessary (and removing them for production builds).
  • Local Development Environment Quirks:
    • Prisma Versioning: Attempting npx prisma db push locally pulled Prisma 7.x, introducing breaking changes incompatible with our codebase. We quickly learned to pin our Prisma CLI version with npx prisma@5.22.0.
    • Environment Variables: Forgetting to load our .env file meant DATABASE_URL wasn't set, leading to local DB connection failures.
    • Ephemeral Docker Volumes: A fresh docker compose up locally meant new, empty Docker volumes for PostgreSQL. This was a good reminder that local development often starts with a blank slate, requiring specific seeding if data is needed for testing.
    • Lesson: Consistent tooling versions, meticulous environment variable management, and understanding Docker volume behavior are paramount for local development sanity and avoiding unexpected issues.

These challenges, though frustrating in the moment, provided valuable insights into our operational environment and development workflow.

The Successful Backfill

With our backfill endpoint deployed (commit d439543 to main) and the lessons from our debugging journey firmly in mind, it was time for the moment of truth. We executed the backfill command from within the production app container, carefully monitoring the logs. The result? A perfect 382/382 insights successfully re-embedded, with zero errors. The workflow_insights table was whole again, and our AI features could once again leverage their full intelligence.

Looking Ahead: Future-Proofing

While the immediate crisis was averted, this incident prompted us to think about prevention and future resilience:

  • Reusable Endpoint: The backfill endpoint remains deployed, a valuable tool in our arsenal should embeddings ever go missing again.
  • Schema Migrations & New Features: The axiom_chunks table, essential for future RAG (Retrieval Augmented Generation) features, isn't yet in production. This highlights the need for careful schema migration planning and potential auto-backfilling of embeddings as part of those scripts.
  • Health Monitoring: We noted the app container was showing 'unhealthy' before the restart and deployment. Continuous monitoring of container health is critical.
  • Automating Embedding Generation: The biggest takeaway is to integrate embedding generation directly into our data pipeline or schema migration scripts. This would ensure that new or altered data always gets its embeddings, preventing such a manual backfill from ever being necessary again.

This journey from crisis to resolution reinforced the importance of robust debugging practices, meticulous deployment, and a continuous learning mindset in the ever-evolving world of production systems.

Conclusion

Debugging production issues can be a high-stakes adventure, but it's also where some of the most profound learning happens. By methodically diagnosing the problem, crafting a precise solution, and learning from every hiccup along the way, we not only restored critical functionality but also fortified our understanding of our infrastructure and development processes. Here's to robust systems and the invaluable lessons learned in the heat of the moment!