The Case of the Missing Embeddings: Debugging a pgvector Production Mystery

The Silent Killer: When Success Logs Lie

Imagine this: you've got a shiny new vector database setup, pgvector, humming along, powering intelligent insights for your users. You're ingesting data, generating embeddings via OpenAI, and everything looks fine. Your audit logs are green, your metrics are steady.

Then, a gut feeling (or maybe a user report) prompts a deeper look. You query your workflow_insights table, expecting rich vector data... and find NULLs. All 1719 of them. Every single embedding column was empty.

Welcome to my Wednesday afternoon.

This isn't just a bug; it's a silent killer. The system thought it was doing its job, but the data told a different, much emptier, story. Our goal was clear: diagnose the root cause, fix the immediate data integrity issue, and build a more resilient system to prevent future embarrassments.

The Investigation: Tracing the Phantom Column

My first instinct was to check the obvious:

Is pgvector installed and enabled? (Yes, v0.8.2)
Is the embedding column correctly defined as vector(1536)? (Yes)
Is the HNSW index in place? (Yes)

Everything looked perfect at a glance. The schema was correct, the extension was active. So why the NULLs?

The breakthrough came from digging into old audit_logs. We found a handful of insight.embedding_failed entries from months ago showing a very specific error: column "embedding" does not exist.

Aha! This pointed directly to a schema migration issue. We use Prisma for our ORM, and prisma db push can be aggressive. What likely happened was:

An early prisma db push temporarily dropped the embedding column.
Our rls.sql script (which defines row-level security policies) then restored it.
During that brief window, or possibly due to caching/schema refresh issues, some embedding writes failed with the "column does not exist" error.
Crucially, once the column was restored, subsequent embedding writes seemed to silently fail. The application logic would attempt the UPDATE, but for reasons unclear at the time, the embedding column would remain NULL even though no error was thrown in the application layer or reported in the audit logs for these later failures. This was the truly painful part – a system that appeared to be working perfectly, but wasn't.

Testing the Full Pipeline

To ensure the current setup was capable of writing embeddings, we ran an end-to-end test on production:

Decrypted an API key.
Called the OpenAI API to generate an embedding.
Executed a raw SQL UPDATE statement via $executeRaw to store the vector.
Verified the vector was correctly stored in the database.

This confirmed the pipeline could work. The problem was historical and related to the robustness of our embedding write process.

The Fixes: Backfill, Retries, and Observability

With the root cause identified (a historical schema hiccup leading to silent failures), it was time for action.

1. The Great Backfill of '26

First, we needed to populate those 1719 missing embeddings. Thankfully, we had an existing admin endpoint: /api/v1/admin/backfill-embeddings. This endpoint iterates through all workflow_insights with NULL embeddings, generates them, and writes them back.

It wasn't entirely smooth sailing. Transient 5xx errors from the OpenAI API meant we had to run the backfill three times. But eventually, we got there: 1719/1719 embeddings now populated. Phew.

2. Robustness: Retries for External APIs

The transient OpenAI errors during the backfill highlighted another weak point: our integration with external APIs wasn't robust enough. We added retry logic with exponential backoff to our openaiEmbed() service:

typescript

// src/server/services/embedding-service.ts
async function openaiEmbed(text: string): Promise<number[]> {
  for (let i = 0; i < 3; i++) { // Max 3 retries
    try {
      const response = await openai.embeddings.create({
        model: "text-embedding-ada-002",
        input: text,
      });
      return response.data[0].embedding;
    } catch (error) {
      if (error.response?.status >= 500 || error.response?.status === 429) {
        const delay = Math.pow(2, i) * 1000; // Exponential backoff: 1s, 2s, 4s
        console.warn(`OpenAI embedding failed (attempt ${i + 1}/3), retrying in ${delay / 1000}s:`, error.message);
        await new Promise(resolve => setTimeout(resolve, delay));
      } else {
        throw error; // Re-throw for non-retryable errors
      }
    }
  }
  throw new Error("Failed to generate OpenAI embedding after multiple retries.");
}

This simple addition significantly improves the reliability of our embedding generation process.

3. Observability: Logging Successes, Not Just Failures

The silent failure mode was the most insidious part of this whole ordeal. To prevent this in the future, we beefed up our logging. Now, whenever embeddings are successfully written, we log it explicitly in all three relevant write paths:

src/server/services/insight-persistence.ts
src/server/services/pipeline-insight-extractor.ts
src/server/services/discussion-knowledge.ts

Log messages now look something like: [insight-persistence] Embeddings: X/Y written

This provides immediate visibility into whether the embedding write was successful, allowing us to quickly spot any future discrepancies between application logic and database state.

4. Cleanup

Finally, a quick sweep to remove any temporary test scripts from the production container.

Lessons Learned: From Pain to Progress

This session was a stark reminder of several critical development principles:

Trust, But Verify Your Data: Never blindly trust success messages from your application or even basic audit logs for critical data. Always verify the actual state of your data in the database, especially for new or complex data types like vectors.
Schema Migrations Are Tricky: Be incredibly cautious with schema migration tools, especially those that might temporarily drop and restore columns (prisma db push can be notorious for this). Understand their exact behavior and how it interacts with running application code.
Robustness for External APIs is Non-Negotiable: External services will have transient failures. Implementing retry logic with exponential backoff is a fundamental requirement for any system relying on third-party APIs.
Log for Observability, Not Just Errors: Don't just log when things go wrong. Log successful operations, especially critical data transformations. This provides a clear audit trail and helps diagnose subtle discrepancies that don't manifest as immediate errors.
Admin Endpoints Are Invaluable: Having a dedicated backfill or administration endpoint can be a lifesaver for data recovery and maintenance. Ensure they are securely authenticated!

What's Next?

With the immediate crisis averted and our systems more resilient, the next steps are clear:

Deploy the Changes: Push the retry logic and enhanced logging to production.
Verify New Insights: Run a new workflow, generate insights, and check the logs for the Embeddings: X/Y written messages to confirm the fixes.
Consider Scheduled Backfills: To catch any future, unforeseen gaps, we'll explore scheduling a periodic backfill (e.g., via cron or a health check) for NULL embeddings.

This journey from mysterious NULLs to a fully populated, robust vector database was a valuable, albeit painful, learning experience. It reinforced the importance of deep debugging, defensive programming, and proactive observability. Here's to fewer silent killers in our future!

json

{"thingsDone":[
  "Diagnosed all 1719 `workflow_insights` rows had `NULL` embedding columns on production.",
  "Confirmed pgvector extension, embedding column, and HNSW index were intact.",
  "Found root cause: early `column 'embedding' does not exist` errors during `prisma db push` schema migration, leading to later silent failures.",
  "Tested full pipeline end-to-end on production to confirm vector storage capability.",
  "Ran backfill via `POST /api/v1/admin/backfill-embeddings` (3 rounds due to transient OpenAI 500s) to populate all 1719 embeddings.",
  "Added retry logic (3 retries with exponential backoff for 5xx/429 errors) to `openaiEmbed()`.",
  "Added success logging to all 3 embedding write paths for better observability.",
  "Cleaned up temporary test scripts from production container."
],"pains":[
  "Understanding why audit logs showed success but DB had zero embeddings written.",
  "Could not definitively pinpoint why inline embedding writes silently failed after the column was restored.",
  "Transient OpenAI API 500 errors during backfill required multiple runs."
],"successes":[
  "All 1719 embeddings successfully backfilled.",
  "Implemented robust retry logic for external API calls.",
  "Significantly improved observability for embedding write operations.",
  "Identified and mitigated a critical schema migration vulnerability.",
  "Established a clear plan for future maintenance and prevention."
],"techStack":[
  "PostgreSQL",
  "pgvector",
  "OpenAI API",
  "Prisma",
  "TypeScript",
  "Docker Compose",
  "Backend Development"
]}