The Case of the Missing Embeddings: Debugging a Silent pgvector Failure

Every now and then, a bug emerges that's more than just a bug; it's a detective story. This week, we found ourselves in the middle of such a narrative, centered around our pgvector database and the crucial embeddings powering our workflow insights. Despite our system reporting success, our vector database was a desolate wasteland of NULL values where rich vector data should have been. This is the story of how we uncovered the mystery, fixed the problem, and fortified our system against future silent failures.

The Silent Killer: Zero Embeddings, Zero Errors?

Our workflow insights, a core feature relying on semantic search and similarity, suddenly went dark. The symptom was clear: all 1719 workflow_insights rows in production had NULL in their embedding columns. This was a critical issue, rendering a key part of our application effectively blind.

Initial checks were baffling:

The pgvector extension (v0.8.2) was installed and active.
The embedding vector(1536) column was present.
The HNSW index for efficient similarity search was intact.

Everything looked fine on the database side, yet the data wasn't there. Compounding the mystery, our internal audit_logs showed embeddingErrors: 0. This was the ultimate red herring: the system thought it was successfully writing embeddings, but the database told a different story.

The Breakthrough: Peering into the Past

The real "Aha!" moment came from digging deeper into historical audit_logs. We found some very early entries marked insight.embedding_failed which contained a crucial error message: "column "embedding" does not exist".

This was the smoking gun! It revealed a fleeting period where the embedding column was actually missing. Our hypothesis:

An early prisma db push operation might have inadvertently dropped the embedding column.
A subsequent rls.sql script, responsible for setting up Row-Level Security, likely recreated the column (or a migration ran that restored it).

The issue was that after the column was restored, our application code (which used $executeRaw for direct SQL updates) continued to try and write to it. However, something in the pipeline, perhaps a subtle timing issue or an unhandled edge case, caused these writes to silently fail without propagating an error back to our application's audit logs. The column existed, but the data wasn't making it in.

The Fix: Backfilling and Fortifying

With the root cause identified, it was time for a two-pronged approach: immediate data recovery and long-term system resilience.

1. Data Recovery: The Great Backfill

We needed to populate those 1719 missing embeddings. We leveraged an existing internal admin endpoint: POST /api/v1/admin/backfill-embeddings. This endpoint orchestrates the entire embedding process:

Decrypting sensitive keys.
Calling the OpenAI API to generate embeddings.
Executing a direct $executeRaw UPDATE query to store the vector in pgvector.

The backfill ran in three rounds. Why three? Because even OpenAI, a robust service, can experience transient 500 errors. This highlighted another area for improvement. In the end, all 1719 embeddings were successfully populated, bringing our workflow insights back to life.

2. System Fortification: Retries and Robust Logging

To prevent future silent failures and handle transient external API issues, we implemented several key improvements:

OpenAI API Retry Logic: We added a robust retry mechanism to our openaiEmbed() service. Now, any 5xx or 429 (Too Many Requests) errors from OpenAI trigger 3 retries with exponential backoff (1s, 2s, 4s). This significantly improves the reliability of our embedding generation.

typescript

// src/server/services/embedding-service.ts (simplified)
async function openaiEmbed(text: string): Promise<number[]> {
    const maxRetries = 3;
    let attempt = 0;
    while (attempt < maxRetries) {
        try {
            const response = await openai.embeddings.create({
                model: "text-embedding-ada-002",
                input: text,
            });
            return response.data[0].embedding;
        } catch (error: any) {
            if (error.response?.status === 500 || error.response?.status === 429) {
                const delay = Math.pow(2, attempt) * 1000; // Exponential backoff
                console.warn(`OpenAI API error (${error.response.status}). Retrying in ${delay / 1000}s...`);
                await new Promise(resolve => setTimeout(resolve, delay));
                attempt++;
            } else {
                throw error; // Re-throw other errors immediately
            }
        }
    }
    throw new Error("Failed to get OpenAI embedding after multiple retries.");
}

Enhanced Embedding Write Logging: The "silent failure" was the most painful aspect of this bug. To combat this, we added explicit success logging to all three critical embedding write paths:
- src/server/services/insight-persistence.ts
- src/server/services/pipeline-insight-extractor.ts
- src/server/services/discussion-knowledge.ts
Now, our logs clearly state [service-name] Embeddings: X/Y written, providing immediate visibility into whether embeddings are truly being persisted. This closes the loop on observability, ensuring our application logs accurately reflect the database's state.
Cleanup: We also removed some old test scripts from the production container, tidying up our deployment.

Lessons Learned

This debugging session offered valuable insights into building resilient data pipelines:

Observability at the Data Layer is Paramount: Relying solely on application-level success messages can be misleading. Always verify data integrity at the persistence layer, especially for critical data like embeddings. Robust logging that confirms writes to the database is crucial.
Schema Evolution is Tricky: Be extremely cautious with database schema changes, especially when multiple tools (like Prisma and raw SQL scripts) interact with the DDL. Subtle interactions can lead to unexpected column states.
Expect Transient Failures: External APIs, no matter how reliable, will have transient issues. Implementing retry logic with exponential backoff is a non-negotiable for any system relying on third-party services.
Admin Tools are Lifesavers: Having an accessible admin endpoint for data backfills or re-processing is invaluable for disaster recovery and data hygiene.

The Road Ahead

With the fixes in place, our immediate next steps include:

Deploying the retry logic and enhanced logging to production.
Verifying the deployment by monitoring new insights for successful embedding writes.
Considering scheduled periodic backfills or health checks to proactively catch any future embedding gaps.

This journey through the missing embeddings reinforced the importance of meticulous debugging, robust error handling, and comprehensive observability. Our pgvector database is now healthier, and our workflow insights are flowing smoothly once more.