The Great Backfill & The Local LLM Leap: A Production Deployment Tale
Join me on a recent production adventure as we tackle a critical vector database embedding failure and integrate self-hosted LLMs using Ollama, sharing the hard-won lessons from the trenches.
Alright team, buckle up! Just wrapped up a rather intense development session that took us through the depths of a broken production vector database and straight into the exciting world of self-hosting LLMs. The goal was clear: fix some critical data integrity issues and bring Ollama into our Hetzner production environment. What started as a 'quick' deployment turned into a full-on debugging marathon, but we emerged victorious. Here’s the story, the fixes, and the hard-earned lessons.
The Case of the Missing Embeddings: A Vector DB Detective Story
Our application relies heavily on vector embeddings for features like workflow_insights. So, imagine the cold sweat when we discovered a whopping 1719 workflow_insights records on production were sporting NULL embeddings. That's essentially a critical part of our AI features silently failing.
The Root Cause: This one was tricky. It turns out a prisma db push operation, intended to apply schema changes, had inadvertently dropped our embedding column. While our rls.sql script (for Row-Level Security) would restore the column, subsequent inline writes were silently failing because the application code expected the column to always be there. A classic race condition meets silent failure – a developer's nightmare.
The Fix:
-
Backfill: First order of business was to get those embeddings back. We spun up a dedicated
POST /api/v1/admin/backfill-embeddingsendpoint. This involved several rounds of execution, largely due to transient500errors from OpenAI's API. Persistence paid off, and eventually, all 1719 records were re-embedded. -
Robustness: To prevent future issues with external API instability, we beefed up our
openaiEmbed()function insrc/server/services/embedding-service.ts. It now includes a robust retry mechanism with exponential backoff for5xxand429errors. Because when you're relying on external services, you will face transient issues.typescript// src/server/services/embedding-service.ts (simplified) async function openaiEmbed(text: string): Promise<number[]> { let retries = 0; const MAX_RETRIES = 3; while (retries < MAX_RETRIES) { try { const response = await openai.embeddings.create({ model: "text-embedding-ada-002", input: text, }); return response.data[0].embedding; } catch (error: any) { if (error.status === 500 || error.status === 429) { console.warn(`OpenAI embedding failed (status: ${error.status}). Retrying in ${Math.pow(2, retries)}s...`); await new Promise(res => setTimeout(res, Math.pow(2, retries) * 1000)); retries++; } else { throw error; // Re-throw other errors immediately } } } throw new Error(`Failed to get OpenAI embedding after ${MAX_RETRIES} retries.`); } -
Visibility: Finally, we added success logging to all three critical embedding write paths:
insight-persistence.ts,pipeline-insight-extractor.ts, anddiscussion-knowledge.ts. No more silent failures for us!
Bringing LLMs Home: Ollama Integration
With our embeddings back in shape, it was time for the fun part: integrating self-hosted LLMs via Ollama. The goal was to gain more control, potentially reduce costs, and experiment with a wider range of models directly on our infrastructure.
Setting up Ollama: We added ollama as a new service to our docker-compose.production.yml. It's configured for CPU-only operation (we're on a VM, not a GPU instance), with sensible resource limits (5GB RAM, 3 CPUs) and a dedicated ollama_data volume for model persistence.
# docker-compose.production.yml (excerpt)
services:
ollama:
image: ollama/ollama:latest
container_name: nyxcore-ollama-1
restart: unless-stopped
profiles: ["ollama"] # Only start if explicitly requested
ports:
- "11434:11434" # Expose for host access if needed
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
limits:
cpus: '3.0'
memory: 5G
The OllamaProvider Adapter: Our LLM architecture uses a provider pattern, so integrating Ollama meant creating a new OllamaProvider adapter at src/server/services/llm/adapters/ollama.ts. This adapter implements our standard complete(), stream(), isAvailable(), and listModels() methods, all talking to Ollama's native /api/chat endpoint.
Wiring it Up:
src/server/services/llm/resolve-provider.tswas updated to special-case Ollama: no API key needed, and it checks forOLLAMA_BASE_URLin environment variables (defaulting tohttp://ollama:11434for Docker internal communication).validateProviderAvailability()was also adjusted to correctly handle Ollama's availability without needing a database API key lookup.- We added three models to our
MODEL_CATALOGinsrc/lib/constants.ts:qwen2.5:7b(as our new default self-hosted option),qwen2.5:3b(for faster, lighter tasks), andllama3.2:3b. Theqwen2.5:3balso landed inFAST_MODELS.
A Quick Win: Disk Space: Before deploying, our production server was at 83% disk utilization. A quick docker system prune freed up a massive 57GB, bringing us down to a comfortable 13% utilization. Always a good feeling!
Lessons from the Trenches: The "Pain Log" Transformed
Not every step was smooth sailing. Here are some of the critical lessons learned from the debugging phase:
1. Debugging Inside Production Containers: Don't Fight the Shell
- The Challenge: I needed to run some quick diagnostic Node.js scripts inside the running application container on production. My go-to was
docker exec nyxcore-app-1 node -e 'console.log(process.env.MY_VAR)'. - The Pain: Escaping dollar signs (
$) for environment variables across SSH, thendocker exec, thennode -ebecame an impossible syntax nightmare. The shell parsing layers were just too complex to reliably get a variable like$MY_VARthrough. - The Workaround: Instead of trying to pass a complex one-liner, I wrote the diagnostic script to a temporary file on the host (
/tmp/debug.js), useddocker cp /tmp/debug.js nyxcore-app-1:/app/debug.jsto copy it into the container, and then executed it withdocker exec -w /app nyxcore-app-1 node debug.js. - The Takeaway: For anything more complex than a trivial command,
docker cpis your friend. It isolates the script execution from the host's shell escaping rules. Also, remember containers often run as non-root users; usedocker exec -u 0for cleanup tasks if permissions are an issue.
2. Minimalist Container Images & Environment Variable Access
- The Challenge: I needed to make an internal
curlrequest from within the app container to verify an endpoint. - The Pain: The Node.js production image is lean –
curlwasn't installed. This meant I couldn't easily test internal network calls directly. - The Workaround: I resorted to
docker exec nyxcore-app-1 printenv AUTH_SECRETfrom the host to grab the necessary authentication token, and then made thecurlrequest from the host machine to the publichttps://nyxcore.cloudendpoint. - The Takeaway: Production images are minimal for a reason (security, size). Don't assume common dev tools are present. Always know your environment variables – in this case, the backfill endpoint fell back to
AUTH_SECRETbecauseADMIN_SECRETwasn't explicitly set in the container, which was an important detail for authentication. When container tools are missing, leverage host tools or temporarydocker runimages to diagnose.
Wrapping Up & What's Next
This session was a fantastic reminder of the realities of production development: critical bug fixes, exciting new feature integrations, and the inevitable debugging hurdles. We've successfully resurrected our vector database embeddings and brought the power of self-hosted LLMs into our stack with Ollama.
Immediate Next Steps (Post-Deployment Checklist):
- Commit all changes (docker-compose, Ollama adapter, provider resolution, constants, embedding retry/logging).
- Push to
mainand pull on the production server. - Start Ollama:
docker compose -f docker-compose.production.yml up -d ollama. - Pull default Ollama models:
docker exec nyxcore-ollama-1 ollama pull qwen2.5:7b(~4.4GB) andollama pull qwen2.5:3b(~2GB). - Rebuild and restart the app:
docker compose -f docker-compose.production.yml build --no-cache app && docker compose -f docker-compose.production.yml up -d app. - Crucial Test: Select Ollama as the provider in our workflow/enrichment UI and verify it works end-to-end.
- (Optional) Pull
llama3.2:3bfor more lightweight experimentation.
It's always a journey, but seeing these pieces come together is incredibly rewarding. What are your go-to debugging tricks in production, or your experiences with self-hosting LLMs? Share in the comments!
{
"thingsDone": [
"Diagnosed and backfilled 1719 NULL vector embeddings in production.",
"Implemented robust retry logic for OpenAI embedding calls.",
"Added comprehensive success logging for embedding write paths.",
"Integrated Ollama as a self-hosted LLM provider.",
"Configured Ollama with Docker Compose, including resource limits and volumes.",
"Developed a dedicated OllamaProvider adapter for LLM services.",
"Updated model catalog and provider resolution logic for Ollama.",
"Freed 57GB of disk space on the production server via docker system prune."
],
"pains": [
"Difficulty with dollar sign escaping for `docker exec node -e` commands.",
"Lack of `curl` utility in the production Node.js Docker image.",
"Understanding environment variable fallbacks (e.g., `AUTH_SECRET` vs `ADMIN_SECRET`)."
],
"successes": [
"Restored data integrity for vector embeddings.",
"Enhanced system resilience against external API failures.",
"Successfully enabled self-hosted LLM capabilities.",
"Optimized production server disk usage.",
"Gained practical insights into production debugging strategies."
],
"techStack": [
"Node.js",
"TypeScript",
"Prisma",
"PostgreSQL",
"pgvector",
"OpenAI API",
"Ollama",
"Docker",
"Docker Compose",
"Hetzner",
"SSH"
]
}