Rescuing Embeddings & Embracing Ollama: My Latest Production Adventure
Join me on a recent production journey as I tackled mysterious pgvector embedding issues, backfilled critical data, and successfully integrated a self-hosted Ollama LLM to cut costs and boost resilience.
Every now and then, a development session turns into a full-blown production saga. This past Wednesday was one of those times. What started as a dual mission – fixing broken pgvector embeddings and bringing a self-hosted Ollama LLM into our production environment – evolved into a deep dive into data integrity, container debugging, and cost-effective AI.
By the end of the evening, both goals were not just met, but fully deployed and verified on our production server. Here's a look at the journey, the solutions, and the hard-won lessons.
The Case of the Missing Embeddings
Our workflow_insights table, which powers crucial search and knowledge features, was silently failing. A quick check revealed 1719 out of 1719 entries had NULL embeddings on production. A perfect score, but in the worst possible way.
The Culprit: A Silent Schema Drift
After some digging, the root cause became clear: a prisma db push command had, at some point, dropped the embedding column. While our rls.sql script (which defines row-level security policies) correctly restored the column, subsequent inline writes to it were silently failing. The column existed, but new data wasn't making it in. This is a subtle but critical failure mode – no errors, just missing data.
The Fix: Backfill & Fortify
- Mass Backfill: The immediate solution was a full backfill. I spun up a dedicated
POST /api/v1/admin/backfill-embeddingsendpoint. This endpoint re-processed all existingworkflow_insights, re-generating and storing their embeddings. Verification was satisfying:SELECT count(embedding) FROM workflow_insights WHERE embedding IS NOT NULL;finally returned1719. - Robust Retries: To prevent future transient issues, I added retry logic to our
openaiEmbed()function insrc/server/services/embedding-service.ts. It now attempts 3 retries with exponential backoff (1s/2s/4s), specifically targeting 5xx server errors and 429 rate limits. - Enhanced Logging: To ensure we'd catch silent failures going forward, I instrumented
insight-persistence.ts,pipeline-insight-extractor.ts, anddiscussion-knowledge.tswith explicit logging for embedding writes:[service] Embeddings: X/Y written. This gives us immediate visibility into success rates.
Enter Ollama: Our Self-Hosted LLM Hero
The second major objective was to integrate a self-hosted LLM solution, primarily driven by a desire for cost control and greater independence from third-party API providers (especially after hitting Anthropic credit limits, more on that later). Ollama was the clear choice for its ease of use and local deployment capabilities.
Dockerizing Ollama
Integrating Ollama into our existing Docker-Compose setup was straightforward:
# docker-compose.production.yml snippet
services:
ollama:
image: ollama/ollama:latest
container_name: nyxcore-ollama-1
restart: always
profiles: ["ollama"] # Only start if explicitly requested or part of default
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434" # Expose for host access if needed for debugging
deploy:
resources:
limits:
cpus: '3.0' # Allocate 3 CPUs
memory: 5G # Allocate 5GB RAM (qwen2.5:7b needs ~8GB to run comfortably)
reservations:
cpus: '1.0'
memory: 2G
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:11434/api/tags || exit 1"]
interval: 10s
timeout: 5s
retries: 5
start_period: 20s
volumes:
ollama_data:
Note: I initially budgeted 5GB RAM, but for models like qwen2.5:7b, 8GB+ is recommended for optimal performance. This is something to monitor.
Building the OllamaProvider
Our LLM service uses a provider abstraction, so I implemented a new OllamaProvider in src/server/services/llm/adapters/ollama.ts:
complete(): Handles non-streaming chat completions via/api/chat.stream(): Implements NDJSON streaming via/api/chat, yieldingtext,done, orerrorchunks. This was crucial for a responsive UI.isAvailable(): A simple health check with a 3-second timeout to confirm Ollama connectivity.listModels(): Dynamically discovers available models via Ollama's API.
Wiring It All Up
The OllamaProvider was then integrated into src/server/services/llm/resolve-provider.ts. A key design decision here was that Ollama doesn't require a DB API key; instead, its availability is checked directly via isAvailable(). This simplifies configuration for self-hosted instances.
Finally, src/lib/constants.ts was updated to include Ollama models in our MODEL_CATALOG (qwen2.5:7b as default, qwen2.5:3b, llama3.2:3b) and FAST_MODELS.ollama for quick, lightweight inference. The default base URL for Ollama was set to http://ollama:11434 (leveraging Docker's internal networking), with an OLLAMA_BASE_URL environment variable for overrides.
Deployment & Model Pulls
Before deploying, I freed up 57GB of disk space with docker system prune -af (dropping usage from 83% to 13%!), which was critical given the size of LLM models. After committing and pulling the changes, the Ollama container was started, and I manually pulled the qwen2.5:7b (4.5GB) and qwen2.5:3b (1.9GB) models. A quick rebuild of our app and everything was running. Verification confirmed Ollama was responding, inference was working, and our app was healthy.
Lessons from the Production Trenches (The "Pain Log" Transformed)
Not everything went smoothly, and these struggles provided valuable lessons:
1. Debugging Inside Containers: The docker exec Dilemma
- Problem: I needed to run a quick Node.js script inside the app container. My initial thought was
docker exec node -e "console.log('hello')" - Pain: Shell escaping across SSH,
docker exec, andnode -equickly became a nightmare. Special characters, quotes, and multi-line scripts were nearly impossible to manage reliably. - Lesson Learned: For anything beyond the simplest one-liner, don't try to inline scripts.
- Actionable Takeaway: Write your script to a temporary file (
/tmp/script.json the host). - Use
docker cp /tmp/script.js <container_id>:/app/script.jsto copy it into the container. - Then,
docker exec -w /app <container_id> node script.jsto run it. - Pro-tip: If your container runs as a non-root user (good practice!), you might need
docker exec -u 0for cleanup commands (e.g.,rm script.js) to gain root privileges temporarily.
- Actionable Takeaway: Write your script to a temporary file (
2. Container Networking & Missing Tools
- Problem: I wanted to
curlorwgetan internal endpoint (e.g., Ollama's API) from inside another container (like our app container) to verify connectivity. - Pain: Neither
curlnorwgetwere installed in our lightweight app or Ollama Docker images. This is common for optimized production images. - Lesson Learned: Don't assume common network tools are present in production container images.
- Actionable Takeaway: If you need to test internal container network communication from the host, find the container's IP address (e.g.,
docker inspect <container_id> | grep "IPAddress") andcurlit directly from the host. For example:curl http://172.18.0.6:11434/api/tags.
- Actionable Takeaway: If you need to test internal container network communication from the host, find the container's IP address (e.g.,
3. The Cost of Cloud LLMs: A Timely Reminder
- Problem: Our Anthropic API credits ran out, causing all Anthropic-dependent workflows to return 400 "credit balance too low" errors.
- Pain: This highlighted a critical vulnerability: reliance on a single, paid external provider for core functionality.
- Lesson Learned: Diversification and fallback mechanisms for critical external services are essential.
- Actionable Takeaway: Ollama's integration came at the perfect time, providing a free, self-hosted fallback. This reinforces the value of having multiple providers or an in-house option. Top up those credits, but also lean into the resilience of self-hosting.
4. Healthchecks Matter (Even When They Don't)
- Problem: After deployment,
docker compose up -d appreported ourappservice as "unhealthy." - Pain: Initial panic, but quickly realized it was a false alarm. The app was responding perfectly fine via
/api/v1/health. The issue was simply that we hadn't defined a healthcheck for theappservice in ourdocker-compose.production.yml. - Lesson Learned: Explicit healthchecks are invaluable for accurate service status reporting and automated restarts.
- Actionable Takeaway: Add a proper healthcheck to the
appservice indocker-compose.production.ymlto reflect its true operational status.
- Actionable Takeaway: Add a proper healthcheck to the
Current State & The Road Ahead
As of now, our production system is robust:
- All 1719
workflow_insightembeddings are populated. - Ollama is running smoothly on
nyxcore-ollama-1at172.18.0.6:11434, withqwen2.5:7bandqwen2.5:3bmodels available. - Disk usage is healthy at ~56GB free.
Immediate next steps include topping up Anthropic credits, thoroughly testing Ollama from the UI, and potentially pulling llama3.2:3b for more model variety. I'll also be closely monitoring embedding logs to confirm new writes are succeeding.
This session was a great reminder that even with careful planning, production environments always have a few surprises in store. The key is to diagnose, fix, fortify, and learn from every challenge.
{"thingsDone":["Fixed broken pgvector embeddings for 1719 records","Implemented retry logic for embedding generation","Added comprehensive logging for embedding writes","Integrated Ollama self-hosted LLM into production via Docker-Compose","Developed a full OllamaProvider with streaming and model discovery","Wired Ollama into LLM resolution logic without API keys","Freed 57GB disk space via docker system prune","Deployed new code and pulled Ollama models on production"],"pains":["Shell escaping for inline scripts via docker exec","Lack of curl/wget in container images for debugging","Anthropic API credit depletion causing service outages","Misleading 'unhealthy' status due to missing Docker healthcheck"],"successes":["Successful backfill of all missing embeddings","Robust and resilient embedding generation pipeline","Seamless integration of a cost-effective, self-hosted LLM","Improved system resilience against external API failures","Efficient disk space management","Verified production deployment and functionality"],"techStack":["pgvector","Ollama","LLM","Docker","Docker-Compose","Prisma","Node.js","TypeScript","OpenAI API","Anthropic API","cURL"]}