From Nulls to Neural Nets: Rescuing Embeddings and Embracing Self-Hosted LLMs in Production

Just wrapped up a marathon development session, and what a ride it was! The mission was clear: tackle a critical data integrity issue with our pgvector embeddings and, in parallel, integrate a self-hosted Ollama LLM into our production environment. I'm thrilled to report: both missions accomplished.

This post chronicles the journey, from diagnosing silent failures to wrangling Docker networking, and the invaluable lessons learned along the way.

The Embedding Emergency: A Tale of Missing Vectors

Our workflow_insights table is the backbone of many intelligent features, relying heavily on pgvector embeddings for semantic search and context retrieval. Imagine our dismay when we discovered 1719 out of 1719 records had a glaring NULL in their embedding column on production. Yikes!

Diagnosing the Root Cause

The culprit? A classic prisma db push gotcha. While prisma db push is fantastic for schema migrations, it sometimes takes an aggressive approach to column changes. In this case, it had dropped our embedding column. Our rls.sql script (for Row-Level Security) subsequently restored it, but this dance left the database in a state where inline writes to the newly restored column were silently failing. Data was being saved, but the vector was never making it in.

The Fix: Backfilling, Retries, and Observability

Mass Backfill: The immediate priority was data integrity. We spun up a dedicated admin endpoint (POST /api/v1/admin/backfill-embeddings) to re-process and generate embeddings for all 1719 affected workflow_insights. A quick SELECT count(embedding) = 1719 confirmed our success.
Robustness with Retries: To prevent future silent failures and handle transient API issues, we added crucial retry logic to our src/server/services/embedding-service.ts openaiEmbed() function. It now attempts 3 retries with an exponential backoff (1s, 2s, 4s) specifically for 5xx server errors and 429 rate limits.
Enhanced Observability: No more silent failures! We instrumented our embedding write paths in insight-persistence.ts, pipeline-insight-extractor.ts, and discussion-knowledge.ts with explicit logging. Now, we'll see messages like [service] Embeddings: X/Y written, giving us immediate feedback on success or failure.

Welcoming Ollama: Our New Local LLM Powerhouse

Why Ollama? The answer was simple: our Anthropic API credits had dwindled to zero. We needed a cost-effective, self-sufficient alternative for internal use and as a fallback. Ollama, with its ability to run open-source LLMs locally, was the perfect fit.

Integrating Ollama into Production

Docker-Compose Setup: First, Ollama needed a home in our docker-compose.production.yml. We provisioned it with 3 CPUs and 5GB of RAM (CPU-only for now), ensuring its data persisted with a dedicated ollama_data volume, and added a healthcheck to /api/tags for reliable monitoring. This ensures Ollama is always ready to serve.
OllamaProvider Implementation: We built a full-fledged OllamaProvider at src/server/services/llm/adapters/ollama.ts. This adapter handles:
- complete() for single-shot, non-streaming requests via /api/chat.
- stream() for real-time, NDJSON streaming responses, yielding text, done, and error chunks.
- isAvailable() with a 3-second timeout to check connectivity.
- listModels() for dynamic model discovery from the Ollama instance.
System Integration: Wiring Ollama into our existing LLM resolution logic in src/server/services/llm/resolve-provider.ts was straightforward. We special-cased Ollama: it doesn't require a database API key and checks its availability directly via isAvailable(). We also updated validateProviderAvailability() accordingly.
Model Catalog & Configuration: We updated src/lib/constants.ts to include our chosen Ollama models (qwen2.5:7b as default, qwen2.5:3b, llama3.2:3b) and defined FAST_MODELS.ollama for quick, lightweight inference. The default base URL for Ollama is set to http://ollama:11434, leveraging Docker's internal service discovery. This can be overridden with the OLLAMA_BASE_URL environment variable.

Deployment: Clearing the Path

Before pushing new models and a new service, a quick docker system prune -af liberated a whopping 57GB of disk space on our production server, bringing usage down from 83% to a comfortable 13%. Always satisfying to reclaim that much space!

With the stage set, we committed a7d9146, pulled to production, and brought Ollama to life. We started the Ollama container, pulled the qwen2.5:7b (4.5GB) and qwen2.5:3b (1.9GB) models, then rebuilt our application with docker compose build --no-cache app && docker compose up -d app.

A quick test confirmed Ollama was responding, and inference requests (like a simple "Hello" from qwen2.5:3b) were working flawlessly. All 1719 workflow insight embeddings were successfully populated. Production is looking healthy!

Lessons Learned from the Trenches

No production deployment is without its quirks. Here are a few challenges we navigated and the insights gained:

1. Container Scripting Woes

Problem: Running Node.js scripts inside a Docker container via docker exec node -e proved to be a nightmare due to complex shell escaping across SSH and Docker exec contexts.
Lesson: For anything beyond trivial commands, avoid docker exec -e. Instead, docker cp your script into the container (e.g., /tmp/script.js), then docker exec -w /app node script.js to run it. And remember, if your container runs as a non-root user, use docker exec -u 0 for file cleanup or operations requiring root permissions.

2. Invisible Container Networking

Problem: Trying to curl or wget from inside our app or Ollama containers failed because neither tool was installed in those slim production images. This made debugging inter-container communication tricky.
Lesson: Don't assume debugging tools are present in lean production images. If you need to test connectivity from the host to a container, use its internal IP address (e.g., curl http://172.18.0.6:11434/api/tags). For more in-depth debugging, consider temporarily adding curl or netcat in a debug image or leveraging docker exec with shell commands to check basic connectivity (like ping if available, or just trying to open a socket).

3. API Credit Management

Problem: Our Anthropic API credits were depleted, causing all Anthropic API calls to return 400 "credit balance too low." This was the primary driver for accelerating Ollama integration.
Lesson: Proactive monitoring of API credits is crucial. Integrate alerts for low balances. Having a robust fallback mechanism, like our new Ollama setup, can be a lifesaver when external services hit their limits or become unavailable.

4. The Case for Healthchecks

Problem: After deployment, docker compose up -d app showed our app service as "unhealthy," even though the application was responding perfectly fine via its /api/v1/health endpoint.
Lesson: This was a false alarm caused by not having an explicit healthcheck defined for the app service in our docker-compose.yml. Always define explicit healthchecks for all critical services. This provides accurate status reporting, prevents false alarms, and ensures your orchestration system correctly understands the state of your applications.

What's Next?

With the core work done, a few immediate next steps remain:

Top up Anthropic API credits to restore full functionality for workflows that explicitly rely on it.
Thoroughly test Ollama from the nyxCore UI: create a workflow step with Ollama as the provider and verify both complete and stream modes work as expected.
Optionally, pull llama3.2:3b (docker exec nyxcore-ollama-1 ollama pull llama3.2:3b) to offer another lightweight LLM alternative.
Add a proper healthcheck for the app service in docker-compose.yml to prevent future "unhealthy" false positives.
Monitor embedding logs after the next workflow run to confirm inline embedding writes are consistently succeeding.

This session was a stark reminder of the dynamic nature of production environments and the importance of resilience, observability, and continuous learning. We've not only fixed a critical data issue and introduced a powerful new local LLM, but we've also hardened our deployment processes and learned valuable lessons that will undoubtedly benefit future endeavors.