Rescuing Embeddings & Embracing Ollama: My Latest Production Adventure

Every now and then, a development session turns into a full-blown production saga. This past Wednesday was one of those times. What started as a dual mission – fixing broken pgvector embeddings and bringing a self-hosted Ollama LLM into our production environment – evolved into a deep dive into data integrity, container debugging, and cost-effective AI.

By the end of the evening, both goals were not just met, but fully deployed and verified on our production server. Here's a look at the journey, the solutions, and the hard-won lessons.

The Case of the Missing Embeddings

Our workflow_insights table, which powers crucial search and knowledge features, was silently failing. A quick check revealed 1719 out of 1719 entries had NULL embeddings on production. A perfect score, but in the worst possible way.

The Culprit: A Silent Schema Drift

After some digging, the root cause became clear: a prisma db push command had, at some point, dropped the embedding column. While our rls.sql script (which defines row-level security policies) correctly restored the column, subsequent inline writes to it were silently failing. The column existed, but new data wasn't making it in. This is a subtle but critical failure mode – no errors, just missing data.

The Fix: Backfill & Fortify

Mass Backfill: The immediate solution was a full backfill. I spun up a dedicated POST /api/v1/admin/backfill-embeddings endpoint. This endpoint re-processed all existing workflow_insights, re-generating and storing their embeddings. Verification was satisfying: SELECT count(embedding) FROM workflow_insights WHERE embedding IS NOT NULL; finally returned 1719.
Robust Retries: To prevent future transient issues, I added retry logic to our openaiEmbed() function in src/server/services/embedding-service.ts. It now attempts 3 retries with exponential backoff (1s/2s/4s), specifically targeting 5xx server errors and 429 rate limits.
Enhanced Logging: To ensure we'd catch silent failures going forward, I instrumented insight-persistence.ts, pipeline-insight-extractor.ts, and discussion-knowledge.ts with explicit logging for embedding writes: [service] Embeddings: X/Y written. This gives us immediate visibility into success rates.

Enter Ollama: Our Self-Hosted LLM Hero

The second major objective was to integrate a self-hosted LLM solution, primarily driven by a desire for cost control and greater independence from third-party API providers (especially after hitting Anthropic credit limits, more on that later). Ollama was the clear choice for its ease of use and local deployment capabilities.

Dockerizing Ollama

Integrating Ollama into our existing Docker-Compose setup was straightforward:

yaml

# docker-compose.production.yml snippet
services:
  ollama:
    image: ollama/ollama:latest
    container_name: nyxcore-ollama-1
    restart: always
    profiles: ["ollama"] # Only start if explicitly requested or part of default
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434" # Expose for host access if needed for debugging
    deploy:
      resources:
        limits:
          cpus: '3.0' # Allocate 3 CPUs
          memory: 5G  # Allocate 5GB RAM (qwen2.5:7b needs ~8GB to run comfortably)
        reservations:
          cpus: '1.0'
          memory: 2G
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:11434/api/tags || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 20s

volumes:
  ollama_data:

Note: I initially budgeted 5GB RAM, but for models like qwen2.5:7b, 8GB+ is recommended for optimal performance. This is something to monitor.

Building the `OllamaProvider`

Our LLM service uses a provider abstraction, so I implemented a new OllamaProvider in src/server/services/llm/adapters/ollama.ts:

complete(): Handles non-streaming chat completions via /api/chat.
stream(): Implements NDJSON streaming via /api/chat, yielding text, done, or error chunks. This was crucial for a responsive UI.
isAvailable(): A simple health check with a 3-second timeout to confirm Ollama connectivity.
listModels(): Dynamically discovers available models via Ollama's API.

Wiring It All Up

The OllamaProvider was then integrated into src/server/services/llm/resolve-provider.ts. A key design decision here was that Ollama doesn't require a DB API key; instead, its availability is checked directly via isAvailable(). This simplifies configuration for self-hosted instances.

Finally, src/lib/constants.ts was updated to include Ollama models in our MODEL_CATALOG (qwen2.5:7b as default, qwen2.5:3b, llama3.2:3b) and FAST_MODELS.ollama for quick, lightweight inference. The default base URL for Ollama was set to http://ollama:11434 (leveraging Docker's internal networking), with an OLLAMA_BASE_URL environment variable for overrides.

Deployment & Model Pulls

Before deploying, I freed up 57GB of disk space with docker system prune -af (dropping usage from 83% to 13%!), which was critical given the size of LLM models. After committing and pulling the changes, the Ollama container was started, and I manually pulled the qwen2.5:7b (4.5GB) and qwen2.5:3b (1.9GB) models. A quick rebuild of our app and everything was running. Verification confirmed Ollama was responding, inference was working, and our app was healthy.

Lessons from the Production Trenches (The "Pain Log" Transformed)

Not everything went smoothly, and these struggles provided valuable lessons:

1. Debugging Inside Containers: The `docker exec` Dilemma

Problem: I needed to run a quick Node.js script inside the app container. My initial thought was docker exec node -e "console.log('hello')"
Pain: Shell escaping across SSH, docker exec, and node -e quickly became a nightmare. Special characters, quotes, and multi-line scripts were nearly impossible to manage reliably.
Lesson Learned: For anything beyond the simplest one-liner, don't try to inline scripts.
- Actionable Takeaway: Write your script to a temporary file (/tmp/script.js on the host).
- Use docker cp /tmp/script.js <container_id>:/app/script.js to copy it into the container.
- Then, docker exec -w /app <container_id> node script.js to run it.
- Pro-tip: If your container runs as a non-root user (good practice!), you might need docker exec -u 0 for cleanup commands (e.g., rm script.js) to gain root privileges temporarily.

2. Container Networking & Missing Tools

Problem: I wanted to curl or wget an internal endpoint (e.g., Ollama's API) from inside another container (like our app container) to verify connectivity.
Pain: Neither curl nor wget were installed in our lightweight app or Ollama Docker images. This is common for optimized production images.
Lesson Learned: Don't assume common network tools are present in production container images.
- Actionable Takeaway: If you need to test internal container network communication from the host, find the container's IP address (e.g., docker inspect <container_id> | grep "IPAddress") and curl it directly from the host. For example: curl http://172.18.0.6:11434/api/tags.

3. The Cost of Cloud LLMs: A Timely Reminder

Problem: Our Anthropic API credits ran out, causing all Anthropic-dependent workflows to return 400 "credit balance too low" errors.
Pain: This highlighted a critical vulnerability: reliance on a single, paid external provider for core functionality.
Lesson Learned: Diversification and fallback mechanisms for critical external services are essential.
- Actionable Takeaway: Ollama's integration came at the perfect time, providing a free, self-hosted fallback. This reinforces the value of having multiple providers or an in-house option. Top up those credits, but also lean into the resilience of self-hosting.

4. Healthchecks Matter (Even When They Don't)

Problem: After deployment, docker compose up -d app reported our app service as "unhealthy."
Pain: Initial panic, but quickly realized it was a false alarm. The app was responding perfectly fine via /api/v1/health. The issue was simply that we hadn't defined a healthcheck for the app service in our docker-compose.production.yml.
Lesson Learned: Explicit healthchecks are invaluable for accurate service status reporting and automated restarts.
- Actionable Takeaway: Add a proper healthcheck to the app service in docker-compose.production.yml to reflect its true operational status.

Current State & The Road Ahead

As of now, our production system is robust:

All 1719 workflow_insight embeddings are populated.
Ollama is running smoothly on nyxcore-ollama-1 at 172.18.0.6:11434, with qwen2.5:7b and qwen2.5:3b models available.
Disk usage is healthy at ~56GB free.

Immediate next steps include topping up Anthropic credits, thoroughly testing Ollama from the UI, and potentially pulling llama3.2:3b for more model variety. I'll also be closely monitoring embedding logs to confirm new writes are succeeding.

This session was a great reminder that even with careful planning, production environments always have a few surprises in store. The key is to diagnose, fix, fortify, and learn from every challenge.

json

{"thingsDone":["Fixed broken pgvector embeddings for 1719 records","Implemented retry logic for embedding generation","Added comprehensive logging for embedding writes","Integrated Ollama self-hosted LLM into production via Docker-Compose","Developed a full OllamaProvider with streaming and model discovery","Wired Ollama into LLM resolution logic without API keys","Freed 57GB disk space via docker system prune","Deployed new code and pulled Ollama models on production"],"pains":["Shell escaping for inline scripts via docker exec","Lack of curl/wget in container images for debugging","Anthropic API credit depletion causing service outages","Misleading 'unhealthy' status due to missing Docker healthcheck"],"successes":["Successful backfill of all missing embeddings","Robust and resilient embedding generation pipeline","Seamless integration of a cost-effective, self-hosted LLM","Improved system resilience against external API failures","Efficient disk space management","Verified production deployment and functionality"],"techStack":["pgvector","Ollama","LLM","Docker","Docker-Compose","Prisma","Node.js","TypeScript","OpenAI API","Anthropic API","cURL"]}