Resurrecting Embeddings & Bringing Local LLMs to Life with Ollama on Production

Ever had one of those development sessions where you dive in expecting a quick fix, only to uncover a deeper mystery? Or perhaps you're finally bringing a long-awaited feature to production, juggling infrastructure, code, and deployment nuances? That was my Wednesday afternoon. The mission: mend our broken vector database embeddings and, in the same breath, usher in the era of self-hosted LLMs via Ollama to our Hetzner production server.

It was a marathon of diagnosis, backfilling, infrastructure setup, and careful code integration. As I prepared to commit and deploy, I took a moment to document the journey – a "Letter to Myself" that I'm now sharing with you.

The Vector DB Resurrection: From NULL to Full

Our application relies heavily on vector embeddings to power intelligent features like insight extraction and knowledge retrieval. So, imagine the cold dread when diagnostics revealed a gaping hole: all 1719 workflow_insights on production had NULL embeddings. A critical component, silently failing.

The Case of the Missing Column

The hunt for the root cause began. It turned out to be a subtle clash in our database migration strategy. While our rls.sql script restored the embedding column after a prisma db push (which, surprisingly, had dropped it), subsequent inline writes were silently failing. The column existed, but the application wasn't successfully writing to it. It was a classic "it works on my machine (and in dev)" scenario, where the specific production environment's nuances unveiled a deeper issue.

Backfilling the Void

With the root cause identified, the immediate priority was data integrity. We needed to backfill those 1719 missing embeddings. I fired up our internal POST /api/v1/admin/backfill-embeddings endpoint. This wasn't a one-shot deal; due to transient 500 errors from the OpenAI API, it took three painstaking rounds to ensure every single insight got its rightful embedding.

Fortifying Against Future Failures

This experience highlighted a fragility in our embedding generation. To prevent future silent failures, I implemented a robust retry mechanism within our src/server/services/embedding-service.ts. The openaiEmbed() function now includes 3 retries with exponential backoff specifically for 5xx and 429 (rate limit) errors. This small but crucial change significantly improves the resilience of our embedding pipeline.

typescript

// src/server/services/embedding-service.ts (simplified)
async function openaiEmbed(text: string[]): Promise<number[][]> {
  const maxRetries = 3;
  let attempt = 0;

  while (attempt < maxRetries) {
    try {
      const response = await openai.embeddings.create({
        model: "text-embedding-ada-002",
        input: text,
      });
      // Log successful embedding
      console.log(`Successfully embedded ${text.length} items.`);
      return response.data.map((d) => d.embedding);
    } catch (error: any) {
      if (error.status >= 500 || error.status === 429) {
        const delay = Math.pow(2, attempt) * 1000; // Exponential backoff
        console.warn(`OpenAI embedding failed (attempt ${attempt + 1}/${maxRetries}): ${error.message}. Retrying in ${delay / 1000}s...`);
        await new Promise((resolve) => setTimeout(resolve, delay));
        attempt++;
      } else {
        console.error("OpenAI embedding failed critically:", error);
        throw error;
      }
    }
  }
  throw new Error(`Failed to embed text after ${maxRetries} attempts.`);
}

Finally, to enhance observability, I added success logging to all three critical embedding write paths: insight-persistence.ts, pipeline-insight-extractor.ts, and discussion-knowledge.ts. Now, we'll know for sure when an embedding has been successfully written.

Bringing Local LLMs to Life with Ollama

While fixing the past, we were also building for the future: integrating self-hosted LLMs. Our choice: Ollama, a fantastic tool for running large language models locally. This move not only offers more control but also opens doors for cost-effective, private LLM inference.

Dockerizing Ollama

The first step was infrastructure. I added an ollama service to our docker-compose.production.yml. It's a lean setup: CPU-only, capped at 5GB RAM and 3 CPUs, with a dedicated ollama_data volume to persist downloaded models.

yaml

# docker-compose.production.yml (excerpt)
services:
  ollama:
    image: ollama/ollama:latest
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '3'
          memory: 5G
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434" # Expose for host access if needed, though app uses internal network
volumes:
  ollama_data:

The Ollama Provider Adapter

The core of the integration lies in the new OllamaProvider adapter at src/server/services/llm/adapters/ollama.ts. This adapter translates our application's generic complete() and stream() LLM interface calls into requests compatible with Ollama's native /api/chat endpoint. It also includes isAvailable() and listModels() methods to allow our application to dynamically check Ollama's status and available models.

Seamless System Integration

Wiring Ollama into our existing LLM resolution logic was straightforward. In src/server/services/llm/resolve-provider.ts, I special-cased Ollama: no API key is needed, and it checks for an OLLAMA_BASE_URL environment variable, defaulting to http://ollama:11434 (leveraging Docker's internal networking for service discovery). We also updated validateProviderAvailability() to correctly handle Ollama without needing a database key lookup.

Model Selection and Optimization

To give our users choice, I added three models to our MODEL_CATALOG in src/lib/constants.ts: qwen2.5:7b (our new default for Ollama), qwen2.5:3b (a faster, lighter option), and llama3.2:3b. The qwen2.5:3b model was specifically added to our FAST_MODELS list for quick, less resource-intensive operations.

As a practical pre-deployment step, I freed up a significant 57GB of disk space on the production server using docker system prune. Our 83% disk usage dropped to a healthy 13%, making ample room for the new Ollama models.

Lessons Learned from the Trenches

No deployment is without its quirks, especially when dealing with production environments and containerization. Here are a few "gotchas" that taught me valuable lessons:

The docker exec Escaping Nightmare: My initial thought for quick debugging inside the app container was docker exec nyxcore-app-1 node -e 'console.log(process.env.MY_VAR)'. Simple enough, right? Wrong. The dollar sign escaping across SSH -> docker exec -> node -e proved to be an impossible puzzle.
- Lesson: For anything more complex than a trivial one-liner, write your script to a temporary file, docker cp it into the container, and then docker exec to run it. Remember that containers often run as non-root users, so docker exec -u 0 might be needed for cleanup.
The Missing curl: When I needed to quickly check an internal endpoint from within the app container, my trusty curl command failed – it simply wasn't installed in our lean Node.js production image.
- Lesson: Don't assume common utilities are present in minimal production images. For external checks, leverage the host machine. I used docker exec nyxcore-app-1 printenv AUTH_SECRET from the host to get the necessary token, then curl from the host directly to https://nyxcore.cloud. This also highlighted that our backfill endpoint falls back to AUTH_SECRET if ADMIN_SECRET isn't set in the container's environment. Good to know!

The Road Ahead: Deployment Plan

With the code polished, the infrastructure defined, and the lessons learned, the final push to production is imminent. Here's the deployment sequence I'm about to execute:

Commit & Push: All changes (docker-compose, Ollama adapter, LLM resolver, constants, embedding retry/logging) are bundled into a single commit.
Pull on Production: The latest main branch is pulled onto our Hetzner server.
Start Ollama: docker compose -f docker-compose.production.yml up -d ollama brings the Ollama service online.
Pull Models: We then pull the necessary models into the Ollama container:
- docker exec nyxcore-ollama-1 ollama pull qwen2.5:7b (our default, ~4.4GB)
- docker exec nyxcore-ollama-1 ollama pull qwen2.5:3b (our fast model, ~2GB)
Rebuild & Restart App: The application container needs to be rebuilt to incorporate the new code, then restarted: docker compose -f docker-compose.production.yml build --no-cache app && docker compose -f docker-compose.production.yml up -d app.
Verification: The crucial step – selecting Ollama as a provider in our workflow/enrichment UI and verifying it performs as expected.
Optional Models: If needed, we'll pull llama3.2:3b as another lightweight alternative.

Our production environment (root@46.225.232.35, nyxcore.cloud) is ready. The OLLAMA_BASE_URL will default to http://ollama:11434 within Docker, and our pgvector setup is healthy with all 1719 embeddings happily populated. We have 63GB of free disk space after the prune, ready for those new model downloads.

This session was a prime example of the multi-faceted nature of modern web development: diagnosing database integrity issues, enhancing system resilience, and integrating cutting-edge AI capabilities, all while navigating the practicalities of production deployment. It's these kinds of challenges that make the work so rewarding.

Here's to robust systems and smarter applications!