The Great Backfill & The Local LLM Leap: A Production Deployment Tale

Alright team, buckle up! Just wrapped up a rather intense development session that took us through the depths of a broken production vector database and straight into the exciting world of self-hosting LLMs. The goal was clear: fix some critical data integrity issues and bring Ollama into our Hetzner production environment. What started as a 'quick' deployment turned into a full-on debugging marathon, but we emerged victorious. Here’s the story, the fixes, and the hard-earned lessons.

The Case of the Missing Embeddings: A Vector DB Detective Story

Our application relies heavily on vector embeddings for features like workflow_insights. So, imagine the cold sweat when we discovered a whopping 1719 workflow_insights records on production were sporting NULL embeddings. That's essentially a critical part of our AI features silently failing.

The Root Cause: This one was tricky. It turns out a prisma db push operation, intended to apply schema changes, had inadvertently dropped our embedding column. While our rls.sql script (for Row-Level Security) would restore the column, subsequent inline writes were silently failing because the application code expected the column to always be there. A classic race condition meets silent failure – a developer's nightmare.

The Fix:

Backfill: First order of business was to get those embeddings back. We spun up a dedicated POST /api/v1/admin/backfill-embeddings endpoint. This involved several rounds of execution, largely due to transient 500 errors from OpenAI's API. Persistence paid off, and eventually, all 1719 records were re-embedded.

Robustness: To prevent future issues with external API instability, we beefed up our openaiEmbed() function in src/server/services/embedding-service.ts. It now includes a robust retry mechanism with exponential backoff for 5xx and 429 errors. Because when you're relying on external services, you will face transient issues.

typescript

// src/server/services/embedding-service.ts (simplified)
async function openaiEmbed(text: string): Promise<number[]> {
    let retries = 0;
    const MAX_RETRIES = 3;
    while (retries < MAX_RETRIES) {
        try {
            const response = await openai.embeddings.create({
                model: "text-embedding-ada-002",
                input: text,
            });
            return response.data[0].embedding;
        } catch (error: any) {
            if (error.status === 500 || error.status === 429) {
                console.warn(`OpenAI embedding failed (status: ${error.status}). Retrying in ${Math.pow(2, retries)}s...`);
                await new Promise(res => setTimeout(res, Math.pow(2, retries) * 1000));
                retries++;
            } else {
                throw error; // Re-throw other errors immediately
            }
        }
    }
    throw new Error(`Failed to get OpenAI embedding after ${MAX_RETRIES} retries.`);
}

Visibility: Finally, we added success logging to all three critical embedding write paths: insight-persistence.ts, pipeline-insight-extractor.ts, and discussion-knowledge.ts. No more silent failures for us!

Bringing LLMs Home: Ollama Integration

With our embeddings back in shape, it was time for the fun part: integrating self-hosted LLMs via Ollama. The goal was to gain more control, potentially reduce costs, and experiment with a wider range of models directly on our infrastructure.

Setting up Ollama: We added ollama as a new service to our docker-compose.production.yml. It's configured for CPU-only operation (we're on a VM, not a GPU instance), with sensible resource limits (5GB RAM, 3 CPUs) and a dedicated ollama_data volume for model persistence.

yaml

# docker-compose.production.yml (excerpt)
services:
  ollama:
    image: ollama/ollama:latest
    container_name: nyxcore-ollama-1
    restart: unless-stopped
    profiles: ["ollama"] # Only start if explicitly requested
    ports:
      - "11434:11434" # Expose for host access if needed
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        limits:
          cpus: '3.0'
          memory: 5G

The OllamaProvider Adapter: Our LLM architecture uses a provider pattern, so integrating Ollama meant creating a new OllamaProvider adapter at src/server/services/llm/adapters/ollama.ts. This adapter implements our standard complete(), stream(), isAvailable(), and listModels() methods, all talking to Ollama's native /api/chat endpoint.

Wiring it Up:

src/server/services/llm/resolve-provider.ts was updated to special-case Ollama: no API key needed, and it checks for OLLAMA_BASE_URL in environment variables (defaulting to http://ollama:11434 for Docker internal communication).
validateProviderAvailability() was also adjusted to correctly handle Ollama's availability without needing a database API key lookup.
We added three models to our MODEL_CATALOG in src/lib/constants.ts: qwen2.5:7b (as our new default self-hosted option), qwen2.5:3b (for faster, lighter tasks), and llama3.2:3b. The qwen2.5:3b also landed in FAST_MODELS.

A Quick Win: Disk Space: Before deploying, our production server was at 83% disk utilization. A quick docker system prune freed up a massive 57GB, bringing us down to a comfortable 13% utilization. Always a good feeling!

Lessons from the Trenches: The "Pain Log" Transformed

Not every step was smooth sailing. Here are some of the critical lessons learned from the debugging phase:

1. Debugging Inside Production Containers: Don't Fight the Shell

The Challenge: I needed to run some quick diagnostic Node.js scripts inside the running application container on production. My go-to was docker exec nyxcore-app-1 node -e 'console.log(process.env.MY_VAR)'.
The Pain: Escaping dollar signs ($) for environment variables across SSH, then docker exec, then node -e became an impossible syntax nightmare. The shell parsing layers were just too complex to reliably get a variable like $MY_VAR through.
The Workaround: Instead of trying to pass a complex one-liner, I wrote the diagnostic script to a temporary file on the host (/tmp/debug.js), used docker cp /tmp/debug.js nyxcore-app-1:/app/debug.js to copy it into the container, and then executed it with docker exec -w /app nyxcore-app-1 node debug.js.
The Takeaway: For anything more complex than a trivial command, docker cp is your friend. It isolates the script execution from the host's shell escaping rules. Also, remember containers often run as non-root users; use docker exec -u 0 for cleanup tasks if permissions are an issue.

2. Minimalist Container Images & Environment Variable Access

The Challenge: I needed to make an internal curl request from within the app container to verify an endpoint.
The Pain: The Node.js production image is lean – curl wasn't installed. This meant I couldn't easily test internal network calls directly.
The Workaround: I resorted to docker exec nyxcore-app-1 printenv AUTH_SECRET from the host to grab the necessary authentication token, and then made the curl request from the host machine to the public https://nyxcore.cloud endpoint.
The Takeaway: Production images are minimal for a reason (security, size). Don't assume common dev tools are present. Always know your environment variables – in this case, the backfill endpoint fell back to AUTH_SECRET because ADMIN_SECRET wasn't explicitly set in the container, which was an important detail for authentication. When container tools are missing, leverage host tools or temporary docker run images to diagnose.

Wrapping Up & What's Next

This session was a fantastic reminder of the realities of production development: critical bug fixes, exciting new feature integrations, and the inevitable debugging hurdles. We've successfully resurrected our vector database embeddings and brought the power of self-hosted LLMs into our stack with Ollama.

Immediate Next Steps (Post-Deployment Checklist):

Commit all changes (docker-compose, Ollama adapter, provider resolution, constants, embedding retry/logging).
Push to main and pull on the production server.
Start Ollama: docker compose -f docker-compose.production.yml up -d ollama.
Pull default Ollama models: docker exec nyxcore-ollama-1 ollama pull qwen2.5:7b (~4.4GB) and ollama pull qwen2.5:3b (~2GB).
Rebuild and restart the app: docker compose -f docker-compose.production.yml build --no-cache app && docker compose -f docker-compose.production.yml up -d app.
Crucial Test: Select Ollama as the provider in our workflow/enrichment UI and verify it works end-to-end.
(Optional) Pull llama3.2:3b for more lightweight experimentation.

It's always a journey, but seeing these pieces come together is incredibly rewarding. What are your go-to debugging tricks in production, or your experiences with self-hosting LLMs? Share in the comments!

json

{
  "thingsDone": [
    "Diagnosed and backfilled 1719 NULL vector embeddings in production.",
    "Implemented robust retry logic for OpenAI embedding calls.",
    "Added comprehensive success logging for embedding write paths.",
    "Integrated Ollama as a self-hosted LLM provider.",
    "Configured Ollama with Docker Compose, including resource limits and volumes.",
    "Developed a dedicated OllamaProvider adapter for LLM services.",
    "Updated model catalog and provider resolution logic for Ollama.",
    "Freed 57GB of disk space on the production server via docker system prune."
  ],
  "pains": [
    "Difficulty with dollar sign escaping for `docker exec node -e` commands.",
    "Lack of `curl` utility in the production Node.js Docker image.",
    "Understanding environment variable fallbacks (e.g., `AUTH_SECRET` vs `ADMIN_SECRET`)."
  ],
  "successes": [
    "Restored data integrity for vector embeddings.",
    "Enhanced system resilience against external API failures.",
    "Successfully enabled self-hosted LLM capabilities.",
    "Optimized production server disk usage.",
    "Gained practical insights into production debugging strategies."
  ],
  "techStack": [
    "Node.js",
    "TypeScript",
    "Prisma",
    "PostgreSQL",
    "pgvector",
    "OpenAI API",
    "Ollama",
    "Docker",
    "Docker Compose",
    "Hetzner",
    "SSH"
  ]
}