Beyond the Default: Taming Ollama's Context and Timeouts for Robust AI Workflows

Development is rarely a straight line. It's often a dance between exciting new features and the nitty-gritty of squashing bugs, optimizing performance, and refining the user experience. This past session was a perfect example, a focused sprint to resolve two key areas: making our AI-driven group workflows more intuitive and, crucially, wrestling with the underlying performance and reliability of our self-hosted LLM, Ollama.

The good news? Both challenges are now deployed to production, making our application more stable and user-friendly. Let's break down what we tackled.

Making Sense of AI Workflows: A Naming Overhaul

One of the core features of our application involves grouping related actions into AI-generated workflows. While the underlying AI logic was powerful, the way these groups were named could be... less than ideal. Imagine a workflow group titled "Action 1 Title, Action 2 Title, Action 3 Title..." – not exactly user-friendly, especially with many actions.

The Problem: Our createGroupWorkflow function was simply concatenating the titles of all items within a group, leading to unwieldy and uninformative group labels.

The Solution: We refined the workflow creation logic to introduce an optional groupName input. Now, when a user selects multiple action points that originate from the same source note, our frontend intelligently derives a meaningful group name from that sourceNote.title.

For instance, instead of: "Draft intro paragraph, Research key points, Outline section 2"

You now see something much clearer: "feat: blog post draft (3 actions)"

This small change significantly improves the clarity and usability of our workflow dashboard.

Code Touches:
- src/server/trpc/routers/action-points.ts: Modified createGroupWorkflow to accept an optional groupName.
- src/app/(dashboard)/dashboard/projects/[id]/page.tsx: Updated the frontend to pass the groupName derived from shared sourceNoteId → sourceNote.title when applicable.

Taming the Titan: Ollama Context and Timeout Challenges

This was the heavier lift of the session, addressing critical stability issues with our self-hosted Ollama LLM. When integrating large language models, especially open-source ones run locally or on modest hardware, you quickly run into practical limits. Our application uses a qwen2.5:7b model, which, while capable, demands careful resource management.

The Context Conundrum: When Prompts Get Truncated

The Challenge: Our AI workflow prompts, which incorporate detailed project wisdom, memory, and specific instructions ({{project.wisdom}}, {{memory}}, etc.), were growing in complexity and length. Ollama's default context window for qwen2.5:7b is 4096 tokens.

The Symptom: We started seeing ominous warnings in the Ollama logs:

WARN source=runner.go:153 msg="truncating input prompt" limit=4096 prompt=5010

This meant our carefully crafted prompts were being cut off mid-sentence, leading to incomplete or nonsensical AI responses and, often, outright 500 errors. Our prompts were consistently around 5010 tokens, just over the default limit.

The Fix: The solution was straightforward but crucial: explicitly increasing the num_ctx parameter in our Ollama request options. We bumped it from 4096 to 8192.

typescript

// src/server/services/llm/adapters/ollama.ts
// ... inside the Ollama adapter configuration
const response = await fetch(`${this.baseUrl}/api/generate`, {
  method: 'POST',
  body: JSON.stringify({
    model,
    prompt,
    options: {
      num_ctx: 8192, // Increased context window!
      // ... other options
    },
    // ...
  }),
  // ...
});

This change immediately resolved the truncation warnings and allowed our full prompts to be processed.

The Timeout Tangle: Waiting for CPU Inference

The Challenge: Even with the context issue resolved, we were still hitting 500 errors, especially for longer, non-streaming AI completions. The culprit? Timeouts. Running a 7B LLM on a CPU-only server, even a capable one, takes time. Our default fetch/browser timeout was simply too short.

The Symptom: After about 5 minutes, non-streaming complete() calls would fail, even if Ollama was still actively processing the request.

The Fix: We introduced an explicit, longer timeout for non-streaming calls using AbortSignal.timeout. We set it to a generous 10 minutes (600,000 milliseconds) to account for the CPU-bound inference time.

typescript

// src/server/services/llm/adapters/ollama.ts
// ... inside the non-streaming complete() method
const response = await fetch(`${this.baseUrl}/api/generate`, {
  method: 'POST',
  body: JSON.stringify({
    model,
    prompt,
    options: {
      num_ctx: 8192,
      // ...
    },
  }),
  signal: AbortSignal.timeout(600_000), // 10-minute timeout for CPU-only inference
});

It's worth noting that streaming requests don't typically suffer from this specific timeout issue, as long as data chunks are flowing; the connection remains active.

These two fixes have dramatically improved the reliability of our Ollama integration, ensuring that our AI workflows execute consistently without truncation or premature timeouts.

Looking Ahead: Optimizations and Future Steps

With these critical fixes deployed (c5fa142), our system is significantly more stable. However, the journey of optimization never truly ends, especially when dealing with LLMs.

Currently, Ollama is our primary LLM provider as our Anthropic API credits are depleted. While Ollama works, CPU-only inference is noticeably slower.

Our immediate next steps include:

Top Up Anthropic API Credits: Re-enabling Anthropic will provide a faster, more robust alternative for certain tasks.
Monitor Ollama Logs: Closely observe production logs to ensure the context and timeout fixes are holding up under real-world load.
Consider Further num_ctx Increase: If prompts continue to grow, we might need to increase num_ctx to 16384, understanding that this consumes more RAM.
Model Optimization: Explore switching the default Ollama model to qwen2.5:3b for faster CPU inference, potentially trading off some capability for speed.
Hardware Upgrade (Optional but impactful): Adding a GPU to our Hetzner server would dramatically accelerate Ollama inference, making our self-hosted LLM experience much snappier.

This session was a great reminder that building robust AI-powered applications involves not just clever algorithms but also a deep understanding of the underlying infrastructure, model limitations, and the constant pursuit of a better user experience. We're excited about these improvements and what's next!