Wrestling with Ollama: How I Tackled LLM Context Limits and Stubborn Timeouts

The life of a solo developer, especially when building an AI-powered tool, is a constant dance between feature development and infrastructure wrangling. Every now and then, you hit a session that beautifully encapsulates this duality – a mix of refining user-facing features and diving deep into the guts of the LLM stack. This past week, I had one such session, culminating in two critical fixes deployed to production.

Let's break down the journey, the pains, and the lessons learned.

The Quest: Better Workflows, Smarter AI

My primary goals for this session were twofold:

Refine Group Workflow Naming: Make the automated grouping of tasks more intuitive and readable for users.
Stabilize Ollama: Tackle some insidious context window truncation and frustrating timeout issues that were plaguing our locally-hosted LLM inference.

Both are now live, thanks to commits 7238f34 and c5fa142.

Part 1: Naming Things - The Developer's Old Nemesis

One of the hardest problems in computer science, they say, is naming things. This holds true even for auto-generated workflow groups. Previously, when a user selected multiple action points and grouped them into an AI-generated workflow, the group label was a clumsy concatenation of all selected item titles. This was far from ideal, especially for complex workflows.

The fix was elegant and user-centric: if all selected action points originated from the same sourceNoteId, we could derive a much more meaningful group name from that source note's title.

Before: "Action 1: Buy milk. Action 2: Buy eggs. Action 3: Buy bread." After: "feat: Grocery List (3 actions)" (if all came from a "Grocery List" note)

Here's a peek at how this was handled in src/server/trpc/routers/action-points.ts:

typescript

// Inside createGroupWorkflow input definition
interface CreateGroupWorkflowInput {
  actionPointIds: string[];
  // ... other inputs
  groupName?: string; // New optional input
}

// And in src/app/(dashboard)/dashboard/projects/[id]/page.tsx
// ... when creating a group workflow
const getGroupName = (selectedItems: ActionPoint[]): string | undefined => {
  if (selectedItems.length === 0) return undefined;

  const firstSourceNoteId = selectedItems[0].sourceNoteId;
  const allFromSameNote = selectedItems.every(
    (item) => item.sourceNoteId === firstSourceNoteId
  );

  if (allFromSameNote && firstSourceNoteId) {
    // Assuming sourceNote is pre-fetched or accessible
    return `${selectedItems[0].sourceNote.title} (${selectedItems.length} actions)`;
  }
  return undefined; // Fallback to default or item concatenation if not from same note
};

// Then passed to the tRPC call:
const groupName = getGroupName(selectedActionPoints);
createGroupWorkflow.mutate({ actionPointIds, groupName });

This small change significantly improves the clarity and usability of our workflow management, making generated groups immediately understandable.

Part 2: Ollama's Gauntlet - Context, Timeouts, and CPU Woes

This was the more challenging beast. Our system relies heavily on Ollama, currently running qwen2.5:7b on a CPU-only server, to power our core AI workflows. These workflows involve complex prompts that include {{project.wisdom}}, {{memory}}, and other contextual data.

The Pain: Truncation and Mysterious 500s

I started noticing a familiar, unsettling message in the Ollama logs:

WARN source=runner.go:153 msg="truncating input prompt" limit=4096 prompt=5010

This was the smoking gun. The default context window for qwen2.5:7b (and many other models) is 4096 tokens. Our workflow prompts, especially with all the dynamic context injected, were consistently hitting around 5010 tokens. This meant the LLM was only seeing the first ~80% of the prompt, leading to incomplete or nonsensical responses.

Compounding this, these truncated requests often resulted in 500 errors after about 5 minutes. While the truncation was clear, the timeout was a bit more opaque. Our CPU-only qwen2.5:7b inference can be slow, sometimes taking longer than the default fetch API timeout or browser timeout limits.

The Fix: More Context, More Time

The solutions, once identified, were relatively straightforward but crucial.

Increased Context Window (num_ctx): I bumped the num_ctx parameter in our Ollama adapter from the default 4096 to 8192. This gives our prompts ample breathing room.

typescript

// src/server/services/llm/adapters/ollama.ts
// ... inside the complete() function
const response = await fetch(`${this.ollamaUrl}/api/generate`, {
  method: 'POST',
  body: JSON.stringify({
    model: modelName,
    prompt: fullPrompt,
    stream: false, // For non-streaming calls
    options: {
      num_ctx: 8192, // Increased context window!
    },
  }),
  signal: AbortSignal.timeout(600_000), // Our new 10-minute timeout
});

Lesson Learned: Always be mindful of your LLM's default context window and the actual length of your prompts. Dynamic prompt engineering with large context injections can easily exceed defaults. Increasing num_ctx does consume more RAM on the Ollama host, so it's a trade-off to monitor.

Extended Timeout (AbortSignal.timeout): For non-streaming complete() calls, I explicitly added an AbortSignal.timeout(600_000) (10 minutes) to the fetch request. This prevents the request from timing out prematurely due to the slow CPU inference. Streaming requests, interestingly, don't suffer from this as much; they stay alive as long as chunks are flowing, which often keeps the connection active.

Lesson Learned: Default HTTP client timeouts (browser, fetch, axios) are often too short for CPU-only LLM inference, especially with larger models. Explicitly setting a generous timeout is essential for stability. Distinguish between streaming and non-streaming requests when considering timeout strategies.

The Current State

With these fixes, our Ollama instance is now much more robust:

Ollama context: 8192 tokens (up from 4096)
Ollama non-streaming timeout: 10 minutes (up from ~5 min default)
Models: qwen2.5:7b, qwen2.5:3b

While qwen2.5:7b is still slow on CPU, at least it's completing its tasks without truncation or premature timeouts.

What's Next? The Road Ahead

Even with these fixes, the journey continues. Here are a few immediate next steps:

Top Up Anthropic API Credits: Ollama is a great fallback, but cloud LLMs are significantly faster.
Monitor Ollama Logs: Ensure the context/timeout fixes are holding up under real-world load.
Consider num_ctx: 16384: If prompts grow even further, we might need to double the context again, keeping an eye on RAM usage.
Explore qwen2.5:3b as Default: For CPU-only inference, a smaller model might strike a better balance between speed and quality.
Dream of a GPU: The ultimate solution for faster local inference on Hetzner would be adding a GPU to the server. One can dream!

This session was a stark reminder that building with LLMs isn't just about prompt engineering; it's about managing the underlying infrastructure, understanding their limitations, and continuously optimizing for both user experience and system stability. It's challenging, but incredibly rewarding to see these pieces click into place.

Happy coding!

json

{
  "thingsDone": [
    "Fixed group workflow naming to use source note titles for clarity.",
    "Increased Ollama's `num_ctx` from 4096 to 8192 tokens to prevent prompt truncation.",
    "Added a 10-minute `AbortSignal.timeout` to non-streaming Ollama requests to prevent premature timeouts on CPU-only inference."
  ],
  "pains": [
    "Ollama `qwen2.5:7b` default context (4096) was insufficient for complex workflow prompts (~5010 tokens).",
    "Ollama requests were timing out after ~5 minutes due to slow CPU-only inference, resulting in 500 errors.",
    "Workflow group names were unhelpful concatenations of action point titles."
  ],
  "successes": [
    "Improved user experience with more meaningful workflow group names.",
    "Eliminated prompt truncation issues with Ollama.",
    "Resolved Ollama request timeouts, ensuring successful completions for long-running inferences.",
    "Deployed critical stability fixes to production."
  ],
  "techStack": [
    "TypeScript",
    "Next.js",
    "tRPC",
    "Ollama",
    "qwen2.5:7b",
    "qwen2.5:3b",
    "LLM",
    "Fetch API",
    "AbortSignal"
  ]
}