Unveiling the AI's Inner Workings: Bringing Live LLM Observability to Our Pipelines

As developers building AI-powered tools, we often find ourselves wrestling with black boxes. Our applications leverage powerful Large Language Models (LLMs) to perform complex tasks like automated code fixes or intelligent refactoring, but the actual "work" happening inside those models remains opaque. How many tokens were used? Which specific model handled the request? What was the estimated cost? And crucially, how long did it really take?

These aren't just academic questions. For debugging, performance optimization, cost management, and ultimately, a better developer experience, real-time visibility into these metrics is invaluable. That's precisely the challenge we tackled in our last development session: bringing live, streaming LLM usage statistics to our AutoFix and Refactor pipeline run detail pages.

The Quest for Transparency: Our Goal

Our primary objective was clear: during an active AutoFix or Refactor pipeline run, as Server-Sent Events (SSE) stream updates to the UI, we wanted to display:

Live Token Usage: Prompt, completion, and total tokens.
Model Information: Which specific LLM model was invoked.
Cost Estimates: A running tally of the monetary cost.
"Stats for Nerds": Beyond the basics, we aimed for insights like energy consumption and estimated time saved, leveraging our existing workflow-metrics library.

This wasn't about adding a new feature per se, but enhancing the observability of existing, critical features.

Diving In: Where the Data Lives (or Doesn't Yet)

The first step was to understand our current state. Where is this data already being captured, and where are the gaps? This exploration revealed some crucial insights:

The Good News: We're Already Capturing Key LLM Data! A huge win right off the bat: our LLMCompletionResult type (defined in src/server/services/llm/types.ts) already meticulously captures:

typescript

interface LLMCompletionResult {
  tokenUsage: {
    prompt: number;
    completion: number;
    total: number;
  };
  costEstimate: number;
  model: string;
  provider: string;
  // ... other fields
}

This means the core data – token counts, cost estimates, model, and provider – is readily available at the point of LLM interaction. This saved us from having to instrument every LLM call from scratch.

The Catch: It's Not Making It to the Client (Yet)

While the data exists, it's not being propagated to the client-side via our SSE streams. Here's a breakdown of the current state across our pipelines:

AutoFix Pipeline (issue-detector.ts, fix-generator.ts): These components correctly call provider.complete() and receive the LLMCompletionResult. However, the tokenUsage and related data are not included in the SSE events that update the UI.
Refactor Pipeline (improvement-generator.ts): This step does save tokenUsage, costEstimate, and model to the RefactorItem database record. But, again, this data isn't actively pushed via SSE events during the run.
Refactor Pipeline (opportunity-detector.ts): This was a key discovery – this particular step doesn't capture token data at all. A definite gap we need to address.
Code-Analysis Pipeline (pattern-detector.ts): This pipeline accumulates totalTokens and totalCost, but only sends these statistics as part of a phase_complete event, not as continuous updates.
Reusable Metrics (src/lib/workflow-metrics.ts): We have existing utilities for calculating energy consumption (Wh) and estimated time saved per model family. This is perfect for our "stats for nerds" section!

This exploration confirmed our hypothesis: the data is there, but the plumbing to stream it live to the UI is missing.

Navigating the Trenches: Lessons from the Dev Server

Even in a planning session, you hit unexpected snags. Here are a couple of "pain points" that turned into immediate lessons learned:

Prisma db execute --stdin Doesn't Return SELECT Results:
- The Attempt: I tried using npx prisma db execute --stdin to quickly run a SELECT query and inspect some data in the database.
- The Fail: No output. Turns out, this command is designed for DDL (Data Definition Language) or DML (Data Manipulation Language) statements, not for returning results from SELECT queries.
- The Workaround/Lesson: For interactive database querying with Prisma, the correct approach is to use npx tsx -e (or node -r tsx) to execute a TypeScript file that uses the Prisma Client. This allows you to write and run arbitrary Prisma queries and log their results. A simple console.log(await prisma.autoFixRun.findMany()) can save a lot of head-scratching.
Next.js --turbopack Flag:
- The Attempt: Out of curiosity, I tried starting the dev server with npm run dev -- --turbopack.
- The Fail: error: unknown option '--turbopack'.
- The Workaround/Lesson: While Turbopack is a promising next-gen bundler for Next.js, it's still evolving and not always fully integrated with all versions or development setups. For now, sticking to the standard npm run dev or our ./scripts/dev-start.sh ensures a stable development environment. Sometimes, the tried and true is the best path.

These small detours are a natural part of the development process and underscore the importance of understanding your tools.

The Blueprint: Crafting the Real-time Experience

With a clear understanding of the data landscape and some practical lessons under our belt, we moved to design the implementation. This is where the "no code changes yet" status becomes exciting – it's all about solidifying the plan before writing a single line of feature code.

Our plan involves a series of coordinated changes across the backend and frontend:

Extend SSE Event Types: We'll enhance our AutoFixEvent and RefactorEvent types to include new fields: tokenUsage, model, provider, costEstimate, and timing (for phase-specific durations). This provides a standardized contract for the data streaming to the client.
Instrument Pipeline Steps:
- Modify issue-detector.ts and fix-generator.ts (AutoFix) to capture the LLMCompletionResult data and include it when yielding new events.
- Modify opportunity-detector.ts and improvement-generator.ts (Refactor) to both capture token data (where missing) and yield it in their respective events.
Accumulate and Stream in pipeline.ts: The core pipeline.ts logic for both AutoFix and Refactor will be updated to:
- Maintain running totals for totalTokens, totalCost, and phaseTimings.
- Include these accumulated stats in the SSE events that are pushed to the client.
Persist Final Stats to DB: Once a run completes, the final aggregated { totalTokens, totalCost, provider, model, phaseTimings } will be saved to the AutoFixRun.stats and RefactorRun.stats JSON fields in the database for historical analysis.
Build a Shared UI Component: A new, reusable <LiveTokenStats> React component will be created. This component will be responsible for:
- Listening to the SSE stream.
- Displaying real-time token usage, cost, model, provider.
- Integrating our workflow-metrics.ts to show estimated energy consumption and time saved.
- Presenting phase-specific timings.
Wire It Up: Finally, we'll integrate the <LiveTokenStats> component into the AutoFix and Refactor detail pages, connecting it to the SSE endpoints.

What's Next?

The planning is complete, the blueprint is drawn. The next session will be all about execution. Here's our immediate action plan:

Finalize the exact schema for extending AutoFixEvent and RefactorEvent.
Implement the capture and yielding of token data in issue-detector.ts and fix-generator.ts.
Do the same for opportunity-detector.ts and improvement-generator.ts.
Update the pipeline.ts logic to accumulate and include these running totals in SSE events.
Extend AutoFixRun.stats and RefactorRun.stats DB schemas to persist final metrics.
Develop the shared <LiveTokenStats> frontend component.
Integrate the component into the existing detail pages.
Conduct thorough end-to-end testing with real pipeline runs.

This feature is more than just "stats"—it's about empowering developers with immediate feedback, helping them understand the performance characteristics and cost implications of their AI-driven workflows. It's a significant step towards a more transparent and debuggable LLM-powered development environment. Stay tuned for the implementation details!

json

{
  "thingsDone": [
    "Confirmed LLMCompletionResult already captures token usage, cost, model, provider.",
    "Identified gaps in SSE streaming for token data across AutoFix and Refactor pipelines.",
    "Noted missing token data capture in Refactor's opportunity-detector.ts.",
    "Confirmed reusability of workflow-metrics.ts for energy and time-saved calculations.",
    "Designed a detailed plan for extending SSE events, instrumenting pipeline steps, accumulating stats, persisting data, and building a shared UI component."
  ],
  "pains": [
    "Prisma 'db execute --stdin' does not return SELECT query results.",
    "Next.js '--turbopack' flag was not recognized in current environment."
  ],
  "successes": [
    "Discovery that core LLM usage data is already captured at the LLM provider layer.",
    "Existing workflow-metrics.ts library provides reusable energy and time-saved calculations.",
    "Successful workaround for Prisma DB querying using 'tsx -e' with PrismaClient.",
    "Developing a comprehensive plan before writing code, ensuring a clear path forward."
  ],
  "techStack": [
    "Next.js",
    "TypeScript",
    "Prisma",
    "Server-Sent Events (SSE)",
    "Large Language Models (LLMs)",
    "React"
  ]
}