Building Better UX: Adding Real-Time Metrics to Workflow Execution

Ever looked at a workflow execution and wondered: "Was this actually worth running?" or "How much energy did my AI model consume?" We recently tackled this exact problem by adding comprehensive per-step metrics to our workflow dashboard. Here's the story of how we went from basic execution logs to a metrics-rich experience that actually helps users understand the value of their automated workflows.

The Mission: Making Workflows Transparent

Our goal was straightforward but ambitious: show users the real impact of each workflow step. We wanted to display:

Energy consumption (in Wh/mWh) based on token usage and model type
Time saved compared to manual execution
Token compression savings when using digest features
All while making dark mode the default and cleaning up some visual inconsistencies

The Technical Journey

Phase 1: Setting the Foundation

We started by creating a dedicated utility module workflow-metrics.ts to handle all the heavy lifting:

typescript

// Energy rates by model family (Wh per million tokens)
const ENERGY_RATES = {
  'claude-3-sonnet': 110,
  'claude-3-haiku': 45,
  'gpt-4o': 540,
  'gpt-4o-mini': 110,
  'gemini-1.5-flash': 85,
  // ... with fallbacks for unknown models
};

export function computeStepMetrics(step: WorkflowStep) {
  const tokenUsage = extractTokenUsage(step);
  const energy = calculateEnergyConsumption(tokenUsage, step.model);
  const timeSaved = estimateTimeSaved(tokenUsage);
  
  return { energy, timeSaved, tokensSaved: calculateTokenSavings(step) };
}

The beauty here is in the model-specific energy rates. Different AI models have vastly different power consumption profiles—GPT-4o can use nearly 5x more energy per token than Claude Haiku. Users deserve to know this!

Phase 2: Smart Token Savings Calculation

One of the trickier challenges was calculating "downstream token savings" when users employ digest compression:

typescript

// When a step compresses 10,000 tokens down to 3,000 tokens,
// and that compressed output gets referenced by 3 downstream steps,
// the total savings = 7,000 tokens × 3 references = 21,000 tokens saved
const compressionSavings = originalTokens * (1 - DIGEST_COMPRESSION);
const downstreamMultiplier = estimateDownstreamReferences(step);
const totalSavings = compressionSavings * downstreamMultiplier;

Phase 3: UI Integration with Smart Conditionals

The metrics bar needed to integrate seamlessly with existing step displays. Here's where we hit our first major challenge.

tsx

// New metrics bar for completed steps
{(step.status === 'completed' || step.status === 'failed') && (
  <div className="flex flex-wrap items-center gap-4 text-sm text-muted-foreground mt-2">
    <span>{formatTokens(tokens)} tokens</span>
    <span>{formatDuration(duration)}</span>
    <span>{formatCost(cost)}</span>
    <div className="flex items-center gap-1">
      <Zap className="h-3 w-3" />
      <span>{formatEnergy(energy)}</span>
    </div>
    <span>~{formatTimeSaved(timeSaved)} saved</span>
  </div>
)}

Lessons Learned: The Duplicate Display Dilemma

Here's where things got interesting. We initially had two places showing similar information:

The new metrics bar (outside the expandable step content)
The old metadata line (inside the expanded step body)

For completed steps, users would see duplicate token counts, costs, and durations. Not great UX!

The Solution: We implemented a conditional rendering strategy using an IIFE (Immediately Invoked Function Expression):

tsx

{/* Only show old metadata for non-completed steps, or just retry info for completed ones */}
{(() => {
  const showingMetricsBar = step.status === 'completed' || step.status === 'failed';
  if (showingMetricsBar) {
    // Only show retry count if it exists
    return step.retryCount > 0 ? `Retry ${step.retryCount}` : null;
  }
  // Show full metadata for pending/running steps
  return `${tokens} tokens • ${duration} • ${cost}`;
})()}

This keeps the UI clean while preserving information density where it matters.

The Small Wins: Dark Mode and Visual Polish

Sometimes the best improvements are the subtle ones:

tsx

// Before: System preference with light fallback
const theme = getTheme() || 'system';

// After: Dark by default (because developers love dark mode)
const theme = getTheme() || 'dark';

We also cleaned up badge borders by removing unnecessary border border-nyx-*/20 classes from colored variants. The colored backgrounds already provide sufficient visual distinction.

Real-World Impact

The metrics now tell a story. Users can see:

"This GPT-4o step consumed 2.1 Wh of energy" (awareness of environmental impact)
"Saved ~45 minutes compared to manual work" (ROI justification)
"Digest compression saved 15,000 tokens downstream" (optimization insights)

What's Next

This foundation opens up exciting possibilities:

Aggregate workflow metrics (total energy, time, cost across all steps)
Mobile optimization for the metrics bar layout
Unit tests for the metrics calculation functions
Tooltips explaining the downstream savings approximation

Key Takeaways

Model-specific energy rates matter - different AI models have vastly different power consumption
Downstream token savings can be significant with digest compression
UI consistency requires careful planning - avoid duplicate information displays
Small UX improvements (like default dark mode) compound into better user experience
Metrics should tell a story - raw numbers are less valuable than contextual insights

The next time you're building workflow tools, consider: what story are your metrics telling? Users don't just want to know what happened—they want to understand the value of what happened.

Want to see more posts about building developer tools and workflow automation? Follow along for more technical deep-dives and UX insights.