From Raw LLM Output to Actionable Insights: Supercharging AI Workflow Reviews

Building sophisticated AI-driven applications often involves complex, multi-step workflows. A critical part of these workflows is the "review step," where human oversight ensures the AI's output aligns with expectations and requirements. However, sifting through verbose LLM responses to pinpoint actionable feedback can be tedious and inefficient.

That's precisely the challenge we tackled in our latest development sprint. Our goal was clear: transform raw LLM review outputs into structured, actionable key points, empowering users to quickly understand, refine, and iterate on their AI workflows. And I'm thrilled to report, we've shipped a comprehensive solution!

The Core Problem: Making AI Reviews Actionable

Imagine an AI workflow designed to generate a marketing campaign brief. After several steps, the AI presents a draft for human review. This review step might involve an LLM summarizing the campaign, highlighting potential issues, or suggesting improvements. The output, while insightful, often comes in a free-form text block.

Our users needed a way to:

Quickly grasp the most important feedback from the LLM.
Easily act on that feedback—whether it's accepting a suggestion, editing it, or using it to restart part of the workflow.
Gain better visibility into the performance and status of their ongoing workflows.

This led us to a multi-phase implementation, touching both our backend services and frontend experience.

Phase 1: Intelligent Key Point Extraction with Haiku

The first crucial step was to distill the essence of an LLM's review. We introduced a new service, src/server/services/review-key-points.ts, housing our extractKeyPoints() function.

For this critical task, we leveraged claude-haiku-4-5-20251001. Why Haiku? Its speed and cost-effectiveness make it ideal for quick, structured data extraction without sacrificing accuracy for this specific use case. The function is designed to:

Prompt Haiku to extract a JSON array of key points from the review output.
Validate the shape of the extracted JSON to ensure data integrity.
Assign unique UUIDs to each key point for stable identification.
Truncate fields (e.g., 200 characters for a summary, 2000 for a detailed description) to keep data concise and manageable.
Cap the total items at 50 to prevent overwhelming the user.

Once extracted, these structured key points are stored directly within the workflow step's checkpoint.keyPoints field, making them persistent and accessible.

Phase 2: Empowering User Control and Workflow Manipulation

Having extracted key points is only half the battle; users need to interact with them. We developed several new mutations and refined existing ones to provide granular control:

updateKeyPoints Mutation: This is the heart of user interaction. It allows users to merge their actions (keeping, discarding, or editing individual key points) back into the workflow step's checkpoint. Inline edits are fully supported, providing a seamless refinement experience.
recreateFromKeyPoint Mutation: This is where feedback turns into action. Users can select specific key points and trigger a "recreate" action. The mutation intelligently resets the target step (and any subsequent steps), then injects the selected key point suggestions as a "hint block" directly into the target step's prompt. This guides the next LLM execution, ensuring the AI learns from previous feedback.
Refined resume and retryFromReview Mutations: We identified a subtle bug where these mutations could inadvertently overwrite existing keyPoints data. A quick fix ensured that new reviewNotes are now merged into the existing checkpoint state using a spread operator ({ ...existingCheckpoint, reviewNotes }), preserving all previously extracted insights.

Phase 3: Bringing it to Life - The Frontend Experience

A powerful backend needs an intuitive frontend. We introduced src/components/workflow/review-key-points-panel.tsx, an interactive panel that brings these features to life:

Severity Summary Bar: A visual overview of the types and quantities of key points.
Grouped Key Points List: Organizes key points for easy digestion.
Per-Item Actions: Users can individually "Keep," "Edit" (inline), or "Discard" each key point.
Bulk Actions: For efficiency, we added "Accept All," "Recreate with Hints," and "Discard & Recreate from Source" buttons, allowing users to apply actions to multiple points simultaneously.

This panel is dynamically integrated into src/app/(dashboard)/dashboard/workflows/[id]/page.tsx, appearing interactively for pending review steps and as a read-only summary for completed ones.

Phase 4: Enhanced Workflow Visibility

Beyond individual review steps, we also improved the overall dashboard experience. The src/app/(dashboard)/dashboard/workflows/page.tsx now displays richer metadata for each workflow card:

Aggregated Cost: See the total cost incurred by a workflow.
Token Usage: Monitor the LLM token consumption.
Step Progress: Understand how many steps are completed out of the total.
Duration: Track the total time a workflow has been running.
Creation Date: Quickly reference when a workflow was initiated.

This provides users with invaluable at-a-glance insights into their workflow's performance, efficiency, and status, right from the main dashboard.

Challenges & Lessons Learned

No development sprint is without its hurdles. Here are a few critical lessons we learned along the way:

JSON Field Handling in Prisma/TypeScript: Storing a dynamic array of objects (Record<string, unknown>[]) directly into a Prisma Json field can lead to TypeScript errors (TS2322). The standard workaround, casting via as unknown as Prisma.InputJsonValue, proved essential and is now a well-established pattern in our codebase for such scenarios.
Preventing Hint Accumulation in Prompts: When implementing the recreateFromKeyPoint mutation, an initial approach simply appended new hint blocks to the step prompt. A crucial code review caught that repeated retries would lead to an ever-growing prompt, potentially exceeding token limits and degrading LLM performance. The fix was elegant: we now strip any previous hint blocks using prompt.split(HINT_SEPARATOR)[0] before appending new ones, ensuring a clean slate for each iteration. This highlights the importance of defensive prompt engineering.
Preserving State During Checkpoint Updates: Similar to the hint accumulation, early versions of our resume and retryFromReview mutations were overwriting the entire checkpoint object with only the reviewNotes. This would have silently destroyed our newly implemented keyPoints data! The fix was to use the object spread syntax ({ ...existingCheckpoint, reviewNotes }) to merge new data while preserving existing fields. A great reminder about the importance of immutable updates and thorough state management.

Looking Ahead: What's Next?

With these features now live, our immediate next steps involve thorough testing and some refinement:

Verifying the end-to-end flow of key point extraction and hint injection.
Ensuring per-item and bulk actions persist state correctly.
Refactoring: Consolidating the duplicated ReviewKeyPoint type into a shared src/types/review.ts for better maintainability.
Security: Considering rehype-sanitize for our MarkdownRenderer as a defense-in-depth against potential XSS from LLM outputs.
Maintainability: The workflow detail page is quite large (~1640 lines). We're looking to extract panels (review, alternatives, fan-out viewer) into separate components to improve modularity.

This sprint has significantly enhanced the usability and power of our AI workflows. By turning raw LLM output into structured, actionable insights, we're making it easier for users to build, review, and iterate on complex AI applications faster than ever before.