Unlocking Human-in-the-Loop: Building LLM-Powered Smart Review Steps

Building intelligent systems isn't just about chaining LLM calls; it's about orchestrating them with human intelligence, feedback, and oversight. That's been our guiding principle as we tackled our latest major feature: Smart Review Steps.

These aren't just passive checkpoints. They're LLM-powered workflow pauses that execute a review prompt, present a detailed analysis, and then offer a clear human decision path: approve, retry, or add notes. It's a fundamental shift towards truly collaborative AI workflows.

After an intense session, I'm thrilled to report that Phase 1-4 is complete and tested. The core functionality—LLM analysis, pausing with output, and a fully interactive review panel UI—is live, committed as 1ec291b, and pushed to main. Let's break down how we got here, what we built, and the invaluable lessons learned along the way.

The Vision: Smart Review Steps

Imagine a complex workflow: analyzing a codebase, designing new features, then generating an implementation plan. At each critical juncture, a human needs to step in, review the LLM's output, and provide guidance. That's where Smart Review Steps come in. They inject a crucial human-in-the-loop element, ensuring quality and alignment with high-level goals.

Our goal was clear:

LLM Analysis: An LLM should perform a deep review based on a specific prompt.
Pause & Present: The workflow should pause, displaying the LLM's analysis in an intuitive UI.
Human Feedback: The user can approve, add notes for downstream steps, or even retry a previous step with an updated prompt.

Architecting the Human Pause

Implementing this required touching several key parts of our stack:

Signalling the Pause: `review_ready` Event

The first step was to introduce a new event type: review_ready within our WorkflowEvent union (src/server/services/workflow-engine.ts:59). This event is the workflow engine's signal to the outside world that it's waiting for human input at a review step.

Injecting Human Wisdom: `ChainContext` & `{{steps.Label.notes}}`

For human feedback to be truly useful, it needs to flow back into the LLM chain. We achieved this by:

Adding stepReviewNotes: Map<string, string> to our ChainContext interface. This extracts review notes from completed checkpoints.
Implementing a new template variable: {{steps.Label.notes}} in resolvePrompt(). Now, downstream LLM steps can directly incorporate user-provided review notes, allowing the AI to learn from human corrections.

From Passive to Active: The `executeStep()` Overhaul

Previously, a review step might have just displayed a static message. No more. We completely replaced this passive logic:

executeStep() now triggers a full LLM execution for review prompts.
It stores the LLM's output, digest, token usage, and cost.
Crucially, the step's status is set to "pending", and the review_ready event is yielded, pausing the workflow.
Even if the LLM fails during the review, we still store the error and pause. Why? Because a human review is still valuable, even if it's to debug an LLM issue or decide to bypass the flawed analysis.

Giving Control: `resume` and `retryFromReview` Mutations

On the tRPC side (src/server/trpc/routers/workflows.ts), we enhanced user control:

The resume mutation now accepts optional reviewNotes (up to 5000 characters), which are stored directly in the checkpoint.
We introduced retryFromReview. This powerful mutation allows users to validate a paused workflow, optionally update the target step's prompt, reset that target step and all subsequent steps, and then set the workflow to be paused at the newly reset target. This means a user can pinpoint an issue in an earlier step, tweak its prompt, and re-run the chain from there.

Crafting the User Experience: The Review Panel UI

The frontend work brought all this together. In src/app/(dashboard)/dashboard/workflows/[id]/page.tsx, we built a comprehensive review panel:

A MarkdownRenderer beautifully displays the LLM's detailed analysis.
A collapsible section shows the original review prompt.
A notes textarea provides a dedicated space for user feedback, complete with a {{steps.Review.notes}} hint to show how their input will be used.
Clear "Approve & Continue" and "Retry a previous step" buttons provide intuitive control. The "Retry" dropdown allows selection of any prior step, with an editable prompt field.
Completed review steps now neatly display both the original prompt and any stored review notes.

Finally, we enhanced our LLM review templates (deepReview1, deepReview2, extensionReview, secReview in src/lib/constants.ts) by adding a systemPrompt, ensuring the LLM understands its role and context for these critical reviews.

All these changes passed npm run typecheck (aside from one pre-existing issue in discussions/[id]/page.tsx), giving us confidence in the type safety of our new features.

Lessons from the Trenches: The "Pain Log" Turned Wisdom

No major feature ships without a few bumps in the road. Here's what tripped us up and what we learned:

1. TypeScript Execution Headaches

Tried: Running the workflow engine directly via node -e with require().
Failed: Node.js doesn't directly understand TypeScript modules without a transpilation step.
Lesson Learned: For quick TypeScript execution outside the main server, npx tsx is an absolute lifesaver. It handles the transpilation on the fly, making it perfect for scripts and one-off tests. Our npx tsx scripts/test-review-step.ts worked perfectly.

2. Shell Escaping Shenanigans

Tried: Using npx tsx -e "..." for inline script execution.
Failed: Shell escaping of the ! character (common in bash heredocs or complex commands) caused esbuild syntax errors, making inline tsx -e frustrating for anything non-trivial.
Lesson Learned: For anything beyond the simplest inline commands, write your TypeScript code to a .ts file first (e.g., scripts/my-test.ts) and then execute it with npx tsx scripts/my-test.ts. It avoids a whole class of shell-related headaches.

3. The Indispensable `dev-start.sh`

Problem: During development, we frequently ran into issues with port 3000 being occupied by previous dev servers (sometimes even on ports 3001, 3002, etc. from failed processes). This led to frustrating "Address already in use" errors.
Lesson Learned: Always, always use a robust dev server startup script. Our scripts/dev-start.sh became invaluable. It handles:
- Killing old processes on port 3000.
- Clearing the .next cache.
- Regenerating the Prisma client. This ensures a clean, consistent development environment every time and saves countless minutes of debugging phantom issues.

The Current State & What's Next

Our dev server is happily running on localhost:3000. We successfully tested with workflow 5954da72-ae20-4c05-97f4-825f9d437a50 ("nyxCore Kimi K2 v3"), which is currently paused at review step 2cdd8ede-7cff-4cb9-aa01-ac1593e20a3f with a rich LLM output: "Critical Review: Kimi K2 LLM Integration Feature Specification" – 3682 tokens, $0.0286, ~33s of processing.

This is a huge step forward, but the journey continues. Our immediate next steps involve:

Structured Review Output: Extracting key points from the LLM's review analysis into a structured list, with per-item actions (recreate with hints, keep, edit manually, discard). This will require new engine logic to parse LLM output and new tRPC mutations/UI components.
Multi-Actions: Implementing "Accept All," "Recreate All (from review criteria)," and "Discard All & Recreate from Design Features" for faster decision-making.
Workflow Overview Enhancements: Adding crucial metadata to the workflow list view: costs, step counts, token usage, creation time, and iterations per workflow.

It's an exciting time to be building in this space. Shipping this core functionality feels like unlocking a new dimension of human-AI collaboration. Onwards to the next phase!

json

{
  "thingsDone": [
    "Implemented `review_ready` event type for workflow pauses",
    "Added `stepReviewNotes` to `ChainContext` for injecting human feedback",
    "Created `{{steps.Label.notes}}` template variable for prompt resolution",
    "Replaced passive review steps with full LLM execution, storing output and pausing",
    "Updated `resume` mutation to accept and store `reviewNotes`",
    "Implemented `retryFromReview` mutation for re-running steps with updated prompts",
    "Built full review panel UI with MarkdownRenderer, notes, and action buttons",
    "Enhanced existing review templates with `systemPrompt`",
    "Ensured all changes pass `npm run typecheck`"
  ],
  "pains": [
    "Direct TypeScript execution with `node -e`",
    "Shell escaping `!` character in inline `tsx -e` commands",
    "Port conflicts and stale caches from previous dev server instances"
  ],
  "successes": [
    "Successful implementation and testing of Smart Review Steps",
    "Robust `npx tsx` usage for TypeScript script execution",
    "Development of a comprehensive `dev-start.sh` script for consistent environment setup",
    "Successful LLM analysis and UI integration for human-in-the-loop workflow"
  ],
  "techStack": [
    "TypeScript",
    "Next.js",
    "tRPC",
    "Prisma",
    "LLMs (Large Language Models)",
    "Workflow Engine (custom)",
    "React (for UI)",
    "MarkdownRenderer"
  ]
}