The Case of the Phantom Pipeline: Debugging Server-Sent Events Gone Wild
How a seemingly innocent navigation action was secretly restarting entire data processing pipelines, and the elegant solution that fixed it.
The Case of the Phantom Pipeline: Debugging Server-Sent Events Gone Wild
Picture this: You're running a long data processing pipeline—maybe it's analyzing code for automated fixes or performing complex refactoring operations. You navigate away from the progress page to grab some coffee, and when you come back, the pipeline has mysteriously restarted from the beginning. Sound familiar?
This exact scenario had me scratching my head for hours until I uncovered a sneaky bug in our Server-Sent Events (SSE) implementation. Here's the story of how a simple navigation action was secretly spawning phantom pipelines.
The Mystery Unfolds
Our application has two main pipeline operations: AutoFix and Refactor. Both use SSE to stream real-time progress updates to the frontend. Users can monitor phases like "scanning," "analyzing," and "improving" as they happen.
The bug manifested in a frustrating way:
- Start a pipeline (it begins processing normally)
- Navigate away from the detail page
- Navigate back to check progress
- The pipeline restarts from Phase 1 😱
What made this particularly insidious was that it looked like normal behavior on the surface. The UI would reconnect and show progress updates—users might not even realize their pipeline had been reset.
Following the Digital Breadcrumbs
The investigation led me through our SSE architecture. Here's what was happening under the hood:
// The problematic SSE route (simplified)
export async function GET(request: Request, { params }: { params: { id: string } }) {
const runId = params.id;
// 🚨 This was the problem - unconditional pipeline start
const generator = runAutoFix(runId);
for await (const event of generator) {
// Stream events to client
await safeEnqueue(event);
}
}
Every time the client connected to the SSE endpoint, it would call runAutoFix() or runRefactor() unconditionally. This meant:
- First visit: Pipeline starts normally ✅
- Navigate away: Client disconnects, but pipeline continues running in background
- Navigate back: New SSE connection triggers a second pipeline instance 💥
The original pipeline would complete in the background, updating the database. But the new pipeline would overwrite the status back to "scanning" and start fresh.
The Elegant Solution
The fix turned out to be beautifully simple. Instead of always starting a new pipeline, we added a status guard:
// Fixed SSE route
export async function GET(request: Request, { params }: { params: { id: string } }) {
const run = await getRun(params.id);
// 🎯 The key insight: only pending runs should start pipelines
if (run.status === "pending") {
const generator = runAutoFix(params.id);
for await (const event of generator) {
await safeEnqueue(event);
}
} else {
// For active/completed runs: send current status and close
await safeEnqueue({ type: "status", data: run.status });
return; // Close stream immediately
}
}
But this created a new challenge: if active runs don't use SSE, how does the UI get updates? The solution was to add intelligent polling:
// Frontend: Poll for updates when SSE isn't available
const { data: run } = api.autoFix.get.useQuery(
{ id: runId },
{
// Only poll when run is in active status
refetchInterval: ACTIVE_STATUSES.includes(data?.status) ? 3000 : false,
}
);
Lessons Learned
This bug taught me several valuable lessons about real-time systems:
1. SSE Connections Are Ephemeral
Just because a client disconnects doesn't mean your server-side process should restart. Always check the current state before taking action.
2. Guard Your Entry Points
Any endpoint that triggers expensive operations should validate whether that operation is actually needed. A simple status check saved us from countless phantom pipelines.
3. Hybrid Approaches Work
We don't have to choose between SSE and polling. Use SSE for new operations and fall back to polling for reconnections. This gives us the best of both worlds: real-time updates with resilient reconnection behavior.
4. Background Processes Are Invisible
The most dangerous bugs are the ones that "work" from the user's perspective. Our phantom pipelines were consuming server resources and potentially causing race conditions, but users might never notice.
Alternative Approaches Considered
I briefly considered adding resume logic to the pipeline generators themselves—essentially making them stateful and able to skip completed phases. However, this approach had several drawbacks:
- Complexity: Each pipeline phase would need resume logic
- Fragility: Database state during mid-phase operations can be ambiguous
- Performance: Checking completion status for every phase adds overhead
The SSE-level guard was much cleaner and more reliable.
Future Considerations
While this fix solves the immediate problem, it reveals some areas for future improvement:
- Manual Re-run Capability: We might want to add a "Re-run" button that explicitly resets status to "pending"
- Orphaned Process Detection: If the server crashes, active runs might get stuck forever. A cleanup job could help identify and handle these cases
- Better State Management: More granular status tracking could help with resume functionality in the future
The Fix in Action
The final commit (bd96a5a) was surprisingly small for such an impactful bug fix:
- Modified SSE routes for both AutoFix and Refactor operations
- Added status guards to prevent unnecessary pipeline starts
- Implemented intelligent polling on the frontend
- Zero breaking changes or schema modifications
Sometimes the best solutions are the simplest ones. A single conditional check eliminated phantom pipelines and made our real-time system much more robust.
Have you encountered similar issues with SSE or real-time systems? I'd love to hear about your debugging adventures in the comments below.