Unmasking the Silent 404: A Deep Dive into Robust File Uploads and Race Conditions
We tackled a critical bug where Axiom document uploads were silently failing, leading to frustrating ENOENT errors. This post details our journey from debugging a phantom 404 to implementing a secure and resilient file upload pipeline, complete with lessons on race conditions and API design.
Every developer knows the unique frustration of a critical feature failing without a clear error message. Recently, our Axiom document upload functionality began exhibiting just such a behavior. Users were trying to upload ISO 27001 compliance documents, only for them to vanish into the ether, eventually manifesting as a cryptic ENOENT (Error: No ENtry) when our backend tried to process them. The files simply weren't there.
This wasn't just a minor glitch; it was a showstopper for a core compliance feature. Our mission: unravel the mystery of the missing files and build a rock-solid upload mechanism.
The Case of the Missing File: Our Debugging Journey
Our initial setup for document uploads was fairly standard for a modern web application:
- A tRPC mutation on the client (
axiom.upload) would initiate the process. - The server would respond with a presigned URL.
- The client would then perform a
PUTrequest directly to this URL, uploading the file. - Upon the
PUT's perceived success, another mutation (confirmUpload) would be triggered to notify the backend that the file was ready for processing. - Finally, a
processDocumentstep on the server would attempt to read and ingest the uploaded file.
Sounds logical, right? Yet, somewhere in this chain, files were getting lost. The ENOENT error was a symptom, not the root cause. Our backend was diligently looking for files that simply hadn't been written to disk.
The breakthrough came when we realized the "presigned URL" wasn't actually pointing to an external storage service (like S3), but rather to an internal API endpoint: /api/v1/uploads/{storageKey}. And here was the critical flaw: no route handler existed for this PUT request.
The client was happily performing a PUT to /api/v1/uploads/{storageKey}, which was silently 404'ing. Because it was a client-side PUT and not directly part of our tRPC mutation's success path, the uploadMutation.onSuccess callback was firing, blissfully unaware that the actual file transfer had failed. This led to a classic race condition:
- The client's
uploadMutationwould succeed (meaning it started the PUT request). uploadMutation.onSuccesswould immediately callconfirmMutation.mutate().- The backend would mark the document as
pendingand try toprocessDocument. processDocumentwould then try tofs.readFile()a file that was never written, resulting in the dreadedENOENT.
The irony was that our REST API path (/api/v1/rag/ingest), which handled uploads differently by writing the file directly within its own request handler, worked perfectly fine. It was only the tRPC-driven browser upload path, relying on a separate PUT, that was broken.
Lessons Learned: Distributed Operations and Silent Failures
This experience highlighted a few crucial takeaways:
- Verify Every Link in the Chain: When dealing with multi-step, distributed operations (like a client-initiated PUT to a separate endpoint), never assume success based solely on the initiation of a step. Each step needs its own explicit success/failure handling.
- Beware of Silent Failures: A 404 on an API endpoint can be particularly insidious if the client isn't properly logging or reacting to the HTTP status code. Always ensure your error handling cascades correctly.
- Race Conditions are Sneaky: The
onSuccesscallback of an initial mutation should only trigger actions that are truly dependent on its complete success, not the success of subsequent asynchronous operations it merely initiates.
Building a Robust Upload Pipeline: The Fix
With the root causes identified, we set about fixing the problem by addressing both the missing API route and the client-side orchestration.
1. The Missing Piece: Our Local File Upload API
The primary fix was to create the missing PUT handler: src/app/api/v1/uploads/[...path]/route.ts. This endpoint is now responsible for receiving the file data and securely writing it to our local temporary storage.
We built this handler with several key considerations for robustness and security:
- Authentication & Authorization: Every request passes through
authenticateRequest(), ensuring only authenticated users can upload. Furthermore, we implemented tenant isolation, verifying that thetenantIdextracted from the upload path matches the authenticated session'stenantId. - Path Traversal Prevention: A critical security measure,
path.resolve()boundary assertion, was added to prevent malicious users from attempting to write files outside of their designated upload directory (e.g.,/tmp/nyxcore-uploads/axiom/{tenantId}/{projectId}/). - Extension Allowlist: To mitigate risks from executable or otherwise dangerous file types, we enforce an explicit allowlist for extensions:
.md,.txt,.pdf,.ts,.js,.py,.json,.yaml,.yml,.toml,.html,.css. - Database Verification: Before accepting an upload, we verify that a
ProjectDocumentwith a matchingstorageKeyexists in astatus: "pending"state. This ensures that only legitimate, pre-registered upload intentions are fulfilled. - Security Headers: We added the
X-Content-Type-Options: nosniffheader to prevent browsers from trying to "guess" the content type, further enhancing security against certain types of attacks.
Files are now securely written to /tmp/nyxcore-uploads/axiom/{tenantId}/{projectId}/{timestamp}-{filename}.
2. Orchestrating the Client Flow
On the client side, in src/app/(dashboard)/dashboard/projects/[id]/page.tsx, we adjusted the upload flow to correctly handle the sequence of operations:
- Correcting the Race Condition: We moved
confirmMutation.mutate()fromuploadMutation.onSuccessto after the successfulPUTrequest to our new local upload handler. This ensures that the backend is only notified to process a file after it has actually been written to disk. - Robust Error Handling: We added a
continuestatement onPUTfailure. This prevents theconfirmMutationfrom being triggered if the actual file upload fails, ensuring that broken uploads don't mistakenly trigger backend processing.
What's Next? Continuous Improvement
While the critical ENOENT bug is squashed and our upload pipeline is significantly more robust, development is an ongoing process. Our immediate next steps include:
- Deployment & Verification: Pushing the fix (
baf22ab) tomainand re-uploading the ISO 27001 documents to confirm successful processing. - Download Functionality: Considering adding a
GEThandler for/api/v1/uploads/[...path]soLocalStorageAdapter.getDownloadUrl()can work, enabling users to download their uploaded documents. - Actual File Deletion: Implementing proper file deletion in
LocalStorageAdapter.delete(), which is currently a no-op. This is crucial for data hygiene and storage management. - Optimizing Large Uploads: Exploring streaming body reads instead of
req.arrayBuffer()to prevent Out-Of-Memory (OOM) errors when handling very large file uploads.
This session was a stark reminder that even seemingly simple features like file uploads can hide complex interactions and subtle bugs. By meticulously tracing the flow, understanding the underlying mechanisms, and prioritizing security at every step, we've transformed a critical failure into a more resilient and trustworthy system.