The Case of the Tainted Text: Debugging a Production Crawler and Learning Hard Lessons
Ever found your meticulously extracted content suddenly riddled with raw HTML? Join me on a deep dive into a production bug, the surgical cleanup, and the critical lessons learned about deploying database changes and remote execution.
It was a quiet afternoon, or so I thought, when the tell-tale signs of a production issue began to emerge. Our site crawler, a critical component of our Retrieval-Augmented Generation (RAG) system, was acting up. Users reported seeing raw HTML tags – <b>, <p>, <a> – embedded directly within search results. This wasn't just a cosmetic issue; it meant our embeddings were tainted, our search relevance compromised, and our users were getting a less-than-ideal experience. The mission was clear: find the source of the HTML leak, clean up the corrupted data, and deploy a robust fix.
This isn't just a story about a bug fix; it's a chronicle of the real-world development process – the frustration, the "aha!" moments, and the invaluable lessons learned when pushing changes to production.
The Problem Unveiled: A Tale of Tainted Text
Our RAG system relies on meticulously cleaned and chunked documents. The crawler's job is to fetch web pages, extract the relevant content, strip out all the cruft (like navigation, ads, and, crucially, HTML tags), and then hand over the pristine text for embedding and indexing.
The symptom was unmistakable: search results for a specific document set (our BetrVG documents, a German labor law compendium) were displaying raw HTML. This indicated that somewhere along the line, the HTML stripping process had failed, or worse, been bypassed.
The Detective Work: Tracing the HTML Leak
My first instinct was to check the HTML stripping logic itself. Had a regex gone rogue? Was a library misbehaving? But after some digging, the true culprit emerged, hiding in plain sight within src/server/services/rag/document-processor.ts.
The document-processor has a critical function: processDocument(). This function is responsible for taking a raw document (either from a local file or a URL) and turning it into a clean, chunkable piece of text. The bug lay in its content loading mechanism.
// src/server/services/rag/document-processor.ts (simplified snippet)
async processDocument(document: Document) {
let content: string;
// The bug: It was always re-fetching from sourceUrl if localPath wasn't immediately available,
// even if the crawler had already processed and cleaned it.
if (document.localPath && fs.accessSync(document.localPath, fs.constants.F_OK)) {
content = fs.readFileSync(document.localPath, 'utf8');
} else {
// This path was being hit inadvertently, leading to re-fetching the raw HTML
// from the original sourceUrl, undoing the crawler's work.
const response = await fetch(document.sourceUrl);
content = await response.text();
}
// ... rest of the processing including HTML stripping ...
}
The issue was subtle: our crawler already extracts and cleans the HTML, saving the cleaned content to a local file. However, the document-processor's processDocument method, when called later (e.g., for re-processing or internal consistency checks), would sometimes bypass the localPath check if fs.accessSync failed for some reason, or if localPath wasn't immediately set up in a specific flow. It would then fall back to re-fetching from the sourceUrl – which, of course, contained the original, raw HTML. The cleaned version was being overwritten or ignored.
The Fix (Part 1): Prioritizing Local Content
The solution was to explicitly ensure that if a local, cleaned version of the document exists, we always prefer it. The original code's fs.accessSync check had nuances that could lead to the fallback. A more robust check was needed.
// src/server/services/rag/document-processor.ts (lines 397-415, conceptual change)
async processDocument(document: Document) {
let content: string;
// ✅ FIX: Prefer local file on disk if it exists, ensuring we use the already-cleaned content.
try {
if (document.localPath) {
await fs.promises.access(document.localPath, fs.constants.F_OK); // Check if file exists
content = await fs.promises.readFile(document.localPath, 'utf8');
} else {
// Fallback to fetching from sourceUrl only if no localPath exists
const response = await fetch(document.sourceUrl);
content = await response.text();
}
} catch (error) {
// If local file access fails, or no localPath, then fetch from sourceUrl
console.warn(`Could not access local file for ${document.id}, falling back to sourceUrl.`, error);
const response = await fetch(document.sourceUrl);
content = await response.text();
}
// ... rest of the processing ...
}
This change ensures that once our crawler has done its job and saved a clean version, the document-processor will respect that, preventing the re-introduction of raw HTML.
The Fix (Part 2): Correcting the Crawler's MimeType
While investigating, I also found a related, albeit less critical, issue in src/server/services/rag/site-crawler-service.ts. The crawler was still reporting its mimeType as text/html even after it had extracted the content. This was a logical inconsistency, as the content it was handing off was already text/plain.
// src/server/services/rag/site-crawler-service.ts (line 333, conceptual change)
// Before: mimeType: 'text/html'
// After:
mimeType: 'text/plain', // ✅ FIX: Content is already extracted plain text
A small change, but important for accuracy and future-proofing.
The Cleanup Operation: Eradicating the Taint
With the root cause identified and fixed, the next critical step was to clean up the existing corrupted data. There were 153 BetrVG documents on production that were "tainted" – meaning their associated chunks contained raw HTML.
A surgical strike was needed. Thanks to well-defined foreign key constraints with ON DELETE CASCADE, deleting the parent Document records would automatically remove all their associated DocumentChunk records.
-- scripts/cleanup-crawled-html.sql
DELETE FROM "Document"
WHERE "projectId" = '4c445cc1-ceb0-49c0-9e41-aaae419cb4bb' -- Specific BetrVG Project ID
AND "sourceUrl" LIKE 'https://www.gesetze-im-internet.de/betrvg/%';
This script precisely targeted the problematic documents, ensuring a clean slate for the re-crawl.
Navigating Production Minefields: Lessons Learned the Hard Way
Deploying changes to a live production environment is always a high-stakes operation. This session provided a fresh batch of "oh, right!" moments and solidified some critical best practices.
Lesson 1: SSH Heredocs and Docker Exec - A Tricky Combo
The Problem: I needed to run multi-statement SQL on the production PostgreSQL instance, which was running inside a Docker container. My go-to for multi-line commands over SSH is a heredoc. My initial attempt looked something like this:
ssh user@host << EOF
docker exec -i nyxcore-db-1 psql -U postgres -d nyxcore_prod
CREATE TABLE crawl_jobs ( ... );
ALTER TABLE crawl_jobs ADD CONSTRAINT ...;
EOF
The Failure: The CREATE TABLE command never executed. The heredoc was being interpreted by the local shell, not correctly passed through the ssh and then docker exec chain as a multi-line input to psql.
The Workaround: I resorted to executing each SQL statement individually, using the -c flag for psql and carefully escaping quotes for a single-line command.
ssh user@host "docker exec -i nyxcore-db-1 psql -U postgres -d nyxcore_prod -c 'CREATE TABLE crawl_jobs (...);'"
ssh user@host "docker exec -i nyxcore-db-1 psql -U postgres -d nyxcore_prod -c 'ALTER TABLE crawl_jobs ADD CONSTRAINT ...;'"
Takeaway: When chaining remote commands and input streams, especially with ssh and docker exec, the shell's interpretation of heredocs can be surprisingly complex. For critical production changes, always prefer explicit, single-line commands or well-tested scripts.
Lesson 2: Prisma db push is NOT for Production (Especially with pgvector)
The Problem: We needed a new crawl_jobs table. My initial thought was to use npx prisma db push inside the production container, as it's quick and usually handles schema syncing.
The Failure (Critical!): db push is designed for rapid development iteration. When it encounters a schema mismatch, it might try to "reset" the schema to match your schema.prisma. In our case, this meant it attempted to drop columns it didn't recognize, specifically the embedding columns used by pgvector. This would have resulted in catastrophic data loss for all our existing document embeddings!
The Workaround: I immediately aborted the db push and manually wrote the CREATE TABLE and ALTER TABLE statements for the new crawl_jobs table, along with the necessary foreign key constraints.
-- Manual DDL for crawl_jobs table
CREATE TABLE "crawl_jobs" (
"id" UUID NOT NULL DEFAULT gen_random_uuid(),
"tenantId" UUID NOT NULL,
"projectId" UUID NOT NULL,
"userId" UUID NOT NULL,
"status" TEXT NOT NULL DEFAULT 'PENDING',
"startedAt" TIMESTAMP(3),
"completedAt" TIMESTAMP(3),
"error" TEXT,
"createdAt" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
"updatedAt" TIMESTAMP(3) NOT NULL,
CONSTRAINT "crawl_jobs_pkey" PRIMARY KEY ("id")
);
ALTER TABLE "crawl_jobs" ADD CONSTRAINT "crawl_jobs_tenantId_fkey" FOREIGN KEY ("tenantId") REFERENCES "Tenant"("id") ON DELETE CASCADE ON UPDATE CASCADE;
ALTER TABLE "crawl_jobs" ADD CONSTRAINT "crawl_jobs_projectId_fkey" FOREIGN KEY ("projectId") REFERENCES "Project"("id") ON DELETE CASCADE ON UPDATE CASCADE;
ALTER TABLE "crawl_jobs" ADD CONSTRAINT "crawl_jobs_userId_fkey" FOREIGN KEY ("userId") REFERENCES "User"("id") ON DELETE CASCADE ON UPDATE CASCADE;
Takeaway: NEVER use prisma db push on a production database, especially if you have custom types or extensions like pgvector! Always use prisma migrate deploy or carefully crafted DDL scripts with proper version control for production schema changes. db push is a development tool; db migrate is for production.
Lesson 3: Health Checks from Inside the Container
The Problem: After deployment, I wanted to quickly verify the application's health. My usual curl localhost:3000/api/v1/health from the SSH host failed with "Connection refused."
The Failure: This is a classic Docker networking gotcha. localhost from the host machine doesn't map directly to localhost inside the container unless explicit port forwarding is configured and exposed.
The Workaround: The fastest way to check was to execute the health check from within the application container itself.
docker exec nyxcore-app-1 node -e "fetch('http://localhost:3000/api/v1/health').then(res => res.json()).then(console.log)"
Takeaway: Understand your container networking. For quick checks, executing commands directly inside the container (docker exec) bypasses host-level networking complexities.
The Road Ahead: Verifying and Securing
With the fixes deployed, the tainted documents deleted, and the new crawl_jobs table in place, the immediate next steps are crucial:
- User Re-crawl: The user needs to re-crawl the BetrVG site. This is the ultimate verification step.
- Verify Clean Text: I'll confirm that the extracted text is now clean, devoid of HTML tags, both in the raw database and in search results.
- RLS for
crawl_jobs: The newcrawl_jobstable currently lacks Row-Level Security (RLS) policies. This is a security oversight that needs to be addressed soon to ensure data isolation between tenants.
Conclusion
This session was a stark reminder of the complexities of building and maintaining robust RAG systems. From tracking down a subtle bug in content processing to navigating the treacherous waters of production database schema changes and Docker networking, every step reinforced the importance of meticulous debugging, careful deployment, and continuous learning.
The key takeaways? Prioritize content integrity, respect your database tools' intended use-cases, and always be prepared to learn from the inevitable bumps in the road. And sometimes, the most effective fix is a simple fs.access check.
{
"thingsDone": [
"Fixed HTML stripping bug in document-processor",
"Corrected crawler mimeType to text/plain",
"Wrote and executed SQL script to delete 153 tainted documents",
"Manually created crawl_jobs table on production with FKs",
"Deployed application with fixes",
"Confirmed production health check"
],
"pains": [
"SSH heredoc failed to pass multi-statement SQL to docker exec psql",
"Attempted Prisma db push on production, which nearly dropped pgvector columns",
"Failed to curl localhost from SSH host for health check due to Docker networking"
],
"successes": [
"Successfully used single-line SQL commands with escaped quotes via docker exec",
"Manually created critical database table and FKs without data loss",
"Verified application health from inside the Docker container",
"Root cause of HTML stripping bug identified and fixed"
],
"techStack": [
"TypeScript",
"Node.js",
"PostgreSQL",
"Prisma",
"Docker",
"RAG (Retrieval-Augmented Generation)",
"Site Crawler",
"fs.promises",
"fetch API"
]
}