From Tainted Tags to Pristine Pages: Debugging Our Site Crawler & Lessons Learned

Every developer knows the feeling: you push a feature, it seems perfect in testing, and then bam—something unexpected happens in production. Our recent adventure involved battling phantom HTML tags, cleaning up tainted data, and navigating a few database migration minefields, all to ensure our site crawler delivers pristine, plain-text content.

This post chronicles a recent development session, detailing the bug, the fix, the cleanup, and some hard-won lessons from deploying changes to a live system.

The Unwanted Encore: HTML's Persistent Presence

Our site crawler is designed to fetch web pages, extract meaningful text, and store it for RAG (Retrieval-Augmented Generation) purposes. The critical part of this process is stripping away all the noise—the HTML tags, scripts, and styling—leaving only the pure content.

The Problem: We started noticing raw HTML tags appearing in our extracted documents, specifically within the BetrVG (German Works Constitution Act) documents. This was a significant issue, as it meant our RAG system would be indexing and searching against noisy, unformatted data.

The root cause was subtle:

Our crawler would fetch a page from sourceUrl.
It would then process and strip the HTML, saving the clean text to a local file.
However, during subsequent processing (e.g., chunking, embedding), our document-processor was configured to re-fetch content from the sourceUrl if a local file wasn't explicitly checked for first. This meant it was re-introducing the raw HTML, effectively undoing the initial stripping!

The Fix: The solution involved a crucial change in our document-processor. We modified the content loading logic (src/server/services/rag/document-processor.ts) to prioritize checking for the local, already-processed file on disk using fs.access before attempting to re-fetch from the sourceUrl. This ensures we always work with the clean, stripped content.

typescript

// Simplified logic for illustration
async function processDocument(documentPath: string, sourceUrl: string) {
  let content: string;
  try {
    // Attempt to read from local file first
    await fs.access(documentPath, fs.constants.F_OK);
    content = await fs.readFile(documentPath, 'utf8');
  } catch (error) {
    // If local file doesn't exist, or access fails, then fetch from sourceUrl
    // This path should ideally be taken only for initial crawl, not re-processing
    content = await fetchAndStripHtml(sourceUrl); 
    await fs.writeFile(documentPath, content); // Save stripped content
  }
  // ... further processing of 'content'
}

Additionally, a minor but important adjustment was made in src/server/services/rag/site-crawler-service.ts to correctly set the mimeType to text/plain for content that had already been extracted. This reinforces that the data we're handling is plain text, not raw HTML.

Cleaning House: Deleting the Tainted Past

With the bug squashed, we faced a new challenge: 153 BetrVG documents had already been crawled and saved with the unwanted HTML. These needed to be purged from production.

We crafted a simple SQL script (scripts/cleanup-crawled-html.sql) to target and delete these specific documents. Thanks to well-defined foreign key constraints with ON DELETE CASCADE, deleting the parent document records automatically removed their associated chunks, ensuring a clean slate.

sql

-- Example SQL for cleanup (actual script was more specific)
DELETE FROM "documents"
WHERE "projectId" = '4c445cc1-ceb0-49c0-9e41-aaae419cb4bb' -- BetrVG project ID
  AND "sourceUrl" LIKE 'https://www.gesetze-im-internet.de/betrvg/%';

Navigating the Database Minefield: Lessons from Production

Beyond the bug fix and data cleanup, we also needed to introduce a new crawl_jobs table to manage our crawling operations more robustly. This led to a few critical lessons about managing databases in production:

Lesson 1: NEVER `prisma db push` on Production

Our application uses Prisma for ORM, and while prisma migrate is the standard for schema changes, sometimes db push seems tempting for quick table additions during development. However, running npx prisma@5.22.0 db push inside our production container proved to be a near-disastrous mistake.

The Pitfall: db push is designed for rapid prototyping and development environments. It attempts to synchronize the database schema with the Prisma schema by potentially dropping and recreating tables or columns. In our case, it would have dropped the crucial embedding columns, which are of the pgvector type, on our existing tables. Losing these columns means losing all our vector embeddings—a catastrophic data loss!

The Workaround: The only safe approach for production schema changes, especially when dealing with custom column types or existing data, is to write explicit SQL DDL statements. We manually created the crawl_jobs table and added the necessary foreign key constraints (tenantId, projectId, userId with ON DELETE CASCADE) directly via docker exec psql.

sql

-- Manual SQL for creating the crawl_jobs table
CREATE TABLE "crawl_jobs" (
    "id" UUID NOT NULL DEFAULT gen_random_uuid(),
    "createdAt" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
    "updatedAt" TIMESTAMP(3) NOT NULL,
    "tenantId" UUID NOT NULL,
    "projectId" UUID NOT NULL,
    "userId" UUID NOT NULL,
    "status" TEXT NOT NULL DEFAULT 'PENDING',
    "url" TEXT NOT NULL,
    "recursive" BOOLEAN NOT NULL DEFAULT FALSE,
    "maxDepth" INTEGER,
    "processedDocuments" INTEGER NOT NULL DEFAULT 0,
    "failedDocuments" INTEGER NOT NULL DEFAULT 0,
    "errorMessage" TEXT,
    CONSTRAINT "crawl_jobs_pkey" PRIMARY KEY ("id")
);

ALTER TABLE "crawl_jobs" ADD CONSTRAINT "crawl_jobs_tenantId_fkey" FOREIGN KEY ("tenantId") REFERENCES "tenants"("id") ON DELETE CASCADE ON UPDATE CASCADE;
ALTER TABLE "crawl_jobs" ADD CONSTRAINT "crawl_jobs_projectId_fkey" FOREIGN KEY ("projectId") REFERENCES "projects"("id") ON DELETE CASCADE ON UPDATE CASCADE;
ALTER TABLE "crawl_jobs" ADD CONSTRAINT "crawl_jobs_userId_fkey" FOREIGN KEY ("userId") REFERENCES "users"("id") ON DELETE CASCADE ON UPDATE CASCADE;

This experience solidified the importance of careful, explicit migration strategies for production databases.

Lesson 2: Taming Remote SQL Execution with `docker exec`

When executing multi-statement SQL scripts remotely via SSH and docker exec psql, we ran into an unexpected hurdle: using a heredoc (<<EOF) to pipe the script failed. The SQL statements simply weren't being passed through the SSH+docker exec chain.

The Pitfall: Heredocs can be tricky with nested command execution and quoting.

The Workaround: For a quick, multi-statement execution, the most reliable method was to use single-line -c flags for psql, carefully escaping any quotes within the SQL statements.

bash

# Failed attempt (simplified)
ssh user@host "docker exec -i nyxcore-db-1 psql -U user -d db_name <<EOF
  CREATE TABLE ...;
  ALTER TABLE ...;
EOF"

# Successful workaround (simplified)
docker exec nyxcore-db-1 psql -U user -d db_name -c "CREATE TABLE \"crawl_jobs\" (...); ALTER TABLE \"crawl_jobs\" ADD CONSTRAINT ..."

While less elegant than a script, this direct approach ensures commands are correctly interpreted. For more complex migrations, dedicated migration tools are always preferred.

Lesson 3: Debugging Health Checks in Dockerized Environments

After deployment, a quick health check is standard procedure. My initial attempt: curl localhost:3000 from the SSH host.

The Pitfall: Connection refused. The application was running inside a Docker container, and localhost on the host machine doesn't necessarily map to localhost inside the container unless explicit port mapping is configured and accessed correctly.

The Workaround: The most direct way to check the health of an application inside its container is to execute the check from within the container itself.

bash

docker exec nyxcore-app-1 node -e "fetch('http://localhost:3000/api/v1/health').then(res => res.json()).then(console.log)"

This confirmed the application was indeed healthy and responding correctly within its isolated environment.

The Road to Recovery: Deployment & Verification

With all fixes and new features implemented, data cleaned, and lessons learned, we completed the deployment:

Committed (0957b1b), pushed to main.
Built a new Docker image.
Deployed and restarted the application.
Verified health checks were passing ({"status":"healthy","checks":{"database":true,"redis":true}}).

The crawl_jobs table is now live, and all 153 tainted BetrVG documents have been deleted, ready for a fresh, clean re-crawl.

What's Next? Solidifying the Foundation

Our immediate next steps involve:

User Re-crawl: The user will re-crawl https://www.gesetze-im-internet.de/betrvg/ via the Axiom tab. We'll verify that the extracted text is now clean, free of HTML tags.
Search Verification: Confirm that Axiom search results display plain text, not raw HTML.
Crawler Plan: Continue with remaining site crawler tasks, including integration tests.
Security Enhancement: Consider adding Row-Level Security (RLS) policies for the new crawl_jobs table to ensure proper access control.

This session was a stark reminder that even seemingly small bugs can have significant data integrity implications, and that production deployments always come with their own unique set of challenges and learning opportunities. Every "pain log" entry is a step towards a more robust and resilient system.