Navigating the Production Minefield: Lessons from a Multi-Tenant Superadmin Deployment

The hum of the servers, the frantic git pull followed by a docker build, and the nervous anticipation of a production deploy – it's a familiar dance for many developers. This past week, we embarked on such a journey for Nyxcore.cloud, aiming to roll out our shiny new superadmin panel and multi-tenant onboarding flow. The goal was clear: get it live, get it healthy, and get the superadmin privileges set for yours truly.

While the mission was ultimately accomplished, the path was anything but smooth. Like any real-world production deployment, it came with its share of head-scratching moments, critical security alerts, and invaluable lessons.

The Mission: Superadmin & Multi-Tenant Onboarding

Our primary objective was to deploy the core infrastructure for multi-tenancy, including:

A Superadmin Panel: A dedicated interface for managing tenants, users, and overall system health.
Multi-Tenant Onboarding: The initial flow for new organizations to sign up and create their isolated environment.

By the end of the session, we had successfully deployed the new features, confirmed the health of the system, and granted superadmin access to oliver.baer@gmail.com. The app was running, the database and Redis were up, and the new code was live.

Triumphs on the Battlefield: What Went Right

Amidst the chaos, several key tasks were completed smoothly:

Next.js Suspense Fix: We hit a common Next.js production pitfall where useSearchParams() can trigger Suspense issues in layouts. Wrapping our NoTenantGuard component in <Suspense> in src/app/(dashboard)/layout.tsx resolved the problem, ensuring production build compatibility. This allowed our layout to correctly handle cases where search parameters weren't immediately available.
Successful Code Deployment: The new codebase, specifically commit b517329, was pulled, built, and deployed to production.
Initial Superadmin Setup: We successfully added the isSuperAdmin column to our production users table and set the flag for the designated superadmin user.
Health Checks Green: Our internal health endpoints confirmed both the database and Redis were up and responsive, indicating the core services were operational.

The Battle Scars: Lessons Learned in Production

This is where the real story lies – the "Pain Log" that transforms into actionable "Lessons Learned."

Lesson 1: Prisma and `pgvector` - A Data Loss Trap Averted

The Problem: We needed to add a simple isSuperAdmin boolean column to our users table. Naturally, the first thought was to use Prisma's migration capabilities. However, when attempting npx prisma@5.22.0 db push on production, Prisma issued a chilling warning: it wanted to DROP our embedding vector(1536) column on the workflow_insights table. This column, crucial for our AI/ML features, contained 24 non-null values and was flagged as Unsupported in Prisma's schema.

Why it Happened: Prisma, while powerful, has limitations with certain custom database types or extensions like pgvector. When it sees a type it doesn't fully understand or manage within its schema definition, it can sometimes default to wanting to "correct" it by dropping and recreating, which is catastrophic for data-rich columns.

The Workaround: We immediately aborted the db push. Instead, we resorted to raw SQL to add the isSuperAdmin column:

sql

ALTER TABLE users
ADD COLUMN IF NOT EXISTS "isSuperAdmin" BOOLEAN DEFAULT FALSE;

The Takeaway:

NEVER use db push --accept-data-loss on production without absolute certainty of its impact. It will destroy data.
Understand your ORM's limitations. If you're using custom database types (like pgvector), be aware that your ORM might not fully support them for schema migrations.
Have a raw SQL fallback plan. For sensitive production database changes, raw SQL is often the safest and most precise method, especially when dealing with ORM blind spots.
Consider a dedicated production migration workflow. For future changes, we need a robust Prisma migration strategy that doesn't conflict with pgvector columns, potentially involving manual review and custom SQL scripts integrated into the process.

Lesson 2: Docker's Stale Container References

The Problem: After successfully building a new Docker image for our app service, a simple docker compose up -d app failed with: Error response from daemon: No such container.

Why it Happened: docker compose up tries to use existing container references. If you've rebuilt an image (especially with --no-cache), the underlying image ID changes, and the old container reference might become stale or invalid, leading to this confusing error. It's like Docker is looking for a specific house that you've just rebuilt on the same plot, and it can't find the old one.

The Workaround: The solution was to explicitly bring down and then bring up the service, forcing a recreation:

bash

docker compose down app && docker compose up -d app

The Takeaway:

When in doubt, recreate. After rebuilding Docker images, especially in a production context, docker compose down (or docker compose rm -f) followed by docker compose up -d ensures you're running a fresh container from the newly built image.
Understand Docker Compose lifecycle. Know when up, down, stop, start, and rm are appropriate, and how they interact with image builds.

Lesson 3: Google Safe Browsing - The New Domain Reputation Game

The Problem: Our brand new domain, nyxcore.cloud, was flagged as "dangerous" by Google Safe Browsing.

Why it Happened: New domains, especially those that launch with new web applications, often start with a very low reputation score. This makes them susceptible to false positives from automated security scanners which might flag them until enough positive reputation is built.

The Workaround: The only solution is to manually submit the domain for review via Google's Safe Browsing report error page: https://safebrowsing.google.com/safebrowsing/report_error/?url=https://nyxcore.cloud.

The Takeaway:

Factor domain reputation into launch plans. For new domains, anticipate potential flagging issues.
Monitor security alerts. Keep an eye on Google Search Console and other security monitoring tools for your domain.
Be proactive. Submit for review as soon as you detect a false positive.

Lesson 4: The Grave Danger of Leaked Secrets

The Problem: During a rapid development phase, an .env file containing live API keys (OpenAI, Anthropic, JWT secret, encryption key, DB password for our mini-rag project) was accidentally pasted into a shared communication channel.

Why it Happened: Human error, often exacerbated by the pressure of quick iterations and information sharing.

The Workaround: I immediately advised the user to rotate all affected keys. Despite a casual dismissal ("vergiss das" - forget that), the critical nature of this cannot be overstated.

The Takeaway:

CRITICAL: Assume compromise, rotate immediately. Any leak of a live secret, no matter how brief or seemingly contained, must be treated as a full compromise. Rotation is non-negotiable.
Implement automated secret scanning. Tools like GitGuardian, GitHub Secret Scanning, or local pre-commit hooks can prevent secrets from ever reaching repositories or public channels.
Educate your team. Continuously reinforce best practices for handling sensitive information. Never share secrets in plain text, even in "private" chats. Use secure vaults or environment variables directly.
Least Privilege & Short-Lived Credentials: Design systems so that credentials have the minimum necessary permissions and are rotated frequently.

Current Production State

As of the last check:

Commit: b517329 is live.
Health: {"status":"healthy","checks":{"database":true,"redis":true}}
Superadmin: isSuperAdmin = true for oliver.baer@gmail.com.
Services: Postgres, Redis, App, Nginx containers are all running.

What's Next on the Horizon

With the initial deployment behind us, here are the immediate next steps:

Verify Superadmin Panel: Log in to nyxcore.cloud and confirm the "Superadmin" link appears in the sidebar and the tenant switcher works in the header.
Test Tenant Creation: Create a test tenant via the superadmin panel, invite a test user, and verify complete isolation between tenants.
Google Safe Browsing Review: Follow up on the submission for nyxcore.cloud.
Rotate Mini-RAG Secrets: Reiterate and ensure the rotation of OpenAI key, Anthropic key, JWT secret, encryption key, and DB password. This is paramount.
Refine Prisma Migration Workflow: Begin planning and implementing a production-safe Prisma migration workflow that respects and doesn't conflict with pgvector columns.

Conclusion

Production deployments are rarely just about code. They're about anticipating infrastructure quirks, understanding ORM limitations, navigating external reputation systems, and most critically, maintaining ironclad security practices. Each "pain" point was a learning opportunity, reinforcing the need for diligence, robust testing, and a healthy dose of paranoia when dealing with live systems. Here's to smoother sailing, but always being prepared for the next wave!

json

{
  "thingsDone": [
    "Fixed useSearchParams() Suspense issue in Next.js layout with <Suspense>",
    "Committed and pushed b517329 to origin/main",
    "Deployed to production (git pull, docker build, container restart)",
    "Added isSuperAdmin column to production DB via raw SQL",
    "Set isSuperAdmin = true for oliver.baer@gmail.com on production DB",
    "Confirmed DB + Redis up and app running"
  ],
  "pains": [
    "Prisma db push attempting to DROP pgvector embedding column due to 'Unsupported' type",
    "Docker compose up failing with 'No such container' after image rebuild",
    "nyxcore.cloud flagged as 'dangerous' by Google Safe Browsing",
    "Accidental leak of live API keys and secrets (OpenAI, Anthropic, JWT, encryption, DB password)"
  ],
  "successes": [
    "Successful deployment of core multi-tenant and superadmin features",
    "Effective workaround for Prisma pgvector incompatibility using raw SQL",
    "Resolution of Docker stale container reference issue",
    "Identification and immediate action on Google Safe Browsing flag",
    "Confirmation of production system health and superadmin access"
  ],
  "techStack": [
    "Next.js",
    "Prisma",
    "PostgreSQL",
    "pgvector",
    "Redis",
    "Docker",
    "Docker Compose",
    "Nginx",
    "OpenAI API",
    "Anthropic API",
    "JWT"
  ]
}

The Mission: Superadmin & Multi-Tenant Onboarding

Triumphs on the Battlefield: What Went Right

The Battle Scars: Lessons Learned in Production

Lesson 1: Prisma and pgvector - A Data Loss Trap Averted

Lesson 2: Docker's Stale Container References

Lesson 3: Google Safe Browsing - The New Domain Reputation Game

Lesson 4: The Grave Danger of Leaked Secrets

Current Production State

What's Next on the Horizon

Conclusion

Lesson 1: Prisma and `pgvector` - A Data Loss Trap Averted