Navigating the Production Minefield: Lessons from a Multi-Tenant Superadmin Deployment
From database migration surprises to Docker woes and critical security alerts, deploying a new multi-tenant superadmin feature to production was a rollercoaster. Here's what we learned from the trenches.
The hum of the servers, the frantic git pull followed by a docker build, and the nervous anticipation of a production deploy – it's a familiar dance for many developers. This past week, we embarked on such a journey for Nyxcore.cloud, aiming to roll out our shiny new superadmin panel and multi-tenant onboarding flow. The goal was clear: get it live, get it healthy, and get the superadmin privileges set for yours truly.
While the mission was ultimately accomplished, the path was anything but smooth. Like any real-world production deployment, it came with its share of head-scratching moments, critical security alerts, and invaluable lessons.
The Mission: Superadmin & Multi-Tenant Onboarding
Our primary objective was to deploy the core infrastructure for multi-tenancy, including:
- A Superadmin Panel: A dedicated interface for managing tenants, users, and overall system health.
- Multi-Tenant Onboarding: The initial flow for new organizations to sign up and create their isolated environment.
By the end of the session, we had successfully deployed the new features, confirmed the health of the system, and granted superadmin access to oliver.baer@gmail.com. The app was running, the database and Redis were up, and the new code was live.
Triumphs on the Battlefield: What Went Right
Amidst the chaos, several key tasks were completed smoothly:
- Next.js Suspense Fix: We hit a common Next.js production pitfall where
useSearchParams()can trigger Suspense issues in layouts. Wrapping ourNoTenantGuardcomponent in<Suspense>insrc/app/(dashboard)/layout.tsxresolved the problem, ensuring production build compatibility. This allowed our layout to correctly handle cases where search parameters weren't immediately available. - Successful Code Deployment: The new codebase, specifically commit
b517329, was pulled, built, and deployed to production. - Initial Superadmin Setup: We successfully added the
isSuperAdmincolumn to our productionuserstable and set the flag for the designated superadmin user. - Health Checks Green: Our internal health endpoints confirmed both the database and Redis were up and responsive, indicating the core services were operational.
The Battle Scars: Lessons Learned in Production
This is where the real story lies – the "Pain Log" that transforms into actionable "Lessons Learned."
Lesson 1: Prisma and pgvector - A Data Loss Trap Averted
The Problem: We needed to add a simple isSuperAdmin boolean column to our users table. Naturally, the first thought was to use Prisma's migration capabilities. However, when attempting npx prisma@5.22.0 db push on production, Prisma issued a chilling warning: it wanted to DROP our embedding vector(1536) column on the workflow_insights table. This column, crucial for our AI/ML features, contained 24 non-null values and was flagged as Unsupported in Prisma's schema.
Why it Happened: Prisma, while powerful, has limitations with certain custom database types or extensions like pgvector. When it sees a type it doesn't fully understand or manage within its schema definition, it can sometimes default to wanting to "correct" it by dropping and recreating, which is catastrophic for data-rich columns.
The Workaround: We immediately aborted the db push. Instead, we resorted to raw SQL to add the isSuperAdmin column:
ALTER TABLE users
ADD COLUMN IF NOT EXISTS "isSuperAdmin" BOOLEAN DEFAULT FALSE;
The Takeaway:
- NEVER use
db push --accept-data-losson production without absolute certainty of its impact. It will destroy data. - Understand your ORM's limitations. If you're using custom database types (like
pgvector), be aware that your ORM might not fully support them for schema migrations. - Have a raw SQL fallback plan. For sensitive production database changes, raw SQL is often the safest and most precise method, especially when dealing with ORM blind spots.
- Consider a dedicated production migration workflow. For future changes, we need a robust Prisma migration strategy that doesn't conflict with
pgvectorcolumns, potentially involving manual review and custom SQL scripts integrated into the process.
Lesson 2: Docker's Stale Container References
The Problem: After successfully building a new Docker image for our app service, a simple docker compose up -d app failed with: Error response from daemon: No such container.
Why it Happened: docker compose up tries to use existing container references. If you've rebuilt an image (especially with --no-cache), the underlying image ID changes, and the old container reference might become stale or invalid, leading to this confusing error. It's like Docker is looking for a specific house that you've just rebuilt on the same plot, and it can't find the old one.
The Workaround: The solution was to explicitly bring down and then bring up the service, forcing a recreation:
docker compose down app && docker compose up -d app
The Takeaway:
- When in doubt, recreate. After rebuilding Docker images, especially in a production context,
docker compose down(ordocker compose rm -f) followed bydocker compose up -densures you're running a fresh container from the newly built image. - Understand Docker Compose lifecycle. Know when
up,down,stop,start, andrmare appropriate, and how they interact with image builds.
Lesson 3: Google Safe Browsing - The New Domain Reputation Game
The Problem: Our brand new domain, nyxcore.cloud, was flagged as "dangerous" by Google Safe Browsing.
Why it Happened: New domains, especially those that launch with new web applications, often start with a very low reputation score. This makes them susceptible to false positives from automated security scanners which might flag them until enough positive reputation is built.
The Workaround: The only solution is to manually submit the domain for review via Google's Safe Browsing report error page: https://safebrowsing.google.com/safebrowsing/report_error/?url=https://nyxcore.cloud.
The Takeaway:
- Factor domain reputation into launch plans. For new domains, anticipate potential flagging issues.
- Monitor security alerts. Keep an eye on Google Search Console and other security monitoring tools for your domain.
- Be proactive. Submit for review as soon as you detect a false positive.
Lesson 4: The Grave Danger of Leaked Secrets
The Problem: During a rapid development phase, an .env file containing live API keys (OpenAI, Anthropic, JWT secret, encryption key, DB password for our mini-rag project) was accidentally pasted into a shared communication channel.
Why it Happened: Human error, often exacerbated by the pressure of quick iterations and information sharing.
The Workaround: I immediately advised the user to rotate all affected keys. Despite a casual dismissal ("vergiss das" - forget that), the critical nature of this cannot be overstated.
The Takeaway:
- CRITICAL: Assume compromise, rotate immediately. Any leak of a live secret, no matter how brief or seemingly contained, must be treated as a full compromise. Rotation is non-negotiable.
- Implement automated secret scanning. Tools like GitGuardian, GitHub Secret Scanning, or local pre-commit hooks can prevent secrets from ever reaching repositories or public channels.
- Educate your team. Continuously reinforce best practices for handling sensitive information. Never share secrets in plain text, even in "private" chats. Use secure vaults or environment variables directly.
- Least Privilege & Short-Lived Credentials: Design systems so that credentials have the minimum necessary permissions and are rotated frequently.
Current Production State
As of the last check:
- Commit:
b517329is live. - Health:
{"status":"healthy","checks":{"database":true,"redis":true}} - Superadmin:
isSuperAdmin = trueforoliver.baer@gmail.com. - Services: Postgres, Redis, App, Nginx containers are all running.
What's Next on the Horizon
With the initial deployment behind us, here are the immediate next steps:
- Verify Superadmin Panel: Log in to nyxcore.cloud and confirm the "Superadmin" link appears in the sidebar and the tenant switcher works in the header.
- Test Tenant Creation: Create a test tenant via the superadmin panel, invite a test user, and verify complete isolation between tenants.
- Google Safe Browsing Review: Follow up on the submission for
nyxcore.cloud. - Rotate Mini-RAG Secrets: Reiterate and ensure the rotation of OpenAI key, Anthropic key, JWT secret, encryption key, and DB password. This is paramount.
- Refine Prisma Migration Workflow: Begin planning and implementing a production-safe Prisma migration workflow that respects and doesn't conflict with
pgvectorcolumns.
Conclusion
Production deployments are rarely just about code. They're about anticipating infrastructure quirks, understanding ORM limitations, navigating external reputation systems, and most critically, maintaining ironclad security practices. Each "pain" point was a learning opportunity, reinforcing the need for diligence, robust testing, and a healthy dose of paranoia when dealing with live systems. Here's to smoother sailing, but always being prepared for the next wave!
{
"thingsDone": [
"Fixed useSearchParams() Suspense issue in Next.js layout with <Suspense>",
"Committed and pushed b517329 to origin/main",
"Deployed to production (git pull, docker build, container restart)",
"Added isSuperAdmin column to production DB via raw SQL",
"Set isSuperAdmin = true for oliver.baer@gmail.com on production DB",
"Confirmed DB + Redis up and app running"
],
"pains": [
"Prisma db push attempting to DROP pgvector embedding column due to 'Unsupported' type",
"Docker compose up failing with 'No such container' after image rebuild",
"nyxcore.cloud flagged as 'dangerous' by Google Safe Browsing",
"Accidental leak of live API keys and secrets (OpenAI, Anthropic, JWT, encryption, DB password)"
],
"successes": [
"Successful deployment of core multi-tenant and superadmin features",
"Effective workaround for Prisma pgvector incompatibility using raw SQL",
"Resolution of Docker stale container reference issue",
"Identification and immediate action on Google Safe Browsing flag",
"Confirmation of production system health and superadmin access"
],
"techStack": [
"Next.js",
"Prisma",
"PostgreSQL",
"pgvector",
"Redis",
"Docker",
"Docker Compose",
"Nginx",
"OpenAI API",
"Anthropic API",
"JWT"
]
}