From Bugs to Builds: My Journey Through a Full-Stack Development Sprint
Join me as I recount a recent intense development session, tackling everything from a tricky dual-provider bug to setting up robust CI/CD and even getting an AI to audit my architectural plans.
Every now and then, you have one of those development sessions that feels like a full-blown sprint. You go in with a laundry list of items, some critical, some quality-of-life, and emerge hours later, blinking, with a sense of immense satisfaction (and a few new battle scars). This was one of those days.
I often write a "Letter to Myself" as a session handoff – a raw brain dump of what got done, what hurt, and what's next. It's a snapshot of the developer's mind. Today, I'm sharing a cleaned-up version of that snapshot, hoping my experiences, triumphs, and stumbles can offer some insights for your own journey.
The Core Mission: Intelligent Provider Routing
One of the cornerstone features of my platform involves intelligently routing requests to different AI providers based on various criteria (cost, performance, specific capabilities). This is managed by a "dual-provider gate" within our workflow engine.
The Bug: The system was designed to compare multiple providers, but a subtle bug emerged when compareProviders was active, yet only one provider was actually configured. The engine would get stuck, waiting for a comparison that couldn't happen.
The Fix: A simple but crucial condition check. I updated workflow-engine.ts:1292 to ensure the alternatives block only fires if step.compareProviders.length > 1. This prevents unnecessary waiting and ensures the system gracefully handles single-provider scenarios even when the "compare" flag is set.
// workflow-engine.ts:1292 (simplified)
if (step.compareProviders && step.compareProviders.length > 1) {
// ... proceed with dual-provider comparison logic ...
} else {
// ... fall back to single provider or default logic ...
}
Production Test: The real test is always production. I spun up workflow a0a00001-0001-4001-8001-d00000000002 ("Dual-Provider Test v2"), pitting Anthropic against Google. Cael (my AI orchestrator) correctly selected Anthropic. Seeing that caelReview and selectedProvider fields populate correctly in the checkpoint was a mini-celebration. Total cost: $0.0217 for 4,996 tokens in 23.5 seconds. Sweet.
Streamlining Deployments: CI/CD & Certbot Automation
Manual deployments are a drag, and certificate expiry is a silent killer. Tackling these two was high on the list.
Building a CI/CD Pipeline
My goal was simple: push to main, and let GitHub Actions handle the rest. No more SSHing in, pulling, building, and restarting Docker containers by hand.
The Setup: I created .github/workflows/deploy.yml. It's a standard flow:
- Triggers on a push to
mainafter CI passes. - SSH's to my Hetzner server.
- Pulls the latest code.
- Builds the Docker image for the
appservice. - Recreates the application containers using
docker compose. - Runs a health check to ensure everything's up and running.
Crucially, I set up GitHub secrets: DEPLOY_HOST, DEPLOY_USER, and DEPLOY_SSH_KEY. The DEPLOY_SSH_KEY is an ed25519 deploy key generated directly on the server (in ~/.ssh/deploy_key) for secure, passwordless access.
Lesson Learned (GitHub Environments): I initially tried to use GitHub environments with wait_timer: 0 for protection rules. Turns out, free plans don't support environment protection rules. A quick pivot was needed: I removed the environment: production declaration from the workflow, relying on repo-level secrets for security. Sometimes you just have to work with what you've got.
Automating Certbot Renewal
Let's Encrypt certificates are fantastic, but manual renewal is a recipe for disaster. I needed proper automation.
The Challenge: My existing Certbot setup used the standalone authenticator, which requires temporarily stopping the web server to bind to port 80. Not ideal for a continuously running service.
The Solution: Switch to the webroot authenticator. This method serves a challenge file from a specific directory, which Nginx can then expose.
- I mounted
/opt/certbot/wwwinto my Nginx container. - Updated the Certbot configuration to use
webroot. - Added a deploy hook:
docker exec nyxcore-nginx-1 nginx -s reload. This ensures Nginx reloads its configuration after a successful certificate renewal, picking up the new certs without downtime.
A certbot dry-run passed with flying colors, confirming the configuration works. My certificates now expire on 2026-05-31, and I'm confident the systemd timer will handle the auto-renewal ~30 days prior. What a relief!
Navigating the Minefield: Lessons from the Pain Log
Not everything goes smoothly. Here are a few "pain points" that turned into valuable lessons:
Docker Compose on Production vs. Local
The Pain: On production, I reflexively typed docker compose build app.
The Failure: no such service: app. My production setup uses a specific docker-compose.production.yml file, not the default docker-compose.yml.
The Lesson: Always be explicit with environment-specific configurations. On production, it's docker compose -f docker-compose.production.yml every time. This is a common pitfall when juggling multiple environments.
Generating NextAuth JWE Tokens Programmatically
The Pain: I needed to generate a NextAuth JWE token programmatically, perhaps for an internal API or a server-side process, and tried to do it inside my Next.js standalone Docker container.
The Failure: The jose and next-auth/jwt libraries weren't available in the optimized Next.js standalone output. It's a lean build, stripping out development dependencies.
The Lesson: For now, user-triggered workflows from the UI are the way to go for auth-related actions. For future programmatic auth, I've documented the manual process: generate JWE locally using HKDF (sha256, AUTH_SECRET, salt=__Secure-authjs.session-token, info=Auth.js Generated Encryption Key (__Secure-authjs.session-token), length=64), then encrypt with A256CBC-HS512. It's a reminder that sometimes you have to get low-level with crypto when your framework's helpers aren't available.
Rapid SSH Connections Dropping
The Pain: While frantically testing the CI/CD, I was making rapid SSH connections to the Hetzner server. After 2-3 connections, subsequent attempts would fail.
The Failure: Connection drops, "connection refused."
The Lesson: This is often due to sshd's MaxStartups setting, which limits the number of unauthenticated connections.
The Workaround: For now, I'm adding sleep 10-20 between SSH sessions or, better yet, combining multiple commands into a single SSH call (e.g., ssh user@host 'command1 && command2'). The long-term fix is to adjust MaxStartups in sshd_config.
The AI Assistant Strikes Again: Cael's Deep Dive
Beyond the infrastructure and bug fixes, I unleashed Cael, my AI assistant, on a significant architectural challenge: a 9-step multi-tenant team management plan. This wasn't just about generating code; it was about a consistency review.
The Task: Review workflow 197f3e9c-031d-414c-9f41-d0487c9d24f8.
The Findings: Cael identified a whopping 23 issues!
- 1 CRITICAL: "Team vs Tenant model confusion." This is huge, as it points to a fundamental conceptual flaw in how I was thinking about the data model.
- 3 HIGH, 12 MEDIUM, 3 LOW: A mix of logical inconsistencies, potential edge cases, and areas for improvement.
The Output: Cael didn't just point out problems; it built an authoritative final implementation prompt, broken down into 8 phases, detailing 11 new files and 6 modified files. This output, saved to /tmp/cael-review.md (1,383 lines!), is a blueprint for the next major feature. It's a powerful demonstration of AI's capability not just for generation, but for critical architectural review.
What's Next?
This session was incredibly productive, moving the needle on several fronts. But the developer's journey is never truly done. Here's what's immediately on the horizon:
- Implement Team Management: Dive into Cael's detailed prompt from
/tmp/cael-review.mdand start building out the new team management features, beginning with schema extensions. - Fix SSH
MaxStartups: Adjust thesshdconfiguration on the server to prevent future SSH connection drops. - Dual-Provider UI Toggle: Add a simple checkbox to the workflow builder UI for
dualProviderAutoSelectto make this powerful feature accessible. - NerdStats Per-Provider Cost: Complete the UI for
src/components/shared/nerd-stats.tsxto display a detailed cost table per AI provider. - Expand Dual-Provider Testing: Test the dual-provider logic in other critical pipelines (auto-fix, refactor, docs, code-analysis) to ensure its robustness across the board.
It was a good day. The kind of day that reminds you why we do this: the thrill of solving complex problems, the satisfaction of automating tedious tasks, and the constant learning curve that keeps us on our toes. Onwards!
{
"thingsDone": [
"Fixed dual-provider gate bug in workflow engine",
"Tested dual-provider functionality on production",
"Created CI/CD pipeline with GitHub Actions",
"Configured GitHub secrets for deployment",
"Set up Certbot auto-renewal with webroot authenticator and Nginx reload hook",
"Ran Cael consistency review on team management workflow",
"Generated authoritative implementation prompt from Cael review"
],
"pains": [
"Used incorrect docker compose file on production",
"Failed to generate NextAuth JWE token in standalone Docker container",
"GitHub environment protection rules not available on free plan",
"Rapid SSH connections dropped on Hetzner server"
],
"successes": [
"Dual-provider routing working as expected",
"CI/CD pipeline successfully deploys to production",
"Certbot dry-run passed, ensuring automated certificate renewal",
"Cael identified critical architectural flaw in team management plan",
"Generated a detailed blueprint for next major feature implementation"
],
"techStack": [
"TypeScript",
"Next.js",
"Docker",
"Docker Compose",
"GitHub Actions",
"Hetzner Cloud",
"Certbot",
"Nginx",
"SSH",
"AI/LLM (Cael)",
"NextAuth.js"
]
}