Shipping Project Sync Phase 1: How We Built It and Almost Lost Our Embeddings

It’s an exciting time in our development cycle! We've just pushed a significant new feature, "Project Sync," into production. This is a big one – a three-phase beast designed to seamlessly integrate external code repositories with our internal knowledge base, ensuring our AI models are always working with the freshest context. Today, I want to talk about the journey through Phase 1: getting the foundational sync mechanism up and running, and the invaluable (and sometimes painful) lessons we learned along the way.

The Vision: Project Sync Phase 1

Our goal with Project Sync is simple yet powerful: allow users to connect their projects to a GitHub repository, select a branch, and have our system intelligently scan and import relevant files, keeping our internal "Memory Entries" up-to-date. Phase 1 focused on the core mechanics: connecting to GitHub, fetching branches, scanning file trees, and performing a diff-aware import.

This wasn't just a backend task; it was a full-stack symphony.

What We Shipped in Phase 1:

Database Foundation (prisma/schema.prisma): We introduced a ProjectSync model to track sync history and extended existing MemoryEntry, RepositoryFile, and Repository models with new sync-related fields. This forms the backbone for tracking what's synced and what's changed.
GitHub Connector (src/server/services/github-connector.ts): The gateway to the code. We built out functions like fetchBranches(), fetchBranchHead(), and fetchRepoTreeWithSha() to interact with the GitHub API, pulling down the necessary metadata to initiate a sync.
The Brains: Project Sync Service (src/server/services/project-sync-service.ts): This is where the magic happens. We engineered a robust AsyncGenerator pipeline with distinct stages: prepare → scan → import → finalize. The import stage is particularly clever, performing a diff-aware sync to only process changes, rather than re-importing everything. This is crucial for efficiency and resource management.
Real-time Progress (SSE API Endpoint): To keep users informed during potentially long sync operations, we created an SSE (Server-Sent Events) endpoint at src/app/api/v1/events/project-sync/[syncId]/route.ts. This streams real-time progress updates directly to the frontend.
Seamless Backend Integration (src/server/trpc/routers/projects.ts): Our tRPC router now includes a dedicated sync sub-router, exposing endpoints for fetching branches, checking sync status, initiating new syncs, viewing history, and even restoring memory entries.
Frontend Experience:
- Progress Tracking Hook (src/hooks/use-project-sync.ts): A custom React hook to consume the SSE stream and update UI state with sync progress.
- Sync Banner (src/components/project/sync-banner.tsx): A dynamic banner displaying phase indicators, a progress bar, and key statistics during an active sync.
- Sync Controls (src/components/project/sync-controls.tsx): The user interface for selecting a branch and triggering the sync, integrated directly into the project overview.

This comprehensive approach meant that from the moment a user clicks "Sync," they get real-time feedback until their project's knowledge base is fully updated. We also baked in safeguards, updating 9 files with a status: "active" filter to protect against superseded memory entries.

Navigating the Minefield: Lessons from the Trenches

No significant feature ships without its share of challenges. These "pain points" are often where the deepest learning happens. Here's a look at what we stumbled upon and how we overcame it, offering some critical lessons for anyone building similar systems.

Lesson 1: Never, Ever `prisma db push` on Production (Unless You Want to Lose Data)

This was, without a doubt, the most harrowing moment of the session. In a moment of oversight, attempting to apply new schema changes, I ran prisma db push --accept-data-loss on our production database.

The Fallout: prisma db push (especially with --accept-data-loss) is designed for rapid iteration in development, not for controlled production migrations. It dropped the embedding vector(1536) column on our workflow_insights table. This meant all 382 of our carefully generated embeddings – the very heart of our AI's understanding – were gone. Poof.

The Fix & The Lesson:

Immediate panic, followed by deep breaths.
Recreated the column via raw SQL: ALTER TABLE workflow_insights ADD COLUMN embedding vector(1536);
Triggered our embedding backfill endpoint (src/app/api/v1/admin/backfill-embeddings/route.ts) to regenerate and restore all 382 embeddings.
Applied the remaining schema changes individually, via raw SQL.

Takeaway: prisma db push is a development tool. For production, always use manual SQL migrations or a carefully crafted, data-preserving migration script. Our internal ./scripts/db-migrate-safe.sh exists for a reason, and this incident reinforced its importance. We're now considering extending that script to specifically protect critical columns like embedding.

Lesson 2: The Perils of Nested SSH & Heredoc SQL Quoting

Applying those raw SQL migrations on production wasn't straightforward either. My initial attempt was to use a heredoc SQL block over SSH, like so:

bash

ssh user@host << EOF
  docker exec -i psql -c "
    ALTER TABLE my_table
    ADD COLUMN my_column TEXT DEFAULT '';
    -- More SQL statements here
  "
EOF

The Fallout: Quote escaping within the nested docker exec psql -c "..." context became an absolute nightmare. The shell interpreter was getting confused, leading to syntax errors and failed commands.

The Fix & The Lesson: Instead of one giant heredoc, we broke it down. Run individual docker exec psql -c "..." commands, one per SQL statement.

bash

# Example of the workaround
ssh root@46.225.232.35 "docker exec nyxcore-postgres-1 psql -U nyxcore -d nyxcore -c \"ALTER TABLE project_syncs ADD COLUMN new_field TEXT;\""
ssh root@46.225.232.35 "docker exec nyxcore-postgres-1 psql -U nyxcore -d nyxcore -c \"CREATE INDEX my_idx ON project_syncs (new_field);\""

(Note: sshpass was used to automate the password for quick execution in this specific scenario, but generally not recommended for security.)

Takeaway: When dealing with complex command nesting and remote execution, simplify. Break down operations into smaller, manageable, and individually verifiable steps. Sometimes, the "smart" one-liner isnocuous.

Lesson 3: tRPC Context: `ctx.user.id` vs `ctx.userId`

A minor but common gotcha when working with tRPC and authentication.

The Fallout: I initially tried to access ctx.userId within a tRPC router, only to be met with a TypeScript error: Property 'userId' does not exist on type 'Context'.

The Fix & The Lesson: Our context structure stores user information under a user object. The correct way to access the authenticated user's ID was ctx.user.id.

typescript

// Incorrect (TypeScript error)
// const userId = ctx.userId;

// Correct
const userId = ctx.user.id;

Takeaway: Always double-check your context object's structure, especially after middleware or authentication layers have populated it. TypeScript is your friend here, guiding you to the correct paths.

Lesson 4: Prisma Self-Relation Requires `@unique`

We needed to establish a self-referencing relation on the ProjectSync model to link a sync to its previousSyncId.

The Fallout: Without the @unique attribute on previousSyncId, Prisma threw a validation error. It expects foreign key references to point to unique fields.

The Fix & The Lesson: Adding @unique to the previousSyncId field in prisma/schema.prisma resolved the issue.

prisma

model ProjectSync {
  id           String @id @default(cuid())
  // ... other fields
  previousSyncId String? @unique // This was the fix!
  previousSync ProjectSync? @relation("ProjectSyncHistory", fields: [previousSyncId], references: [id])
  nextSync     ProjectSync? @relation("ProjectSyncHistory")
}

Takeaway: When setting up relations, especially self-relations or one-to-one relationships, remember that foreign keys often need to reference unique fields on the target model. Prisma's validation helps enforce database integrity.

What's Next? The Road Ahead

With Phase 1 successfully deployed and the lessons learned etched into our collective memory, we're already looking forward to the next stages of Project Sync:

Phase 2: Code Analysis & Docs Regeneration: Extending the sync pipeline to perform deeper code analysis and automatically regenerate documentation based on the synced codebase.
Phase 3: Consolidation, Axiom & Embedding Refresh: The final phase will involve sophisticated consolidation of memory entries, integration with our Axiom logging, and intelligent embedding refreshes.
RLS Policies: Implementing Row-Level Security (RLS) policies for the new project_syncs table to ensure data privacy and access control.
End-to-End Testing: Thoroughly testing the full sync feature with a diverse range of real GitHub repositories.
Migration Script Refinement: Updating ./scripts/db-migrate-safe.sh to explicitly protect critical, new sync-related columns during future migrations.

Shipping features is exhilarating, but the journey often involves unexpected detours and valuable learning experiences. Project Sync Phase 1 was a testament to our team's ability to build complex systems while navigating the realities of production deployments. We’re excited for the next phases and to bring even more powerful capabilities to our users!

Happy coding!

json

{
  "thingsDone": [
    "Implemented Project Sync Phase 1 (full 3-phase feature)",
    "Created embedding backfill endpoint and restored 382 production vectors",
    "Extended Prisma schema with ProjectSync model and sync fields",
    "Developed GitHub connector services (fetchBranches, fetchRepoTreeWithSha)",
    "Built core ProjectSyncService with AsyncGenerator pipeline (prepare→scan→import→finalize) and diff-aware sync",
    "Created SSE endpoint for real-time project sync progress streaming",
    "Added tRPC sync sub-router (branches, status, start, history, restoreMemory)",
    "Developed useProjectSync hook for SSE progress tracking",
    "Created Project Sync UI components (banner, controls)",
    "Integrated SyncControls into ProjectOverview",
    "Updated 9 files with 'status: active' filter for superseded entry protection",
    "Applied production schema changes via manual SQL (after initial mishap)",
    "Rebuilt app and ensured embeddings were intact on production"
  ],
  "pains": [
    "Lost 382 production embeddings due to accidental `prisma db push --accept-data-loss`",
    "Struggled with quote escaping in nested SSH + docker exec heredoc SQL commands",
    "Incorrectly tried to access `ctx.userId` instead of `ctx.user.id` in tRPC",
    "Prisma validation error for self-relation without `@unique` on `previousSyncId`"
  ],
  "successes": [
    "Successfully restored lost embeddings via raw SQL and backfill endpoint",
    "Developed robust manual SQL migration workaround for production",
    "Identified and corrected tRPC context access pattern",
    "Resolved Prisma self-relation issue by adding `@unique`",
    "Successfully deployed a complex full-stack feature (Project Sync Phase 1)",
    "Implemented real-time progress streaming for user feedback"
  ],
  "techStack": [
    "Next.js",
    "Prisma",
    "TypeScript",
    "PostgreSQL (vector extension)",
    "GitHub API",
    "Server-Sent Events (SSE)",
    "tRPC",
    "React",
    "Docker",
    "SSH"
  ]
}