Project Sync Phase 1 Shipped: Navigating Feature Delivery, Data Safety, and Production Pitfalls
We just shipped Phase 1 of our ambitious Project Sync feature, bringing robust branch selection to our users. Join us as we recount the architectural choices, the production challenges, and the invaluable lessons learned along the way.
Shipping a major feature is always a blend of excitement, meticulous planning, and unexpected challenges. We've just deployed Phase 1 of our Project Sync feature, a crucial step towards seamlessly integrating user codebases with our platform, complete with branch selection. This post dives into the journey of building out this complex functionality, the architectural decisions, the production hurdles we overcame, and the key lessons we learned.
The Vision: Project Sync with Branch Selection
Our goal with Project Sync is ambitious: to provide a robust, three-phase system for users to synchronize their code repositories with our platform, enabling powerful insights and intelligent actions. Phase 1 focused on the foundational capabilities, primarily branch selection and a reliable synchronization pipeline. This allows users to pick a specific branch from their GitHub repository, and our system will pull in the latest code, keeping it up-to-date.
With Phase 1 now live, our users can initiate syncs, track their progress in real-time, and ensure their project's knowledge base is current. The immediate feedback from our users has been incredibly positive, fueling our excitement for Phases 2 and 3.
Building Phase 1: A Full-Stack Journey
Bringing Project Sync to life required touching almost every part of our stack, from the database schema to the front-end user interface. Here’s a breakdown of the key components we developed and integrated:
1. Data Model Foundation: prisma/schema.prisma
We extended our prisma/schema.prisma to introduce the ProjectSync model. This model tracks each synchronization event, linking it to MemoryEntry, RepositoryFile, and Repository models with new sync-related fields. A crucial design decision here was adding a status: "active" filter across various models to protect against superseded entries, ensuring that only the latest, most relevant data is processed. This helps maintain data integrity and prevents stale information from polluting the system.
// Example snippet (simplified)
model ProjectSync {
id String @id @default(cuid())
repositoryId String
repository Repository @relation(fields: [repositoryId], references: [id])
branch String
status String
startedAt DateTime @default(now())
finishedAt DateTime?
// ... other fields like previousSyncId
}
2. GitHub Integration: src/server/services/github-connector.ts
To enable branch selection and content fetching, we enhanced our github-connector.ts service with new methods:
fetchBranches(): To list all available branches for a given repository.fetchBranchHead(): To get the latest commit SHA for a specific branch.fetchRepoTreeWithSha(): To retrieve the entire file tree for a repository at a given commit SHA.
These functions are the backbone of pulling raw code data into our system.
3. The Core Sync Engine: src/server/services/project-sync-service.ts
This is where the magic happens. We implemented a sophisticated AsyncGenerator pipeline within project-sync-service.ts to manage the entire synchronization process:
prepare: Initializes the sync process, fetches branch details.scan: Traverses the repository tree, identifies new, modified, or deleted files.import: Processes identified files, creating or updatingRepositoryFileandMemoryEntryrecords. This phase is "diff-aware," meaning it intelligently updates only what has changed, optimizing performance and resource usage.finalize: Cleans up, marks the sync as complete, and handles any post-processing.
This pipeline ensures a resilient and observable sync process.
4. Real-time Feedback: SSE for Progress Streaming
User experience is paramount. To keep users informed about the progress of their syncs, we implemented a Server-Sent Events (SSE) endpoint at src/app/api/v1/events/project-sync/[syncId]/route.ts. This allows us to stream real-time updates directly to the client. On the front-end, src/hooks/use-project-sync.ts consumes these events, updating the UI dynamically.
5. API & UI: tRPC, React Components
src/server/trpc/routers/projects.ts: We added a dedicatedsyncsub-router to handle all Project Sync API calls, including fetching branches, checking status, starting new syncs, and viewing history.- React Components:
src/components/project/sync-banner.tsx: Displays phase indicators, progress bars, and sync statistics.src/components/project/sync-controls.tsx: Provides the branch dropdown and the "Sync Now" button.src/components/project/project-overview.tsx: Integrates theSyncControlsto make the feature easily accessible.
A Quick Win: Embedding Backfill
As part of preparing for Project Sync, we also created an admin endpoint (src/app/api/v1/admin/backfill-embeddings/route.ts) to restore 382 lost embedding vectors on production. This was a critical step to ensure our AI capabilities remained fully functional, demonstrating our commitment to data integrity and recoverability.
Navigating the Treacherous Waters: Lessons Learned from Production
While the development phase was smooth, deploying to production brought its own set of challenges, leading to some invaluable lessons.
1. The prisma db push Trap: A Data Loss Scare
Challenge: In an attempt to quickly apply schema changes, we tried running prisma db push --accept-data-loss on our production database.
Outcome: This command, designed for development, dropped a critical embedding vector(1536) column on one of our tables (workflow_insights), leading to the loss of all 382 embeddings.
Lesson Learned: NEVER use prisma db push on production with existing data. It's designed for schema development, not safe migrations. For production, always use controlled migration scripts (./scripts/db-migrate-safe.sh in our case) or manual SQL. We recovered by manually recreating the column via raw SQL, then re-running our embedding backfill. This incident reinforced the absolute necessity of a robust, data-preserving migration strategy for production environments.
-- Example of manual column recreation
ALTER TABLE workflow_insights
ADD COLUMN embedding vector(1536);
2. Escaping the Escapes: Heredoc SQL on Remote Servers
Challenge: When applying manual SQL changes via SSH, we attempted to use heredoc syntax with escaped quotes for multi-line statements within a docker exec psql command.
Outcome: The nested SSH and docker exec context broke the quote escaping, leading to syntax errors.
Lesson Learned: Complex shell escaping in nested environments is a recipe for headaches. Simplify where possible. Our workaround was to run individual docker exec psql -c "..." commands, one per SQL statement. This made the process slower but significantly more reliable and debuggable.
3. tRPC Context: Dot-Notation Details
Challenge: Inside a tRPC router, we initially tried to access the user ID via ctx.userId.
Outcome: TypeScript threw an error, indicating that userId did not exist directly on the context.
Lesson Learned: Always consult your framework's context definitions and types. In our tRPC setup, the user ID was correctly accessed via ctx.user.id. A small detail, but one that can halt progress.
4. Prisma Self-Relation: The @unique Requirement
Challenge: We implemented a self-relation on the ProjectSync model for previousSyncId without adding the @unique attribute.
Outcome: Prisma validation errors during data insertion because the previousSyncId field was intended to uniquely identify a previous sync for a new one.
Lesson Learned: When defining relations or fields that should represent a unique link or identifier, ensure the @unique attribute is applied in your Prisma schema. This enforces data integrity at the database level and prevents unexpected validation failures.
What's Next: The Road Ahead
Phase 1 is just the beginning. Our immediate next steps involve building out the remaining phases of Project Sync:
- Phase 2: Code Analysis & Docs Regeneration: Extending the sync pipeline to include deep code analysis and automated documentation regeneration based on the synchronized codebase.
- Phase 3: Consolidation, Axiom & Embedding Refresh: Further enhancing the pipeline with knowledge consolidation, integration with our Axiom reasoning engine, and refreshing embeddings for updated code.
- Security Enhancements: Adding Row-Level Security (RLS) policies for the new
project_syncstable to ensure data privacy and access control. - End-to-End Testing: Thoroughly testing the full sync feature on real-world GitHub projects.
- Migration Script Updates: Considering updates to our
./scripts/db-migrate-safe.shto specifically protect the new sync-related columns during future schema changes.
Conclusion
Shipping Phase 1 of Project Sync has been a rewarding experience, showcasing our ability to deliver complex, full-stack features. It's a testament to careful planning, robust architecture, and a resilient team. More importantly, the challenges encountered during deployment, especially regarding database migrations, have provided invaluable lessons that will make our future development and deployment processes even more robust and secure. We're excited for what Phases 2 and 3 will bring!