Summary
This RFC proposes migrating the Tuist server’s primary Postgres database off Supabase and into our self-hosted CAPI Kubernetes clusters on Hetzner, managed by the CloudNativePG operator. ClickHouse stays on its current managed footprint. The migration uses Postgres logical replication to keep an in-cluster copy continuously synced with Supabase, then performs an atomic cutover of both reads and writes during a short read-only window. Reverse logical replication runs for a soak period after the cutover so a roll-back to Supabase remains available if a fault appears.
Motivation
Tuist’s transactional workload is small. The heavy analytical tables live in ClickHouse, and what remains in Postgres is account metadata, project graphs, Oban coordination, and similar OLTP rows. That workload fits comfortably on a modest in-cluster cluster and benefits from being adjacent to the application pods that read it.
Three concrete pain points motivate the move:
- Operational boundary tax. Role provisioning, network policy, and secret rotation cross the Supabase boundary today. The
tuist_processorrole lives as a one-shot SQL file ininfra/supabase/tuist-processor-role.sqlbecause Supabase only allowsCREATE ROLEas thepostgressuperuser, and the file’s own header anticipates the move: “If/when we move Postgres in-cluster, this file gets collapsed into the Postgres operator’s Cluster CR (declarative role management).” Every new role, grant change, or env bootstrap currently runs as out-of-band SQL with 1Password lookups, instead of being a declarative resource alongside the rest of the chart. - Connection topology. Supabase Supavisor sits between every Tuist pod and the database. It has its own failure modes, its own latency profile, and its own connection accounting that we cannot tune. In-cluster, the Pooler is a Kubernetes resource we own.
- Cost and control. Three Supabase projects (staging, canary, production) consume budget for capacity we already have idle on the Hetzner cluster. More importantly, the GDPR data-handling surface is shared with Supabase today. Bringing Postgres in-cluster consolidates the data-processing boundary inside infrastructure we already audit for ClickHouse and S3.
The migration does not solve any user-visible problem on its own. It removes a class of operational friction and unblocks declarative role management, in-cluster observability, and faster iteration on database-layer concerns.
Current state
Postgres runs on three Supabase projects, one per environment (inzgspjesrhqhleomvkb for staging, thapckmapmffdgaiarvu for canary, yltsvvtmlktxdgpcbylg for production). The Helm chart at infra/helm/tuist/ already abstracts the database configuration into a tri-state postgresql.mode enum (embedded for self-hosters, external for our managed deploys, with the cluster-internal StatefulSet path covering the embedded case). Per-env values files point the external mode at the Supabase pooler endpoints.
The application connects through Ecto via Tuist.Repo, configured in server/config/runtime.exs. Migrations live in server/priv/repo/migrations/ and number 274 at the time of writing. The processor Deployment connects with a least-privilege role provisioned via SQL run by hand against each Supabase project.
ClickHouse is on a separate managed footprint and is not in scope for this RFC. The build processor and xcresult processor both write to ClickHouse via Tuist.IngestRepo, and that path remains unchanged.
The /ops scope in server/lib/tuist_web/router.ex already mounts third-party UIs (Oban, FunWithFlags, LiveDashboard) behind the :ops pipeline. FunWithFlags is in use for per-account feature flags (:kura, :qa_billing_enabled).
Proposal
Operator and topology
The migration uses CloudNativePG as the Postgres operator. It is CNCF Sandbox, production-deployed at scale, and models the database as a declarative Cluster resource with companion Pooler, Backup, ScheduledBackup, and managed-role specs. Major-version upgrades are blue/green through a fresh cluster plus logical replication, which is the same mechanic this migration uses.
Per-environment topology:
| Environment | Instances | Storage | Resource floor |
|---|---|---|---|
| Staging | 2 (1 primary + 1 sync replica) | 100 GiB | 1 CPU, 4 GiB |
| Canary | 3 (1 primary + 2 sync replicas) | 250 GiB | 2 CPU, 8 GiB |
| Production | 3 (1 primary + 2 sync replicas) | 250 GiB | 2 CPU, 8 GiB |
Storage uses the hcloud-volumes StorageClass already wired through infra/k8s/mgmt/bootstrap/hcloud-csi-values.yaml. Volumes are online-resizable. Pod anti-affinity scopes each instance to a different node. Sync replication means a node loss does not drop committed writes.
A connection pooler is part of the day-one topology rather than a later addition. The application is already structured around a pooler hop: the Supabase pooler endpoint on port 6543 is what the server pods, the processor at infra/helm/tuist/values.yaml:294, and the xcresult processor connect to today. Removing the pooler at cutover would change the connection accounting in a way the workload was never sized for. The Tuist pod fleet (web server replicas, processor pods, xcresult processor pods, autoscaler-driven surges, plus a connection per BEAM scheduler in each pod) opens far more frontend connections than Postgres should hold as dedicated backends, and transaction pooling is what bridges the two.
Each Cluster therefore comes with two Pooler CRs: a read-write pooler in transaction mode pointed at the primary, and a read-only pooler in transaction mode pointed at the replica set. Two pooler instances per side, both behind a Service that the chart consumes. The application’s DATABASE_URL points at the read-write pooler; pgweb, the agent MCP server, and the future read-only role surfaces point at the read-only pooler. Direct connections to the primary (port 5432 on the cluster’s -rw service) are kept available for the rare cases that need session-bound features (advisory locks, session-level SET, prepared LISTEN/NOTIFY); the migration job uses this path. Transaction-mode pooling does not support named prepared statements on its own. The application’s Postgrex configuration sets prepare: :unnamed (matching the current Supavisor-fronted setup), and PgBouncer 1.21+ named-prepared-statement support is available as an upgrade path if Postgrex defaults change.
A new postgresql.mode: cnpg branch in the chart emits the CNPG Cluster, the two Pooler resources, and the ScheduledBackup resource. The existing embedded and external modes stay, so self-hosters are unaffected and we can mode-flip per environment.
The processor’s role provisioning moves from infra/supabase/tuist-processor-role.sql into Cluster.spec.managed.roles with grants set by a bootstrap.initdb.postInitApplicationSQL block on first init. The one-shot SQL file is retired.
Backups and restore
WAL archiving and base backups use CNPG’s built-in Barman Cloud integration, targeting Tigris (S3-compatible, already in use for app artifacts). Backups land in a dedicated tuist-pg-backups bucket. Retention is 30 days with continuous WAL archiving for point-in-time recovery anywhere in that window.
Backup schedule per environment:
- Continuous WAL archiving to Tigris (RPO target: under 5 minutes)
- Daily base backup at 03:00 UTC
- Retention: 30 days
Backups are encrypted at rest using bucket-level server-side encryption on the tuist-pg-backups bucket (Tigris-managed keys, the SSE-S3 equivalent). This is the same encryption posture as our existing Tigris buckets for app artifacts, requires no per-operation key handling in the CNPG backup pipeline, and Tigris handles key rotation. If a future compliance requirement calls for customer-managed keys, the migration to SSE-C is a bucket-policy change plus a key reference in the Cluster barmanObjectStore spec, not a redesign.
Restore drills are part of the rollout. A successful restore from Tigris into a fresh Cluster resource is a gate before moving from staging to canary. The drill is then repeated quarterly as a runbook entry.
Tigris credentials for the backup target sync via ESO from a new 1Password entry per environment (TIGRIS_CNPG_BACKUP_ACCESS_KEY_ID, TIGRIS_CNPG_BACKUP_SECRET_ACCESS_KEY), mirroring the pattern in infra/helm/tuist/templates/processor-external-secrets.yaml.
Access for engineers and agents
Two access surfaces matter beyond the application: humans inspecting the database, and agents querying it. Both run through the same defense-in-depth model: a dedicated read-only role per audience, routed to the read-replica pooler, with role-scoped GUCs enforcing read-only transactions and timeouts.
Engineers (pgweb under /ops/db). A pgweb Deployment runs in-cluster as a ClusterIP Service with no Ingress. The Phoenix application reverse-proxies /ops/db to it via ReverseProxyPlug, so the existing :ops pipeline authorizes every request before pgweb sees it. The sidebar in server/lib/tuist_web/components/app_layout_components.ex gets a new Database entry between Jobs and Flags. This follows the established pattern at server/lib/tuist_web/router.ex:735 where forward "/flags" and live_dashboard "/dashboard" already mount third-party UIs under /ops.
The pgweb pod connects as a tuist_ops_ro role pointed at the read-only Pooler (-pooler-ro).
There is intentionally no write path inside the /ops UI. Writes, deletions, and schema introspection that mutates state happen exclusively through kubectl cnpg psql against the read-write Pooler, authenticating as the tuist_app role from a developer’s authenticated kubectl context. This deliberately puts friction in front of destructive operations: the engineer has to be on the cluster’s kubectl context, has to invoke a CLI command, and has to type the statement themselves rather than click into a row. The /ops/db browser UI stays a read-only surface for inspection. If a recurring class of write operations later emerges as something that should live in the UI, it becomes a purpose-built LiveView with its own role and audit channel, not a generalized SQL editor.
Agents (MCP server with tuist_agent_ro). A separate Deployment runs an MCP server exposing read-only Postgres access over the protocol agents already speak. The Service is not exposed via public Ingress. Instead, it joins the existing Tuist tailnet through a Tailscale sidecar, so the MCP endpoint is only reachable from devices already authenticated to the tailnet. The agent role is distinct from the engineer role so the audit trail and resource limits are independent. Tailscale ACLs scope which tagged identities (developer laptops, CI agent runners) may reach the MCP service at all.
Postgres only ever sees the shared tuist_agent_ro role on the wire, so per-agent attribution lives at the MCP layer. The MCP server reads the caller’s identity from the Tailscale connection metadata (the tagged device and, where present, the logged-in tailnet user). Every accepted query produces a structured log entry that includes the Tailscale identity, an MCP session ID, the query text, duration, and rows returned. These entries ship to Loki via Alloy on the same audit channel as pgaudit. The Postgres-side pgaudit log and the MCP-side log can be joined on session start time and a per-query correlation ID generated by the MCP server and emitted as a Postgres SET application_name = ... at session open. The result is a single audit timeline that answers “which agent ran what” at attribution granularity, even though the database role is shared.
Both roles share the same defense stack:
- Network-level gate. The two surfaces are unreachable from the public internet. pgweb has no Ingress and is reached only by Phoenix reverse-proxying authenticated
/opsrequests through cluster DNS. The agent MCP server has no Ingress and is reached only over the Tailscale tailnet, with Tailscale ACLs restricting which identities can connect. A leaked Postgres credential without network-level access is not enough on its own. - Postgres-level grants.
SELECT-only onpublic, withALTER DEFAULT PRIVILEGESso new tables are auto-included. NoINSERT/UPDATE/DELETE/TRUNCATE/EXECUTEanywhere. Postgres returnspermission deniedon any other statement. - Replica-only routing. Both roles connect to the
-pooler-roservice, which routes to hot standbys. Postgres rejects writes at the engine level on a standby even if a privilege escalation slipped through. - Role-scoped GUCs.
default_transaction_read_only = on, plusstatement_timeout,idle_in_transaction_session_timeout, andlock_timeoutcalibrated tighter for agents than for engineers. A runaway query bounds at the role level. - Tool-level allow-list. pgweb’s
--readonlyflag and the MCP server’s restricted mode reject non-SELECT statements before they leave the tool. This layer produces better error messages, not security, but stops obvious mistakes earlier. - Audit.
pgauditis enabled in theClusterspec, scoped to log all statements from the two read-only roles, with output shipped to Loki via the existing Grafana Alloy.
Writes are an explicit out-of-band promotion: a different role (tuist_agent_rw for agents, manual tuist_app access via kubectl cnpg psql for engineers), a different secret, a different audit channel. Read-only is the path of least resistance.
Cutover via logical replication
The cutover is the riskiest phase, and the design optimizes for simplicity and reversibility. The application code is unchanged: a single Tuist.Repo continues to handle every read and write. Only the connection URL it points at changes during the cutover.
Catch-up phase. Supabase remains the writer until the cutover moment. A PUBLICATION on Supabase covers all tables except oban_jobs and oban_peers, and a SUBSCRIPTION on the CNPG cluster pulls changes continuously. Postgres logical replication preserves transaction ordering and per-transaction atomicity, so CNPG stays a faithful copy of Supabase with sub-second lag on this workload size. The catch-up phase runs until row counts and per-table checksums match between the two clusters and replication lag has been at zero for the configured soak window. Soak targets: 72 hours in staging, 48 hours in canary, 72 hours in production after lag first stabilizes at zero. These are the minimum gates rather than fixed waits; if any anomaly appears during the soak (slot disconnection, unexpected lag spikes, schema drift), the clock restarts.
Pre-cutover validation. Before the flip, an operator runs:
- Replication-slot lag check:
SELECT slot_name, confirmed_flush_lsn, pg_current_wal_lsn() - confirmed_flush_lsn AS lag FROM pg_replication_slots;against Supabase. Lag must read zero. - Per-table row-count parity across the publication set.
- Spot-check checksums on a representative subset of tables (
accounts,projects,command_events, billing tables). - Confirm that no schema-only or out-of-band SQL has been applied to Supabase since the subscription was established.
If any of these fail, the cutover is aborted and the catch-up window is extended. No partial flip.
Oban is the exception. The oban_jobs and oban_peers tables cannot be logically replicated. Two consumers with SELECT ... FOR UPDATE SKIP LOCKED against the same logical job set would double-execute every job, with real downstream damage (duplicate ClickHouse rows, duplicate webhooks). Oban is drained before cutover (pause enqueues, let the queue run to empty), the tables are recreated empty on CNPG, and enqueues resume on the new primary after the flip.
Atomic cutover. The flip is a single short read-only window covering both reads and writes:
- Set the application to read-only (return 503 on mutation routes).
- Pause Oban; wait for in-flight jobs to drain.
- Re-run the pre-cutover validation; confirm replication lag is still zero.
- Enable reverse replication (CNPG to Supabase) as a roll-back path.
- Flip the
DATABASE_URLinvalues-managed-<env>.yamlfrom the Supabase pooler to the CNPG pooler; redeploy. - Resume Oban; lift the read-only flag.
Total user-visible impact: roughly 30 seconds of “submit failed, please retry.” Reads recover on the same window as writes. The window is comparable to a routine deploy with a long migration.
Reverse replication as safety net. After cutover, CNPG publishes back to Supabase via the reverse subscription for 14 days. The reverse PUBLICATION covers all replicated tables, matching the forward direction, rather than a curated rollback subset; one branching rule is simpler than maintaining a separate list and avoids the failure mode of needing to roll back a table that was not in the subset. If a fault appears that warrants a roll-back, Supabase has a current copy and the DATABASE_URL flip is reversed. After 14 days of green operation, the reverse replication, the Supabase projects, and infra/supabase/ are retired.
Scope
In scope:
- Migration of all data currently in the three Supabase Postgres projects to in-cluster CNPG.
- Backup and restore infrastructure targeting Tigris.
- Read-only access surfaces for engineers (pgweb under
/ops/db) and agents (MCP server). - The cutover mechanism: logical replication, atomic flip of reads and writes, reverse-replication safety net.
Out of scope:
- ClickHouse. Stays on its current managed footprint. Build and xcresult processors continue writing through
Tuist.IngestRepo. The natural follow-on migration is sketched in the Future work section below. - The self-hoster path. The chart’s
embeddedmode keeps working unchanged for self-hosted installs that run their own single Postgres pod. - Multi-region failover. Each environment remains single-region; production-us-east and production-us-west cluster configurations stay as they are for now.
- Schema changes. Migrations are unaffected and continue to run against
Tuist.Repoexclusively. The same Repo serves the cluster before and after the cutover; only its connection URL changes.
Trade-offs
Advantages
- Declarative database state. Roles, grants, backups, pooler configuration, and version live in the chart alongside the rest of the stack. New environments come up with one
helm installplus replication wiring, instead of a separate Supabase project provisioning step. - Tighter operational boundary. GDPR data-handling surface consolidates into infrastructure we already audit. Network policy can scope database access at the cluster level. Backup destination is a bucket we own.
- Better connection topology. Pooler instances scale with the cluster, and connection accounting is observable through CNPG’s bundled exporters. No external pooler hop.
- Faster iteration on database concerns. New roles are a CR edit; today they are a SQL file run by hand against each project. Tuning a parameter is
kubectl apply; today it is a Supabase dashboard click. - Major-version upgrades become routine. CNPG’s blue/green upgrade through logical replication is the same mechanic this migration uses. The first one being the migration itself is incidental.
- Cost. Three Supabase projects retire. The capacity is already provisioned on the Hetzner cluster.
Disadvantages
- Ongoing Postgres operations. Major-version upgrades, vacuum behavior, replication slot management, failover testing, and on-call for a stateful workload become Tuist’s responsibility. The workload is small, but the on-call surface is non-zero.
- Backup correctness depends on us. Supabase’s continuous backup is implicit; CNPG’s is configured and tested by us. The restore drill is mandatory, not optional.
- No Supabase dashboard. The “click row, edit row” affordance for one-off support is replaced by pgweb (read-only) plus
kubectl cnpg psql(read-write). pgweb covers most cases; the write path is intentionally more deliberate. - Cutover window. A short read-only window during the writer flip. Comparable to a normal deploy maintenance window but visible to writes.
- All-at-once flip. Reads and writes move together. There is no gradual ramp to catch behavioral surprises in a small slice of traffic before the rest. The roll-back path is the reverse replication, which keeps Supabase warm enough to flip back to if a fault appears post-cutover. This trade keeps the migration code surface near zero at the cost of a less granular per-account staging.
- Compliance reframing. Customer-facing documentation that references Supabase as a sub-processor needs updating. This is a paperwork item, but it has a fixed cost.
Alternatives considered
Stay on Supabase
The least-effort option. It defers the migration cost indefinitely. It also defers the operational friction items listed in Motivation indefinitely, and locks the database-layer iteration speed to whatever Supabase’s product evolution makes available. Rejected because the inflection in iteration speed and the declarative-role-management unlock outweigh the migration cost on a one-time basis.
Migrate to a different managed Postgres provider
Options considered: AWS RDS, Crunchy Bridge, Neon. Each replaces one set of managed-provider constraints with another. The operational boundary tax does not go away; it relocates. Iteration speed on database-layer concerns improves marginally if at all. Network topology gets worse, because the database is no longer adjacent to the application pods. Cost rises in most cases. Rejected because none of these address the underlying motivation.
App-layer dual-write
Write every mutation to both Supabase and CNPG in application code, with a flag-gated read cutover on top. Discussed at length during the design phase. Rejected because:
- Two-phase commit is absent, so any partial write failure produces permanent silent drift.
- Concurrent transaction ordering is not preserved across two databases.
- Oban dual-write produces duplicate job execution, with real downstream damage.
- Every Ecto migration would need to run against both repos, doubling migration plumbing.
- The fan-out Repo wrapper is a large code surface that exists only for the migration window.
The advantages dual-write seems to offer (smooth read cutover, ability to compare results) are equally available through logical replication paired with a reverse-replication roll-back path, without any of the drift hazards. The cutover window is also small enough (around 30 seconds) that “no user-visible window at all” is not worth the dual-write complexity.
Gradual read ramp with feature flags and shadow reads
An earlier design split reads from writes during the cutover. A second Tuist.Repo.CNPG Ecto repo, a Tuist.Repo.Router module, and a per-account :read_from_cnpg flag would have allowed reads to ramp at 1%, 10%, 50%, 100% while writes still went to Supabase. A :shadow_read_cnpg flag would have driven a fire-and-forget comparison runner logging divergence between the two repos. Writes would have switched in a separate flip later.
Rejected because the safety property the ramp provides is small relative to its cost. Logical replication preserves transaction ordering and atomicity by design; the class of bugs that shadow reads would catch is almost entirely application-layer routing bugs introduced by the ramp itself rather than Postgres replication bugs. The cost is concrete: a second Repo, the router module, the shadow runner, two FunWithFlags entries, per-callsite migration of hot read paths, and a cleanup PR after cutover to collapse it all back. The reverse-replication safety net preserves the actual roll-back property without any of this code surface.
Self-managed Postgres without an operator
A raw StatefulSet running Postgres with hand-rolled backups (WAL-G, cron jobs) and manual failover. Rejected because CNPG already implements all of this declaratively, with a mature CRD model, automated failover, blue/green upgrades, and integrated monitoring. Rolling our own gives up the leverage of a community-maintained operator for no benefit.
Future work
The most natural follow-on to this migration is moving ClickHouse in-cluster on the same pattern. Several of the gains in this RFC apply with even more force to ClickHouse. The build processor and xcresult processor already run in-cluster on Hetzner and emit the heaviest write traffic in the stack, so every xcactivitylog row, command event, and cache event currently leaves the cluster to reach the managed ClickHouse footprint. Pulling that adjacent saves real egress and write latency. Declarative ops, ESO-managed credentials, Tigris-backed backups, and a consolidated compliance boundary all transfer cleanly. Cost is a stronger driver here than it is for Postgres because retention and ingestion volumes are larger.
The migration mechanic is different. ClickHouse has no logical-replication equivalent; instead, ReplicatedMergeTree lets in-cluster nodes join an existing shard as replicas coordinated through ClickHouse Keeper, catch up via the replication log, and become the only replicas after the managed nodes are removed. The tooling is production-grade: the Altinity ClickHouse Operator is the de facto standard on Kubernetes, ClickHouse Keeper replaces the ZooKeeper dependency, and clickhouse-backup covers incremental S3-compatible backups to Tigris.
The recommended sequencing is to let the Postgres migration settle for one or two operational cycles before opening the ClickHouse one. The team absorbs one new operator at a time, restore-drill and on-call muscle builds, and the second migration inherits the runbook and chart conventions established here. The operational delta over Postgres is real but bounded: schema changes are heavier, mutations are asynchronous background work, and stuck-merge investigations are a class of problem that does not exist in the Postgres operator. The intent is to budget for ClickHouse in-cluster explicitly when that RFC is written, not to wait for a trigger event.
References
- CloudNativePG project documentation: https://cloudnative-pg.io/documentation/
- CloudNativePG backup and recovery: Redirecting to CloudNativePG Docs...
- CloudNativePG managed roles: Redirecting to CloudNativePG Docs...
- Postgres logical replication documentation: PostgreSQL: Documentation: 16: Chapter 31. Logical Replication
- pgaudit project: GitHub - pgaudit/pgaudit: PostgreSQL Audit Extension · GitHub
- PgBouncer transaction pooling: PgBouncer command-line usage
- Tuist processor role bootstrap (current state):
infra/supabase/tuist-processor-role.sql - Tuist Helm chart values:
infra/helm/tuist/values.yaml - Tuist server data export documentation:
server/data-export.md