Moving Postgres in-cluster

marekfort · May 20, 2026, 8:14am

Summary

This RFC proposes migrating the Tuist server’s primary Postgres database off Supabase and into our self-hosted CAPI Kubernetes clusters on Hetzner, managed by the CloudNativePG operator. ClickHouse stays on its current managed footprint. The migration uses Postgres logical replication to keep an in-cluster copy continuously synced with Supabase, then performs an atomic cutover of both reads and writes during a short read-only window. Reverse logical replication runs for a soak period after the cutover so a roll-back to Supabase remains available if a fault appears.

Motivation

Tuist’s transactional workload is small. The heavy analytical tables live in ClickHouse, and what remains in Postgres is account metadata, project graphs, Oban coordination, and similar OLTP rows. That workload fits comfortably on a modest in-cluster cluster and benefits from being adjacent to the application pods that read it.

Three concrete pain points motivate the move:

Operational boundary tax. Role provisioning, network policy, and secret rotation cross the Supabase boundary today. The tuist_processor role lives as a one-shot SQL file in infra/supabase/tuist-processor-role.sql because Supabase only allows CREATE ROLE as the postgres superuser, and the file’s own header anticipates the move: “If/when we move Postgres in-cluster, this file gets collapsed into the Postgres operator’s Cluster CR (declarative role management).” Every new role, grant change, or env bootstrap currently runs as out-of-band SQL with 1Password lookups, instead of being a declarative resource alongside the rest of the chart.
Connection topology. Supabase Supavisor sits between every Tuist pod and the database. It has its own failure modes, its own latency profile, and its own connection accounting that we cannot tune. In-cluster, the Pooler is a Kubernetes resource we own.
Cost and control. Three Supabase projects (staging, canary, production) consume budget for capacity we already have idle on the Hetzner cluster. More importantly, the GDPR data-handling surface is shared with Supabase today. Bringing Postgres in-cluster consolidates the data-processing boundary inside infrastructure we already audit for ClickHouse and S3.

The migration does not solve any user-visible problem on its own. It removes a class of operational friction and unblocks declarative role management, in-cluster observability, and faster iteration on database-layer concerns.

Current state

Postgres runs on three Supabase projects, one per environment (inzgspjesrhqhleomvkb for staging, thapckmapmffdgaiarvu for canary, yltsvvtmlktxdgpcbylg for production). The Helm chart at infra/helm/tuist/ already abstracts the database configuration into a tri-state postgresql.mode enum (embedded for self-hosters, external for our managed deploys, with the cluster-internal StatefulSet path covering the embedded case). Per-env values files point the external mode at the Supabase pooler endpoints.

The application connects through Ecto via Tuist.Repo, configured in server/config/runtime.exs. Migrations live in server/priv/repo/migrations/ and number 274 at the time of writing. The processor Deployment connects with a least-privilege role provisioned via SQL run by hand against each Supabase project.

ClickHouse is on a separate managed footprint and is not in scope for this RFC. The build processor and xcresult processor both write to ClickHouse via Tuist.IngestRepo, and that path remains unchanged.

The /ops scope in server/lib/tuist_web/router.ex already mounts third-party UIs (Oban, FunWithFlags, LiveDashboard) behind the :ops pipeline. FunWithFlags is in use for per-account feature flags (:kura, :qa_billing_enabled).

Proposal

Operator and topology

The migration uses CloudNativePG as the Postgres operator. It is CNCF Sandbox, production-deployed at scale, and models the database as a declarative Cluster resource with companion Pooler, Backup, ScheduledBackup, and managed-role specs. Major-version upgrades are blue/green through a fresh cluster plus logical replication, which is the same mechanic this migration uses.

Per-environment topology:

Environment	Instances	Storage	Resource floor
Staging	2 (1 primary + 1 sync replica)	100 GiB	1 CPU, 4 GiB
Canary	3 (1 primary + 2 sync replicas)	250 GiB	2 CPU, 8 GiB
Production	3 (1 primary + 2 sync replicas)	250 GiB	2 CPU, 8 GiB

Storage uses the hcloud-volumes StorageClass already wired through infra/k8s/mgmt/bootstrap/hcloud-csi-values.yaml. Volumes are online-resizable. Pod anti-affinity scopes each instance to a different node. Sync replication means a node loss does not drop committed writes.

A connection pooler is part of the day-one topology rather than a later addition. The application is already structured around a pooler hop: the Supabase pooler endpoint on port 6543 is what the server pods, the processor at infra/helm/tuist/values.yaml:294, and the xcresult processor connect to today. Removing the pooler at cutover would change the connection accounting in a way the workload was never sized for. The Tuist pod fleet (web server replicas, processor pods, xcresult processor pods, autoscaler-driven surges, plus a connection per BEAM scheduler in each pod) opens far more frontend connections than Postgres should hold as dedicated backends, and transaction pooling is what bridges the two.

Each Cluster therefore comes with two Pooler CRs: a read-write pooler in transaction mode pointed at the primary, and a read-only pooler in transaction mode pointed at the replica set. Two pooler instances per side, both behind a Service that the chart consumes. The application’s DATABASE_URL points at the read-write pooler; pgweb, the agent MCP server, and the future read-only role surfaces point at the read-only pooler. Direct connections to the primary (port 5432 on the cluster’s -rw service) are kept available for the rare cases that need session-bound features (advisory locks, session-level SET, prepared LISTEN/NOTIFY); the migration job uses this path. Transaction-mode pooling does not support named prepared statements on its own. The application’s Postgrex configuration sets prepare: :unnamed (matching the current Supavisor-fronted setup), and PgBouncer 1.21+ named-prepared-statement support is available as an upgrade path if Postgrex defaults change.

A new postgresql.mode: cnpg branch in the chart emits the CNPG Cluster, the two Pooler resources, and the ScheduledBackup resource. The existing embedded and external modes stay, so self-hosters are unaffected and we can mode-flip per environment.

The processor’s role provisioning moves from infra/supabase/tuist-processor-role.sql into Cluster.spec.managed.roles with grants set by a bootstrap.initdb.postInitApplicationSQL block on first init. The one-shot SQL file is retired.

Backups and restore

WAL archiving and base backups use CNPG’s built-in Barman Cloud integration, targeting Tigris (S3-compatible, already in use for app artifacts). Backups land in a dedicated tuist-pg-backups bucket. Retention is 30 days with continuous WAL archiving for point-in-time recovery anywhere in that window.

Backup schedule per environment:

Continuous WAL archiving to Tigris (RPO target: under 5 minutes)
Daily base backup at 03:00 UTC
Retention: 30 days

Backups are encrypted at rest using bucket-level server-side encryption on the tuist-pg-backups bucket (Tigris-managed keys, the SSE-S3 equivalent). This is the same encryption posture as our existing Tigris buckets for app artifacts, requires no per-operation key handling in the CNPG backup pipeline, and Tigris handles key rotation. If a future compliance requirement calls for customer-managed keys, the migration to SSE-C is a bucket-policy change plus a key reference in the Cluster barmanObjectStore spec, not a redesign.

Restore drills are part of the rollout. A successful restore from Tigris into a fresh Cluster resource is a gate before moving from staging to canary. The drill is then repeated quarterly as a runbook entry.

Tigris credentials for the backup target sync via ESO from a new 1Password entry per environment (TIGRIS_CNPG_BACKUP_ACCESS_KEY_ID, TIGRIS_CNPG_BACKUP_SECRET_ACCESS_KEY), mirroring the pattern in infra/helm/tuist/templates/processor-external-secrets.yaml.

Access for engineers and agents

Two access surfaces matter beyond the application: humans inspecting the database, and agents querying it. Both run through the same defense-in-depth model: a dedicated read-only role per audience, routed to the read-replica pooler, with role-scoped GUCs enforcing read-only transactions and timeouts.

Engineers (pgweb under /ops/db). A pgweb Deployment runs in-cluster as a ClusterIP Service with no Ingress. The Phoenix application reverse-proxies /ops/db to it via ReverseProxyPlug, so the existing :ops pipeline authorizes every request before pgweb sees it. The sidebar in server/lib/tuist_web/components/app_layout_components.ex gets a new Database entry between Jobs and Flags. This follows the established pattern at server/lib/tuist_web/router.ex:735 where forward "/flags" and live_dashboard "/dashboard" already mount third-party UIs under /ops.

The pgweb pod connects as a tuist_ops_ro role pointed at the read-only Pooler (-pooler-ro).

There is intentionally no write path inside the /ops UI. Writes, deletions, and schema introspection that mutates state happen exclusively through kubectl cnpg psql against the read-write Pooler, authenticating as the tuist_app role from a developer’s authenticated kubectl context. This deliberately puts friction in front of destructive operations: the engineer has to be on the cluster’s kubectl context, has to invoke a CLI command, and has to type the statement themselves rather than click into a row. The /ops/db browser UI stays a read-only surface for inspection. If a recurring class of write operations later emerges as something that should live in the UI, it becomes a purpose-built LiveView with its own role and audit channel, not a generalized SQL editor.

Agents (MCP server with tuist_agent_ro). A separate Deployment runs an MCP server exposing read-only Postgres access over the protocol agents already speak. The Service is not exposed via public Ingress. Instead, it joins the existing Tuist tailnet through a Tailscale sidecar, so the MCP endpoint is only reachable from devices already authenticated to the tailnet. The agent role is distinct from the engineer role so the audit trail and resource limits are independent. Tailscale ACLs scope which tagged identities (developer laptops, CI agent runners) may reach the MCP service at all.

Postgres only ever sees the shared tuist_agent_ro role on the wire, so per-agent attribution lives at the MCP layer. The MCP server reads the caller’s identity from the Tailscale connection metadata (the tagged device and, where present, the logged-in tailnet user). Every accepted query produces a structured log entry that includes the Tailscale identity, an MCP session ID, the query text, duration, and rows returned. These entries ship to Loki via Alloy on the same audit channel as pgaudit. The Postgres-side pgaudit log and the MCP-side log can be joined on session start time and a per-query correlation ID generated by the MCP server and emitted as a Postgres SET application_name = ... at session open. The result is a single audit timeline that answers “which agent ran what” at attribution granularity, even though the database role is shared.

Both roles share the same defense stack:

Network-level gate. The two surfaces are unreachable from the public internet. pgweb has no Ingress and is reached only by Phoenix reverse-proxying authenticated /ops requests through cluster DNS. The agent MCP server has no Ingress and is reached only over the Tailscale tailnet, with Tailscale ACLs restricting which identities can connect. A leaked Postgres credential without network-level access is not enough on its own.
Postgres-level grants. SELECT-only on public, with ALTER DEFAULT PRIVILEGES so new tables are auto-included. No INSERT/UPDATE/DELETE/TRUNCATE/EXECUTE anywhere. Postgres returns permission denied on any other statement.
Replica-only routing. Both roles connect to the -pooler-ro service, which routes to hot standbys. Postgres rejects writes at the engine level on a standby even if a privilege escalation slipped through.
Role-scoped GUCs. default_transaction_read_only = on, plus statement_timeout, idle_in_transaction_session_timeout, and lock_timeout calibrated tighter for agents than for engineers. A runaway query bounds at the role level.
Tool-level allow-list. pgweb’s --readonly flag and the MCP server’s restricted mode reject non-SELECT statements before they leave the tool. This layer produces better error messages, not security, but stops obvious mistakes earlier.
Audit. pgaudit is enabled in the Cluster spec, scoped to log all statements from the two read-only roles, with output shipped to Loki via the existing Grafana Alloy.

Writes are an explicit out-of-band promotion: a different role (tuist_agent_rw for agents, manual tuist_app access via kubectl cnpg psql for engineers), a different secret, a different audit channel. Read-only is the path of least resistance.

Cutover via logical replication

The cutover is the riskiest phase, and the design optimizes for simplicity and reversibility. The application code is unchanged: a single Tuist.Repo continues to handle every read and write. Only the connection URL it points at changes during the cutover.

Catch-up phase. Supabase remains the writer until the cutover moment. A PUBLICATION on Supabase covers all tables except oban_jobs and oban_peers, and a SUBSCRIPTION on the CNPG cluster pulls changes continuously. Postgres logical replication preserves transaction ordering and per-transaction atomicity, so CNPG stays a faithful copy of Supabase with sub-second lag on this workload size. The catch-up phase runs until row counts and per-table checksums match between the two clusters and replication lag has been at zero for the configured soak window. Soak targets: 72 hours in staging, 48 hours in canary, 72 hours in production after lag first stabilizes at zero. These are the minimum gates rather than fixed waits; if any anomaly appears during the soak (slot disconnection, unexpected lag spikes, schema drift), the clock restarts.

Pre-cutover validation. Before the flip, an operator runs:

Replication-slot lag check: SELECT slot_name, confirmed_flush_lsn, pg_current_wal_lsn() - confirmed_flush_lsn AS lag FROM pg_replication_slots; against Supabase. Lag must read zero.
Per-table row-count parity across the publication set.
Spot-check checksums on a representative subset of tables (accounts, projects, command_events, billing tables).
Confirm that no schema-only or out-of-band SQL has been applied to Supabase since the subscription was established.

If any of these fail, the cutover is aborted and the catch-up window is extended. No partial flip.

Oban is the exception. The oban_jobs and oban_peers tables cannot be logically replicated. Two consumers with SELECT ... FOR UPDATE SKIP LOCKED against the same logical job set would double-execute every job, with real downstream damage (duplicate ClickHouse rows, duplicate webhooks). Oban is drained before cutover (pause enqueues, let the queue run to empty), the tables are recreated empty on CNPG, and enqueues resume on the new primary after the flip.

Atomic cutover. The flip is a single short read-only window covering both reads and writes:

Set the application to read-only (return 503 on mutation routes).
Pause Oban; wait for in-flight jobs to drain.
Re-run the pre-cutover validation; confirm replication lag is still zero.
Enable reverse replication (CNPG to Supabase) as a roll-back path.
Flip the DATABASE_URL in values-managed-<env>.yaml from the Supabase pooler to the CNPG pooler; redeploy.
Resume Oban; lift the read-only flag.

Total user-visible impact: roughly 30 seconds of “submit failed, please retry.” Reads recover on the same window as writes. The window is comparable to a routine deploy with a long migration.

Reverse replication as safety net. After cutover, CNPG publishes back to Supabase via the reverse subscription for 14 days. The reverse PUBLICATION covers all replicated tables, matching the forward direction, rather than a curated rollback subset; one branching rule is simpler than maintaining a separate list and avoids the failure mode of needing to roll back a table that was not in the subset. If a fault appears that warrants a roll-back, Supabase has a current copy and the DATABASE_URL flip is reversed. After 14 days of green operation, the reverse replication, the Supabase projects, and infra/supabase/ are retired.

Scope

In scope:

Migration of all data currently in the three Supabase Postgres projects to in-cluster CNPG.
Backup and restore infrastructure targeting Tigris.
Read-only access surfaces for engineers (pgweb under /ops/db) and agents (MCP server).
The cutover mechanism: logical replication, atomic flip of reads and writes, reverse-replication safety net.

Out of scope:

ClickHouse. Stays on its current managed footprint. Build and xcresult processors continue writing through Tuist.IngestRepo. The natural follow-on migration is sketched in the Future work section below.
The self-hoster path. The chart’s embedded mode keeps working unchanged for self-hosted installs that run their own single Postgres pod.
Multi-region failover. Each environment remains single-region; production-us-east and production-us-west cluster configurations stay as they are for now.
Schema changes. Migrations are unaffected and continue to run against Tuist.Repo exclusively. The same Repo serves the cluster before and after the cutover; only its connection URL changes.

Trade-offs

Advantages

Declarative database state. Roles, grants, backups, pooler configuration, and version live in the chart alongside the rest of the stack. New environments come up with one helm install plus replication wiring, instead of a separate Supabase project provisioning step.
Tighter operational boundary. GDPR data-handling surface consolidates into infrastructure we already audit. Network policy can scope database access at the cluster level. Backup destination is a bucket we own.
Better connection topology. Pooler instances scale with the cluster, and connection accounting is observable through CNPG’s bundled exporters. No external pooler hop.
Faster iteration on database concerns. New roles are a CR edit; today they are a SQL file run by hand against each project. Tuning a parameter is kubectl apply; today it is a Supabase dashboard click.
Major-version upgrades become routine. CNPG’s blue/green upgrade through logical replication is the same mechanic this migration uses. The first one being the migration itself is incidental.
Cost. Three Supabase projects retire. The capacity is already provisioned on the Hetzner cluster.

Disadvantages

Ongoing Postgres operations. Major-version upgrades, vacuum behavior, replication slot management, failover testing, and on-call for a stateful workload become Tuist’s responsibility. The workload is small, but the on-call surface is non-zero.
Backup correctness depends on us. Supabase’s continuous backup is implicit; CNPG’s is configured and tested by us. The restore drill is mandatory, not optional.
No Supabase dashboard. The “click row, edit row” affordance for one-off support is replaced by pgweb (read-only) plus kubectl cnpg psql (read-write). pgweb covers most cases; the write path is intentionally more deliberate.
Cutover window. A short read-only window during the writer flip. Comparable to a normal deploy maintenance window but visible to writes.
All-at-once flip. Reads and writes move together. There is no gradual ramp to catch behavioral surprises in a small slice of traffic before the rest. The roll-back path is the reverse replication, which keeps Supabase warm enough to flip back to if a fault appears post-cutover. This trade keeps the migration code surface near zero at the cost of a less granular per-account staging.
Compliance reframing. Customer-facing documentation that references Supabase as a sub-processor needs updating. This is a paperwork item, but it has a fixed cost.

Alternatives considered

Stay on Supabase

The least-effort option. It defers the migration cost indefinitely. It also defers the operational friction items listed in Motivation indefinitely, and locks the database-layer iteration speed to whatever Supabase’s product evolution makes available. Rejected because the inflection in iteration speed and the declarative-role-management unlock outweigh the migration cost on a one-time basis.

Migrate to a different managed Postgres provider

Options considered: AWS RDS, Crunchy Bridge, Neon. Each replaces one set of managed-provider constraints with another. The operational boundary tax does not go away; it relocates. Iteration speed on database-layer concerns improves marginally if at all. Network topology gets worse, because the database is no longer adjacent to the application pods. Cost rises in most cases. Rejected because none of these address the underlying motivation.

App-layer dual-write

Write every mutation to both Supabase and CNPG in application code, with a flag-gated read cutover on top. Discussed at length during the design phase. Rejected because:

Two-phase commit is absent, so any partial write failure produces permanent silent drift.
Concurrent transaction ordering is not preserved across two databases.
Oban dual-write produces duplicate job execution, with real downstream damage.
Every Ecto migration would need to run against both repos, doubling migration plumbing.
The fan-out Repo wrapper is a large code surface that exists only for the migration window.

The advantages dual-write seems to offer (smooth read cutover, ability to compare results) are equally available through logical replication paired with a reverse-replication roll-back path, without any of the drift hazards. The cutover window is also small enough (around 30 seconds) that “no user-visible window at all” is not worth the dual-write complexity.

Gradual read ramp with feature flags and shadow reads

An earlier design split reads from writes during the cutover. A second Tuist.Repo.CNPG Ecto repo, a Tuist.Repo.Router module, and a per-account :read_from_cnpg flag would have allowed reads to ramp at 1%, 10%, 50%, 100% while writes still went to Supabase. A :shadow_read_cnpg flag would have driven a fire-and-forget comparison runner logging divergence between the two repos. Writes would have switched in a separate flip later.

Rejected because the safety property the ramp provides is small relative to its cost. Logical replication preserves transaction ordering and atomicity by design; the class of bugs that shadow reads would catch is almost entirely application-layer routing bugs introduced by the ramp itself rather than Postgres replication bugs. The cost is concrete: a second Repo, the router module, the shadow runner, two FunWithFlags entries, per-callsite migration of hot read paths, and a cleanup PR after cutover to collapse it all back. The reverse-replication safety net preserves the actual roll-back property without any of this code surface.

Self-managed Postgres without an operator

A raw StatefulSet running Postgres with hand-rolled backups (WAL-G, cron jobs) and manual failover. Rejected because CNPG already implements all of this declaratively, with a mature CRD model, automated failover, blue/green upgrades, and integrated monitoring. Rolling our own gives up the leverage of a community-maintained operator for no benefit.

Future work

The most natural follow-on to this migration is moving ClickHouse in-cluster on the same pattern. Several of the gains in this RFC apply with even more force to ClickHouse. The build processor and xcresult processor already run in-cluster on Hetzner and emit the heaviest write traffic in the stack, so every xcactivitylog row, command event, and cache event currently leaves the cluster to reach the managed ClickHouse footprint. Pulling that adjacent saves real egress and write latency. Declarative ops, ESO-managed credentials, Tigris-backed backups, and a consolidated compliance boundary all transfer cleanly. Cost is a stronger driver here than it is for Postgres because retention and ingestion volumes are larger.

The migration mechanic is different. ClickHouse has no logical-replication equivalent; instead, ReplicatedMergeTree lets in-cluster nodes join an existing shard as replicas coordinated through ClickHouse Keeper, catch up via the replication log, and become the only replicas after the managed nodes are removed. The tooling is production-grade: the Altinity ClickHouse Operator is the de facto standard on Kubernetes, ClickHouse Keeper replaces the ZooKeeper dependency, and clickhouse-backup covers incremental S3-compatible backups to Tigris.

The recommended sequencing is to let the Postgres migration settle for one or two operational cycles before opening the ClickHouse one. The team absorbs one new operator at a time, restore-drill and on-call muscle builds, and the second migration inherits the runbook and chart conventions established here. The operational delta over Postgres is real but bounded: schema changes are heavier, mutations are asynchronous background work, and stuck-merge investigations are a class of problem that does not exist in the Postgres operator. The intent is to budget for ClickHouse in-cluster explicitly when that RFC is written, not to wait for a trigger event.

References

CloudNativePG project documentation: https://cloudnative-pg.io/documentation/
CloudNativePG backup and recovery: Redirecting to CloudNativePG Docs...
CloudNativePG managed roles: Redirecting to CloudNativePG Docs...
Postgres logical replication documentation: PostgreSQL: Documentation: 16: Chapter 31. Logical Replication
pgaudit project: GitHub - pgaudit/pgaudit: PostgreSQL Audit Extension · GitHub
PgBouncer transaction pooling: PgBouncer command-line usage
Tuist processor role bootstrap (current state): infra/supabase/tuist-processor-role.sql
Tuist Helm chart values: infra/helm/tuist/values.yaml
Tuist server data export documentation: server/data-export.md

pepicrft · May 20, 2026, 12:54pm

Have you considered a client-side solution that we render in /ops/db, routing the PG connection through the server for authn & authz purposes? Something I like about this model is that it’d work for our on-premise installations too. Not sure what the state of client-side options is.

If we go with the proxy model, how do we authenticate the user once it’s routed to a separate instance? I believe both our server, which is the source of truth for identities, and the other instance would need to agree on a model that allows the latter to verify the authentication is valid.

Really like this idea. We can probably extend it later on to serve other infrastructure-related tools and prompts. What do you have in mind for this server? Will it be an Elixir app?

Very excited abou this direction. There’s no point in having a service managing this for us when we’ve already built the muscle and the foundation for us to host this.

marekfort · May 20, 2026, 1:30pm

I’d start with something that works out-of-the-box. It’s not just about running queries, but also inspecting the table, etc. I think we should nudge self-hosters to use the helm chart where we can include extra components like this instead of forcing ourselves to reimplement features like this.

Any request would still go through the Phoenix server. The pgweb instance wouldn’t be reachable on its own. We can still consider down the line to reimplement pgweb directly, but I wouldn’t want to do that in the initial iteration.

This can be exposed via the main server with a dedicated endpoint.

pepicrft · May 20, 2026, 2:07pm

Sounds good

Not sure if the agent accounted for this or not, but being proxied requires the server to be proxy-aware, pgweb in this case. They seem to support that via the --prefix flag, which I’d use to avoid route collision with the main server --prefix=/ops/db. I’d also consider using --log-forwarded-user and the header X-Forwarded-Email when proxying so the access is attributed to a particular user.

As we keep adding more operations tooling to the main tuist server, I was thinking whether it’s a good idea to conflate product & operations under the same server. Not as something to address as part of this, but something for us to keep in mind regarding is this a good idea long term?

esnunes · May 20, 2026, 4:40pm

What is the retry count and retry backoff strategy? Is it possible that a broken job blocks the migration for a considerable amount of time?

esnunes · May 20, 2026, 5:06pm

Will Oban tables be part of the migrated data? The RFC mentions that those tables will be created empty. In case they won’t be part of the initial sync, is there a plan to migrate them later? During the soak window, will Oban tables be reverse replicated?

Will schema migrations be allowed during the soak window and the 14 days of reverse subscription?

Would be possible to get some sample data + statistics of the database at the time the backup is performed and automate the database restore drill by creating a new cluster from the backup and validate whether the data matches the sample data? I guess tables that are append-only wouldn’t need the sample data as they could be validated against historical data up to the point of the backup.

Will be possible to associate queries performed via `kubectl cnpg psql` to the kubernetes authenticated user for auditing purposes?

marekfort · May 21, 2026, 8:24am

We can pause enqueuing Oban jobs and then when doing the cutover, lift the available + scheduled + retryable jobs to the CPNG database. Something to try and do on staging, but since the migration should happen during a low-traffic window (i.e., a weekend), the impact of losing some Oban jobs, in the worst case scenario, should be low.

Similar to the previous answer, we won’t migrate all the Oban data (since the Oban data is not useful historically) and just replicate the to-be-executed jobs during the cutover.

This is a very good point. We’ll need to ensure the migrations are replayed against the CNPG database, too. Alternatively, we can decrease the operational complexity of this and keep the soak window to the minimum (maybe 1-2 days) during which we can afford to freeze migrations.

Yes, this is possible and something worth doing before the cutover to increase our confidence in the migration.

I think the ultimate goal should be for no-one having access to the kubectl cnpg psql capability. We could do attribution there, but it would still depend on users executing the command in a certain way (such as via a wrapper script). I think I may try to see how difficult it would be to reimplement the pg web surfaces that we actually need in /ops, so read queries would be only available either via the MCP or through our built-in SQL editor, in both cases, our server could always ensure to authenticate given user’s permissions (and build an audit log).

\

I agree with this. The other benefit of having /ops in its own server would be that we could enforce Tailscale rules for better defense-in-depth. We should consider doing this (albeit, that should be its own effort as you mentioned)