Post-mortem: Cache and registry outage, April 7, 2026

cschmatzler · April 8, 2026, 5:43pm

On April 7, 2026, Tuist experienced an incident that made cache endpoints unavailable globally for approximately 12 minutes, from 16:30 UTC to 16:42 UTC.

Impact

The incident affected all users globally.

During the incident window, cache operations failed with:

✖ Error
  None of the cache endpoints are reachable.

Timeline

On April 7, 2026 :

16:30 UTC — A configuration change was deployed to our cache endpoints that prevented them from starting.
16:39 UTC — The issue was identified and rollback began.
16:42 UTC — Rollback completed and service fully recovered.

What happened?

We have been working on a distributed metadata plane for our global fleet of cache nodes. This had been in development for several weeks and was manually rolled out in stages earlier that day.

After deploying the feature to roughly half of our regions, a subsequent rollout hit the underlying database’s connection limits and failed to start. At that stage there was no customer impact, because the affected endpoint was still in maintenance mode.

To address that connection-limit issue, we changed the database connection settings from direct Postgres connections to PgBouncer by updating the configured port in our secrets management from 5432 to 6432.

The application was still configured to set tcp_keepalives_idle on its database connections. That worked when connecting directly to Postgres, and it also worked in the initial manual rollout. However, the upstream PgBouncer instance already manages that setting on the PgBouncer-to-Postgres connection and rejects client connections that attempt to set it themselves.

Because the cache application runs database migrations during startup, it failed to connect to the database during the migration phase and never became healthy. As a result, the service did not start, which caused the outage.

Preventative measures

We identified several gaps that allowed this issue to reach production and delayed detection:

Migration failures were not reported. Our error-reporting library only initialized after full application startup, so the database error during the migration step was never captured. We are investigating how to initialize error reporting earlier so failures during migrations are reported as well.
Health checks did not verify database readiness. We have already deployed a change that includes database health in the Docker health check. This prevents an old container from being cordoned before the new one is considered fully healthy.
Alerting assumed metrics would exist. Because the application never fully started, the /metrics endpoint was also unavailable. We have updated our alerting so that missing request data now triggers an alert instead of being treated as healthy.
CI did not match production closely enough. This issue was specific to the behavior of the upstream PgBouncer configuration and did not reproduce against base Postgres in local development or CI. We are now looking at adding PgBouncer to our CI workflows so production-like connection behavior is exercised before deployment.

We’re sorry for the disruption this caused, and we’re using the changes above to reduce the likelihood of similar incidents in the future.