Self-hosted GitHub Actions cache for Tuist Runners

This RFC proposes a self-hosted GitHub Actions cache backend for Tuist Runners, co-located with each runner fleet. Today, a customer job running on a Tuist runner still ships its actions/cache traffic across the internet to GitHub’s hosted cache (Azure Blob Storage). The round trip negates much of the value of running the job close to fast, dedicated hardware. We propose terminating the GitHub Actions cache protocol inside our own infrastructure and storing artifacts in self-hosted, S3-compatible object storage (SeaweedFS), one cluster per fleet. The customer changes nothing in their workflow. This is an infrastructure and performance improvement, not a new user-facing product surface.

The two fleets are served by two independent storage clusters: SeaweedFS on Scaleway for the macOS fleet, SeaweedFS on Hetzner for the Linux fleet. The split is the locality-optimal design, not a compromise, and falls out naturally because cache keys are already OS-scoped.

Motivation

Tuist Runners place customer CI jobs on dedicated macOS (Scaleway Apple Silicon) and Linux (Hetzner bare-metal) hardware. The compute is fast and close, but the cache is not:

  1. Cache I/O leaves the datacenter. actions/cache uploads and downloads tarballs to GitHub’s hosted cache, which is Azure Blob Storage. A job on a Hetzner box in Germany or a Mac mini on Scaleway pulls and pushes its cache across the public internet on every run.
  2. Cache transfer dominates wall-clock for cache-heavy jobs. For Swift/Xcode work the cache payloads (DerivedData, SwiftPM .build, actions/cache archives) are large, and the transfer time is a large fraction of total job time.
  3. We pay for egress we do not control. Cross-internet cache traffic is bandwidth we cannot optimize and cannot keep on a private network.
  4. The hardware advantage is undercut. Customers move to Tuist Runners for speed. A slow cache path is the most visible way that promise leaks.

The problem statement: cache artifacts for jobs running on Tuist Runners should be stored on fast storage co-located with the runner, on a private network, with zero changes to the customer’s workflow.

Prior Art

Blacksmith

Blacksmith (blacksmith.sh/blog/cache) solves the same problem for their Firecracker-based runners. Their design:

  • A lightweight proxy inside each runner VM intercepts cache requests and forwards them to a host-level proxy, while non-cache GitHub control-plane traffic goes to its usual destination.
  • nftables on the host atomically swaps redirect rules per VM, replacing an earlier iptables approach that did not handle multi-VM rule management cleanly.
  • A host proxy decodes GitHub’s Azure-Blob-style URLs and translates them into S3-compatible calls against a self-hosted MinIO cluster, using an in-house Azure-Blob-to-S3 SDK.
  • They reverse-engineered GitHub’s move to a Twirp-based cache service that uses the Azure Blob Storage SDK, and found the SDK skips concurrency optimizations when the hostname does not look like an Azure endpoint, so they mint Azure-like URLs.

The key lesson we adopt: terminate the cache protocol with network interception, and translate Azure Blob operations to S3. The key place we diverge: storage backend (SeaweedFS, not MinIO, see Alternatives) and where the interception terminates.

actions/toolkit cache internals

We grounded the design in the toolkit source rather than assumptions:

  • packages/cache/src/internal/config.ts: the action selects cache service v2 only if the runner sets ACTIONS_CACHE_SERVICE_V2, otherwise v1 (and always v1 on GitHub Enterprise Server). The service URL for v2 is ACTIONS_RESULTS_URL; for v1 it is ACTIONS_CACHE_URL falling back to ACTIONS_RESULTS_URL. These come straight from the runner-injected environment.
  • packages/cache/src/internal/uploadUtils.ts: v2 upload uses @azure/storage-blob BlockBlobClient.uploadFile(path, { blockSize, concurrency, maxSingleShotSize: 128 * 1024 * 1024 }). Files at or below 128 MB go up as a single Put Blob; larger files are staged as Put Block calls and committed with Put Block List.
  • packages/cache/src/internal/downloadUtils.ts: v2 download uses BlockBlobClient.downloadToBuffer with a 128 MB max segment and internal ranged GETs, parallelized by downloadConcurrency.
  • packages/cache/src/internal/shared/cacheTwirpClient.ts: coordination is POST to /twirp/github.actions.results.api.v1.CacheService/<Method>, JSON or protobuf body, Authorization: Bearer <ACTIONS_RUNTIME_TOKEN>.
  • actions/toolkit issue #1051: there is no supported way to override the cache URL, and ACTIONS_RUNTIME_TOKEN is multiplexed with other runner subsystems (artifacts, OIDC), so it cannot be repointed without breaking them. This rules out the environment-variable override approach (see Alternatives).

Current state

Tuist Runners use a shared warm pool with dispatch-time binding. A runner Pod or Tart VM boots without an identity, polls /api/internal/runners/dispatch, and only learns its tenant (owner) in the response that hands it a JIT config. The VM or container is single-job ephemeral and is destroyed after the job.

  • macOS: Tart VMs on Scaleway Apple Silicon Mac minis, driven by tart-kubelet. The VM is NAT’d behind the mini. Env and the SA token are staged host-side by tart-kubelet and shared into the guest.
  • Linux: kata-fc (Firecracker) microVMs on Hetzner bare-metal, scheduled as Pods. Env and the SA token are projected natively by kubelet.

There is already a host-side reverse proxy pattern in tart-kubelet (used today to expose VM metrics to the cluster). nftables is already in use on the Linux substrate. Both are relevant building blocks for the interception layer.

No cache backend exists today. Customer jobs use GitHub’s hosted cache as-is.

Proposal

Architecture overview

One self-hosted cache stack per fleet. Each stack has three parts:

  1. SeaweedFS S3-compatible object storage, server-side only, on the fleet’s private network.
  2. cache-gateway, a Go service that terminates the GitHub Actions cache v2 protocol (Twirp coordination plus the Azure Block Blob transfer protocol) and translates it to S3 against SeaweedFS.
  3. runner-cache-proxy, a host-side SNI-routing proxy that transparently redirects cache traffic from the guest to the gateway and leaves all other traffic alone.
 ┌─ Fleet datacenter (Scaleway or Hetzner) ─────────────────────────┐
 │                                                                   │
 │  Runner host (Mac mini / bare-metal node)                         │
 │   ┌─ Guest (Tart VM / kata-fc microVM) ─┐                         │
 │   │  actions/cache (v2, Azure Blob SDK)  │                        │
 │   └──────────────┬───────────────────────┘                       │
 │        :443 DNAT │ (pf / nftables)                                │
 │                  ▼                                                 │
 │   ┌──────────────────────────────┐                                │
 │   │  runner-cache-proxy (host)    │  SNI peek:                     │
 │   │  • CacheService → gateway     │   GitHub cache host → MITM     │
 │   │  • everything else → GitHub   │   other SNI → blind splice     │
 │   └──────────────┬───────────────┘                                │
 │                  ▼                                                 │
 │   ┌──────────────────────────────┐      ┌────────────────────┐    │
 │   │  cache-gateway (per fleet)    │─────▶│  SeaweedFS (S3)    │    │
 │   │  twirp + Azure-blob ⇄ S3      │◀─────│  private network   │    │
 │   └──────────────────────────────┘      └────────────────────┘    │
 └───────────────────────────────────────────────────────────────────┘

The guest only ever talks to the gateway’s own hostnames. SeaweedFS is never exposed to the guest.

Storage: SeaweedFS, one cluster per fleet

SeaweedFS runs as a dedicated storage tier on both fleets, deliberately separated from the runner nodes:

  • Scaleway (macOS): a dedicated Scaleway cloud instance with a block volume in the same region/AZ as the Mac minis. We do not run storage on the minis: they are precious compute, CAPI-managed, and reimaged.
  • Hetzner (Linux): dedicated storage nodes on the same private network as the runner bare-metal boxes, not the runner boxes themselves. Default to dedicated bare-metal NVMe storage nodes (ordered through the same Hetzner Robot flow the runner fleet already uses); fall back to Hetzner Cloud VMs with attached Cloud Volumes if avoiding extra Robot hardware outweighs the throughput penalty (see below).

Why a dedicated storage tier rather than co-locating on the runner nodes. Putting SeaweedFS on the runner bare-metal boxes is tempting (same NVMe, zero network hops) but couples storage to the revenue-generating workload in three ways we do not want:

  1. Resource contention. Cache is large and write-heavy. SeaweedFS would compete with customer microVMs for disk I/O bandwidth and, critically, disk capacity. A busy or full cache could slow or fail customer builds, which is exactly the workload we are trying to make faster. Runner density is planned on CPU and RAM; cache grows on disk. The two should scale independently.
  2. Lifecycle coupling. Runner nodes are cattle: CAPI reimages them (installimage on RAID 1). Stateful cache data on a node that gets reprovisioned is an anti-pattern, forcing constant rebalancing and risking loss beyond what replication covers.
  3. Blast radius. A SeaweedFS fault (memory leak, disk fill) on a runner node would degrade or down customer jobs on that node.

Why the locality goal still holds. The goal is to keep cache I/O off the public internet, not literally on the same chassis. A dedicated storage node on the same datacenter private network is sub-millisecond away over 10 Gbit, which for bulk tarball transfer (throughput-bound, not latency-bound) is indistinguishable from same-box in practice.

Bare-metal storage nodes vs Cloud VMs, within the dedicated tier. Cache restore sits on the critical path of job start, so storage read throughput matters. Bare-metal NVMe storage nodes saturate the 10 Gbit fabric; Hetzner Cloud Volumes are network-attached block storage with materially lower per-volume throughput, which can bottleneck large restores. We therefore default to dedicated bare-metal NVMe storage nodes and treat Cloud VMs as the lower-effort fallback when the operational cost of another Robot box outweighs the throughput difference. This mirrors the Scaleway decision to use a dedicated instance rather than the compute hosts.

Each cluster runs master, volume, filer, and S3 gateway components. A single bucket gha-cache, prefixed per account. Objects are written with a 7-day TTL (matching GitHub’s default cache lifetime) so SeaweedFS handles routine expiry. Static S3 credentials are delivered via 1Password and External Secrets Operator.

The two clusters are fully independent. A Scaleway incident cannot affect Linux cache, and vice versa. Cache loss is degraded performance, not data loss, so replication factor is tuned modestly per fleet.

Why two stores and not one shared store. The point of self-hosting is LAN-speed proximity. A shared store would force one fleet to pull cache across the internet, defeating the purpose. The split loses nothing, because actions/cache keys conventionally embed ${{ runner.os }} and the content is OS-specific, so a macOS job and a Linux job essentially never resolve to the same entry even on GitHub’s own backend. The dispatch path already knows the OS at JIT-mint time, so it simply points each fleet at its local stack.

Expected storage cost

The two fleets have very different cost profiles, because Hetzner bare metal is cheap and Scaleway compute is not. Figures below are approximate monthly euros as of mid-2026 (Hetzner raised prices on 2026-04-01 and Scaleway updated on 2026-06-01, so confirm before committing). The cache gateway is a small service that co-locates on the storage node or existing cluster capacity, so its compute cost is negligible and is not broken out.

Hetzner (Linux fleet): self-host on bare metal, recommended.

Option Spec Approx. EUR/mo
1 dedicated NVMe node AX41 / AX42-class, ~1-2 TB local NVMe included ~55-70
2 nodes (replication) as above, times two ~110-140
Cloud fallback, per node CCX23 (~31.49) + 1 TB Cloud Volume (0.057/GB, ~57) ~88

Local NVMe is included in the dedicated price, so the bare-metal option is both faster and cheaper than the Cloud VM plus Cloud Volume fallback. That is the dominant reason to prefer bare metal here.

Scaleway (macOS fleet): self-host on an instance plus block storage.

Product: Scaleway Instances (Production-Optimized PRO2 line) plus Scaleway Block Storage (15,000 IOPS SSD), in the same region as the Apple Silicon Mac minis.

Component Spec Approx. EUR/mo
Instance PRO2-XS (4 vCPU, 16 GiB) ~120
Block Storage 1 TB at 0.086/GB ~86
One node + 1 TB ~206
Two nodes (replication) ~410

A smaller Cost-Optimized instance lowers the compute line if SeaweedFS fits in less RAM.

A cost flag for the Scaleway side. Self-hosting on Scaleway runs roughly 3-4x the Hetzner cost, because Scaleway compute and block storage are both pricier. Scaleway’s managed Object Storage (S3-compatible, in the same region so locality still holds) is far cheaper, on the order of 0.0146/GB/month with no compute, so roughly 15-40 EUR/mo for the same working set. The self-host-everywhere stance was a deliberate choice, but on the macOS fleet specifically the locality argument for self-hosting is weak (managed Object Storage is in-region too) and the cost gap is large. The gateway code is identical either way; only the S3 endpoint and credentials change. See Open questions.

cache-gateway: the Azure-Blob to S3 translator

A Go service (Go matches every other infra control-plane and data-plane component: runners-controller, tart-kubelet, hetzner-robot-controller), one instance per fleet, co-located with that fleet’s SeaweedFS. It exposes two surfaces.

Coordination (Twirp). POST /twirp/github.actions.results.api.v1.CacheService/<Method>, JSON first. Three methods matter:

Method Request Gateway response
CreateCacheEntry { key, version } { ok, signed_upload_url }
FinalizeCacheEntry { key, version, size_bytes } { ok, entry_id }
GetCacheEntryDownloadURL { key, restore_keys, version } { ok, matched_key, signed_download_url }

The gateway validates the tenant cache token, derives the prefix gha-cache/<account_id>/<repo>/<version>/<key>, and mints signed URLs that point at its own blob surface, never at an Azure or raw-S3 hostname. Once coordination is terminated by us, the entire blob path is on infrastructure we control.

Blob transfer (Azure Block Blob subset). Served on the gateway’s own hostname with a real certificate and HMAC-URL auth. The subset of Azure operations the action actually performs, and their S3 mapping:

Azure Block Blob op (what the action sends) When SeaweedFS S3 op
Put Blob (single-shot, at or below 128 MB) small upload PutObject
Put Block (comp=block&blockid=...) each block of a large upload CreateMultipartUpload (lazy, first block) + UploadPart
Put Block List (comp=blocklist) commit a large upload CompleteMultipartUpload
Get Blob Properties (HEAD) size before download HeadObject
Get Blob + Range ranged download GetObject + Range

State strategy, mostly database-free:

  • Entry index is SeaweedFS itself. Keys are deterministic, so an exact match is a HeadObject and a restore_keys prefix match is ListObjectsV2 with a prefix, newest LastModified wins. No separate entry database.
  • In-flight multipart state is in-memory, keyed by blob path: { azure_blockid -> (uploadId, partNumber, ETag) }. The gateway lazily issues CreateMultipartUpload on the first Put Block; on Put Block List it reads the ordered block-ID list from the XML body and issues CompleteMultipartUpload with parts in that order. We run one gateway instance per fleet to start; HA via sticky routing or shared (Redis) state is deferred.
  • Eviction is the SeaweedFS 7-day TTL plus a periodic per-tenant size-cap sweeper.

Two sharp edges, covered by tests from day one:

  1. Sign the path, not the full query. The Azure SDK appends &comp=block&blockid=... to the signed URL we handed it. The gateway’s HMAC must cover a canonical subset (object path plus our own sig and exp params) and ignore the SDK-added params, or every block PUT fails signature validation. The signed URL itself is the only auth on the blob path, the SAS-URL pattern, so no bearer token is needed there.
  2. S3 multipart minimum part size is 5 MB (except the last part). Fine at the default block size, but we assert it so a customer-tuned uploadChunkSize below 5 MB degrades gracefully (buffer or single-shot) instead of failing CompleteMultipartUpload.

Throughput note. The Azure SDK skips parallel-block optimizations when the host does not look Azure-like (Blacksmith’s finding). We either give the gateway blob surface an Azure-shaped hostname or accept serialized transfers. This is a tuning detail; both are functionally correct.

runner-cache-proxy: host-side selective interception

We cannot use environment-variable override (see Alternatives and Prior Art): the runner re-injects the cache URLs per job, and ACTIONS_RUNTIME_TOKEN is multiplexed with other subsystems. Transparent operation therefore requires network interception, as Blacksmith found.

The proxy runs on the host (Mac mini or Hetzner node), never inside the customer guest, so the CA private key and the tenant token never touch customer-controlled ground.

  • pf (macOS) or nftables (Linux) DNAT all guest :443 to the proxy.
  • The proxy peeks the ClientHello SNI without decrypting:
    • SNI matches GitHub Actions cache hostnames: MITM with an on-the-fly certificate signed by a baked CA, then route by path. .../CacheService/* goes to cache-gateway (with token swap, below). Everything else (.../ArtifactService/*, OIDC, telemetry) is forwarded to genuine GitHub with the original token untouched. This selective routing is essential: actions/upload-artifact@v4 shares the same ACTIONS_RESULTS_URL host under a different service name.
    • Any other SNI: blind TCP splice, no decryption. MITM is bounded to GitHub’s cache plane only, which keeps unrelated customer traffic private and avoids breaking cert-pinned connections.
  • Token swap without trusting the guest. tart-kubelet already stages env and the SA token per VM on the host; it also stages the dispatch-minted cache token there. The proxy maps the connection source IP (the guest’s NAT address) to that guest’s cache token and injects it as the bearer to the gateway. The guest never holds a cache secret.

Only the CA public certificate is baked into the runner images’ trust stores (infra/runner-image, infra/linux-runner-image). The CA private key, the proxy binary, and the pf/nftables rules ship via host bootstrap (infra/macos-host-bootstrap and the Linux node KubeadmConfigTemplate).

Tenant scoping and the cache token

Tenant isolation is enforced in depth, by independent layers, so that no single failure (including full compromise of a customer guest) breaches another tenant’s cache. We treat the guest as adversarial: on macOS the entire Tart VM is the customer’s environment with root, so the customer can inspect or bypass anything running in the guest. The design does not depend on them not doing so.

The independent layers, outer to inner:

  1. Compute isolation. Each job runs in its own kata-fc microVM or Tart VM. There is no shared process or filesystem surface between tenants; the only shared resource is the cache store, reached solely through the gateway.
  2. Storage is not guest-routable. SeaweedFS lives on the private storage network and is never exposed to the guest. A guest can address only the gateway’s narrow protocol surface, never the object store directly.
  3. Host-side token custody. The cache token is staged host-side and injected by the host proxy based on a host-controlled source-IP to guest mapping. The token is never present inside the guest, and a guest cannot make the proxy attach a different tenant’s token (it cannot forge another guest’s NAT source identity). A fully compromised guest still cannot obtain a credential for another account.
  4. Cryptographic, least-privilege token scoping. The token is asymmetrically signed (the server holds the private key; the gateway only verifies, so the data plane holds no signing material), scoped to { account_id, repo, fleet }, and short-lived. The gateway derives the SeaweedFS key prefix purely from the verified claims, never from request input, so there is no request parameter that lets a valid token for one account address another account’s prefix.
  5. Path-scoped, expiring blob URLs. Signed blob URLs embed the already-resolved object path plus an HMAC and an expiry. The blob surface validates only the HMAC and never re-derives tenancy from client input, so even a leaked URL grants access to a single object for a bounded window, not to a tenant’s namespace.

Bypassing the host proxy is therefore not a privilege escalation: it costs the customer their own cache and grants nothing, because layers 1, 2, 4, and 5 hold without it. Network interception is a transparency and performance mechanism, not the tenant-isolation boundary. This is deliberate: the customer-controlled guest is an adversarial boundary, so isolation is layered rather than resting on any single control.

At dispatch (Tuist.Runners.dispatch_for_sa / mint_jit), the server already knows account_id, repo, and the fleet, and mints the token described in layer 4 alongside the fleet’s cache_gateway_url. Host agents (tart-kubelet, runners-controller) stage it host-side per layer 3.

Server and host-agent changes

  • Tuist.Runners.dispatch_for_sa / mint_jit: add cache_token and cache_gateway_url to the dispatch response.
  • Host agents (tart-kubelet, runners-controller) stage the cache token host-side per guest, alongside the existing env and SA-token staging.
  • The gateway is configured with the verification public key. Nothing else changes server-side.

Protocol drift: failing open and observability

GitHub controls the cache protocol and can change it without notice (a new actions/cache version, a new Twirp service version, a changed blob behavior). The system is designed so that any such change degrades to cache misses, never to broken customer workflows, and so that we detect the drift immediately. This is safe to do aggressively precisely because interception is a performance mechanism and not the tenant-isolation boundary (see Tenant scoping): pass-through routes to GitHub’s own per-repo-isolated cache.

Failing open to GitHub. The interception layer is allowlist-based and treats anything it does not explicitly recognize as pass-through to genuine GitHub:

  • The runner-cache-proxy only MITMs SNIs on its cache-hostname allowlist and only redirects paths matching the known CacheService methods. An unrecognized hostname, service version, method, or unparseable request is blind-spliced or forwarded straight to GitHub, so the customer transparently gets GitHub’s hosted cache. Protocol drift is therefore fail-open by construction: an unknown shape routes to GitHub automatically.
  • The decision is made at the coordination call and is sticky for that entry: if we forward CreateCacheEntry / GetCacheEntryDownloadURL to GitHub, GitHub returns its own Azure blob URLs and the subsequent blob traffic passes through automatically. We never half-intercept an entry.
  • A health-gated circuit breaker trips the proxy to full pass-through when the gateway or SeaweedFS is unhealthy, unreachable, or slow (strict per-call timeouts; on timeout we pass through). Infra faults degrade the same way protocol drift does.

The degraded state is the one we can tolerate: recent entries we had been serving from SeaweedFS are cold on GitHub, so restores miss until GitHub re-warms, but every workflow keeps running. actions/cache reinforces this, since cache restore failures are warnings rather than job failures by default; our job is only to ensure we never turn a cache interaction into something worse than a miss (a TLS failure, a connection refused, a hang).

Detecting drift immediately. The gateway and proxy emit Prometheus metrics into the same prom_ex / Grafana / Grafana IRM stack the runners feature already uses:

  • Pass-through-fallback rate as a first-class SLI. In steady state effectively all cache traffic is handled by us, so a rising fallback rate is the single strongest signal of either protocol drift or gateway ill health. Alert on any sustained nonzero rate of unknown-protocol-shape fallbacks, and on fallback rate crossing a low threshold.
  • Protocol-shape counters tagged by service name, method, detected version, and schema-validation result, so an alert points directly at what changed. Unrecognized shapes are logged (sanitized) for fast diagnosis.
  • Per-fleet cache SLIs: coordination success and error, blob upload and download success and error, Azure-to-S3 translation errors, and cache hit rate. A hit-rate drop or 5xx spike is a secondary signal.
  • A synthetic canary: a scheduled job runs a real actions/cache save-and-restore through the gateway end to end (the same controlled-runner harness from rollout step 2), pinned to the latest actions/cache, and alerts on failure. This exercises the real action against our gateway continuously, catching an action update or GitHub protocol change as it ships, typically before customers hit it.
  • Release watch on actions/cache and actions/toolkit: a new release triggers re-validation of the gateway against it, since protocol changes usually arrive via a new action version first.

Helm wiring

  • A cacheGatewayFleet[] block per fleet in infra/helm/tuist: image digest pin, SeaweedFS S3 endpoint and credentials secret ref, verification public key.
  • SeaweedFS via its upstream chart (or a thin wrapper), one release per fleet.
  • Runner-image digest bumps flow through the existing release pipelines. The CA certificate and proxy ship via host bootstrap.

Scope

In scope:

  • GitHub Actions cache v2 protocol (current and forward-looking).
  • macOS (Scaleway) and Linux (Hetzner) fleets, each with an independent SeaweedFS cluster.
  • Transparent operation: no customer workflow changes.

Out of scope for the first iteration:

  • Cache v1 protocol. Added only if telemetry shows old actions/cache pins in customer workflows.
  • Gateway HA (multi-replica with shared multipart state).
  • Cross-fleet or global cache deduplication.
  • A customer-facing cache management UI or API.

Trade-offs

Advantages

  • Cache I/O stays on the fleet’s private network at LAN speed, which is the entire point of running jobs on dedicated hardware.
  • Zero customer-facing change. Unmodified actions/cache keeps working.
  • We own the storage and can tune durability, retention, and capacity per fleet.
  • The interception surface is bounded to GitHub’s cache plane, so other traffic (artifacts, OIDC) is untouched and private.
  • Secrets (CA key, tenant token) stay host-side, off the customer-controlled guest.

Disadvantages

  • We take on a new data-plane service (cache-gateway) and a host-side MITM proxy, plus SeaweedFS as new stateful infrastructure to operate, on call, in two datacenters.
  • The Azure-Block-Blob-to-S3 translation is non-trivial: stateful block-ID to multipart-part mapping and the sign-the-path-not-the-query detail are easy to get subtly wrong.
  • MITM of GitHub’s cache hostnames requires a baked CA in the runner images. This is acceptable because we own the images, but it is a trust decision worth stating plainly.
  • Per-account cache quota and metrics become per-fleet because of the macOS/Linux split. A unified account view requires extra roll-up.
  • We are coupled to GitHub’s cache protocol. If GitHub changes the Twirp service or the blob protocol, the gateway must follow. The interception layer is designed to fail open to GitHub’s hosted cache so a change degrades to cache misses rather than broken workflows, and a synthetic canary plus pass-through-rate alerting catch drift early (see Protocol drift: failing open and observability).

Alternatives considered

Environment-variable override instead of network interception

Set ACTIONS_RESULTS_URL, ACTIONS_CACHE_SERVICE_V2, and ACTIONS_RUNTIME_TOKEN in the job environment to point actions/cache at our gateway. Rejected. The runner re-injects these per job from the job message, so a pre-staged value is overwritten (config.ts reads them straight from process.env). Even if the URL stuck, ACTIONS_RUNTIME_TOKEN is multiplexed with artifacts and OIDC (toolkit issue #1051), so repointing it would break other subsystems. Interception is the only transparent option.

Swap actions/cache for an S3-native action

Have customers use an action like tespkg/actions-cache that targets S3 directly. Rejected. It requires every customer to change their workflow, which violates the zero-change goal and would not transparently cover existing pipelines.

Managed S3 (Tigris or similar) instead of self-hosted

Use our existing managed object storage rather than running SeaweedFS. Rejected for this use case. The whole point is LAN-speed locality next to the runner; a managed bucket reintroduces a network hop and egress we cannot keep on a private network. Self-hosting co-located storage is the requirement, not an implementation detail.

MinIO as the storage backend

The obvious S3-compatible self-hosted choice historically. Rejected. As of early 2026 MinIO stripped its community-edition console (May 2025) and the upstream repository moved to no-longer-maintained and was archived (February 2026). It is not a safe foundation for a new deployment. SeaweedFS (Apache 2.0, mature since 2015, fast on both large and small objects, rack-aware replication) is the chosen backend. Garage (AGPL, Rust, simple multi-node) was the runner-up and remains a fallback.

One shared store across both fleets

A single SeaweedFS cluster serving both macOS and Linux. Rejected. It forces one fleet to cross the internet for cache, defeating the locality goal, and provides no real benefit because cache entries are already OS-scoped by key convention.

In-guest proxy instead of host-side

Run the interception proxy inside the runner VM (closer to Blacksmith’s nginx-in-VM). Rejected as the primary design. It would place the CA private key and tenant token inside the customer-controlled guest. Host-side termination keeps both secrets off customer ground while reusing the existing tart-kubelet host-proxy pattern and the existing nftables substrate.

Relationship to sticky cache volumes

A natural follow-on question is whether we should also offer per-repo sticky cache volumes, exemplified by Blacksmith’s sticky disks: a persistent block volume attached to the runner that carries arbitrary on-disk state across runs (Docker layer cache, incremental compilation state, large dependency trees). This is complementary, not an alternative. It is not a different backing store for the GHA cache and replaces nothing in this RFC; it is an additive feature that would sit alongside it.

Blacksmith itself splits the two the same way: its Cache product backs actions/cache with object storage (MinIO / S3), exactly as proposed here, while its sticky disks are a separate product whose canonical use is Docker layer caching. The two coexist on a single job: actions/cache traffic goes to our S3-backed gateway, and a sticky volume could be mounted at, say, the Docker data root for state that actions/cache models poorly. We are deferring the volume piece deliberately, and would treat it as its own RFC.

How it differs. The GHA cache in this RFC is protocol-level and content-addressed: it intercepts actions/cache, stores keyed tarballs in shared S3 namespaced by token, and never attaches anything to the VM. A sticky volume is block-level: the job sees a mounted filesystem that persists as a per-repo image, capturing whatever the build leaves on disk, not only what the customer explicitly wired into actions/cache. It needs volume lifecycle management (provision, attach, snapshot, garbage-collect, quota) and, critically, concurrency control, because a read-write block device cannot be safely shared between two jobs of the same repo running at once.

Why not now. It conflicts head-on with the shared warm pool and dispatch-time binding the runners architecture is built on. A per-tenant block device must be attached at VM or Pod creation (tart run --disk, a CSI PVC), but at that moment the Pod is still identity-less and does not learn its tenant until it claims a job. Supporting sticky volumes would mean either binding identity ahead of dispatch (giving up the shared pool), hot-attaching a device after the claim (poorly supported on Tart today), or cloning a per-repo volume on claim. All three are significant architectural changes. By contrast, the protocol-level cache covers the most common and highest-value case (actions/cache) transparently, with no change to dispatch and no per-tenant attach, so it is the right first step, and it stands on its own regardless of whether we later add volumes.

Very rough future shape. On Linux (kata-fc), virtio block hot-plug after the claim is the most plausible path: keep a per-repo volume on a block-snapshot store, and on dispatch attach a copy-on-write snapshot to the microVM, reconciling on job completion (newest-wins, or per-branch volumes). On macOS (Tart), attach a per-repo disk image at launch via --disk, which likely requires a per-tenant warm sub-pool bound slightly ahead of dispatch, since Tart hot-plug is weak. Either way the substrate is block storage with copy-on-write snapshots (local NVMe or Cloud Volume snapshots), not the SeaweedFS object tier this RFC introduces, and the central new problem to solve is per-repo concurrency (a single-writer lock, or a COW snapshot per job with a merge or last-writer policy).

Rollout plan

The rollout is staged so that each layer is validated before the next depends on it.

  1. SeaweedFS up per fleet, validated via raw S3 access. No runner integration yet.
  2. Gateway correctness, decoupled from interception. Point a real actions/cache v2 run at the gateway from a controlled internal runner where we set the env directly (nothing overwrites it there). This proves the Azure-to-S3 translation end-to-end before any MITM exists.
  3. Host proxy (SNI routing, MITM, token swap) on one canary host. Bake the CA, wire pf/nftables.
  4. Dispatch token minting and host-side staging.
  5. Canary one fleet, measure cache hit rate and blob throughput against GitHub’s hosted cache, then roll the second fleet.

Each environment progression (staging, canary, production) gates on cache correctness and a throughput comparison.

Future work

  • Cache v1 protocol support if telemetry warrants it.
  • Gateway HA: multi-replica with sticky routing or Redis-backed in-flight multipart state.
  • Per-account quota and metrics roll-up across both fleets for a unified customer view.
  • Cache analytics surfaced to customers (hit rate, size, savings).
  • Evaluate Kura (our Rust distributed cache mesh) as a cross-node backing layer for warm cache beyond a single node, if node-local stickiness proves insufficient.
  • Per-repo sticky cache volumes for workloads that re-tarball large on-disk state every run (Docker layer cache, incremental builds). See Relationship to sticky cache volumes for how it complements this proposal and a rough shape; it warrants its own RFC.

Open questions

  • Should per-fleet cache quota be exposed to customers as-is, or rolled up into an account-wide number from day one?
  • Do we make the gateway blob hostname Azure-shaped to keep the SDK’s parallel-block path, or accept serialized transfers initially and revisit on measured throughput?
  • What is the right per-tenant size cap and eviction policy beyond the 7-day TTL, and should it scale with plan?
  • On the macOS fleet, do we self-host SeaweedFS on a Scaleway instance plus Block Storage, or point the gateway at Scaleway’s managed Object Storage? The latter is materially cheaper and still in-region, and the gateway is identical either way. See Expected storage cost.

References

Thanks for writing this. The overall direction makes sense to me, especially keeping the cache path local to each runner fleet and treating the guest as adversarial.

There are a few things I’d like to bring up:

First, tenant isolation looks right at the token level, but I’d make the path and key canonicalization story explicit. The gateway derives paths from account_id, repo, version, and key, and key / restore_keys are workflow-controlled. Can we guarantee that raw keys never let someone escape their namespace through ../, encoded slashes, double-decoding, backslashes, Unicode normalization, or router-level path normalization? I’d prefer either a canonical slash-free encoding that still preserves restore-prefix semantics, or a small metadata index plus opaque object IDs. The tests should cover traversal, percent-encoding, broad restore prefixes, and the exact canonical value used by signing, routing, and S3 operations.

Related to that, I think we should decide explicitly whether we want to match GitHub’s branch/ref cache scope semantics or intentionally accept broader repo-level cache sharing. The action computes the key and version, but GitHub also applies server-side scope rules for current branches, default branches, base branches, and PR merge refs. If our gateway only scopes by { account_id, repo, fleet, version, key }, two runs with the same key could share cache entries in cases where GitHub-hosted cache would keep them separate.

Second, one storage cluster per fleet is fine for the first version, but I’d like the RFC to imagine how this would scale once cache traffic grows. Not necessarily to implement sharding now, but to make the future shape explicit: what dimension we would shard on, account, repo, fleet, or region; how dispatch or the gateway would find the right shard; how a large tenant would be moved or isolated; and whether rebalancing would mean accepting cold misses or doing background copy. I also expect a noisy-neighbor problem around the storage servers’ NICs: a few large restores or saves can saturate private-network bandwidth and make everyone else’s jobs look slower, even if CPU and disk are fine. I’d like the RFC to define the metrics that tell us this part of the infrastructure is degrading the perceived performance of the runners: cache restore/save wall-clock time as a percentage of job time, p50/p95/p99 blob throughput and first-byte latency, gateway coordination latency and error rate, fallback-to-GitHub rate, object-store read/write latency, storage utilization and eviction pressure, per-tenant bandwidth, per-shard NIC utilization, and private-network saturation. Those are the signals that would tell us when the cache is no longer helping, or when one tenant or one region is starting to affect everyone else.

Third, I think there’s a broader artifact story here that overlaps with the discussions around Kura as a technology for making cached artifacts reusable beyond CI environments. A resolved node_modules/ installation is a good example: it is useful in CI, but the same artifact could also be valuable outside the GitHub Actions cache protocol if we can address it, validate it, and distribute it through a Tuist-owned cache layer. The RFC proposes SeaweedFS as the storage backend for actions/cache, while Kura is the technology we have been discussing for reusable distributed cache artifacts. I think the RFC should explain how we reconcile those two storage directions. Is SeaweedFS only an implementation detail behind the GitHub cache gateway, with a boundary that Kura could replace or front later? Should Kura eventually become the artifact index and distribution layer, with SeaweedFS only storing opaque blobs? Or are these intentionally separate systems with no migration path between them? I wouldn’t expand this RFC to solve reusable artifacts, but I’d like it to state the relationship explicitly so we don’t build metadata, addressing, quotas, metrics, and retention twice, or create a cache layout that makes future Kura-backed reuse harder.

1. Key canonicalization / namespace safety. Agreed, and it changes the design. I’m dropping the DB-less, key-derived-path approach for opaque, server-generated object IDs plus a small metadata index. key/restore_keys never become a path, URL, or signed string — they’re only parameterized index values, while storage paths use the token-derived account_id plus an opaque ID. That makes traversal structurally impossible rather than something we filter for: there’s no concatenation to escape from. Keys are treated as opaque bytes with no normalization (byte-distinct = distinct entry), and one canonical value (the opaque ID) is used identically by signing, routing, and the S3 op. The test matrix covers ../ and ..\, single and double percent-encoding of slashes and dots, overlong UTF-8 and NFC/NFD pairs, empty and single-character restore prefixes, router and SDK path normalization, and a property test that signed == routed == S3 key.

2. Ref / branch scope. Agreed this is a security boundary (cross-branch and fork-PR cache poisoning), not just hit-rate. We’ll match GitHub’s ref-scoping rather than collapse to repo-level sharing. The dispatch-minted cache token already carries account_id/repo/fleet; we extend it with the ref-scope claims dispatch already has from the workflow_job webhook (creating ref, default branch, PR base ref, untrusted-fork flag), which becomes a scope dimension in the index applied as a read predicate alongside account_id. A canary can start with a simplified own-ref + default-branch rule on trusted repos, but GitHub-equivalent scoping lands before any untrusted or multi-branch workflows.


3. Scaling, noisy neighbors, and metrics. I’ll correct my own framing here first: leading with an application-level “shard on account_id + shard map” was over-engineered. That pattern is operationally heavy (maintain the map, run migrations, promote tenants) and explicitly doesn’t scale by itself. The right model is the one Blacksmith effectively uses with their MinIO cluster: one self-scaling object-store cluster per fleet, with tenants as buckets, and app-level sharding kept only as a rare escape hatch.

How it self-scales:

  • SeaweedFS scales as a cluster. Capacity and throughput grow by adding volume servers; the cluster starts placing new volumes onto them automatically, and the front end (filer + S3 gateway) scales horizontally. Adding capacity is “the cluster absorbs a node,” not a migration project.
  • The TTL bounds the working set. A 7-day expiry means the cache has a steady-state size, not unbounded growth — it’s not a system of record. New writes flow onto free capacity immediately, and old volumes age out, so the cluster self-levels over a TTL window without active rebalancing. Most of the time it doesn’t need to scale at all because it self-trims.
  • Tenants are buckets, which gives per-customer credentials, quota, and clean deletion at near-zero marginal cost. Onboarding a customer is creating a bucket, never a sharding decision.

The real scaling signal isn’t capacity, it’s NIC / private-network bandwidth. A handful of large restores or saves can saturate a storage server’s NIC and slow every other job on it while CPU and disk sit idle — that’s the noisy-neighbor failure, and it’s a physical-bandwidth problem, not a CPU/disk one. Mitigations are per-tenant bandwidth limits at the gateway, and adding a node when a cluster’s NIC trends toward saturation. The metrics are what tell us that’s happening before customers feel it; your list is exactly right and I’ll add all of it, framed around “is the cache still helping / is one tenant or region degrading others”: restore+save time as a share of job time (the north-star), hit rate, blob throughput p50/95/99 and first-byte latency, coordination latency/error, fallback-to-GitHub rate, object-store latency, storage utilization + eviction pressure, per-tenant bandwidth, and per-shard NIC utilization / private-network saturation.

App-level sharding then drops to an escape hatch, not the strategy — reserved for the rare tenant that outgrows a single cluster or needs hard physical isolation (a noisy whale, or a data-residency/compliance requirement). Because it’s rare and deliberate, it never becomes routine ops. The one honest limit: on the Hetzner bare-metal fleet, adding a node is still a manual Robot order (those hosts are monthly-billed and operator-provisioned, same as the runner fleet), so hardware growth isn’t fully autonomous — but the cluster absorbs the node automatically once it’s in, and the TTL-bounded working set makes that a rare action rather than a treadmill.


4. Kura. I want to draw a firmer line than my framing implied, and correct how I drew it. You’re right that “input/output-defined vs. opaque” doesn’t hold up — Kura is a CAS storing arbitrary blobs too, so that’s not the distinction. The real difference is where the artifacts are consumed, and the distribution topology that follows from it:

  • The GitHub Actions cache is only useful inside the runner environment. Its sole consumer is a runner in one fleet/region; nothing outside that fleet ever reads it. So the correct topology is region-local and disposable — one store per fleet, no cross-region replication. SeaweedFS-per-fleet is exactly that.
  • Kura exists for the opposite requirement: replicating and distributing artifacts across regions and environments (CI, local dev, other machines) so they’re reusable everywhere. Geo-distribution is its reason to exist.

Putting the GitHub cache on Kura would mean running a cross-region distribution mesh for a workload whose artifacts are only ever consumed in the region that produced them — paying for replication that has no consumer, and adding coupling and latency for nothing. The two have opposite distribution requirements: one deliberately local, the other deliberately distributed.

So they stay separate systems, by design. SeaweedFS-per-fleet is the right backend for the GitHub cache, and Kura should not back it — not because of how blobs are addressed (both are CASes), but because replicating this cache beyond its fleet buys nothing. The only thing worth not duplicating is the generic infrastructure underneath — object storage, metrics/alerting plumbing, capacity planning — not the systems themselves.