This RFC proposes a self-hosted GitHub Actions cache backend for Tuist Runners, co-located with each runner fleet. Today, a customer job running on a Tuist runner still ships its actions/cache traffic across the internet to GitHub’s hosted cache (Azure Blob Storage). The round trip negates much of the value of running the job close to fast, dedicated hardware. We propose terminating the GitHub Actions cache protocol inside our own infrastructure and storing artifacts in self-hosted, S3-compatible object storage (SeaweedFS), one cluster per fleet. The customer changes nothing in their workflow. This is an infrastructure and performance improvement, not a new user-facing product surface.
The two fleets are served by two independent storage clusters: SeaweedFS on Scaleway for the macOS fleet, SeaweedFS on Hetzner for the Linux fleet. The split is the locality-optimal design, not a compromise, and falls out naturally because cache keys are already OS-scoped.
Motivation
Tuist Runners place customer CI jobs on dedicated macOS (Scaleway Apple Silicon) and Linux (Hetzner bare-metal) hardware. The compute is fast and close, but the cache is not:
- Cache I/O leaves the datacenter.
actions/cacheuploads and downloads tarballs to GitHub’s hosted cache, which is Azure Blob Storage. A job on a Hetzner box in Germany or a Mac mini on Scaleway pulls and pushes its cache across the public internet on every run. - Cache transfer dominates wall-clock for cache-heavy jobs. For Swift/Xcode work the cache payloads (DerivedData, SwiftPM
.build,actions/cachearchives) are large, and the transfer time is a large fraction of total job time. - We pay for egress we do not control. Cross-internet cache traffic is bandwidth we cannot optimize and cannot keep on a private network.
- The hardware advantage is undercut. Customers move to Tuist Runners for speed. A slow cache path is the most visible way that promise leaks.
The problem statement: cache artifacts for jobs running on Tuist Runners should be stored on fast storage co-located with the runner, on a private network, with zero changes to the customer’s workflow.
Prior Art
Blacksmith
Blacksmith (blacksmith.sh/blog/cache) solves the same problem for their Firecracker-based runners. Their design:
- A lightweight proxy inside each runner VM intercepts cache requests and forwards them to a host-level proxy, while non-cache GitHub control-plane traffic goes to its usual destination.
nftableson the host atomically swaps redirect rules per VM, replacing an earlieriptablesapproach that did not handle multi-VM rule management cleanly.- A host proxy decodes GitHub’s Azure-Blob-style URLs and translates them into S3-compatible calls against a self-hosted MinIO cluster, using an in-house Azure-Blob-to-S3 SDK.
- They reverse-engineered GitHub’s move to a Twirp-based cache service that uses the Azure Blob Storage SDK, and found the SDK skips concurrency optimizations when the hostname does not look like an Azure endpoint, so they mint Azure-like URLs.
The key lesson we adopt: terminate the cache protocol with network interception, and translate Azure Blob operations to S3. The key place we diverge: storage backend (SeaweedFS, not MinIO, see Alternatives) and where the interception terminates.
actions/toolkit cache internals
We grounded the design in the toolkit source rather than assumptions:
packages/cache/src/internal/config.ts: the action selects cache service v2 only if the runner setsACTIONS_CACHE_SERVICE_V2, otherwise v1 (and always v1 on GitHub Enterprise Server). The service URL for v2 isACTIONS_RESULTS_URL; for v1 it isACTIONS_CACHE_URLfalling back toACTIONS_RESULTS_URL. These come straight from the runner-injected environment.packages/cache/src/internal/uploadUtils.ts: v2 upload uses@azure/storage-blobBlockBlobClient.uploadFile(path, { blockSize, concurrency, maxSingleShotSize: 128 * 1024 * 1024 }). Files at or below 128 MB go up as a singlePut Blob; larger files are staged asPut Blockcalls and committed withPut Block List.packages/cache/src/internal/downloadUtils.ts: v2 download usesBlockBlobClient.downloadToBufferwith a 128 MB max segment and internal ranged GETs, parallelized bydownloadConcurrency.packages/cache/src/internal/shared/cacheTwirpClient.ts: coordination isPOSTto/twirp/github.actions.results.api.v1.CacheService/<Method>, JSON or protobuf body,Authorization: Bearer <ACTIONS_RUNTIME_TOKEN>.actions/toolkitissue #1051: there is no supported way to override the cache URL, andACTIONS_RUNTIME_TOKENis multiplexed with other runner subsystems (artifacts, OIDC), so it cannot be repointed without breaking them. This rules out the environment-variable override approach (see Alternatives).
Current state
Tuist Runners use a shared warm pool with dispatch-time binding. A runner Pod or Tart VM boots without an identity, polls /api/internal/runners/dispatch, and only learns its tenant (owner) in the response that hands it a JIT config. The VM or container is single-job ephemeral and is destroyed after the job.
- macOS: Tart VMs on Scaleway Apple Silicon Mac minis, driven by
tart-kubelet. The VM is NAT’d behind the mini. Env and the SA token are staged host-side bytart-kubeletand shared into the guest. - Linux: kata-fc (Firecracker) microVMs on Hetzner bare-metal, scheduled as Pods. Env and the SA token are projected natively by kubelet.
There is already a host-side reverse proxy pattern in tart-kubelet (used today to expose VM metrics to the cluster). nftables is already in use on the Linux substrate. Both are relevant building blocks for the interception layer.
No cache backend exists today. Customer jobs use GitHub’s hosted cache as-is.
Proposal
Architecture overview
One self-hosted cache stack per fleet. Each stack has three parts:
- SeaweedFS S3-compatible object storage, server-side only, on the fleet’s private network.
cache-gateway, a Go service that terminates the GitHub Actions cache v2 protocol (Twirp coordination plus the Azure Block Blob transfer protocol) and translates it to S3 against SeaweedFS.runner-cache-proxy, a host-side SNI-routing proxy that transparently redirects cache traffic from the guest to the gateway and leaves all other traffic alone.
┌─ Fleet datacenter (Scaleway or Hetzner) ─────────────────────────┐
│ │
│ Runner host (Mac mini / bare-metal node) │
│ ┌─ Guest (Tart VM / kata-fc microVM) ─┐ │
│ │ actions/cache (v2, Azure Blob SDK) │ │
│ └──────────────┬───────────────────────┘ │
│ :443 DNAT │ (pf / nftables) │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ runner-cache-proxy (host) │ SNI peek: │
│ │ • CacheService → gateway │ GitHub cache host → MITM │
│ │ • everything else → GitHub │ other SNI → blind splice │
│ └──────────────┬───────────────┘ │
│ ▼ │
│ ┌──────────────────────────────┐ ┌────────────────────┐ │
│ │ cache-gateway (per fleet) │─────▶│ SeaweedFS (S3) │ │
│ │ twirp + Azure-blob ⇄ S3 │◀─────│ private network │ │
│ └──────────────────────────────┘ └────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
The guest only ever talks to the gateway’s own hostnames. SeaweedFS is never exposed to the guest.
Storage: SeaweedFS, one cluster per fleet
SeaweedFS runs as a dedicated storage tier on both fleets, deliberately separated from the runner nodes:
- Scaleway (macOS): a dedicated Scaleway cloud instance with a block volume in the same region/AZ as the Mac minis. We do not run storage on the minis: they are precious compute, CAPI-managed, and reimaged.
- Hetzner (Linux): dedicated storage nodes on the same private network as the runner bare-metal boxes, not the runner boxes themselves. Default to dedicated bare-metal NVMe storage nodes (ordered through the same Hetzner Robot flow the runner fleet already uses); fall back to Hetzner Cloud VMs with attached Cloud Volumes if avoiding extra Robot hardware outweighs the throughput penalty (see below).
Why a dedicated storage tier rather than co-locating on the runner nodes. Putting SeaweedFS on the runner bare-metal boxes is tempting (same NVMe, zero network hops) but couples storage to the revenue-generating workload in three ways we do not want:
- Resource contention. Cache is large and write-heavy. SeaweedFS would compete with customer microVMs for disk I/O bandwidth and, critically, disk capacity. A busy or full cache could slow or fail customer builds, which is exactly the workload we are trying to make faster. Runner density is planned on CPU and RAM; cache grows on disk. The two should scale independently.
- Lifecycle coupling. Runner nodes are cattle: CAPI reimages them (installimage on RAID 1). Stateful cache data on a node that gets reprovisioned is an anti-pattern, forcing constant rebalancing and risking loss beyond what replication covers.
- Blast radius. A SeaweedFS fault (memory leak, disk fill) on a runner node would degrade or down customer jobs on that node.
Why the locality goal still holds. The goal is to keep cache I/O off the public internet, not literally on the same chassis. A dedicated storage node on the same datacenter private network is sub-millisecond away over 10 Gbit, which for bulk tarball transfer (throughput-bound, not latency-bound) is indistinguishable from same-box in practice.
Bare-metal storage nodes vs Cloud VMs, within the dedicated tier. Cache restore sits on the critical path of job start, so storage read throughput matters. Bare-metal NVMe storage nodes saturate the 10 Gbit fabric; Hetzner Cloud Volumes are network-attached block storage with materially lower per-volume throughput, which can bottleneck large restores. We therefore default to dedicated bare-metal NVMe storage nodes and treat Cloud VMs as the lower-effort fallback when the operational cost of another Robot box outweighs the throughput difference. This mirrors the Scaleway decision to use a dedicated instance rather than the compute hosts.
Each cluster runs master, volume, filer, and S3 gateway components. A single bucket gha-cache, prefixed per account. Objects are written with a 7-day TTL (matching GitHub’s default cache lifetime) so SeaweedFS handles routine expiry. Static S3 credentials are delivered via 1Password and External Secrets Operator.
The two clusters are fully independent. A Scaleway incident cannot affect Linux cache, and vice versa. Cache loss is degraded performance, not data loss, so replication factor is tuned modestly per fleet.
Why two stores and not one shared store. The point of self-hosting is LAN-speed proximity. A shared store would force one fleet to pull cache across the internet, defeating the purpose. The split loses nothing, because actions/cache keys conventionally embed ${{ runner.os }} and the content is OS-specific, so a macOS job and a Linux job essentially never resolve to the same entry even on GitHub’s own backend. The dispatch path already knows the OS at JIT-mint time, so it simply points each fleet at its local stack.
Expected storage cost
The two fleets have very different cost profiles, because Hetzner bare metal is cheap and Scaleway compute is not. Figures below are approximate monthly euros as of mid-2026 (Hetzner raised prices on 2026-04-01 and Scaleway updated on 2026-06-01, so confirm before committing). The cache gateway is a small service that co-locates on the storage node or existing cluster capacity, so its compute cost is negligible and is not broken out.
Hetzner (Linux fleet): self-host on bare metal, recommended.
| Option | Spec | Approx. EUR/mo |
|---|---|---|
| 1 dedicated NVMe node | AX41 / AX42-class, ~1-2 TB local NVMe included | ~55-70 |
| 2 nodes (replication) | as above, times two | ~110-140 |
| Cloud fallback, per node | CCX23 (~31.49) + 1 TB Cloud Volume (0.057/GB, ~57) | ~88 |
Local NVMe is included in the dedicated price, so the bare-metal option is both faster and cheaper than the Cloud VM plus Cloud Volume fallback. That is the dominant reason to prefer bare metal here.
Scaleway (macOS fleet): self-host on an instance plus block storage.
Product: Scaleway Instances (Production-Optimized PRO2 line) plus Scaleway Block Storage (15,000 IOPS SSD), in the same region as the Apple Silicon Mac minis.
| Component | Spec | Approx. EUR/mo |
|---|---|---|
| Instance | PRO2-XS (4 vCPU, 16 GiB) | ~120 |
| Block Storage | 1 TB at 0.086/GB | ~86 |
| One node + 1 TB | ~206 | |
| Two nodes (replication) | ~410 |
A smaller Cost-Optimized instance lowers the compute line if SeaweedFS fits in less RAM.
A cost flag for the Scaleway side. Self-hosting on Scaleway runs roughly 3-4x the Hetzner cost, because Scaleway compute and block storage are both pricier. Scaleway’s managed Object Storage (S3-compatible, in the same region so locality still holds) is far cheaper, on the order of 0.0146/GB/month with no compute, so roughly 15-40 EUR/mo for the same working set. The self-host-everywhere stance was a deliberate choice, but on the macOS fleet specifically the locality argument for self-hosting is weak (managed Object Storage is in-region too) and the cost gap is large. The gateway code is identical either way; only the S3 endpoint and credentials change. See Open questions.
cache-gateway: the Azure-Blob to S3 translator
A Go service (Go matches every other infra control-plane and data-plane component: runners-controller, tart-kubelet, hetzner-robot-controller), one instance per fleet, co-located with that fleet’s SeaweedFS. It exposes two surfaces.
Coordination (Twirp). POST /twirp/github.actions.results.api.v1.CacheService/<Method>, JSON first. Three methods matter:
| Method | Request | Gateway response |
|---|---|---|
CreateCacheEntry |
{ key, version } |
{ ok, signed_upload_url } |
FinalizeCacheEntry |
{ key, version, size_bytes } |
{ ok, entry_id } |
GetCacheEntryDownloadURL |
{ key, restore_keys, version } |
{ ok, matched_key, signed_download_url } |
The gateway validates the tenant cache token, derives the prefix gha-cache/<account_id>/<repo>/<version>/<key>, and mints signed URLs that point at its own blob surface, never at an Azure or raw-S3 hostname. Once coordination is terminated by us, the entire blob path is on infrastructure we control.
Blob transfer (Azure Block Blob subset). Served on the gateway’s own hostname with a real certificate and HMAC-URL auth. The subset of Azure operations the action actually performs, and their S3 mapping:
| Azure Block Blob op (what the action sends) | When | SeaweedFS S3 op |
|---|---|---|
Put Blob (single-shot, at or below 128 MB) |
small upload | PutObject |
Put Block (comp=block&blockid=...) |
each block of a large upload | CreateMultipartUpload (lazy, first block) + UploadPart |
Put Block List (comp=blocklist) |
commit a large upload | CompleteMultipartUpload |
Get Blob Properties (HEAD) |
size before download | HeadObject |
Get Blob + Range |
ranged download | GetObject + Range |
State strategy, mostly database-free:
- Entry index is SeaweedFS itself. Keys are deterministic, so an exact match is a
HeadObjectand arestore_keysprefix match isListObjectsV2with a prefix, newestLastModifiedwins. No separate entry database. - In-flight multipart state is in-memory, keyed by blob path:
{ azure_blockid -> (uploadId, partNumber, ETag) }. The gateway lazily issuesCreateMultipartUploadon the firstPut Block; onPut Block Listit reads the ordered block-ID list from the XML body and issuesCompleteMultipartUploadwith parts in that order. We run one gateway instance per fleet to start; HA via sticky routing or shared (Redis) state is deferred. - Eviction is the SeaweedFS 7-day TTL plus a periodic per-tenant size-cap sweeper.
Two sharp edges, covered by tests from day one:
- Sign the path, not the full query. The Azure SDK appends
&comp=block&blockid=...to the signed URL we handed it. The gateway’s HMAC must cover a canonical subset (object path plus our ownsigandexpparams) and ignore the SDK-added params, or every block PUT fails signature validation. The signed URL itself is the only auth on the blob path, the SAS-URL pattern, so no bearer token is needed there. - S3 multipart minimum part size is 5 MB (except the last part). Fine at the default block size, but we assert it so a customer-tuned
uploadChunkSizebelow 5 MB degrades gracefully (buffer or single-shot) instead of failingCompleteMultipartUpload.
Throughput note. The Azure SDK skips parallel-block optimizations when the host does not look Azure-like (Blacksmith’s finding). We either give the gateway blob surface an Azure-shaped hostname or accept serialized transfers. This is a tuning detail; both are functionally correct.
runner-cache-proxy: host-side selective interception
We cannot use environment-variable override (see Alternatives and Prior Art): the runner re-injects the cache URLs per job, and ACTIONS_RUNTIME_TOKEN is multiplexed with other subsystems. Transparent operation therefore requires network interception, as Blacksmith found.
The proxy runs on the host (Mac mini or Hetzner node), never inside the customer guest, so the CA private key and the tenant token never touch customer-controlled ground.
pf(macOS) ornftables(Linux) DNAT all guest:443to the proxy.- The proxy peeks the ClientHello SNI without decrypting:
- SNI matches GitHub Actions cache hostnames: MITM with an on-the-fly certificate signed by a baked CA, then route by path.
.../CacheService/*goes tocache-gateway(with token swap, below). Everything else (.../ArtifactService/*, OIDC, telemetry) is forwarded to genuine GitHub with the original token untouched. This selective routing is essential:actions/upload-artifact@v4shares the sameACTIONS_RESULTS_URLhost under a different service name. - Any other SNI: blind TCP splice, no decryption. MITM is bounded to GitHub’s cache plane only, which keeps unrelated customer traffic private and avoids breaking cert-pinned connections.
- SNI matches GitHub Actions cache hostnames: MITM with an on-the-fly certificate signed by a baked CA, then route by path.
- Token swap without trusting the guest.
tart-kubeletalready stages env and the SA token per VM on the host; it also stages the dispatch-minted cache token there. The proxy maps the connection source IP (the guest’s NAT address) to that guest’s cache token and injects it as the bearer to the gateway. The guest never holds a cache secret.
Only the CA public certificate is baked into the runner images’ trust stores (infra/runner-image, infra/linux-runner-image). The CA private key, the proxy binary, and the pf/nftables rules ship via host bootstrap (infra/macos-host-bootstrap and the Linux node KubeadmConfigTemplate).
Tenant scoping and the cache token
Tenant isolation is enforced in depth, by independent layers, so that no single failure (including full compromise of a customer guest) breaches another tenant’s cache. We treat the guest as adversarial: on macOS the entire Tart VM is the customer’s environment with root, so the customer can inspect or bypass anything running in the guest. The design does not depend on them not doing so.
The independent layers, outer to inner:
- Compute isolation. Each job runs in its own kata-fc microVM or Tart VM. There is no shared process or filesystem surface between tenants; the only shared resource is the cache store, reached solely through the gateway.
- Storage is not guest-routable. SeaweedFS lives on the private storage network and is never exposed to the guest. A guest can address only the gateway’s narrow protocol surface, never the object store directly.
- Host-side token custody. The cache token is staged host-side and injected by the host proxy based on a host-controlled source-IP to guest mapping. The token is never present inside the guest, and a guest cannot make the proxy attach a different tenant’s token (it cannot forge another guest’s NAT source identity). A fully compromised guest still cannot obtain a credential for another account.
- Cryptographic, least-privilege token scoping. The token is asymmetrically signed (the server holds the private key; the gateway only verifies, so the data plane holds no signing material), scoped to
{ account_id, repo, fleet }, and short-lived. The gateway derives the SeaweedFS key prefix purely from the verified claims, never from request input, so there is no request parameter that lets a valid token for one account address another account’s prefix. - Path-scoped, expiring blob URLs. Signed blob URLs embed the already-resolved object path plus an HMAC and an expiry. The blob surface validates only the HMAC and never re-derives tenancy from client input, so even a leaked URL grants access to a single object for a bounded window, not to a tenant’s namespace.
Bypassing the host proxy is therefore not a privilege escalation: it costs the customer their own cache and grants nothing, because layers 1, 2, 4, and 5 hold without it. Network interception is a transparency and performance mechanism, not the tenant-isolation boundary. This is deliberate: the customer-controlled guest is an adversarial boundary, so isolation is layered rather than resting on any single control.
At dispatch (Tuist.Runners.dispatch_for_sa / mint_jit), the server already knows account_id, repo, and the fleet, and mints the token described in layer 4 alongside the fleet’s cache_gateway_url. Host agents (tart-kubelet, runners-controller) stage it host-side per layer 3.
Server and host-agent changes
Tuist.Runners.dispatch_for_sa/mint_jit: addcache_tokenandcache_gateway_urlto the dispatch response.- Host agents (
tart-kubelet,runners-controller) stage the cache token host-side per guest, alongside the existing env and SA-token staging. - The gateway is configured with the verification public key. Nothing else changes server-side.
Protocol drift: failing open and observability
GitHub controls the cache protocol and can change it without notice (a new actions/cache version, a new Twirp service version, a changed blob behavior). The system is designed so that any such change degrades to cache misses, never to broken customer workflows, and so that we detect the drift immediately. This is safe to do aggressively precisely because interception is a performance mechanism and not the tenant-isolation boundary (see Tenant scoping): pass-through routes to GitHub’s own per-repo-isolated cache.
Failing open to GitHub. The interception layer is allowlist-based and treats anything it does not explicitly recognize as pass-through to genuine GitHub:
- The
runner-cache-proxyonly MITMs SNIs on its cache-hostname allowlist and only redirects paths matching the knownCacheServicemethods. An unrecognized hostname, service version, method, or unparseable request is blind-spliced or forwarded straight to GitHub, so the customer transparently gets GitHub’s hosted cache. Protocol drift is therefore fail-open by construction: an unknown shape routes to GitHub automatically. - The decision is made at the coordination call and is sticky for that entry: if we forward
CreateCacheEntry/GetCacheEntryDownloadURLto GitHub, GitHub returns its own Azure blob URLs and the subsequent blob traffic passes through automatically. We never half-intercept an entry. - A health-gated circuit breaker trips the proxy to full pass-through when the gateway or SeaweedFS is unhealthy, unreachable, or slow (strict per-call timeouts; on timeout we pass through). Infra faults degrade the same way protocol drift does.
The degraded state is the one we can tolerate: recent entries we had been serving from SeaweedFS are cold on GitHub, so restores miss until GitHub re-warms, but every workflow keeps running. actions/cache reinforces this, since cache restore failures are warnings rather than job failures by default; our job is only to ensure we never turn a cache interaction into something worse than a miss (a TLS failure, a connection refused, a hang).
Detecting drift immediately. The gateway and proxy emit Prometheus metrics into the same prom_ex / Grafana / Grafana IRM stack the runners feature already uses:
- Pass-through-fallback rate as a first-class SLI. In steady state effectively all cache traffic is handled by us, so a rising fallback rate is the single strongest signal of either protocol drift or gateway ill health. Alert on any sustained nonzero rate of unknown-protocol-shape fallbacks, and on fallback rate crossing a low threshold.
- Protocol-shape counters tagged by service name, method, detected version, and schema-validation result, so an alert points directly at what changed. Unrecognized shapes are logged (sanitized) for fast diagnosis.
- Per-fleet cache SLIs: coordination success and error, blob upload and download success and error, Azure-to-S3 translation errors, and cache hit rate. A hit-rate drop or 5xx spike is a secondary signal.
- A synthetic canary: a scheduled job runs a real
actions/cachesave-and-restore through the gateway end to end (the same controlled-runner harness from rollout step 2), pinned to the latestactions/cache, and alerts on failure. This exercises the real action against our gateway continuously, catching an action update or GitHub protocol change as it ships, typically before customers hit it. - Release watch on
actions/cacheandactions/toolkit: a new release triggers re-validation of the gateway against it, since protocol changes usually arrive via a new action version first.
Helm wiring
- A
cacheGatewayFleet[]block per fleet ininfra/helm/tuist: image digest pin, SeaweedFS S3 endpoint and credentials secret ref, verification public key. - SeaweedFS via its upstream chart (or a thin wrapper), one release per fleet.
- Runner-image digest bumps flow through the existing release pipelines. The CA certificate and proxy ship via host bootstrap.
Scope
In scope:
- GitHub Actions cache v2 protocol (current and forward-looking).
- macOS (Scaleway) and Linux (Hetzner) fleets, each with an independent SeaweedFS cluster.
- Transparent operation: no customer workflow changes.
Out of scope for the first iteration:
- Cache v1 protocol. Added only if telemetry shows old
actions/cachepins in customer workflows. - Gateway HA (multi-replica with shared multipart state).
- Cross-fleet or global cache deduplication.
- A customer-facing cache management UI or API.
Trade-offs
Advantages
- Cache I/O stays on the fleet’s private network at LAN speed, which is the entire point of running jobs on dedicated hardware.
- Zero customer-facing change. Unmodified
actions/cachekeeps working. - We own the storage and can tune durability, retention, and capacity per fleet.
- The interception surface is bounded to GitHub’s cache plane, so other traffic (artifacts, OIDC) is untouched and private.
- Secrets (CA key, tenant token) stay host-side, off the customer-controlled guest.
Disadvantages
- We take on a new data-plane service (
cache-gateway) and a host-side MITM proxy, plus SeaweedFS as new stateful infrastructure to operate, on call, in two datacenters. - The Azure-Block-Blob-to-S3 translation is non-trivial: stateful block-ID to multipart-part mapping and the sign-the-path-not-the-query detail are easy to get subtly wrong.
- MITM of GitHub’s cache hostnames requires a baked CA in the runner images. This is acceptable because we own the images, but it is a trust decision worth stating plainly.
- Per-account cache quota and metrics become per-fleet because of the macOS/Linux split. A unified account view requires extra roll-up.
- We are coupled to GitHub’s cache protocol. If GitHub changes the Twirp service or the blob protocol, the gateway must follow. The interception layer is designed to fail open to GitHub’s hosted cache so a change degrades to cache misses rather than broken workflows, and a synthetic canary plus pass-through-rate alerting catch drift early (see Protocol drift: failing open and observability).
Alternatives considered
Environment-variable override instead of network interception
Set ACTIONS_RESULTS_URL, ACTIONS_CACHE_SERVICE_V2, and ACTIONS_RUNTIME_TOKEN in the job environment to point actions/cache at our gateway. Rejected. The runner re-injects these per job from the job message, so a pre-staged value is overwritten (config.ts reads them straight from process.env). Even if the URL stuck, ACTIONS_RUNTIME_TOKEN is multiplexed with artifacts and OIDC (toolkit issue #1051), so repointing it would break other subsystems. Interception is the only transparent option.
Swap actions/cache for an S3-native action
Have customers use an action like tespkg/actions-cache that targets S3 directly. Rejected. It requires every customer to change their workflow, which violates the zero-change goal and would not transparently cover existing pipelines.
Managed S3 (Tigris or similar) instead of self-hosted
Use our existing managed object storage rather than running SeaweedFS. Rejected for this use case. The whole point is LAN-speed locality next to the runner; a managed bucket reintroduces a network hop and egress we cannot keep on a private network. Self-hosting co-located storage is the requirement, not an implementation detail.
MinIO as the storage backend
The obvious S3-compatible self-hosted choice historically. Rejected. As of early 2026 MinIO stripped its community-edition console (May 2025) and the upstream repository moved to no-longer-maintained and was archived (February 2026). It is not a safe foundation for a new deployment. SeaweedFS (Apache 2.0, mature since 2015, fast on both large and small objects, rack-aware replication) is the chosen backend. Garage (AGPL, Rust, simple multi-node) was the runner-up and remains a fallback.
One shared store across both fleets
A single SeaweedFS cluster serving both macOS and Linux. Rejected. It forces one fleet to cross the internet for cache, defeating the locality goal, and provides no real benefit because cache entries are already OS-scoped by key convention.
In-guest proxy instead of host-side
Run the interception proxy inside the runner VM (closer to Blacksmith’s nginx-in-VM). Rejected as the primary design. It would place the CA private key and tenant token inside the customer-controlled guest. Host-side termination keeps both secrets off customer ground while reusing the existing tart-kubelet host-proxy pattern and the existing nftables substrate.
Relationship to sticky cache volumes
A natural follow-on question is whether we should also offer per-repo sticky cache volumes, exemplified by Blacksmith’s sticky disks: a persistent block volume attached to the runner that carries arbitrary on-disk state across runs (Docker layer cache, incremental compilation state, large dependency trees). This is complementary, not an alternative. It is not a different backing store for the GHA cache and replaces nothing in this RFC; it is an additive feature that would sit alongside it.
Blacksmith itself splits the two the same way: its Cache product backs actions/cache with object storage (MinIO / S3), exactly as proposed here, while its sticky disks are a separate product whose canonical use is Docker layer caching. The two coexist on a single job: actions/cache traffic goes to our S3-backed gateway, and a sticky volume could be mounted at, say, the Docker data root for state that actions/cache models poorly. We are deferring the volume piece deliberately, and would treat it as its own RFC.
How it differs. The GHA cache in this RFC is protocol-level and content-addressed: it intercepts actions/cache, stores keyed tarballs in shared S3 namespaced by token, and never attaches anything to the VM. A sticky volume is block-level: the job sees a mounted filesystem that persists as a per-repo image, capturing whatever the build leaves on disk, not only what the customer explicitly wired into actions/cache. It needs volume lifecycle management (provision, attach, snapshot, garbage-collect, quota) and, critically, concurrency control, because a read-write block device cannot be safely shared between two jobs of the same repo running at once.
Why not now. It conflicts head-on with the shared warm pool and dispatch-time binding the runners architecture is built on. A per-tenant block device must be attached at VM or Pod creation (tart run --disk, a CSI PVC), but at that moment the Pod is still identity-less and does not learn its tenant until it claims a job. Supporting sticky volumes would mean either binding identity ahead of dispatch (giving up the shared pool), hot-attaching a device after the claim (poorly supported on Tart today), or cloning a per-repo volume on claim. All three are significant architectural changes. By contrast, the protocol-level cache covers the most common and highest-value case (actions/cache) transparently, with no change to dispatch and no per-tenant attach, so it is the right first step, and it stands on its own regardless of whether we later add volumes.
Very rough future shape. On Linux (kata-fc), virtio block hot-plug after the claim is the most plausible path: keep a per-repo volume on a block-snapshot store, and on dispatch attach a copy-on-write snapshot to the microVM, reconciling on job completion (newest-wins, or per-branch volumes). On macOS (Tart), attach a per-repo disk image at launch via --disk, which likely requires a per-tenant warm sub-pool bound slightly ahead of dispatch, since Tart hot-plug is weak. Either way the substrate is block storage with copy-on-write snapshots (local NVMe or Cloud Volume snapshots), not the SeaweedFS object tier this RFC introduces, and the central new problem to solve is per-repo concurrency (a single-writer lock, or a COW snapshot per job with a merge or last-writer policy).
Rollout plan
The rollout is staged so that each layer is validated before the next depends on it.
- SeaweedFS up per fleet, validated via raw S3 access. No runner integration yet.
- Gateway correctness, decoupled from interception. Point a real
actions/cachev2 run at the gateway from a controlled internal runner where we set the env directly (nothing overwrites it there). This proves the Azure-to-S3 translation end-to-end before any MITM exists. - Host proxy (SNI routing, MITM, token swap) on one canary host. Bake the CA, wire
pf/nftables. - Dispatch token minting and host-side staging.
- Canary one fleet, measure cache hit rate and blob throughput against GitHub’s hosted cache, then roll the second fleet.
Each environment progression (staging, canary, production) gates on cache correctness and a throughput comparison.
Future work
- Cache v1 protocol support if telemetry warrants it.
- Gateway HA: multi-replica with sticky routing or Redis-backed in-flight multipart state.
- Per-account quota and metrics roll-up across both fleets for a unified customer view.
- Cache analytics surfaced to customers (hit rate, size, savings).
- Evaluate Kura (our Rust distributed cache mesh) as a cross-node backing layer for warm cache beyond a single node, if node-local stickiness proves insufficient.
- Per-repo sticky cache volumes for workloads that re-tarball large on-disk state every run (Docker layer cache, incremental builds). See Relationship to sticky cache volumes for how it complements this proposal and a rough shape; it warrants its own RFC.
Open questions
- Should per-fleet cache quota be exposed to customers as-is, or rolled up into an account-wide number from day one?
- Do we make the gateway blob hostname Azure-shaped to keep the SDK’s parallel-block path, or accept serialized transfers initially and revisit on measured throughput?
- What is the right per-tenant size cap and eviction policy beyond the 7-day TTL, and should it scale with plan?
- On the macOS fleet, do we self-host SeaweedFS on a Scaleway instance plus Block Storage, or point the gateway at Scaleway’s managed Object Storage? The latter is materially cheaper and still in-region, and the gateway is identical either way. See Expected storage cost.
References
- Blacksmith cache architecture: Reverse engineering GitHub Actions cache to make it fast | Blacksmith
actions/toolkitcache internals:packages/cache/src/internal/{config,uploadUtils,downloadUtils}.tsandshared/cacheTwirpClient.tsactions/toolkitissue #1051 (non-GitHub-hosted caching): Add support for non-GitHub-hosted caching for self-hosted runners · Issue #1051 · actions/toolkit · GitHub- SeaweedFS: GitHub - seaweedfs/seaweedfs: SeaweedFS is a distributed storage system for object storage (S3), file systems, and Iceberg tables, designed to handle billions of files with O(1) disk access and effortless horizontal scaling. · GitHub
- Tuist Runners architecture (internal):
server/lib/tuist/runners.ex,infra/runners-controller,infra/tart-kubelet,infra/runner-image,infra/linux-runner-image