Summary
This RFC proposes a per-customer Prometheus-compatible /metrics endpoint that
customers can scrape to pull live build, test, cache, and CLI command
metrics out of Tuist, and a Grafana app plugin that bundles a datasource, a
guided config page, and curated dashboards covering build performance, test
performance, test reliability, cache effectiveness, and CLI usage across
both Xcode and Gradle projects.
The scrape path is fed by :telemetry events emitted on build, test, and
cache lifecycle transitions, aggregated in-memory per account. ClickHouse stays
authoritative for the in-app historical views; the Prometheus stream is the
live sampled feed customers run through their own observability stack.
Motivation
Tuist already computes rich analytics (build durations, test durations, cache
hit rates, flakiness) and surfaces them in the dashboard at tuist.dev. For many
teams that is enough. For teams with a centralized observability practice, it
is not:
- One pane of glass. Platform teams want build and test health on the
same Grafana board as deploy health, API latency, and error rates. Today
they cannot. - Alerting. Customers want to alert on build-time regressions, cache-hit
drops, and flakiness spikes using their existing Prometheus alert manager
and on-call rotation, not a second notification surface. - Customer-controlled retention and correlation. Tuist retains its own
metrics for a long time, but customers often want build and test signals
living next to their other operational data in a TSDB they control, with
retention policies they set, so they can join build health against deploy
events, incident timelines, or capacity plans without crossing system
boundaries. - Data residency. Enterprise customers occasionally need build health
signals to stay inside their own infrastructure. Pulling via scrape lets
them do that.
The ask in its simplest form: give us a URL and a token, we will take it from
there. A Prometheus scrape endpoint plus a Grafana integration is the smallest
surface that satisfies all four.
Prior Art
Fly.io per-app Prometheus
Fly exposes a token-authenticated scrape URL per application. One target per
resource, HMAC-style bearer token, Prometheus text format. The customer drops
it into their scrape config and is done. We adopt the token-per-tenant model
directly: no per-scrape query parameters, no tenant header, just Authorization: Bearer <account-token> resolving to one account.
Grafana Mimir multi-tenant exposition
Mimir partitions data via the X-Scope-OrgID request header. The tenancy
signal is out-of-band from query parameters and orthogonal to labels. We apply
the same principle: the authenticated subject is the tenancy signal, and
account_id never appears as a label in the exposed output. Customers see
their data as if it were single-tenant.
telemetry_metrics_prometheus_core
The idiomatic Elixir library for turning :telemetry events into Prometheus
exposition. It is ETS-backed under the hood with a global registry. We reuse
the telemetry vocabulary (counter, sum, distribution) and the OpenMetrics
output shape but own the registry so we can index it by account_id cheaply.
Grafana App Plugin model
Official integrations for Kubernetes, GitHub, MongoDB, and others ship as
Grafana app plugins. The plugin bundles:
- A datasource (usually a thin wrapper around Prometheus)
- A config page that collects credentials and emits a scrape-target snippet
- A set of curated dashboards
We follow the same structure. The dashboards are the headline feature, the
datasource is a wrapper around Prometheus-compatible storage, and the config
page writes the scrape config for Grafana Agent / Alloy users.
Proposal
Endpoint
GET /api/accounts/:account_handle/metrics
Response body is OpenMetrics text exposition (Content-Type:
application/openmetrics-text; version=1.0.0; charset=utf-8). The endpoint
returns a snapshot of the in-memory aggregator at scrape time. Prometheus
records the evolution across scrapes on its side.
Per-project slicing is exposed as a project label on every metric, not as a
separate endpoint. This keeps the customer’s scrape config to a single target
per Tuist account. Label cardinality is bounded by the number of projects in
the account.
Authentication
| Property | Value |
|---|---|
| Header | Authorization: Bearer <account_token> |
| Token type | Account token only (project tokens are not accepted) |
| Required scope | metrics:read |
| Authorization action | :metrics_read on the account |
| Rate limit key | metrics:auth:account:{account_id}, reusing TuistWeb.RateLimit.InMemory |
| Rate limit shape | 1 request per 10 seconds per account. Aligns with a 15s Prometheus scrape interval with headroom for retries. |
A new metrics:read scope is added to the existing fine-grained account-token
model at account_tokens_controller.ex.
Customers create a token with that scope, drop it into their scrape config,
and scrape.
Exposition format
We emit OpenMetrics text. It is a CNCF-incubated standardization of the
Prometheus text format; both formats coexist and are negotiated via the
Accept header. OpenMetrics adds stricter typing, UNIT metadata, _created
timestamps, and exemplars, which we want to keep open as a future option (see
Open Questions). Prometheus and every Prometheus-compatible TSDB accept it
out of the box, and we fall back to the classic text/plain; version=0.0.4
format when the client does not advertise OpenMetrics in its Accept.
Every metric is exported with a project label. Additional labels are kept
small and bounded (see Cardinality below).
# HELP tuist_xcode_build_duration_seconds Duration of an Xcode build run, including all phases
# TYPE tuist_xcode_build_duration_seconds histogram
# UNIT tuist_xcode_build_duration_seconds seconds
tuist_xcode_build_duration_seconds_bucket{project="ios-app",scheme="App",is_ci="true",status="success",le="30"} 12
tuist_xcode_build_duration_seconds_bucket{project="ios-app",scheme="App",is_ci="true",status="success",le="60"} 48
tuist_xcode_build_duration_seconds_bucket{project="ios-app",scheme="App",is_ci="true",status="success",le="120"} 103
tuist_xcode_build_duration_seconds_bucket{project="ios-app",scheme="App",is_ci="true",status="success",le="300"} 140
tuist_xcode_build_duration_seconds_bucket{project="ios-app",scheme="App",is_ci="true",status="success",le="+Inf"} 142
tuist_xcode_build_duration_seconds_sum{project="ios-app",scheme="App",is_ci="true",status="success"} 12894.5
tuist_xcode_build_duration_seconds_count{project="ios-app",scheme="App",is_ci="true",status="success"} 142
# HELP tuist_xcode_build_runs_total Total number of Xcode build runs observed
# TYPE tuist_xcode_build_runs_total counter
tuist_xcode_build_runs_total{project="ios-app",scheme="App",is_ci="true",status="success"} 142
tuist_xcode_build_runs_total{project="ios-app",scheme="App",is_ci="true",status="failure"} 7
# HELP tuist_gradle_build_runs_total Total number of Gradle build runs observed
# TYPE tuist_gradle_build_runs_total counter
tuist_gradle_build_runs_total{project="android-app",module=":app",is_ci="true",status="success"} 88
# HELP tuist_xcode_cache_hits_total Xcode binary cache hits
# TYPE tuist_xcode_cache_hits_total counter
tuist_xcode_cache_hits_total{project="ios-app"} 3412
# HELP tuist_xcode_cache_misses_total Xcode binary cache misses
# TYPE tuist_xcode_cache_misses_total counter
tuist_xcode_cache_misses_total{project="ios-app"} 509
# HELP tuist_module_cache_hits_total Module cache hits
# TYPE tuist_module_cache_hits_total counter
tuist_module_cache_hits_total{project="ios-app"} 780
# HELP tuist_module_cache_misses_total Module cache misses
# TYPE tuist_module_cache_misses_total counter
tuist_module_cache_misses_total{project="ios-app"} 50
# HELP tuist_gradle_cache_hits_total Gradle build cache hits
# TYPE tuist_gradle_cache_hits_total counter
tuist_gradle_cache_hits_total{project="android-app"} 2104
# HELP tuist_gradle_cache_misses_total Gradle build cache misses
# TYPE tuist_gradle_cache_misses_total counter
tuist_gradle_cache_misses_total{project="android-app"} 811
# HELP tuist_command_invocations_total CLI command invocations
# TYPE tuist_command_invocations_total counter
tuist_command_invocations_total{project="ios-app",command="generate",is_ci="false",status="success"} 419
tuist_command_invocations_total{project="ios-app",command="install",is_ci="true",status="success"} 87
tuist_command_invocations_total{project="ios-app",command="cache warm",is_ci="true",status="success"} 142
# EOF
Metric set (initial release)
Metrics fall into three groups:
tuist_xcode_*for Xcode-specific build, test, and cache metrics.tuist_gradle_*for the Gradle equivalents.tuist_command_*for CLI command invocations, which are cross-cutting
and apply to both build systems.
The Xcode and Gradle namespaces use their respective systems’ native
vocabularies for labels (scheme for Xcode, module for Gradle) rather than
a forced abstraction. The reconciliation section below explains why.
Xcode metrics:
| Metric | Type | Labels | Source telemetry event |
|---|---|---|---|
tuist_xcode_build_runs_total |
counter | project, scheme, is_ci, status, xcode_version, swift_version |
[:tuist, :xcode, :build, :completed] |
tuist_xcode_build_duration_seconds |
histogram | project, scheme, is_ci, status, xcode_version, swift_version |
[:tuist, :xcode, :build, :completed] |
tuist_xcode_test_runs_total |
counter | project, scheme, is_ci, status, xcode_version, swift_version |
[:tuist, :xcode, :test_run, :completed] |
tuist_xcode_test_run_duration_seconds |
histogram | project, scheme, is_ci, status, xcode_version, swift_version |
[:tuist, :xcode, :test_run, :completed] |
tuist_xcode_test_cases_total |
counter | project, status (passed/failed/skipped), xcode_version, swift_version |
[:tuist, :xcode, :test_case, :completed] |
tuist_xcode_cache_hits_total |
counter | project |
[:tuist, :xcode, :cache, :hit] |
tuist_xcode_cache_misses_total |
counter | project |
[:tuist, :xcode, :cache, :miss] |
tuist_module_cache_hits_total |
counter | project |
[:tuist, :module, :cache, :hit] |
tuist_module_cache_misses_total |
counter | project |
[:tuist, :module, :cache, :miss] |
Gradle metrics:
| Metric | Type | Labels | Source telemetry event |
|---|---|---|---|
tuist_gradle_build_runs_total |
counter | project, module, is_ci, status, gradle_version, jvm_version |
[:tuist, :gradle, :build, :completed] |
tuist_gradle_build_duration_seconds |
histogram | project, module, is_ci, status, gradle_version, jvm_version |
[:tuist, :gradle, :build, :completed] |
tuist_gradle_test_runs_total |
counter | project, module, is_ci, status, gradle_version, jvm_version |
[:tuist, :gradle, :test_run, :completed] |
tuist_gradle_test_run_duration_seconds |
histogram | project, module, is_ci, status, gradle_version, jvm_version |
[:tuist, :gradle, :test_run, :completed] |
tuist_gradle_test_cases_total |
counter | project, status (passed/failed/skipped), gradle_version, jvm_version |
[:tuist, :gradle, :test_case, :completed] |
tuist_gradle_cache_hits_total |
counter | project |
[:tuist, :gradle, :cache, :hit] |
tuist_gradle_cache_misses_total |
counter | project |
[:tuist, :gradle, :cache, :miss] |
CLI command metrics (cross-cutting):
| Metric | Type | Labels | Source telemetry event |
|---|---|---|---|
tuist_command_invocations_total |
counter | project, command, is_ci, status |
[:tuist, :command, :completed] |
tuist_command_duration_seconds |
histogram | project, command, is_ci, status |
[:tuist, :command, :completed] |
The command label carries the full command path as a single dot- or
space-separated string: generate, install, build, test,
cache warm, cache clean, registry login, and so on. Subcommands are
kept in the same label rather than split into a separate subcommand
label because they are not semantically independent dimensions; a user
who runs tuist cache warm is not doing a cache with a warm
modifier, they are running the specific cache warm command. Keeping
them flat makes PromQL straightforward:
rate(tuist_command_invocations_total{command="cache warm"}[5m]).
CLI command metrics do not live under either build-system namespace
because commands are the user’s entry point, not an artifact of a
specific build system. tuist generate produces output for Xcode today
but may target other systems in the future; tuist install is
system-agnostic; tuist build delegates to whichever system the project
is configured for. Tracking them separately from build / test metrics
lets customers correlate their own invocation patterns with the
downstream build and test results.
Histogram buckets for durations use a fixed, shared schedule so customers can
aggregate across projects without losing fidelity:
[1, 2, 5, 10, 20, 30, 60, 120, 180, 300, 600, 900, 1200, 1800, 2700, 3600]
This covers 1 second through 60 minutes, which matches the range the dashboard
currently shows for both builds and test runs.
Reconciling Xcode and Gradle
The namespaces are split per build system, and each namespace uses its
system’s native label vocabulary. Xcode metrics have a scheme label.
Gradle metrics have a module label. There is no generic target label and
no build_system discriminator. The build system is implicit in the metric
name.
This is deliberate. A scheme is not a module; forcing both into one
abstraction surfaces wrong-shaped questions (and wrong-shaped alerts) more
often than it helps. Splitting the namespaces makes every single-system
dashboard unambiguous, lets the two systems’ schemas evolve independently,
and removes the temptation to average metrics across systems that are not
directly comparable.
The tradeoff is cross-system aggregation. Platform teams managing mixed iOS
and Android projects need one extra PromQL step, either a metric-name regex:
sum by (project) ({__name__=~"tuist_(xcode|gradle)_build_runs_total"})
or an explicit union:
sum(rate(tuist_xcode_build_runs_total[5m]))
+ sum(rate(tuist_gradle_build_runs_total[5m]))
Both are common Prometheus idioms. We judge the clarity of single-system
dashboards to be worth more than the one-line savings on cross-system ones.
Labels that do cross namespaces by convention:
| Label | Values | Applies to |
|---|---|---|
project |
customer-supplied project identifier | All metrics |
is_ci |
true / false |
Build, test, and command metrics |
status |
success / failure |
Build, test-run, and command metrics |
status (test case) |
passed / failed / skipped |
Test-case metrics |
Keeping these identical across namespaces is the concession that lets
customers write a regex over metric names and still group the results
meaningfully. CLI command metrics (tuist_command_*) sit outside both
build-system namespaces because invocations are the user’s entry point,
not an artifact of a specific system. They use the same shared labels
(project, is_ci, status) so they aggregate cleanly alongside the
per-system metrics when customers want a “total activity” panel.
System-specific version labels are in from day one.
Xcode metrics carry xcode_version and swift_version. Gradle metrics
carry gradle_version and jvm_version. The values drift slowly (teams
migrate gradually and typically run 2-3 versions concurrently), so
cardinality stays bounded, and these are exactly the labels customers
need when they want to answer “did our build time regress after the
Xcode 16 upgrade?” or “is the old Gradle version slower than the new
one?” The cache metrics stay version-free in v1: Tuist’s cache already
keys on version internally, and version-scoped cache analysis is niche
enough to live in ClickHouse for now.
Other system-specific dimensions deferred to a follow-up.
Xcode-only (configuration, platform) and Gradle-only (variant,
flavor) labels add cardinality and are more situational than versions.
Revisit based on customer feedback.
Reconciling cache types
Tuist has three distinct cache layers in production today. Each gets its own
metric name so the semantics are unambiguous and the metric name itself
documents the cache layer:
| Metrics | What it is | Build system | What a hit saves |
|---|---|---|---|
tuist_xcode_cache_hits_total / tuist_xcode_cache_misses_total |
Xcode binary cache: compiled framework / library / bundle artifacts reused across machines | Xcode | Compile and link time for the cached target |
tuist_module_cache_hits_total / tuist_module_cache_misses_total |
Module cache: cached output of tuist generate for projects and their dependencies |
Xcode | Project generation time |
tuist_gradle_cache_hits_total / tuist_gradle_cache_misses_total |
Gradle build cache: cached task outputs keyed on task inputs | Gradle | Task execution time |
The three layers operate at different points in the pipeline and on different
inputs. Folding them into one generic tuist_cache_* metric family with a
cache_type label would invite averaging across kinds that are not directly
comparable (a 50% binary-cache hit rate is a very different situation from a
50% module-cache hit rate). Separate metric names make that mistake harder.
Each cache gets two monotonic counters, one for hits and one for misses,
mirroring the pattern Bazel’s own observability uses (remote_cache_hits
and friends) and the standard shape you see in nginx, varnish, and other
Prometheus-ecosystem caches. Customers compute hit rate over any window with
rate():
sum by (project) (rate(tuist_xcode_cache_hits_total[5m]))
/ (
sum by (project) (rate(tuist_xcode_cache_hits_total[5m]))
+ sum by (project) (rate(tuist_xcode_cache_misses_total[5m]))
)
This gives customers full flexibility over the window (5m for live
dashboards, 24h for trend panels, 30d for capacity planning) without us
having to pick one for them. Adding a future cache layer means adding a new
pair of metric names rather than a new label value, which is a deliberate
and small cost to keep the naming unambiguous.
Aggregator architecture
┌─────────────────────────────────────────────────────┐
│ Build / test / cache runtime code paths │
│ │
│ :telemetry.execute([:tuist, :build, :completed], │
│ %{duration: 142.5}, │
│ %{account_id, project_id, scheme, is_ci, │
│ status}) │
└──────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Tuist.Metrics.Aggregator (:telemetry handler) │
│ ETS table :tuist_metrics │
│ key: {account_id, metric_name, label_tuple} │
│ val: counter | {histogram_buckets, sum, count} │
└──────────────┬──────────────────────────────────────┘
│ (read-only on scrape)
▼
┌─────────────────────────────────────────────────────┐
│ TuistWeb.API.MetricsController │
│ 1. Resolve account from Bearer token │
│ 2. ETS match on {account_id, :_, :_} │
│ 3. Render OpenMetrics text │
│ 4. Emit with Cache-Control: no-store │
└─────────────────────────────────────────────────────┘
The ETS table is a single :named_table, :public, :set with
write_concurrency: true and read_concurrency: true. Counter bumps use
:ets.update_counter/3, which is atomic and lockless. Histogram observations
bump the appropriate bucket counter, the sum (as a scaled integer to stay
atomic), and the count in a single update_counter call against a tuple of
positions.
A :telemetry handler is attached at application startup for each of the
events listed above. The handler is synchronous and does constant work (O(1)
ETS updates), so it does not need a backing queue. If future metrics require
heavier work per event, we can introduce a small per-node aggregator process
with a mailbox.
Cache hit and miss events are emitted on separate telemetry paths
([:tuist, :<system>, :cache, :hit] and [:tuist, :<system>, :cache, :miss])
and update the corresponding _hits_total or _misses_total counter
directly from the telemetry handler. No gauge recomputer is needed: hit
rate is derived on the client side via PromQL rate().
Cluster-aware aggregation and deploy behavior
The server runs on a single node in steady state, and deploys use a
rolling strategy: the new node comes up and becomes healthy, the load
balancer adds it to the pool, and only then does the old node begin
draining. Every deploy briefly puts two nodes behind the LB, and today’s
single-node steady state may not be tomorrow’s. Rather than design for
single-node and paper over the deploy overlap separately, the aggregator
is cluster-aware from day one. The same mechanism handles both the
rolling-deploy overlap and any future permanent multi-node topology, and
both reduce to the same implementation in Erlang.
Design: scrape-time RPC fan-in.
Each node maintains its own local ETS table. On scrape, the receiving
node fans out to every peer in the cluster and merges snapshots:
Scrape arrives on node A
│
▼
:erpc.multicall([node() | Node.list()], Tuist.Metrics.Aggregator,
:snapshot_for, [account_id], 2_000)
│
▼ (parallel)
Each node reads its local ETS filtered by account_id
│
▼
Receive {:ok, snapshot_N} or {:error, reason} per node
│
▼
Merge snapshots by summing counter and histogram-bucket values per
{metric, labels} key
│
▼
Render OpenMetrics text, return response
Counter merges are straightforward because counters and histogram bucket
counters are commutative sums. Distributed Erlang (already in place for
Tuist’s Phoenix / libcluster setup) gives us the node discovery and the
RPC primitive for free.
What this buys us.
- Rolling deploys have no oscillation. During the overlap window, a
scrape landing on either node returns the merged view from both. The
per-node counter splits are invisible to Prometheus, sorate()sees
continuous growth instead of an oscillating series. - Future multi-node is free. Scaling from one steady-state node to N
is a config change, not an architecture rewrite. Scrape cost grows as
O(N), and with a 2s per-peer timeout the design is comfortable through
single-digit node counts. - Cluster cost is modest.
:erpc.multicallis a small message to each
peer and a parallel wait. A scrape cache (10s TTL per account) absorbs
bursts when multiple Prometheus replicas scrape the same account in
the same window.
Failure handling.
If a peer is unreachable (GC pause, rolling deploy half-complete, transient
network blip), :erpc returns {:error, _} for that node. The scrape
handler returns the merged results from the reachable nodes, sets an
X-Tuist-Metrics-Partial: <node> response header, and emits a warning
log. Failing the scrape outright would convert a partial outage into a
total metrics blackout, which is worse than serving a brief partial view
that Prometheus’s own up{} metric and histogram_quantile() stability
already cover.
The three windows on a rolling deploy.
─── pre-handover ─── │ ─── overlap ─── │ ─── post-handover ───
▲ ▲
new node healthy, old node exits after
LB adds it to pool graceful drain
-
Pre-handover window.
Only the old node serves. Fan-in returns the old node’s ETS. -
Overlap window.
Both nodes are in the LB pool. Telemetry events split between them
according to LB routing. Fan-in merges both ETS tables on every
scrape, so Prometheus sees the combined total, regardless of which
node the scrape landed on. No oscillation, no reset. -
Post-handover window.
The old node is removed from the LB pool, drains, and exits. Events
in the old node’s ETS that were not captured by a scrape during the
overlap are lost to Prometheus (ClickHouse still has them). Graceful
shutdown, below, is the mitigation.
Graceful shutdown with extended drain.
When the LB removes the old node from the pool, the application stays
alive for a configurable grace period (default 30s, slightly longer than
2x the recommended 15s scrape interval) before the BEAM exits. During
this grace period:
:telemetryhandlers continue to run, draining any in-flight events.- The node stays in the distributed Erlang cluster, so fan-in from the
new node’s scrapes still reaches it and picks up its final ETS state. - The
/metricsendpoint on the old node also stays up, so even if the
LB happens to send Prometheus a scrape back to it during the drain,
that works too. - No new CLI uploads arrive because the LB has removed the instance.
This costs 30s per deploy but ensures the old node’s last window of
events is captured before its ETS goes away.
Keeping the overlap window short.
The overlap is a property of the deploy orchestration, not of this
system, but because fan-in removes the oscillation concern, the overlap
length only matters as an upper bound on the unscraped-event loss window
on the old node. A few knobs worth calling out for ops:
- Readiness check for the new node returns healthy quickly (the
/metrics
endpoint responds with an empty-but-valid payload even with zero
accumulated events). - LB weighting shifts traffic to the new node aggressively once it is
healthy, so the old node’s share drops quickly. - The old node’s graceful drain begins as soon as the LB removes it,
not after a further delay.
With these in place the overlap is typically 10-30 seconds.
What stays durable, what stays transient.
| Layer | Durable across deploys | Notes |
|---|---|---|
| ClickHouse: raw build / test / cache records | Yes | Authoritative historical record. In-app dashboard reads from here. |
| ETS: aggregated counters and histogram buckets | No | Reset on every deploy. Prometheus rate() handles the reset. |
| Prometheus (customer side): time series | Yes (in customer’s TSDB) | Reset-aware queries, data retained per customer’s policy. |
Customers reconstructing long-horizon totals use increase(...[30d]),
which sums reset-aware deltas across the whole range. This is the
Prometheus-idiomatic pattern and works the same with or without deploy
resets.
Customer-facing guidance we ship with the docs.
- Do use
rate(),irate(), andincrease()for any counter-based
panel or alert. All three handle resets. - Do not alert on the absolute value of a
_totalcounter, since it
will drop to zero on deploy and the alert would fire spuriously. - Do use
histogram_quantile(rate(..._bucket[5m]))for latency
percentiles.rate()on the bucket counters is reset-aware. - Avoid comparing a
_sumor_countsnapshot directly to its value
an hour ago. Useincrease(..._sum[1h]) / increase(..._count[1h])for
mean latency.
Mitigations considered and deferred.
- Checkpoint ETS to disk on shutdown, restore on startup. Eliminates
the reset boundary but adds a disk write at SIGTERM and a read at boot,
both of which are new failure modes (partial writes, schema drift,
version mismatch on rolling back a deploy). Not worth the complexity
for a stream Prometheus already handles. - Snapshot final ETS to Redis or ClickHouse on shutdown. Same concern
plus a new cross-service dependency on the shutdown path. Deferred. - Back the aggregator with ClickHouse-computed rollups for continuity.
Continuous reconciliation against ClickHouse is a heavier design that
makes sense if we ever promise “no counter resets, ever” as a product
guarantee. Today we do not.
Cross-node fan-in plus extended-drain ships in v1. Between them they
eliminate in-flight oscillation and bound the unscraped-event loss window
to sub-second. If customers still surface concrete pain from
deploy-induced resets (for example, alerts firing on resets or dashboards
showing visible discontinuities despite rate()), persistence is the
follow-up.
Cardinality budget
Cardinality is the only way this design blows up. Rules we enforce:
| Rule | Why |
|---|---|
| No commit SHA, branch name, user ID, or test case name as a label | Unbounded or high-cardinality dimensions belong in the dashboard’s ClickHouse view, not in Prometheus |
project label: typically 1-2 projects per account |
Healthy |
scheme label (Xcode): dozens per project |
Healthy |
module label (Gradle): can be hundreds per project in larger monorepos |
Largest cardinality driver, worth monitoring |
command label (CLI): ~20-30 distinct (sub)command names |
Bounded by the CLI surface itself |
xcode_version, swift_version, gradle_version, jvm_version: included |
Drift slowly (2-3 concurrent versions typical), high signal for migration tracking and version-vs-version comparisons |
configuration, platform, variant, flavor: deferred |
More situational than versions, revisit with feedback |
| Histograms share one global bucket schedule | Prevents per-project bucket explosions |
Rough cardinality, per account, for the build histogram:
- Xcode namespace:
projects * schemes * is_ci * status * xcode_version * swift_version * buckets ≈ 2 * 30 * 2 * 2 * 3 * 3 * 17 ≈ 37k - Gradle namespace (large monorepo):
projects * modules * is_ci * status * gradle_version * jvm_version * buckets ≈ 1 * 300 * 2 * 2 * 3 * 3 * 17 ≈ 184k
In practice the version multipliers are much smaller than the raw product
because a given module or scheme uses one version at a time with rare
transitions, so effective cardinality is closer to 1.5-2x the version-free
baseline. Gradle monorepos are the biggest driver to watch; if we see
customers with thousands of modules, we may need a module allowlist or
prefix-based aggregation. Revisit based on real scrape sizes once we have
customers onboarded.
Delivery behavior
| Property | Value |
|---|---|
| Content type | application/openmetrics-text; version=1.0.0; charset=utf-8 |
| Fallback content type | text/plain; version=0.0.4 when Accept does not include OpenMetrics |
| Recommended scrape interval | 15 seconds |
| Minimum scrape interval | 10 seconds (enforced via rate limit) |
| Response cache | None. The ETS read is already O(series count per account), and caching introduces a staleness window that breaks rate(). |
| Counter resets on deploy | Expected. Prometheus’s rate() and increase() handle resets correctly. |
| Data retention on the Tuist side | None beyond the live aggregator. Customers own their history in their TSDB. |
Grafana plugin
Config page prompts for:
- Tuist account handle
- An account token with
metrics:readscope - Target Prometheus-compatible datasource already configured in Grafana
On save, the plugin:
- Calls
GET /api/accounts/:account_handle/metricsonce to validate the token - Emits a scrape config snippet the customer drops into Grafana Agent / Alloy:
scrape_configs: - job_name: tuist scrape_interval: 15s metrics_path: /api/accounts/<account>/metrics scheme: https static_configs: - targets: ["tuist.dev"] authorization: type: Bearer credentials: "<account_token>" - Provisions the bundled dashboards into the selected datasource
Dashboards use standard PromQL. Xcode and Gradle get separate dashboards
(or separate rows on a combined dashboard) because the underlying label
vocabulary differs. Example panel queries on the Xcode build performance
dashboard:
| Panel | PromQL |
|---|---|
| Build rate (5m) | sum by (project) (rate(tuist_xcode_build_runs_total[5m])) |
| Build failure ratio (5m) | sum by (project) (rate(tuist_xcode_build_runs_total{status="failure"}[5m])) / sum by (project) (rate(tuist_xcode_build_runs_total[5m])) |
| p50 build duration | histogram_quantile(0.5, sum by (project, le) (rate(tuist_xcode_build_duration_seconds_bucket[5m]))) |
| p90 build duration | histogram_quantile(0.9, sum by (project, le) (rate(tuist_xcode_build_duration_seconds_bucket[5m]))) |
| Mean build duration | sum by (project) (rate(tuist_xcode_build_duration_seconds_sum[5m])) / sum by (project) (rate(tuist_xcode_build_duration_seconds_count[5m])) |
Gradle panels mirror the Xcode ones with tuist_gradle_* metric names. For
mixed-stack customers who want one board covering both, the metric-name
regex idiom works:
sum by (project) ({__name__=~"tuist_(xcode|gradle)_build_runs_total"})
Example panels on the cache effectiveness dashboard:
| Panel | PromQL |
|---|---|
| Xcode binary cache hit rate (5m) | sum by (project) (rate(tuist_xcode_cache_hits_total[5m])) / (sum by (project) (rate(tuist_xcode_cache_hits_total[5m])) + sum by (project) (rate(tuist_xcode_cache_misses_total[5m]))) |
| Module cache hit rate (5m) | sum by (project) (rate(tuist_module_cache_hits_total[5m])) / (sum by (project) (rate(tuist_module_cache_hits_total[5m])) + sum by (project) (rate(tuist_module_cache_misses_total[5m]))) |
| Gradle cache hit rate (5m) | sum by (project) (rate(tuist_gradle_cache_hits_total[5m])) / (sum by (project) (rate(tuist_gradle_cache_hits_total[5m])) + sum by (project) (rate(tuist_gradle_cache_misses_total[5m]))) |
| Xcode cache miss throughput (5m) | sum by (project) (rate(tuist_xcode_cache_misses_total[5m])) |
Example panels on the CLI usage dashboard:
| Panel | PromQL |
|---|---|
| Invocations by command (5m) | sum by (command) (rate(tuist_command_invocations_total[5m])) |
| Command failure rate | sum by (command) (rate(tuist_command_invocations_total{status="failure"}[5m])) / sum by (command) (rate(tuist_command_invocations_total[5m])) |
| CI vs local invocation split | sum by (is_ci, command) (rate(tuist_command_invocations_total[5m])) |
| p90 command duration | histogram_quantile(0.9, sum by (command, le) (rate(tuist_command_duration_seconds_bucket[5m]))) |
Plugin distribution: the Grafana plugin catalog requires plugins to be
signed, and catalog submission takes Grafana-side review time. We submit
for catalog signing ahead of the launch so the release lands with the
plugin already available through the in-app catalog. If the signing
review is not complete by launch, we ship as an unsigned GitHub release
in the interim (Grafana allows explicit opt-in via the
GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS config) and swap to the
signed catalog entry as soon as it is approved.
Alternatives Considered
1. JSON query API instead of a scrape endpoint
Expose a structured JSON query API (filters, time ranges, grouping) that a
bespoke Grafana datasource plugin consumes. More flexible for ad-hoc slicing
(per-scheme p99 over the last year). Rejected for the initial release:
- Puts ClickHouse on the critical path of every dashboard render, with
unbounded query shapes coming from customers. The in-app views have
bounded shapes we control; a customer dashboard does not. - Requires a fully custom Grafana datasource plugin. Scrape reuses the
Prometheus datasource every Grafana installation already has. - Reinvents time-series querying badly. Prometheus already does this well,
and customers already know PromQL.
We can add a JSON query API later as a separate surface if customers ask for
ad-hoc slices that the scrape snapshot cannot express.
2. ClickHouse-backed scrape endpoint
Scrape handler queries ClickHouse (possibly through a rollup materialized
view) on every request. Conceptually simpler: no aggregator, no telemetry
plumbing, just a SELECT. Rejected:
- Couples every scrape to ClickHouse availability. A ClickHouse hiccup
creates a metrics outage for every customer. - Wrong semantics for Prometheus counters. ClickHouse rollups give
instantaneous window counts, not monotonic counters.rate()over a
window-count series reports garbage. - Per-scrape query load scales with customer count times scrape frequency.
Easy to DoS ourselves.
The hybrid (ClickHouse rollup for histograms, ETS for counters) is also
dismissed: two sources of truth for the same data and all the rollup
problems above.
3. telemetry_metrics_prometheus_core with filter-at-scrape
Use the off-the-shelf library and include account_id as a label on every
metric. At scrape time, filter the rendered text to rows matching the
authenticated account and strip the label. Rejected:
- The global registry grows with tenant count. Every account’s metrics sit
in memory on every node, and the library was not designed for this. - Filtering text post-render is fragile (string parsing, escaping edge
cases). The registry has a structured view but the library does not expose
per-label queries. - One bad label on one internal telemetry event would leak that label into
every customer’s metrics.
The tenant-aware ETS aggregator is more code but is easier to reason about
and has a natural isolation boundary.
4. Redis-backed aggregator via Redix
Store counters and histograms in Redis so every node in a multi-node server
deployment sees the same aggregate without cluster fan-in. Rejected for now:
- Network hop on every telemetry event. Build and test events are high
frequency and latency-sensitive, and Tuist emits them on hot paths. - Redis becomes critical path for metrics emission, and a Redis blip turns
into a metrics outage. - The current server runs on few enough nodes that per-node ETS plus
scrape-time fan-in is tractable (see Open Questions).
If the server ever scales to enough nodes that per-scrape RPC fan-in becomes
expensive, Redis-backed aggregation is the natural next step.
5. Per-project endpoints instead of one account endpoint with a project
label
GET /api/accounts/:account/projects/:project/metrics as the sole endpoint.
Rejected because it forces the customer’s scrape config to grow with their
project count, each with its own target and token. The single-target shape
matches how customers already think about Tuist (one account, many projects)
and matches the plugin’s one-time-setup promise. Per-project filtering is
still available in Grafana via the project label.
Open Questions
- Histograms and customer bucket preferences. A shared global bucket
schedule keeps cardinality bounded but gives customers no say in the
fidelity of their p99. A customer with a 10-second median build cannot
distinguish p99=11s from p99=15s. Worth exposing a second, finer
short-duration histogram? Or leave it for v2?