Per-account `/metrics` endpoint and Grafana integration

Summary

This RFC proposes a per-customer Prometheus-compatible /metrics endpoint that
customers can scrape to pull live build, test, cache, and CLI command
metrics out of Tuist, and a Grafana app plugin that bundles a datasource, a
guided config page, and curated dashboards covering build performance, test
performance, test reliability, cache effectiveness, and CLI usage across
both Xcode and Gradle projects.

The scrape path is fed by :telemetry events emitted on build, test, and
cache lifecycle transitions, aggregated in-memory per account. ClickHouse stays
authoritative for the in-app historical views; the Prometheus stream is the
live sampled feed customers run through their own observability stack.

Motivation

Tuist already computes rich analytics (build durations, test durations, cache
hit rates, flakiness) and surfaces them in the dashboard at tuist.dev. For many
teams that is enough. For teams with a centralized observability practice, it
is not:

  1. One pane of glass. Platform teams want build and test health on the
    same Grafana board as deploy health, API latency, and error rates. Today
    they cannot.
  2. Alerting. Customers want to alert on build-time regressions, cache-hit
    drops, and flakiness spikes using their existing Prometheus alert manager
    and on-call rotation, not a second notification surface.
  3. Customer-controlled retention and correlation. Tuist retains its own
    metrics for a long time, but customers often want build and test signals
    living next to their other operational data in a TSDB they control, with
    retention policies they set, so they can join build health against deploy
    events, incident timelines, or capacity plans without crossing system
    boundaries.
  4. Data residency. Enterprise customers occasionally need build health
    signals to stay inside their own infrastructure. Pulling via scrape lets
    them do that.

The ask in its simplest form: give us a URL and a token, we will take it from
there. A Prometheus scrape endpoint plus a Grafana integration is the smallest
surface that satisfies all four.

Prior Art

Fly.io per-app Prometheus

Fly exposes a token-authenticated scrape URL per application. One target per
resource, HMAC-style bearer token, Prometheus text format. The customer drops
it into their scrape config and is done. We adopt the token-per-tenant model
directly: no per-scrape query parameters, no tenant header, just Authorization: Bearer <account-token> resolving to one account.

Grafana Mimir multi-tenant exposition

Mimir partitions data via the X-Scope-OrgID request header. The tenancy
signal is out-of-band from query parameters and orthogonal to labels. We apply
the same principle: the authenticated subject is the tenancy signal, and
account_id never appears as a label in the exposed output. Customers see
their data as if it were single-tenant.

telemetry_metrics_prometheus_core

The idiomatic Elixir library for turning :telemetry events into Prometheus
exposition. It is ETS-backed under the hood with a global registry. We reuse
the telemetry vocabulary (counter, sum, distribution) and the OpenMetrics
output shape but own the registry so we can index it by account_id cheaply.

Grafana App Plugin model

Official integrations for Kubernetes, GitHub, MongoDB, and others ship as
Grafana app plugins. The plugin bundles:

  1. A datasource (usually a thin wrapper around Prometheus)
  2. A config page that collects credentials and emits a scrape-target snippet
  3. A set of curated dashboards

We follow the same structure. The dashboards are the headline feature, the
datasource is a wrapper around Prometheus-compatible storage, and the config
page writes the scrape config for Grafana Agent / Alloy users.

Proposal

Endpoint

GET /api/accounts/:account_handle/metrics

Response body is OpenMetrics text exposition (Content-Type:
application/openmetrics-text; version=1.0.0; charset=utf-8). The endpoint
returns a snapshot of the in-memory aggregator at scrape time. Prometheus
records the evolution across scrapes on its side.

Per-project slicing is exposed as a project label on every metric, not as a
separate endpoint. This keeps the customer’s scrape config to a single target
per Tuist account. Label cardinality is bounded by the number of projects in
the account.

Authentication

Property Value
Header Authorization: Bearer <account_token>
Token type Account token only (project tokens are not accepted)
Required scope metrics:read
Authorization action :metrics_read on the account
Rate limit key metrics:auth:account:{account_id}, reusing TuistWeb.RateLimit.InMemory
Rate limit shape 1 request per 10 seconds per account. Aligns with a 15s Prometheus scrape interval with headroom for retries.

A new metrics:read scope is added to the existing fine-grained account-token
model at account_tokens_controller.ex.
Customers create a token with that scope, drop it into their scrape config,
and scrape.

Exposition format

We emit OpenMetrics text. It is a CNCF-incubated standardization of the
Prometheus text format; both formats coexist and are negotiated via the
Accept header. OpenMetrics adds stricter typing, UNIT metadata, _created
timestamps, and exemplars, which we want to keep open as a future option (see
Open Questions). Prometheus and every Prometheus-compatible TSDB accept it
out of the box, and we fall back to the classic text/plain; version=0.0.4
format when the client does not advertise OpenMetrics in its Accept.

Every metric is exported with a project label. Additional labels are kept
small and bounded (see Cardinality below).

# HELP tuist_xcode_build_duration_seconds Duration of an Xcode build run, including all phases
# TYPE tuist_xcode_build_duration_seconds histogram
# UNIT tuist_xcode_build_duration_seconds seconds
tuist_xcode_build_duration_seconds_bucket{project="ios-app",scheme="App",is_ci="true",status="success",le="30"} 12
tuist_xcode_build_duration_seconds_bucket{project="ios-app",scheme="App",is_ci="true",status="success",le="60"} 48
tuist_xcode_build_duration_seconds_bucket{project="ios-app",scheme="App",is_ci="true",status="success",le="120"} 103
tuist_xcode_build_duration_seconds_bucket{project="ios-app",scheme="App",is_ci="true",status="success",le="300"} 140
tuist_xcode_build_duration_seconds_bucket{project="ios-app",scheme="App",is_ci="true",status="success",le="+Inf"} 142
tuist_xcode_build_duration_seconds_sum{project="ios-app",scheme="App",is_ci="true",status="success"} 12894.5
tuist_xcode_build_duration_seconds_count{project="ios-app",scheme="App",is_ci="true",status="success"} 142

# HELP tuist_xcode_build_runs_total Total number of Xcode build runs observed
# TYPE tuist_xcode_build_runs_total counter
tuist_xcode_build_runs_total{project="ios-app",scheme="App",is_ci="true",status="success"} 142
tuist_xcode_build_runs_total{project="ios-app",scheme="App",is_ci="true",status="failure"} 7

# HELP tuist_gradle_build_runs_total Total number of Gradle build runs observed
# TYPE tuist_gradle_build_runs_total counter
tuist_gradle_build_runs_total{project="android-app",module=":app",is_ci="true",status="success"} 88

# HELP tuist_xcode_cache_hits_total Xcode binary cache hits
# TYPE tuist_xcode_cache_hits_total counter
tuist_xcode_cache_hits_total{project="ios-app"} 3412
# HELP tuist_xcode_cache_misses_total Xcode binary cache misses
# TYPE tuist_xcode_cache_misses_total counter
tuist_xcode_cache_misses_total{project="ios-app"} 509

# HELP tuist_module_cache_hits_total Module cache hits
# TYPE tuist_module_cache_hits_total counter
tuist_module_cache_hits_total{project="ios-app"} 780
# HELP tuist_module_cache_misses_total Module cache misses
# TYPE tuist_module_cache_misses_total counter
tuist_module_cache_misses_total{project="ios-app"} 50

# HELP tuist_gradle_cache_hits_total Gradle build cache hits
# TYPE tuist_gradle_cache_hits_total counter
tuist_gradle_cache_hits_total{project="android-app"} 2104
# HELP tuist_gradle_cache_misses_total Gradle build cache misses
# TYPE tuist_gradle_cache_misses_total counter
tuist_gradle_cache_misses_total{project="android-app"} 811

# HELP tuist_command_invocations_total CLI command invocations
# TYPE tuist_command_invocations_total counter
tuist_command_invocations_total{project="ios-app",command="generate",is_ci="false",status="success"} 419
tuist_command_invocations_total{project="ios-app",command="install",is_ci="true",status="success"} 87
tuist_command_invocations_total{project="ios-app",command="cache warm",is_ci="true",status="success"} 142

# EOF

Metric set (initial release)

Metrics fall into three groups:

  • tuist_xcode_* for Xcode-specific build, test, and cache metrics.
  • tuist_gradle_* for the Gradle equivalents.
  • tuist_command_* for CLI command invocations, which are cross-cutting
    and apply to both build systems.

The Xcode and Gradle namespaces use their respective systems’ native
vocabularies for labels (scheme for Xcode, module for Gradle) rather than
a forced abstraction. The reconciliation section below explains why.

Xcode metrics:

Metric Type Labels Source telemetry event
tuist_xcode_build_runs_total counter project, scheme, is_ci, status, xcode_version, swift_version [:tuist, :xcode, :build, :completed]
tuist_xcode_build_duration_seconds histogram project, scheme, is_ci, status, xcode_version, swift_version [:tuist, :xcode, :build, :completed]
tuist_xcode_test_runs_total counter project, scheme, is_ci, status, xcode_version, swift_version [:tuist, :xcode, :test_run, :completed]
tuist_xcode_test_run_duration_seconds histogram project, scheme, is_ci, status, xcode_version, swift_version [:tuist, :xcode, :test_run, :completed]
tuist_xcode_test_cases_total counter project, status (passed/failed/skipped), xcode_version, swift_version [:tuist, :xcode, :test_case, :completed]
tuist_xcode_cache_hits_total counter project [:tuist, :xcode, :cache, :hit]
tuist_xcode_cache_misses_total counter project [:tuist, :xcode, :cache, :miss]
tuist_module_cache_hits_total counter project [:tuist, :module, :cache, :hit]
tuist_module_cache_misses_total counter project [:tuist, :module, :cache, :miss]

Gradle metrics:

Metric Type Labels Source telemetry event
tuist_gradle_build_runs_total counter project, module, is_ci, status, gradle_version, jvm_version [:tuist, :gradle, :build, :completed]
tuist_gradle_build_duration_seconds histogram project, module, is_ci, status, gradle_version, jvm_version [:tuist, :gradle, :build, :completed]
tuist_gradle_test_runs_total counter project, module, is_ci, status, gradle_version, jvm_version [:tuist, :gradle, :test_run, :completed]
tuist_gradle_test_run_duration_seconds histogram project, module, is_ci, status, gradle_version, jvm_version [:tuist, :gradle, :test_run, :completed]
tuist_gradle_test_cases_total counter project, status (passed/failed/skipped), gradle_version, jvm_version [:tuist, :gradle, :test_case, :completed]
tuist_gradle_cache_hits_total counter project [:tuist, :gradle, :cache, :hit]
tuist_gradle_cache_misses_total counter project [:tuist, :gradle, :cache, :miss]

CLI command metrics (cross-cutting):

Metric Type Labels Source telemetry event
tuist_command_invocations_total counter project, command, is_ci, status [:tuist, :command, :completed]
tuist_command_duration_seconds histogram project, command, is_ci, status [:tuist, :command, :completed]

The command label carries the full command path as a single dot- or
space-separated string: generate, install, build, test,
cache warm, cache clean, registry login, and so on. Subcommands are
kept in the same label rather than split into a separate subcommand
label because they are not semantically independent dimensions; a user
who runs tuist cache warm is not doing a cache with a warm
modifier, they are running the specific cache warm command. Keeping
them flat makes PromQL straightforward:
rate(tuist_command_invocations_total{command="cache warm"}[5m]).

CLI command metrics do not live under either build-system namespace
because commands are the user’s entry point, not an artifact of a
specific build system. tuist generate produces output for Xcode today
but may target other systems in the future; tuist install is
system-agnostic; tuist build delegates to whichever system the project
is configured for. Tracking them separately from build / test metrics
lets customers correlate their own invocation patterns with the
downstream build and test results.

Histogram buckets for durations use a fixed, shared schedule so customers can
aggregate across projects without losing fidelity:

[1, 2, 5, 10, 20, 30, 60, 120, 180, 300, 600, 900, 1200, 1800, 2700, 3600]

This covers 1 second through 60 minutes, which matches the range the dashboard
currently shows for both builds and test runs.

Reconciling Xcode and Gradle

The namespaces are split per build system, and each namespace uses its
system’s native label vocabulary. Xcode metrics have a scheme label.
Gradle metrics have a module label. There is no generic target label and
no build_system discriminator. The build system is implicit in the metric
name.

This is deliberate. A scheme is not a module; forcing both into one
abstraction surfaces wrong-shaped questions (and wrong-shaped alerts) more
often than it helps. Splitting the namespaces makes every single-system
dashboard unambiguous, lets the two systems’ schemas evolve independently,
and removes the temptation to average metrics across systems that are not
directly comparable.

The tradeoff is cross-system aggregation. Platform teams managing mixed iOS
and Android projects need one extra PromQL step, either a metric-name regex:

sum by (project) ({__name__=~"tuist_(xcode|gradle)_build_runs_total"})

or an explicit union:

sum(rate(tuist_xcode_build_runs_total[5m]))
  + sum(rate(tuist_gradle_build_runs_total[5m]))

Both are common Prometheus idioms. We judge the clarity of single-system
dashboards to be worth more than the one-line savings on cross-system ones.

Labels that do cross namespaces by convention:

Label Values Applies to
project customer-supplied project identifier All metrics
is_ci true / false Build, test, and command metrics
status success / failure Build, test-run, and command metrics
status (test case) passed / failed / skipped Test-case metrics

Keeping these identical across namespaces is the concession that lets
customers write a regex over metric names and still group the results
meaningfully. CLI command metrics (tuist_command_*) sit outside both
build-system namespaces because invocations are the user’s entry point,
not an artifact of a specific system. They use the same shared labels
(project, is_ci, status) so they aggregate cleanly alongside the
per-system metrics when customers want a “total activity” panel.

System-specific version labels are in from day one.
Xcode metrics carry xcode_version and swift_version. Gradle metrics
carry gradle_version and jvm_version. The values drift slowly (teams
migrate gradually and typically run 2-3 versions concurrently), so
cardinality stays bounded, and these are exactly the labels customers
need when they want to answer “did our build time regress after the
Xcode 16 upgrade?” or “is the old Gradle version slower than the new
one?” The cache metrics stay version-free in v1: Tuist’s cache already
keys on version internally, and version-scoped cache analysis is niche
enough to live in ClickHouse for now.

Other system-specific dimensions deferred to a follow-up.
Xcode-only (configuration, platform) and Gradle-only (variant,
flavor) labels add cardinality and are more situational than versions.
Revisit based on customer feedback.

Reconciling cache types

Tuist has three distinct cache layers in production today. Each gets its own
metric name so the semantics are unambiguous and the metric name itself
documents the cache layer:

Metrics What it is Build system What a hit saves
tuist_xcode_cache_hits_total / tuist_xcode_cache_misses_total Xcode binary cache: compiled framework / library / bundle artifacts reused across machines Xcode Compile and link time for the cached target
tuist_module_cache_hits_total / tuist_module_cache_misses_total Module cache: cached output of tuist generate for projects and their dependencies Xcode Project generation time
tuist_gradle_cache_hits_total / tuist_gradle_cache_misses_total Gradle build cache: cached task outputs keyed on task inputs Gradle Task execution time

The three layers operate at different points in the pipeline and on different
inputs. Folding them into one generic tuist_cache_* metric family with a
cache_type label would invite averaging across kinds that are not directly
comparable (a 50% binary-cache hit rate is a very different situation from a
50% module-cache hit rate). Separate metric names make that mistake harder.

Each cache gets two monotonic counters, one for hits and one for misses,
mirroring the pattern Bazel’s own observability uses (remote_cache_hits
and friends) and the standard shape you see in nginx, varnish, and other
Prometheus-ecosystem caches. Customers compute hit rate over any window with
rate():

sum by (project) (rate(tuist_xcode_cache_hits_total[5m]))
  / (
    sum by (project) (rate(tuist_xcode_cache_hits_total[5m]))
    + sum by (project) (rate(tuist_xcode_cache_misses_total[5m]))
  )

This gives customers full flexibility over the window (5m for live
dashboards, 24h for trend panels, 30d for capacity planning) without us
having to pick one for them. Adding a future cache layer means adding a new
pair of metric names rather than a new label value, which is a deliberate
and small cost to keep the naming unambiguous.

Aggregator architecture

          ┌─────────────────────────────────────────────────────┐
          │ Build / test / cache runtime code paths             │
          │                                                     │
          │   :telemetry.execute([:tuist, :build, :completed],   │
          │     %{duration: 142.5},                             │
          │     %{account_id, project_id, scheme, is_ci,        │
          │       status})                                      │
          └──────────────┬──────────────────────────────────────┘
                         │
                         ▼
          ┌─────────────────────────────────────────────────────┐
          │ Tuist.Metrics.Aggregator (:telemetry handler)       │
          │   ETS table :tuist_metrics                          │
          │   key:  {account_id, metric_name, label_tuple}      │
          │   val:  counter | {histogram_buckets, sum, count}   │
          └──────────────┬──────────────────────────────────────┘
                         │  (read-only on scrape)
                         ▼
          ┌─────────────────────────────────────────────────────┐
          │ TuistWeb.API.MetricsController                      │
          │   1. Resolve account from Bearer token              │
          │   2. ETS match on {account_id, :_, :_}              │
          │   3. Render OpenMetrics text                        │
          │   4. Emit with Cache-Control: no-store              │
          └─────────────────────────────────────────────────────┘

The ETS table is a single :named_table, :public, :set with
write_concurrency: true and read_concurrency: true. Counter bumps use
:ets.update_counter/3, which is atomic and lockless. Histogram observations
bump the appropriate bucket counter, the sum (as a scaled integer to stay
atomic), and the count in a single update_counter call against a tuple of
positions.

A :telemetry handler is attached at application startup for each of the
events listed above. The handler is synchronous and does constant work (O(1)
ETS updates), so it does not need a backing queue. If future metrics require
heavier work per event, we can introduce a small per-node aggregator process
with a mailbox.

Cache hit and miss events are emitted on separate telemetry paths
([:tuist, :<system>, :cache, :hit] and [:tuist, :<system>, :cache, :miss])
and update the corresponding _hits_total or _misses_total counter
directly from the telemetry handler. No gauge recomputer is needed: hit
rate is derived on the client side via PromQL rate().

Cluster-aware aggregation and deploy behavior

The server runs on a single node in steady state, and deploys use a
rolling strategy: the new node comes up and becomes healthy, the load
balancer adds it to the pool, and only then does the old node begin
draining. Every deploy briefly puts two nodes behind the LB, and today’s
single-node steady state may not be tomorrow’s. Rather than design for
single-node and paper over the deploy overlap separately, the aggregator
is cluster-aware from day one. The same mechanism handles both the
rolling-deploy overlap and any future permanent multi-node topology, and
both reduce to the same implementation in Erlang.

Design: scrape-time RPC fan-in.
Each node maintains its own local ETS table. On scrape, the receiving
node fans out to every peer in the cluster and merges snapshots:

Scrape arrives on node A
        │
        ▼
:erpc.multicall([node() | Node.list()], Tuist.Metrics.Aggregator,
                :snapshot_for, [account_id], 2_000)
        │
        ▼  (parallel)
Each node reads its local ETS filtered by account_id
        │
        ▼
Receive {:ok, snapshot_N} or {:error, reason} per node
        │
        ▼
Merge snapshots by summing counter and histogram-bucket values per
{metric, labels} key
        │
        ▼
Render OpenMetrics text, return response

Counter merges are straightforward because counters and histogram bucket
counters are commutative sums. Distributed Erlang (already in place for
Tuist’s Phoenix / libcluster setup) gives us the node discovery and the
RPC primitive for free.

What this buys us.

  • Rolling deploys have no oscillation. During the overlap window, a
    scrape landing on either node returns the merged view from both. The
    per-node counter splits are invisible to Prometheus, so rate() sees
    continuous growth instead of an oscillating series.
  • Future multi-node is free. Scaling from one steady-state node to N
    is a config change, not an architecture rewrite. Scrape cost grows as
    O(N), and with a 2s per-peer timeout the design is comfortable through
    single-digit node counts.
  • Cluster cost is modest. :erpc.multicall is a small message to each
    peer and a parallel wait. A scrape cache (10s TTL per account) absorbs
    bursts when multiple Prometheus replicas scrape the same account in
    the same window.

Failure handling.
If a peer is unreachable (GC pause, rolling deploy half-complete, transient
network blip), :erpc returns {:error, _} for that node. The scrape
handler returns the merged results from the reachable nodes, sets an
X-Tuist-Metrics-Partial: <node> response header, and emits a warning
log. Failing the scrape outright would convert a partial outage into a
total metrics blackout, which is worse than serving a brief partial view
that Prometheus’s own up{} metric and histogram_quantile() stability
already cover.

The three windows on a rolling deploy.

  ─── pre-handover ─── │ ─── overlap ─── │ ─── post-handover ───
                       ▲                  ▲
                       new node healthy,  old node exits after
                       LB adds it to pool graceful drain
  1. Pre-handover window.
    Only the old node serves. Fan-in returns the old node’s ETS.

  2. Overlap window.
    Both nodes are in the LB pool. Telemetry events split between them
    according to LB routing. Fan-in merges both ETS tables on every
    scrape, so Prometheus sees the combined total, regardless of which
    node the scrape landed on. No oscillation, no reset.

  3. Post-handover window.
    The old node is removed from the LB pool, drains, and exits. Events
    in the old node’s ETS that were not captured by a scrape during the
    overlap are lost to Prometheus (ClickHouse still has them). Graceful
    shutdown, below, is the mitigation.

Graceful shutdown with extended drain.
When the LB removes the old node from the pool, the application stays
alive for a configurable grace period (default 30s, slightly longer than
2x the recommended 15s scrape interval) before the BEAM exits. During
this grace period:

  • :telemetry handlers continue to run, draining any in-flight events.
  • The node stays in the distributed Erlang cluster, so fan-in from the
    new node’s scrapes still reaches it and picks up its final ETS state.
  • The /metrics endpoint on the old node also stays up, so even if the
    LB happens to send Prometheus a scrape back to it during the drain,
    that works too.
  • No new CLI uploads arrive because the LB has removed the instance.

This costs 30s per deploy but ensures the old node’s last window of
events is captured before its ETS goes away.

Keeping the overlap window short.
The overlap is a property of the deploy orchestration, not of this
system, but because fan-in removes the oscillation concern, the overlap
length only matters as an upper bound on the unscraped-event loss window
on the old node. A few knobs worth calling out for ops:

  • Readiness check for the new node returns healthy quickly (the /metrics
    endpoint responds with an empty-but-valid payload even with zero
    accumulated events).
  • LB weighting shifts traffic to the new node aggressively once it is
    healthy, so the old node’s share drops quickly.
  • The old node’s graceful drain begins as soon as the LB removes it,
    not after a further delay.

With these in place the overlap is typically 10-30 seconds.

What stays durable, what stays transient.

Layer Durable across deploys Notes
ClickHouse: raw build / test / cache records Yes Authoritative historical record. In-app dashboard reads from here.
ETS: aggregated counters and histogram buckets No Reset on every deploy. Prometheus rate() handles the reset.
Prometheus (customer side): time series Yes (in customer’s TSDB) Reset-aware queries, data retained per customer’s policy.

Customers reconstructing long-horizon totals use increase(...[30d]),
which sums reset-aware deltas across the whole range. This is the
Prometheus-idiomatic pattern and works the same with or without deploy
resets.

Customer-facing guidance we ship with the docs.

  • Do use rate(), irate(), and increase() for any counter-based
    panel or alert. All three handle resets.
  • Do not alert on the absolute value of a _total counter, since it
    will drop to zero on deploy and the alert would fire spuriously.
  • Do use histogram_quantile(rate(..._bucket[5m])) for latency
    percentiles. rate() on the bucket counters is reset-aware.
  • Avoid comparing a _sum or _count snapshot directly to its value
    an hour ago. Use increase(..._sum[1h]) / increase(..._count[1h]) for
    mean latency.

Mitigations considered and deferred.

  • Checkpoint ETS to disk on shutdown, restore on startup. Eliminates
    the reset boundary but adds a disk write at SIGTERM and a read at boot,
    both of which are new failure modes (partial writes, schema drift,
    version mismatch on rolling back a deploy). Not worth the complexity
    for a stream Prometheus already handles.
  • Snapshot final ETS to Redis or ClickHouse on shutdown. Same concern
    plus a new cross-service dependency on the shutdown path. Deferred.
  • Back the aggregator with ClickHouse-computed rollups for continuity.
    Continuous reconciliation against ClickHouse is a heavier design that
    makes sense if we ever promise “no counter resets, ever” as a product
    guarantee. Today we do not.

Cross-node fan-in plus extended-drain ships in v1. Between them they
eliminate in-flight oscillation and bound the unscraped-event loss window
to sub-second. If customers still surface concrete pain from
deploy-induced resets (for example, alerts firing on resets or dashboards
showing visible discontinuities despite rate()), persistence is the
follow-up.

Cardinality budget

Cardinality is the only way this design blows up. Rules we enforce:

Rule Why
No commit SHA, branch name, user ID, or test case name as a label Unbounded or high-cardinality dimensions belong in the dashboard’s ClickHouse view, not in Prometheus
project label: typically 1-2 projects per account Healthy
scheme label (Xcode): dozens per project Healthy
module label (Gradle): can be hundreds per project in larger monorepos Largest cardinality driver, worth monitoring
command label (CLI): ~20-30 distinct (sub)command names Bounded by the CLI surface itself
xcode_version, swift_version, gradle_version, jvm_version: included Drift slowly (2-3 concurrent versions typical), high signal for migration tracking and version-vs-version comparisons
configuration, platform, variant, flavor: deferred More situational than versions, revisit with feedback
Histograms share one global bucket schedule Prevents per-project bucket explosions

Rough cardinality, per account, for the build histogram:

  • Xcode namespace: projects * schemes * is_ci * status * xcode_version * swift_version * buckets ≈ 2 * 30 * 2 * 2 * 3 * 3 * 17 ≈ 37k
  • Gradle namespace (large monorepo): projects * modules * is_ci * status * gradle_version * jvm_version * buckets ≈ 1 * 300 * 2 * 2 * 3 * 3 * 17 ≈ 184k

In practice the version multipliers are much smaller than the raw product
because a given module or scheme uses one version at a time with rare
transitions, so effective cardinality is closer to 1.5-2x the version-free
baseline. Gradle monorepos are the biggest driver to watch; if we see
customers with thousands of modules, we may need a module allowlist or
prefix-based aggregation. Revisit based on real scrape sizes once we have
customers onboarded.

Delivery behavior

Property Value
Content type application/openmetrics-text; version=1.0.0; charset=utf-8
Fallback content type text/plain; version=0.0.4 when Accept does not include OpenMetrics
Recommended scrape interval 15 seconds
Minimum scrape interval 10 seconds (enforced via rate limit)
Response cache None. The ETS read is already O(series count per account), and caching introduces a staleness window that breaks rate().
Counter resets on deploy Expected. Prometheus’s rate() and increase() handle resets correctly.
Data retention on the Tuist side None beyond the live aggregator. Customers own their history in their TSDB.

Grafana plugin

Config page prompts for:

  1. Tuist account handle
  2. An account token with metrics:read scope
  3. Target Prometheus-compatible datasource already configured in Grafana

On save, the plugin:

  1. Calls GET /api/accounts/:account_handle/metrics once to validate the token
  2. Emits a scrape config snippet the customer drops into Grafana Agent / Alloy:
    scrape_configs:
      - job_name: tuist
        scrape_interval: 15s
        metrics_path: /api/accounts/<account>/metrics
        scheme: https
        static_configs:
          - targets: ["tuist.dev"]
        authorization:
          type: Bearer
          credentials: "<account_token>"
    
  3. Provisions the bundled dashboards into the selected datasource

Dashboards use standard PromQL. Xcode and Gradle get separate dashboards
(or separate rows on a combined dashboard) because the underlying label
vocabulary differs. Example panel queries on the Xcode build performance
dashboard:

Panel PromQL
Build rate (5m) sum by (project) (rate(tuist_xcode_build_runs_total[5m]))
Build failure ratio (5m) sum by (project) (rate(tuist_xcode_build_runs_total{status="failure"}[5m])) / sum by (project) (rate(tuist_xcode_build_runs_total[5m]))
p50 build duration histogram_quantile(0.5, sum by (project, le) (rate(tuist_xcode_build_duration_seconds_bucket[5m])))
p90 build duration histogram_quantile(0.9, sum by (project, le) (rate(tuist_xcode_build_duration_seconds_bucket[5m])))
Mean build duration sum by (project) (rate(tuist_xcode_build_duration_seconds_sum[5m])) / sum by (project) (rate(tuist_xcode_build_duration_seconds_count[5m]))

Gradle panels mirror the Xcode ones with tuist_gradle_* metric names. For
mixed-stack customers who want one board covering both, the metric-name
regex idiom works:

sum by (project) ({__name__=~"tuist_(xcode|gradle)_build_runs_total"})

Example panels on the cache effectiveness dashboard:

Panel PromQL
Xcode binary cache hit rate (5m) sum by (project) (rate(tuist_xcode_cache_hits_total[5m])) / (sum by (project) (rate(tuist_xcode_cache_hits_total[5m])) + sum by (project) (rate(tuist_xcode_cache_misses_total[5m])))
Module cache hit rate (5m) sum by (project) (rate(tuist_module_cache_hits_total[5m])) / (sum by (project) (rate(tuist_module_cache_hits_total[5m])) + sum by (project) (rate(tuist_module_cache_misses_total[5m])))
Gradle cache hit rate (5m) sum by (project) (rate(tuist_gradle_cache_hits_total[5m])) / (sum by (project) (rate(tuist_gradle_cache_hits_total[5m])) + sum by (project) (rate(tuist_gradle_cache_misses_total[5m])))
Xcode cache miss throughput (5m) sum by (project) (rate(tuist_xcode_cache_misses_total[5m]))

Example panels on the CLI usage dashboard:

Panel PromQL
Invocations by command (5m) sum by (command) (rate(tuist_command_invocations_total[5m]))
Command failure rate sum by (command) (rate(tuist_command_invocations_total{status="failure"}[5m])) / sum by (command) (rate(tuist_command_invocations_total[5m]))
CI vs local invocation split sum by (is_ci, command) (rate(tuist_command_invocations_total[5m]))
p90 command duration histogram_quantile(0.9, sum by (command, le) (rate(tuist_command_duration_seconds_bucket[5m])))

Plugin distribution: the Grafana plugin catalog requires plugins to be
signed, and catalog submission takes Grafana-side review time. We submit
for catalog signing ahead of the launch so the release lands with the
plugin already available through the in-app catalog. If the signing
review is not complete by launch, we ship as an unsigned GitHub release
in the interim (Grafana allows explicit opt-in via the
GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS config) and swap to the
signed catalog entry as soon as it is approved.

Alternatives Considered

1. JSON query API instead of a scrape endpoint

Expose a structured JSON query API (filters, time ranges, grouping) that a
bespoke Grafana datasource plugin consumes. More flexible for ad-hoc slicing
(per-scheme p99 over the last year). Rejected for the initial release:

  1. Puts ClickHouse on the critical path of every dashboard render, with
    unbounded query shapes coming from customers. The in-app views have
    bounded shapes we control; a customer dashboard does not.
  2. Requires a fully custom Grafana datasource plugin. Scrape reuses the
    Prometheus datasource every Grafana installation already has.
  3. Reinvents time-series querying badly. Prometheus already does this well,
    and customers already know PromQL.

We can add a JSON query API later as a separate surface if customers ask for
ad-hoc slices that the scrape snapshot cannot express.

2. ClickHouse-backed scrape endpoint

Scrape handler queries ClickHouse (possibly through a rollup materialized
view) on every request. Conceptually simpler: no aggregator, no telemetry
plumbing, just a SELECT. Rejected:

  1. Couples every scrape to ClickHouse availability. A ClickHouse hiccup
    creates a metrics outage for every customer.
  2. Wrong semantics for Prometheus counters. ClickHouse rollups give
    instantaneous window counts, not monotonic counters. rate() over a
    window-count series reports garbage.
  3. Per-scrape query load scales with customer count times scrape frequency.
    Easy to DoS ourselves.

The hybrid (ClickHouse rollup for histograms, ETS for counters) is also
dismissed: two sources of truth for the same data and all the rollup
problems above.

3. telemetry_metrics_prometheus_core with filter-at-scrape

Use the off-the-shelf library and include account_id as a label on every
metric. At scrape time, filter the rendered text to rows matching the
authenticated account and strip the label. Rejected:

  1. The global registry grows with tenant count. Every account’s metrics sit
    in memory on every node, and the library was not designed for this.
  2. Filtering text post-render is fragile (string parsing, escaping edge
    cases). The registry has a structured view but the library does not expose
    per-label queries.
  3. One bad label on one internal telemetry event would leak that label into
    every customer’s metrics.

The tenant-aware ETS aggregator is more code but is easier to reason about
and has a natural isolation boundary.

4. Redis-backed aggregator via Redix

Store counters and histograms in Redis so every node in a multi-node server
deployment sees the same aggregate without cluster fan-in. Rejected for now:

  1. Network hop on every telemetry event. Build and test events are high
    frequency and latency-sensitive, and Tuist emits them on hot paths.
  2. Redis becomes critical path for metrics emission, and a Redis blip turns
    into a metrics outage.
  3. The current server runs on few enough nodes that per-node ETS plus
    scrape-time fan-in is tractable (see Open Questions).

If the server ever scales to enough nodes that per-scrape RPC fan-in becomes
expensive, Redis-backed aggregation is the natural next step.

5. Per-project endpoints instead of one account endpoint with a project

label

GET /api/accounts/:account/projects/:project/metrics as the sole endpoint.
Rejected because it forces the customer’s scrape config to grow with their
project count, each with its own target and token. The single-target shape
matches how customers already think about Tuist (one account, many projects)
and matches the plugin’s one-time-setup promise. Per-project filtering is
still available in Grafana via the project label.

Open Questions

  1. Histograms and customer bucket preferences. A shared global bucket
    schedule keeps cardinality bounded but gives customers no say in the
    fidelity of their p99. A customer with a 10-second median build cannot
    distinguish p99=11s from p99=15s. Worth exposing a second, finer
    short-duration histogram? Or leave it for v2?

Thanks for putting this together :clap:. The questions that I had were resolved at the end of it, like why not use ClickHouse for persistence, or what about clustering & deployments with the information that we are holding in memory. Just a minor thing:

With this solution, we’ll then use :telemetry for two things, sending internal service telemetry, whose events also start with tuist_, and user-facing metrics. I’d consider using telemetry event names that make it obvious that an event is internal vs user-facing.

Additionally, I was thinking that since we have the docs in Elixir now, the page that documents all the metrics available can be dynamically generated from Elixir data structures to markdown.

Good point, we can adjust the user-facing telemetry to include an additional metrics namespace, i.e., tuist_metrics_*.

Good idea, I’ll do that.