Automations evolution

Summary

We propose a unified automation system for Tuist, inspired by Buildkite Workflows and Shopify Functions. Automations replace all current hardcoded project settings (flaky detection, quarantine, Slack alerts, performance alert rules) with a single, extensible model.

Each automation has a type (the detection strategy), configuration (thresholds, windows), trigger actions (what to do when the condition is met), and recovery actions (what to do when it clears). The system ships with a set of pre-built automation types covering test flakiness and project health, with a path to fully custom automations (JavaScript/WASM) in the future.

Motivation

Today, managing test health in Tuist requires configuring several disconnected settings:

  • Flaky detection: a toggle + threshold in the project settings
  • Quarantine: a separate toggle that couples to flaky detection
  • Slack alerts: another toggle + channel, only for flaky tests
  • Performance alerts: a completely different alert_rules system with its own UI

This leads to several problems:

  1. Rigidity: Teams cannot customize when tests get muted vs. skipped based on severity. A team wanting “mute if flakiness rate < 50%, skip if >= 50%” has no way to express this.
  2. Duplication: The flaky detection pipeline, quarantine logic, and alert rules all implement variations of the same pattern (condition → action → recovery) but share no infrastructure.
  3. Limited composability: You cannot combine actions (e.g., “mark as flaky AND send Slack AND mute the test” in one rule). Each behavior is a separate toggle.
  4. Hard to extend: Adding a new signal (e.g., “test duration regression”) or a new action (e.g., “send webhook”) requires new project fields and custom workers.

Teams onboarding to test insights have diverse needs. Rather than building one-off solutions for each, we need a system that is structured enough to cover common cases with a UI, but extensible enough to handle custom logic in the future.

Prior Art

Buildkite Test Engine Workflows

Buildkite Workflows is the closest model to what we want. Each workflow has:

  • A detection strategy (e.g., “Passed on retry”, “Transition count”) with configurable parameters (window, threshold, branch)
  • Actions: composable list of actions (change test state, add label, send Slack, send webhook, create Linear issue)
  • Recovery condition with recovery actions

The detection strategy determines when to act. The actions determine what to do. They’re independent – any strategy can have any combination of actions. This is the core pattern we adopt.

Grafana Alerting

Grafana Alerting contributes the alarm lifecycle model:

  • State machine: Normal → Pending → Alerting → Recovering → Normal
  • Pending period: Condition must hold for a minimum time before firing (prevents flapping)
  • Keep-firing-for: Delays recovery to prevent rapid fire-resolve cycles
  • Contact points (Slack, webhooks, PagerDuty) as action targets

Shopify Functions

Shopify Functions demonstrates how to extend a platform with user-provided code:

  • Functions declare their input via a GraphQL query (the platform pre-fetches only what’s needed)
  • Functions return declarative output (the platform executes the declared operations)
  • Functions run in a sandbox (WASM) with strict resource limits

We plan to adopt this pattern in a future phase for fully custom automations, where users can write JavaScript compiled to WASM that returns the same typed action array as the built-in automations.

Proposal

This RFC builds on the mute and skip quarantine modes RFC, which introduces a test case state field (enabled, muted, skipped). Automations can change this state via the change_state action.

The Model

An automation is attached to a project and has:

  1. Automation type: The detection strategy (e.g., flakiness_rate, build_duration_deviation)
  2. Config: Type-specific parameters (thresholds, windows, baselines)
  3. Trigger actions: A list of typed actions to execute when the condition is first met
  4. Recovery: An optional recovery condition + actions to execute when the condition clears

The platform evaluates automations on a time-based cadence (configurable per automation). On each tick, the platform fetches the relevant data from ClickHouse, evaluates the condition, and executes actions on state transitions (not on every tick).

Automation Types

Test Automations

Type What it detects Config
flaky_run_count Tests with N+ unique flaky run groups in a window threshold, window
flakiness_rate Tests above X% flakiness in a window threshold, window

Project Health Automations

Type What it detects Config Scope
build_duration_regression Build duration increased by X% vs previous window metric (p50/p90/p99/avg), deviation, window (executions) scheme, environment
test_duration_regression Test run duration increased by X% vs previous window metric (p50/p90/p99/avg), deviation, window (executions) scheme, environment
cache_hit_rate_regression Cache hit rate decreased by X% vs previous window deviation, window (executions) environment
bundle_size_regression Bundle size increased by X% vs previous bundle metric (install_size/download_size), deviation git_branch, bundle_name

These match the existing alert_rules system. Each compares a current window against a previous window (or current bundle vs previous bundle) and fires when the deviation exceeds the configured percentage. Future automation types (e.g., absolute thresholds) can be added without changing the data model.

New automation types can be added over time without changing the data model.

Actions

Actions are composable – any automation type can have any combination of actions. Each action is a typed object:

Type Fields Description Idempotency
change_state state Change entity state (e.g., test case: muted/skipped/enabled) No-op if already in that state
add_label label Add a label to the entity (e.g., “flaky”, “slow”, “regression”) No-op if label already present
remove_label label Remove a label from the entity No-op if label not present
send_slack channel, message, cooldown Send a Slack notification Deduplicated per automation+entity, respects cooldown
send_webhook url, cooldown POST to a webhook URL Same

Labels are an extensible tagging system. The current is_flaky boolean on test cases becomes derived from the presence of the “flaky” label – add_label("flaky") sets it, remove_label("flaky") clears it. This generalizes to any label (“slow”, “regression”, “critical”, or custom labels), and extends to other subjects in the future.

When multiple automations target the same entity, state has a severity hierarchy: skipped > muted > enabled. A change_state("skipped") always takes precedence over change_state("muted"), regardless of which automation fires first. The platform applies the highest severity state across all active automations.

State-setting actions (change_state, add_label, remove_label) are inherently idempotent. Event actions (send_slack, send_webhook) are deduplicated by the platform: they fire on state transitions (first time the condition is met, or first time after recovery), not on every evaluation tick. The optional cooldown field (e.g., "24h") controls re-notification for ongoing conditions like build regressions.

The platform provides a url field on every entity, so Slack messages can include dashboard links:

*Test skipped due to high flakiness*
*Test:* testLogin
*Flakiness rate:* 52%
<https://tuist.dev/org/project/tests/test-cases/abc|View in dashboard>

Recovery

Recovery is optional per automation. When enabled, it has:

  • Condition: When the alarm should clear (e.g., “14 days without trigger”, “metric drops below X”)
  • Actions: What to do on recovery (same action types as trigger actions)

Recovery is evaluated on each tick for automations currently in the “triggered” state. When the recovery condition is met, the recovery actions are executed and the automation returns to its normal state for that entity.

Examples

Flaky test detection + auto-mute + Slack notification:

{
  "name": "Flaky test detection",
  "enabled": true,
  "automation_type": "flakiness_rate",
  "config": {
    "threshold": 10,
    "window": "30d"
  },
  "cadence": "5m",
  "trigger_actions": [
    { "type": "add_label", "label": "flaky" },
    { "type": "change_state", "state": "muted" },
    { "type": "send_slack", "channel": "#test-alerts", "message": "Test ${name} muted (flakiness: ${flakinessRate}%)\n<${url}|View in dashboard>" }
  ],
  "recovery": {
    "enabled": true,
    "config": {
      "days_without_trigger": 14
    }
  },
  "recovery_actions": [
    { "type": "remove_label", "label": "flaky" },
    { "type": "change_state", "state": "enabled" }
  ]
}

Auto-skip highly flaky tests (separate rule):

{
  "name": "Auto-skip highly flaky",
  "enabled": true,
  "automation_type": "flakiness_rate",
  "config": {
    "threshold": 50,
    "window": "30d"
  },
  "cadence": "5m",
  "trigger_actions": [
    { "type": "change_state", "state": "skipped" },
    { "type": "send_slack", "channel": "#test-alerts", "message": "Test ${name} skipped (flakiness: ${flakinessRate}%)\n<${url}|View in dashboard>" }
  ],
  "recovery": {
    "enabled": true,
    "config": {
      "days_without_trigger": 14
    }
  },
  "recovery_actions": [
    { "type": "change_state", "state": "enabled" }
  ]
}

Build duration regression:

{
  "name": "Build p90 regression",
  "enabled": true,
  "automation_type": "build_duration_regression",
  "config": {
    "metric": "p90",
    "deviation": 20,
    "window": 100,
    "scheme": "",
    "environment": "ci"
  },
  "cadence": "10m",
  "trigger_actions": [
    { "type": "send_slack", "channel": "#builds", "message": "p90 build duration regressed by ${deviation}%", "cooldown": "24h" }
  ]
}

Input Query

Each automation type has a known set of data it needs from ClickHouse. In Phase 1, the platform handles this internally – each type maps to a specific ClickHouse query. There is no user-facing input declaration for built-in types.

In Phase 2 (custom automations), each automation will declare its input needs via a GraphQL query, following the Shopify Functions pattern. We chose GraphQL over alternatives (flat JSON field lists, nested JSON, custom DSL) because it naturally handles field selection, parameterized fields, and association traversal with mature tooling. See the “Future: Custom Automations” section for details.

Evaluation

Automations run on a time-based cadence (e.g., every 5 minutes, every hour). On each tick:

  1. The platform fetches the relevant metrics from ClickHouse for the automation type (batched per project)
  2. For each entity (test case, build, etc.), evaluates the condition
  3. If the condition is newly met (state transition from normal to triggered):
    • Executes trigger actions (enqueues Oban jobs)
    • Records the triggered state in automation_states (ClickHouse)
  4. If the automation is in triggered state and recovery is enabled:
    • Evaluates the recovery condition
    • If met: executes recovery actions, updates state to recovered

In the future, we may add event-based evaluation (e.g., “evaluate immediately after a test run completes”) as an optimization.

Dashboard

The Automations page shows all automations as cards, grouped by category (Tests, Project Health):

Tests
+---------------------------------------------------------------+
| Flaky test detection                              [on/off]    |
| flakiness_rate >= 10% in last 30 days                         |
| Actions: add_label(flaky), change_state(muted), send_slack    |
| Recovery: 14 days -> remove_label(flaky), change_state(enabled)|
|                                          [Edit] [Test] [Logs] |
+---------------------------------------------------------------+

| Auto-skip highly flaky                            [on/off]    |
| flakiness_rate >= 50% in last 30 days                         |
| Actions: change_state(skipped), send_slack                    |
| Recovery: 14 days -> change_state(enabled)                    |
|                                          [Edit] [Test] [Logs] |
+---------------------------------------------------------------+

Project Health
+---------------------------------------------------------------+
| Build p90 regression                              [on/off]    |
| p90 deviation >= 20% above ci_default_branch                  |
| Actions: send_slack (cooldown: 24h)                           |
|                                          [Edit] [Test] [Logs] |
+---------------------------------------------------------------+

                         [+ New automation]

Edit opens a type-specific builder UI with the condition parameters, action list, and recovery config. Test evaluates the automation against current data and shows what would trigger.

“+ New automation” offers the available automation types grouped by category.

Backward Compatibility

  • Existing projects with auto_mark_flaky_tests / auto_quarantine_flaky_tests / flaky_test_alerts_enabled settings will have automations generated from their current configuration. No behavioral change.
  • Existing alert_rules for build/test duration, cache, and bundle size will be migrated to automations.
  • Old project settings fields and the alert_rules table are kept temporarily and eventually removed.

Default Automations

New projects get one default automation:

  • Flaky test detection (enabled): flakiness_rate >= 1% in 30 days, actions: [add_label(“flaky”)], recovery: 14 days → [remove_label(“flaky”)]

Future: Custom Automations

The built-in automation types cover common patterns, but teams will eventually need logic that can’t be expressed through configuration alone (e.g., “skip only if flaky AND the test was added in the last 7 days AND it’s not in the critical suite”).

When this happens, users can eject to code – converting the automation to a JavaScript function compiled to WebAssembly via Javy (the same toolchain Shopify uses for Shopify Functions). The ejection is a one-way door, with a confirmation dialog making this clear.

The custom function receives pre-fetched data (declared via a GraphQL input query) and returns the same typed action array as built-in automations. This means the action execution infrastructure (state changes, labels, Slack, webhooks, idempotency, cooldowns) is shared between both modes.

// Custom automation: complex flaky test handling
function evaluate(testCase) {
  const actions = [];

  if (testCase.flakinessRate >= 50 && testCase.recentRuns.some(r => r.status === "failure")) {
    actions.push({ type: "add_label", label: "flaky" });
    actions.push({ type: "change_state", state: "skipped" });
    actions.push({ type: "send_slack", channel: "#alerts", message: `${testCase.name} skipped` });
  } else if (testCase.flakinessRate >= 10) {
    actions.push({ type: "add_label", label: "flaky" });
    actions.push({ type: "change_state", state: "muted" });
  }

  return actions;
}

Key design decisions for custom automations (to be detailed in a future RFC):

  • Runtime: JavaScript compiled to WASM via Javy, executed server-side via Wasmex (wrapping wasmtime), testable in the browser via native WASM. We considered Lua (via Luerl) but chose JS/WASM for developer familiarity, stronger sandboxing, and the Shopify-validated toolchain.
  • Input declaration: GraphQL queries, following the Shopify Functions pattern. Chosen over flat JSON field lists (too limited), nested JSON (reinvents GraphQL without tooling), and custom DSLs (maintenance burden).
  • Storage: JS source in Postgres, compiled .wasm in S3 (Tigris). Compilation is instant on save.
  • Sandboxing: WASM hardware-level memory isolation, fuel-based execution limits, wall clock timeouts.

Why Built-in Automations First

We considered starting with fully custom automations (JS/WASM) from day one, but chose to ship built-in automation types with a structured JSON representation first. The JSON maps directly to our database model, making it straightforward to build, validate, and evolve.

There are two main reasons for this approach:

  1. Dedicated builder UI: Each built-in automation type gets a purpose-built UI that’s intuitive to use. Users configure thresholds, select Slack channels, and toggle recovery through dropdowns and inputs – not a code editor. This is the right experience for the majority of teams whose needs fit the common patterns.

  2. Platform-owned evolution: By owning the built-in automation types, we can iterate on the common set of automations that we see useful across teams – adding new types, refining defaults, improving evaluation logic – without requiring users to update their code. The structured JSON representation is a stable contract that we can extend while maintaining backward compatibility.

At the same time, this puts the foundation in place (the typed action array, the evaluation engine, the state tracking) to eventually allow teams to completely customize this layer by ejecting to code when the built-in types aren’t enough.

Since the JSON representation is a complete description of an automation, it can eventually support import/export – allowing teams to share automations across projects, version-control them alongside their code, or generate them via LLMs. This is not something we’ll build initially, but the structured format makes it straightforward to add later.

Input declaration : GraphQL queries

Don’t think we should be implementing GraphQL. We already have REST API + MCP + LiveView, increasing our API surface even more will be hard to manage.

JavaScript compiled to WASM via Javy, executed server-side via Wasmex (wrapping wasmtime), testable in the browser via native WASM.

Sandboxing : WASM hardware-level memory isolation

Would be very interested on more detail on this before forming an opinion. What infrastructure does this run on? Why this over something like Dynamic Workers?

WASM functions would be running on our own infra. The main benefits are:

  • we fully own this piece instead of delegating this to a third party (also beneficial to self-hosting customers)
  • WASM allows teams to eventually compile the function in a language of their choice. As we try to stay close to the individual build systems, teams might appreciate this.

The main benefit of GraphQL is that the consumer can choose what they need. Especially if we want to include analytical data, then having a way to declarative tell our server what the function needs is crucial. But once we get there, we can start with having a set of subjects that the method can observe with a pre-defined structure. GraphQL would only make sense if the queries in the predefined queries would start to get expensive and we’d like the consumers to be more selective about what they need.

Thanks for putting this together :folded_hands:

I’m on board with the solution, phasing, and layering. Once we iterate to something more programmable, it’ll feel like an ejection to the user, but all we’ll end up doing is allowing them to drill down a layer and get to the engine primitives that make this possible.

Another example you might want to include as a reference is Ona and their automations, which I believe embrace a .yaml format:

tasks:
  install:
    name: Install dependencies
    command: npm ci
    triggeredBy:
      - postDevcontainerStart

I’d keep an eye on how this space (agentic automation) evolves, because I believe we’ll see some patterns emerging around how to codify automation that includes non-deterministic agentic steps. I also think this primitive has the potential to be shareable by the community as our platform has more capabilities (e.g. we expose more data and signals)