Summary
We propose a unified automation system for Tuist, inspired by Buildkite Workflows and Shopify Functions. Automations replace all current hardcoded project settings (flaky detection, quarantine, Slack alerts, performance alert rules) with a single, extensible model.
Each automation has a type (the detection strategy), configuration (thresholds, windows), trigger actions (what to do when the condition is met), and recovery actions (what to do when it clears). The system ships with a set of pre-built automation types covering test flakiness and project health, with a path to fully custom automations (JavaScript/WASM) in the future.
Motivation
Today, managing test health in Tuist requires configuring several disconnected settings:
- Flaky detection: a toggle + threshold in the project settings
- Quarantine: a separate toggle that couples to flaky detection
- Slack alerts: another toggle + channel, only for flaky tests
- Performance alerts: a completely different
alert_rulessystem with its own UI
This leads to several problems:
- Rigidity: Teams cannot customize when tests get muted vs. skipped based on severity. A team wanting “mute if flakiness rate < 50%, skip if >= 50%” has no way to express this.
- Duplication: The flaky detection pipeline, quarantine logic, and alert rules all implement variations of the same pattern (condition → action → recovery) but share no infrastructure.
- Limited composability: You cannot combine actions (e.g., “mark as flaky AND send Slack AND mute the test” in one rule). Each behavior is a separate toggle.
- Hard to extend: Adding a new signal (e.g., “test duration regression”) or a new action (e.g., “send webhook”) requires new project fields and custom workers.
Teams onboarding to test insights have diverse needs. Rather than building one-off solutions for each, we need a system that is structured enough to cover common cases with a UI, but extensible enough to handle custom logic in the future.
Prior Art
Buildkite Test Engine Workflows
Buildkite Workflows is the closest model to what we want. Each workflow has:
- A detection strategy (e.g., “Passed on retry”, “Transition count”) with configurable parameters (window, threshold, branch)
- Actions: composable list of actions (change test state, add label, send Slack, send webhook, create Linear issue)
- Recovery condition with recovery actions
The detection strategy determines when to act. The actions determine what to do. They’re independent – any strategy can have any combination of actions. This is the core pattern we adopt.
Grafana Alerting
Grafana Alerting contributes the alarm lifecycle model:
- State machine: Normal → Pending → Alerting → Recovering → Normal
- Pending period: Condition must hold for a minimum time before firing (prevents flapping)
- Keep-firing-for: Delays recovery to prevent rapid fire-resolve cycles
- Contact points (Slack, webhooks, PagerDuty) as action targets
Shopify Functions
Shopify Functions demonstrates how to extend a platform with user-provided code:
- Functions declare their input via a GraphQL query (the platform pre-fetches only what’s needed)
- Functions return declarative output (the platform executes the declared operations)
- Functions run in a sandbox (WASM) with strict resource limits
We plan to adopt this pattern in a future phase for fully custom automations, where users can write JavaScript compiled to WASM that returns the same typed action array as the built-in automations.
Proposal
This RFC builds on the mute and skip quarantine modes RFC, which introduces a test case state field (enabled, muted, skipped). Automations can change this state via the change_state action.
The Model
An automation is attached to a project and has:
- Automation type: The detection strategy (e.g.,
flakiness_rate,build_duration_deviation) - Config: Type-specific parameters (thresholds, windows, baselines)
- Trigger actions: A list of typed actions to execute when the condition is first met
- Recovery: An optional recovery condition + actions to execute when the condition clears
The platform evaluates automations on a time-based cadence (configurable per automation). On each tick, the platform fetches the relevant data from ClickHouse, evaluates the condition, and executes actions on state transitions (not on every tick).
Automation Types
Test Automations
| Type | What it detects | Config |
|---|---|---|
flaky_run_count |
Tests with N+ unique flaky run groups in a window | threshold, window |
flakiness_rate |
Tests above X% flakiness in a window | threshold, window |
Project Health Automations
| Type | What it detects | Config | Scope |
|---|---|---|---|
build_duration_regression |
Build duration increased by X% vs previous window | metric (p50/p90/p99/avg), deviation, window (executions) |
scheme, environment |
test_duration_regression |
Test run duration increased by X% vs previous window | metric (p50/p90/p99/avg), deviation, window (executions) |
scheme, environment |
cache_hit_rate_regression |
Cache hit rate decreased by X% vs previous window | deviation, window (executions) |
environment |
bundle_size_regression |
Bundle size increased by X% vs previous bundle | metric (install_size/download_size), deviation |
git_branch, bundle_name |
These match the existing alert_rules system. Each compares a current window against a previous window (or current bundle vs previous bundle) and fires when the deviation exceeds the configured percentage. Future automation types (e.g., absolute thresholds) can be added without changing the data model.
New automation types can be added over time without changing the data model.
Actions
Actions are composable – any automation type can have any combination of actions. Each action is a typed object:
| Type | Fields | Description | Idempotency |
|---|---|---|---|
change_state |
state |
Change entity state (e.g., test case: muted/skipped/enabled) | No-op if already in that state |
add_label |
label |
Add a label to the entity (e.g., “flaky”, “slow”, “regression”) | No-op if label already present |
remove_label |
label |
Remove a label from the entity | No-op if label not present |
send_slack |
channel, message, cooldown |
Send a Slack notification | Deduplicated per automation+entity, respects cooldown |
send_webhook |
url, cooldown |
POST to a webhook URL | Same |
Labels are an extensible tagging system. The current is_flaky boolean on test cases becomes derived from the presence of the “flaky” label – add_label("flaky") sets it, remove_label("flaky") clears it. This generalizes to any label (“slow”, “regression”, “critical”, or custom labels), and extends to other subjects in the future.
When multiple automations target the same entity, state has a severity hierarchy: skipped > muted > enabled. A change_state("skipped") always takes precedence over change_state("muted"), regardless of which automation fires first. The platform applies the highest severity state across all active automations.
State-setting actions (change_state, add_label, remove_label) are inherently idempotent. Event actions (send_slack, send_webhook) are deduplicated by the platform: they fire on state transitions (first time the condition is met, or first time after recovery), not on every evaluation tick. The optional cooldown field (e.g., "24h") controls re-notification for ongoing conditions like build regressions.
The platform provides a url field on every entity, so Slack messages can include dashboard links:
*Test skipped due to high flakiness*
*Test:* testLogin
*Flakiness rate:* 52%
<https://tuist.dev/org/project/tests/test-cases/abc|View in dashboard>
Recovery
Recovery is optional per automation. When enabled, it has:
- Condition: When the alarm should clear (e.g., “14 days without trigger”, “metric drops below X”)
- Actions: What to do on recovery (same action types as trigger actions)
Recovery is evaluated on each tick for automations currently in the “triggered” state. When the recovery condition is met, the recovery actions are executed and the automation returns to its normal state for that entity.
Examples
Flaky test detection + auto-mute + Slack notification:
{
"name": "Flaky test detection",
"enabled": true,
"automation_type": "flakiness_rate",
"config": {
"threshold": 10,
"window": "30d"
},
"cadence": "5m",
"trigger_actions": [
{ "type": "add_label", "label": "flaky" },
{ "type": "change_state", "state": "muted" },
{ "type": "send_slack", "channel": "#test-alerts", "message": "Test ${name} muted (flakiness: ${flakinessRate}%)\n<${url}|View in dashboard>" }
],
"recovery": {
"enabled": true,
"config": {
"days_without_trigger": 14
}
},
"recovery_actions": [
{ "type": "remove_label", "label": "flaky" },
{ "type": "change_state", "state": "enabled" }
]
}
Auto-skip highly flaky tests (separate rule):
{
"name": "Auto-skip highly flaky",
"enabled": true,
"automation_type": "flakiness_rate",
"config": {
"threshold": 50,
"window": "30d"
},
"cadence": "5m",
"trigger_actions": [
{ "type": "change_state", "state": "skipped" },
{ "type": "send_slack", "channel": "#test-alerts", "message": "Test ${name} skipped (flakiness: ${flakinessRate}%)\n<${url}|View in dashboard>" }
],
"recovery": {
"enabled": true,
"config": {
"days_without_trigger": 14
}
},
"recovery_actions": [
{ "type": "change_state", "state": "enabled" }
]
}
Build duration regression:
{
"name": "Build p90 regression",
"enabled": true,
"automation_type": "build_duration_regression",
"config": {
"metric": "p90",
"deviation": 20,
"window": 100,
"scheme": "",
"environment": "ci"
},
"cadence": "10m",
"trigger_actions": [
{ "type": "send_slack", "channel": "#builds", "message": "p90 build duration regressed by ${deviation}%", "cooldown": "24h" }
]
}
Input Query
Each automation type has a known set of data it needs from ClickHouse. In Phase 1, the platform handles this internally – each type maps to a specific ClickHouse query. There is no user-facing input declaration for built-in types.
In Phase 2 (custom automations), each automation will declare its input needs via a GraphQL query, following the Shopify Functions pattern. We chose GraphQL over alternatives (flat JSON field lists, nested JSON, custom DSL) because it naturally handles field selection, parameterized fields, and association traversal with mature tooling. See the “Future: Custom Automations” section for details.
Evaluation
Automations run on a time-based cadence (e.g., every 5 minutes, every hour). On each tick:
- The platform fetches the relevant metrics from ClickHouse for the automation type (batched per project)
- For each entity (test case, build, etc.), evaluates the condition
- If the condition is newly met (state transition from normal to triggered):
- Executes trigger actions (enqueues Oban jobs)
- Records the triggered state in
automation_states(ClickHouse)
- If the automation is in triggered state and recovery is enabled:
- Evaluates the recovery condition
- If met: executes recovery actions, updates state to recovered
In the future, we may add event-based evaluation (e.g., “evaluate immediately after a test run completes”) as an optimization.
Dashboard
The Automations page shows all automations as cards, grouped by category (Tests, Project Health):
Tests
+---------------------------------------------------------------+
| Flaky test detection [on/off] |
| flakiness_rate >= 10% in last 30 days |
| Actions: add_label(flaky), change_state(muted), send_slack |
| Recovery: 14 days -> remove_label(flaky), change_state(enabled)|
| [Edit] [Test] [Logs] |
+---------------------------------------------------------------+
| Auto-skip highly flaky [on/off] |
| flakiness_rate >= 50% in last 30 days |
| Actions: change_state(skipped), send_slack |
| Recovery: 14 days -> change_state(enabled) |
| [Edit] [Test] [Logs] |
+---------------------------------------------------------------+
Project Health
+---------------------------------------------------------------+
| Build p90 regression [on/off] |
| p90 deviation >= 20% above ci_default_branch |
| Actions: send_slack (cooldown: 24h) |
| [Edit] [Test] [Logs] |
+---------------------------------------------------------------+
[+ New automation]
Edit opens a type-specific builder UI with the condition parameters, action list, and recovery config. Test evaluates the automation against current data and shows what would trigger.
“+ New automation” offers the available automation types grouped by category.
Backward Compatibility
- Existing projects with
auto_mark_flaky_tests/auto_quarantine_flaky_tests/flaky_test_alerts_enabledsettings will have automations generated from their current configuration. No behavioral change. - Existing
alert_rulesfor build/test duration, cache, and bundle size will be migrated to automations. - Old project settings fields and the
alert_rulestable are kept temporarily and eventually removed.
Default Automations
New projects get one default automation:
- Flaky test detection (enabled):
flakiness_rate>= 1% in 30 days, actions: [add_label(“flaky”)], recovery: 14 days → [remove_label(“flaky”)]
Future: Custom Automations
The built-in automation types cover common patterns, but teams will eventually need logic that can’t be expressed through configuration alone (e.g., “skip only if flaky AND the test was added in the last 7 days AND it’s not in the critical suite”).
When this happens, users can eject to code – converting the automation to a JavaScript function compiled to WebAssembly via Javy (the same toolchain Shopify uses for Shopify Functions). The ejection is a one-way door, with a confirmation dialog making this clear.
The custom function receives pre-fetched data (declared via a GraphQL input query) and returns the same typed action array as built-in automations. This means the action execution infrastructure (state changes, labels, Slack, webhooks, idempotency, cooldowns) is shared between both modes.
// Custom automation: complex flaky test handling
function evaluate(testCase) {
const actions = [];
if (testCase.flakinessRate >= 50 && testCase.recentRuns.some(r => r.status === "failure")) {
actions.push({ type: "add_label", label: "flaky" });
actions.push({ type: "change_state", state: "skipped" });
actions.push({ type: "send_slack", channel: "#alerts", message: `${testCase.name} skipped` });
} else if (testCase.flakinessRate >= 10) {
actions.push({ type: "add_label", label: "flaky" });
actions.push({ type: "change_state", state: "muted" });
}
return actions;
}
Key design decisions for custom automations (to be detailed in a future RFC):
- Runtime: JavaScript compiled to WASM via Javy, executed server-side via Wasmex (wrapping wasmtime), testable in the browser via native WASM. We considered Lua (via Luerl) but chose JS/WASM for developer familiarity, stronger sandboxing, and the Shopify-validated toolchain.
- Input declaration: GraphQL queries, following the Shopify Functions pattern. Chosen over flat JSON field lists (too limited), nested JSON (reinvents GraphQL without tooling), and custom DSLs (maintenance burden).
- Storage: JS source in Postgres, compiled .wasm in S3 (Tigris). Compilation is instant on save.
- Sandboxing: WASM hardware-level memory isolation, fuel-based execution limits, wall clock timeouts.
Why Built-in Automations First
We considered starting with fully custom automations (JS/WASM) from day one, but chose to ship built-in automation types with a structured JSON representation first. The JSON maps directly to our database model, making it straightforward to build, validate, and evolve.
There are two main reasons for this approach:
-
Dedicated builder UI: Each built-in automation type gets a purpose-built UI that’s intuitive to use. Users configure thresholds, select Slack channels, and toggle recovery through dropdowns and inputs – not a code editor. This is the right experience for the majority of teams whose needs fit the common patterns.
-
Platform-owned evolution: By owning the built-in automation types, we can iterate on the common set of automations that we see useful across teams – adding new types, refining defaults, improving evaluation logic – without requiring users to update their code. The structured JSON representation is a stable contract that we can extend while maintaining backward compatibility.
At the same time, this puts the foundation in place (the typed action array, the evaluation engine, the state tracking) to eventually allow teams to completely customize this layer by ejecting to code when the built-in types aren’t enough.
Since the JSON representation is a complete description of an automation, it can eventually support import/export – allowing teams to share automations across projects, version-control them alongside their code, or generate them via LLMs. This is not something we’ll build initially, but the structured format makes it straightforward to add later.