RFC: Code Coverage and Test Impact Analysis

marekfort · May 13, 2026, 9:46am

Summary

This RFC proposes adding code coverage as a first-class Tuist platform capability, then using the collected coverage evidence to power coverage-backed Test Impact Analysis (TIA) for Gradle projects and non-generated Xcode projects.

The goal is to reduce test feedback time without depending on generated-project metadata or opaque machine-learning predictions. Instead of asking Tuist to guess which tests are relevant, we collect evidence about which source files each test, suite, or module covers. A future test run can then skip tests that have previously passed against the same covered and tracked files.

This is inspired by Datadog Test Impact Analysis, which uses per-test or per-suite code coverage to determine whether tests can be skipped. It is intentionally different from Develocity Predictive Test Selection, which uses a server-side predictive model trained on historical code changes and test outcomes.

The proposed path is:

Add coverage ingestion and dashboards for Xcode and Gradle.
Store coverage evidence at file granularity.
Design the selection service and product experience for Xcode and Gradle together.
Start with Xcode and Gradle simulation paths; implementation can land one build system at a time based on instrumentation maturity, performance, and safety.
Keep the existing hash-based selective testing for generated projects.

Motivation

Tuist already helps teams speed up test feedback through generated-project selective testing, test sharding, test insights, and quarantine. The current selective testing approach works well when Tuist owns enough of the graph to hash test targets and their transitive dependencies. That is a strong model for generated Xcode projects, but it does not naturally extend to:

Non-generated Xcode projects, where Tuist does not own the complete project graph.
Gradle projects, where test selection should integrate with Gradle’s project, task, and test filtering model.
Poorly modularized codebases, where target-level selection still runs too many tests.

There is also a separate product opportunity around code coverage. Users already expect CI/test platforms to answer questions like:

Did coverage go up or down on this pull request?
Which modules or files are poorly covered?
Which tests are expensive relative to the coverage they provide?
Which changed files are not covered by any tests?

If Tuist is going to collect coverage anyway, coverage-backed TIA is a natural second layer. It gives us a more explainable and safer path than starting with ML-based predictive test selection.

Current State

Tuist already has several pieces that make this direction realistic:

Test history is stored for both xcode and gradle build systems in test_runs, test_cases, and test_case_runs.
The Gradle plugin already collects test module, suite/class, and case/method results through TuistTestInsights.
The Gradle plugin already applies suite-level filters for sharding with Test.filter.includeTestsMatching.
Xcode result bundles are already parsed locally or server-side through the .xcresult processing pipeline.
Generated Xcode projects already support scheme/test-action coverage settings in the project model.
The current selective testing docs explicitly call out that target-level hashing cannot detect in-code dependencies between tests and sources.

What is missing:

A platform-level coverage data model.
Coverage ingestion for Xcode and Gradle.
A stable mapping from test identity to covered files.
Changed-file and file-hash snapshots per run.
A server endpoint that returns selected/skipped tests before execution.
A simulator/reporting mode that proves the selection would have been safe before enforcing skips.

Proposed Solution

Product Shape

Introduce two related capabilities:

Code Coverage
- Ingest coverage from Xcode and Gradle test runs.
- Show project, module, file, branch, and pull request coverage.
- Show coverage deltas in pull/merge request comments.
- Expose coverage data through APIs and MCP tools.
Test Impact Analysis
- Use coverage evidence to skip tests that are unlikely to be affected by the current change.
- Explain every skip with the covered files and previous passing evidence.
- Fall back to running tests when evidence is missing, stale, or ambiguous.
- Provide simulation mode before enforcement.

Terminology

This RFC uses “Test Impact Analysis” rather than “Predictive Test Selection” because the first implementation should be evidence-based rather than probabilistic. We can add predictive ranking later, but the first trust story should be:

This test passed before when the files it covers had the same content, so running it again is not expected to add signal.

Coverage Evidence

Coverage evidence is an observation collected from a concrete test or coverage run. It records that a test scope covered a repository-relative source file at a specific commit, when that file had a specific Git blob hash and reported coverage.

For example:

In test run R, at commit C,
suite CalculatorTests covered Sources/Calculator/Add.swift,
where Add.swift had Git blob hash H,
and the coverage tool reported the covered lines/counts for that file.

Tuist can use this evidence later to decide whether a test is skippable. If a suite previously passed and all files covered by that suite still have the same hashes, the suite can become a skip candidate. If any covered file changed, or if no passing evidence exists, Tuist should run the suite.

This is intentionally called evidence rather than truth: coverage tools report what happened during one run under one configuration. Coverage evidence can be stale, incomplete, or missing non-code dependencies such as fixtures, snapshots, environment variables, generated resources, or network responses.

Coverage Collection

Coverage tools usually collect coverage by instrumenting the code under test, running the tests, and exporting execution counts keyed by source files and lines. The instrumentation can happen at compile time, at bytecode/class-load time, or through a language runtime hook.

Examples:

Xcode/LLVM coverage instruments the built products and writes coverage into the test result bundle, which can be inspected with xccov.
JaCoCo instruments JVM class files, commonly through a Java agent, and writes execution data that can be rendered as XML/HTML reports.
Istanbul/nyc instruments JavaScript code and emits reports such as LCOV.
Some SaaS tools do not generate coverage themselves; they observe or upload coverage produced by the language ecosystem and normalize it into their own model.

Run-level coverage is enough for dashboards, coverage trends, and PR coverage deltas. Test Impact Analysis needs more granular evidence: ideally each test should be mapped to the files it covered. Suite-level or module-level evidence can be useful as a fallback, but it is less precise and will select more tests than necessary. Tools typically get per-test evidence by integrating with the test framework and either snapshotting/resetting coverage counters around test events, collecting coverage in isolated test processes, or using framework/runtime support to associate coverage with the active test.

The main constraint should be client-side cost, not server-side complexity. Tuist can afford rich server-side processing, storage, and indexing if that produces better selection quality. What we should avoid is making every CLI or Gradle plugin invocation substantially slower, more memory hungry, or less parallel by default.

“Expensive” therefore means client-side overhead: instrumentation can slow test execution, counters may need to be snapshotted or reset around each test scope, coverage artifacts and uploads get larger, some frameworks may need isolated or serial execution, and unsupported framework behavior can force conservative fallbacks. Datadog acknowledges this cost: its docs say excluded branches still collect per-test coverage, but the performance impact is mitigated by collecting coverage only when Datadog detects enough new coverage information to offset the cost. Datadog’s Swift Testing support is also serial-only, which is a concrete example of per-test coverage constraining parallelism.

For Tuist, per-test coverage evidence should be the north star for selection quality. Suite-level or module-level evidence should be treated as compatibility and performance fallbacks. Aggregate coverage should remain useful on its own, while TIA should only use granular coverage evidence when it is reliable enough to explain why a test was skipped.

Coverage Normalization

Tools like SonarQube use generic coverage XML as an import format. The format intentionally keeps the model small: files contain coverable lines, each line has a covered boolean, and lines can optionally include branch counts. This makes it useful for showing cross-language coverage in one product, but it is not a lossless representation of every ecosystem’s native coverage data.

Tuist should use a generic normalized model for dashboarding and cross-build-system queries, but it should not use SonarQube-style generic XML as the canonical internal representation for Test Impact Analysis.

If Tuist converted native coverage into a small generic format too early, it could lose data that matters later:

Execution counts, not just covered/uncovered booleans.
Target/module/function/method hierarchy from xccov.
Suite/test attribution needed by TIA.
Git blob identity for covered files.
Coverage source and tool version.
Native branch/condition detail beyond simple per-line branch totals.
Distinction between observed coverage and carried-forward coverage.

The preferred design is to ingest ecosystem-native coverage through adapters, normalize it into a Tuist-owned coverage model that preserves the fields we need, and optionally keep the original coverage artifact or an extracted raw representation for reprocessing. Tuist can still export or display generic coverage views, but generic coverage should be an output/view, not the only stored truth.

Coverage Reporting With Selected Tests

Selective testing changes the meaning of a coverage report. If only selected tests run, the coverage reported by the client only includes tests that executed in the current process. Tuist’s platform coverage should not expose that partial view as the primary coverage result.

Tuist should make this explicit by distinguishing:

Observed coverage: coverage produced by tests that actually ran in the current run.
Carried-forward coverage: previous passing coverage evidence reused for skipped tests whose covered files are unchanged.
Reported coverage: observed coverage plus valid carried-forward coverage for skipped tests.

Coverage dashboards, coverage APIs, PR comments, and coverage gates should use reported coverage. In other words, whenever selective testing skips tests, Tuist should combine the coverage reported by the client with valid carried-forward coverage from the skipped tests. This keeps coverage trends meaningful even when not every test executes in the current run.

The observed report can remain available as diagnostic metadata because it is the only coverage measured in the current process, but it should not be the primary platform coverage view. Reported coverage must still be explainable and conservative: Tuist should only carry coverage forward when the skipped test has passing evidence and the covered files have the same Git blob hashes. Changed files should only receive current observed coverage; if no selected test covers a changed file, Tuist should surface that as a coverage gap rather than borrowing stale coverage.

Full test runs remain the gold standard for refreshing coverage evidence and validating reported coverage.

File Hashes

For source-controlled files, the preferred file hash should be the Git blob object ID for that file at the commit being reported. The client can derive this from Git, for example by resolving the covered repository-relative path against the run commit. This gives Tuist the exact content identity that Git recorded for the file, without inventing a separate hashing scheme.

Datadog’s public documentation suggests a conservative Git-centered model rather than arbitrary filesystem hashing. Test Impact Analysis requires Git to be available, analyzes commit history with past coverage, and compares the current commit to previous commits where the covered and tracked files are identical. Non-code files that can affect tests are modeled as repository-relative tracked files; if any tracked file changes, Datadog runs all tests.

Tuist should follow the same shape for the initial implementation:

Only Git-addressable, repository-relative files should be eligible as covered-file evidence for skipping.
Non-code inputs that live in the repository, such as dependency manifests, Dockerfiles, Makefiles, snapshot fixtures, or generator configuration, should be configured as tracked-file globs.
If a tracked file changes, Tuist should run all tests.
If a coverage path points to a generated file, temporary file, or file outside the repository root, Tuist should treat that evidence as incomplete and avoid using it to skip tests.

We can add explicit runtime input hashing later, but that should be a separate design. It would need clear rules for when the file exists, how the current run computes the same digest before test selection, and how users opt into the risk. It should not be part of the Datadog-style MVP.

The stored evidence should include the hash algorithm/source so future Git SHA-256 repositories and Git SHA-1 repositories remain distinguishable.

Git Metadata and Upload Timing

Datadog appears to split this into two streams:

Git metadata is collected from CI provider environment variables and the local .git directory. Their source-code integration can also sync repository metadata through datadog-ci git-metadata upload, which sends the repository URL, current commit SHA, and tracked file paths. Older tracer documentation also describes Intelligent Test Runner support as generating and uploading Git packfiles.
Test and coverage data is collected by the test instrumentation during the test session and sent to Datadog. Before execution, the library asks the backend for skippable tests using the current repository/commit metadata. During or after execution, the library uploads coverage evidence so future commits can be evaluated.

For Tuist, the equivalent should be explicit:

At test-selection time, before tests run, the client sends the current Git context: repository URL, current commit SHA, base commit SHA when available, branch/ref, and changed tracked files.
At coverage-ingestion time, after tests run, the client uploads coverage evidence keyed by the same Git context.
The client should compute Git blob IDs for repository-relative covered files and include them with the coverage evidence, unless we add a separate Git metadata sync endpoint that gives the server enough Git object data to derive those identities itself.
If the server cannot prove that the covered files from a previous passing run are identical to the current commit, it should run the test.

Changed Files and Tracked Files

For each TIA-enabled run, the client should send:

Current commit SHA.
Base commit SHA when available.
Changed file paths between base and head.
Current hashes for changed files.
Candidate tests/suites/modules.
Build system and invocation context.

The server should treat a test as skippable only when it can prove that all files relevant to the test are unchanged compared with a previous passing run. If a changed file is not covered by any known test, Tuist should surface that as a coverage insight rather than pretending the change is safe.

Selection Algorithm

For each candidate test or suite:

Always select if it is new.
Always select if it recently failed.
Always select if it is recently flaky.
Always select if it is explicitly marked as unskippable.
Always select if no passing coverage evidence exists.
Always select if any covered/tracked file has changed since the last passing evidence.
Otherwise, mark it as skippable.

The server response should include:

Selected tests/suites.
Skipped tests/suites.
Skip reason.
Last passing evidence commit/run.
Estimated time saved.
Coverage gaps for changed files.

Xcode Implementation

Xcode should be planned from the beginning, not treated as a later extension. It is also a strong candidate for the first implementation slice because Tuist’s core audience is Xcode users, and non-generated Xcode projects are one of the main motivations for this RFC.

Reasons:

Xcode coverage is available through test schemes, test plans, .xcresult, and xccov.
Total run-level coverage is useful as a product feature immediately.
Tuist already parses Xcode result bundles locally and server-side.
Non-generated projects need this feature most, but they also provide less graph metadata than generated projects.

The Xcode path should start with coverage ingestion and simulation before TIA enforcement:

Parse Xcode coverage from .xcresult using xccov; use xcrun xccov view --report --json for aggregate target/file/function summaries and xcrun xccov view --archive for line-level execution data when needed.
Display coverage in the dashboard and PR comments.
Add test enumeration from .xctestproducts or .xctestrun.
Experiment with per-test coverage collection, with suite-level coverage as a fallback when per-test attribution would make the client run too slowly or require unacceptable serialization.
Run Xcode TIA in simulation mode once coverage evidence is good enough to explain selections.
If reliable, add Xcode TIA at the finest safe granularity for the framework: test first, suite fallback.
Keep generated-project hash-based selective testing as the preferred deterministic path when available.

For aggregate Xcode coverage, xccov should be the source of truth. It is Xcode’s purpose-built coverage tool, it can read .xcresult bundles directly, and it exposes both report-level summaries and line-level archive data. This matches the broader ecosystem: SonarQube’s current Swift/Xcode 13.3+ guidance uses xccov-to-sonarqube-generic.sh to convert xccov output into SonarQube’s generic coverage format, and other tools commonly convert xccov output into LCOV or tool-specific formats.

xcresulttool should remain useful for result-bundle data outside coverage, such as test summaries, attachments, and logs. A Swift parser in the existing xcresult processor should only become necessary if xccov is too slow, unavailable in the processor environment, or does not expose the coverage granularity needed for TIA.

For non-generated Xcode projects, TIA should not require Tuist to parse the complete Xcode project graph. The first version should rely on coverage evidence, changed files, test enumeration, and Xcode’s existing -only-testing/.xctestrun filtering mechanisms.

Gradle Implementation

Gradle should be planned alongside Xcode as a first-class implementation track. It may be the lower-friction path for enforced skipping because the Gradle plugin already has suite discovery and filtering primitives, but the product plan should not depend on proving Gradle first before starting the Xcode path.

Reasons:

JaCoCo is widely adopted and already works well with Java/Kotlin/JVM projects.
The Tuist Gradle plugin already collects test suites and cases.
Gradle’s Test.filter.includeTestsMatching is already used by Tuist sharding.
Android teams are an explicit target audience for this feature.

Gradle should also aim for per-test coverage evidence where the JVM/test-framework hooks make it cheap enough. The first enforceable slice may still operate at suite/class granularity, matching the existing sharding model, if method-level discovery or attribution adds too much client-side overhead. Method-level selection can come later once we know the cost profile.

The Gradle plugin should:

Enable or discover JaCoCo reports when coverage collection is enabled.
Upload coverage reports and test results to Tuist.
Before executing tests, ask Tuist for a selection plan.
Apply include filters for selected test classes.
Record skipped tests as “not selected” or an equivalent state.
Periodically run all tests or a random exploration sample to detect stale coverage evidence.

Non-Goals

Replacing hash-based selective testing for generated projects.
Building a Develocity-style ML predictive model in the first version.
Guaranteeing test selection for every test framework and coverage format.
Treating aggregate coverage as sufficient evidence to skip individual tests.
Making coverage thresholds mandatory for all projects.
Supporting Android device/instrumentation tests in the first Gradle TIA version.

User Experience

Xcode

For non-generated projects, users should not need to adopt Tuist manifests. A future command could look like:

tuist xcodebuild test \
  -workspace App.xcworkspace \
  -scheme App \
  -destination "platform=iOS Simulator,name=iPhone 16" \
  -enableCodeCoverage YES

In the first phase this uploads coverage and test insights. With TIA simulation enabled, Tuist can report what it would have selected or skipped. In a later enforced mode, Tuist can add -only-testing filters after fetching a selection plan.

Gradle

tuist {
    testInsights {
        enabled = true
    }

    coverage {
        enabled = true
    }

    testImpactAnalysis {
        enabled = true
        mode = "simulation" // "simulation", "enforced", "off"
        granularity = "suite"
    }
}

CI output:

Tuist: Test Impact Analysis simulation
Tuist: 842 test suites analyzed
Tuist: 517 would be selected
Tuist: 325 would be skipped
Tuist: estimated serial time saved: 18m 42s
Tuist: no predicted missed failures in the last 30 simulated runs

Enforced mode:

Tuist: Test Impact Analysis selected 517 of 842 test suites
Tuist: skipped 325 suites with previous passing coverage evidence
Tuist: selected 12 recently failed suites and 4 recently flaky suites

Trust and Safety

False skips are the main product risk. The system should be conservative by default.

Required safeguards:

Simulation mode before enforcement.
Always-run rules for new, changed, failed, flaky, and unskippable tests.
Full-run or remaining-tests jobs later in CI.
Random exploration samples to prevent feedback bias.
Clear “why skipped” explanations.
Fallback to running everything when coverage evidence is missing or stale.
Project-level controls for mode, branch, CI provider, and minimum history.

The dashboard should make missed failures inspectable. A user should be able to open a skipped test and see:

Why Tuist considered it skippable.
Which files it covered in the last passing evidence.
Which commit/run supplied the evidence.
Which changed files were outside its coverage set.

Trade-offs

Advantages

Builds toward a coverage product users already want.
More explainable than ML-based prediction.
Works for Gradle and non-generated Xcode without requiring generated metadata.
Gives useful value before test skipping is enabled.
Allows richer server-side selection logic while keeping client-side collection conservative.
Fits existing Tuist test insights, sharding, quarantine, and PR-comment surfaces.

Disadvantages

Coverage collection adds runtime overhead.
Per-test coverage can be client-side expensive and framework-specific.
Xcode per-test coverage may be significantly harder than Gradle coverage.
Coverage evidence can miss non-code dependencies such as fixtures, network calls, environment variables, and generated files.
Selective runs make total coverage reporting more nuanced.
The data volume may be large for repositories with many files and tests.

References

Datadog Test Impact Analysis: https://docs.datadoghq.com/tests/test_impact_analysis/how_it_works/
Datadog Test Impact Analysis for Swift: https://docs.datadoghq.com/tests/test_impact_analysis/setup/swift/
Datadog Code Coverage: https://docs.datadoghq.com/tests/code_coverage/
Develocity Predictive Test Selection: Develocity Predictive Test Selection User Manual | Develocity Documentation 2026.1
Meta Predictive Test Selection paper: [1810.05286] Predictive Test Selection
Ekstazi regression test selection: GitHub - gliga/ekstazi: Software testing optimization tool for Java · GitHub
STARTS static regression test selection: GitHub - TestingResearchIllinois/starts: STARTS - A tool for STAtic Regression Test Selection · GitHub
SonarQube Swift coverage parameters: Test coverage parameters | SonarQube Server | Sonar Documentation
SonarQube Swift coverage community guide: [Coverage & Test Data] Generate Reports for Swift - Guides - Sonar Community
xccov manual page: xccov(1)
SonarQube generic test data format: Generic test data | SonarQube Server | Sonar Documentation

pepicrft · May 13, 2026, 11:27am

Thanks for putting this together. Have some comments/thoughts:

Isn’t there a standard for this? If not, do you see in defining a schema around this that agents can learn about (i.e., like agents can learn about Grafana dashboards or k8s clusters).

Something we learned through binary caching is that users will want to debug optimizations like this one. Do you plan to surface all the details about how the selection has happened for a particular run (e.g. in the dashboard UI).

Would it make sense to name it “dry-run” instead of “simulation”?

Naming

I wonder if “Test Impact Analysis” is the right name. There are two pieces to it: collecting extra data per test run (e.g., code coverage, Git metadata), which we are not collecting today (e.g., “Test Insights”), and then using that data to optimize runs (i.e., “Test Selection”). Would it make sense to treat insights as an umbrella for a new type of data (e.g., “Coverage”), instead of introducing another term?

I’m onboard with the direction

marekfort · May 13, 2026, 1:24pm

A useful follow-up: I was pointed to Paul Hammant’s article, The Rise of Test Impact Analysis, and I think it reinforces the direction of this RFC while sharpening one implementation point.

The main learning for me is that we should separate using coverage evidence from refreshing coverage evidence.

Accurate TIA wants a dynamic map like:


test method -> source files covered

changed source file -> impacted tests

The article describes Microsoft collecting dynamic dependencies per test method and storing that mapping server-side. It also describes a simpler coverage-tool-based approach where tests are run one at a time to build the map. That collection path can be expensive, but once the map exists, selecting tests should be cheap.

For Tuist, that suggests:

Per-test coverage evidence should be the north star for selection quality.
Suite/module-level evidence should be a fallback, not the ideal.
Normal CLI/plugin test runs should stay cheap: ask the server what to run using existing evidence.
Evidence refresh can happen in dedicated paths: full runs, remaining-test runs, periodic CI jobs, or simulation runs.
Server-side complexity/storage is acceptable if it keeps the client-side path lightweight.
The map can also power product insights, such as duplicate/overlapping tests or tests that cover too much transitive surface area.

This also fits the coverage-reporting part of the RFC: when tests are skipped, reported coverage should combine coverage observed in the current run with valid carried-forward coverage evidence from skipped tests, while full runs remain the best way to refresh and validate that evidence.

marekfort · May 13, 2026, 1:32pm

Thanks Pedro for the comment.

On the coverage format/schema point: there are ecosystem formats, but I don’t think there is a single standard that captures everything we need here. xccov, JaCoCo, LCOV, Cobertura, and SonarQube generic coverage all cover parts of the problem, but they mostly model reported coverage. They don’t really model the Tuist-specific pieces we would need for selection, such as:

per-test or per-suite attribution,
Git blob identity for covered files,
observed versus carried-forward coverage,
skipped tests and why they were skipped,
selection evidence and fallback reasons.

So I think we should ingest ecosystem-native formats, normalize them into a Tuist-owned internal schema, and optionally export generic views where useful. And yes, I like the idea of making that schema learnable by agents. We could expose it through documented APIs/MCP tools in the same spirit as “list builds”, “list test cases”, or “inspect build files”: agents should be able to ask “why was this test skipped?”, “which files does this test cover?”, “which changed files lack coverage?”, or “why did coverage change in this PR?” without scraping UI state. FWIW, we already do this for our current test insights where we analyze Gradle and Xcode natively, but the way we store the data in the database/expose that via API is then generic across ecosystems.

On debuggability: absolutely yes. This needs the same level of inspectability as binary caching. If a test was selected or skipped, users should be able to see the decision inputs. For a run, I imagine the dashboard showing:

all candidate tests,
selected versus skipped tests,
reason for each decision,
previous passing evidence run/commit,
covered files and their Git blob hashes,
changed files and tracked files that invalidated selection,
fallback reason when Tuist decided to run everything,
estimated time saved,
a “copy as JSON” action for the selection plan.

That last part feels especially important for support and for agents. The selection plan should be an artifact, not just UI text.

On simulation versus dry-run: I agree. dry-run is probably a better user-facing mode for “calculate and report what would be selected, but do not actually skip anything.” So the user-facing modes could be something like:

off
dry-run
enabled

On naming: I also agree that “Test Impact Analysis” may be more of an implementation concept than the product name. The product surface could probably sit under Test Insights:

Coverage: collect, report, and explain code coverage.
Test Selection: use test/coverage/Git evidence to optimize which tests run.

Internally, the engine can still be coverage-backed TIA, but externally “Test Selection”/“Selective testing” might be clearer.