Getting the tuning of this just right was actually quite the ordeal. A brute force approach with a large application is miserably slow. Notionally, you run tuist graph -d -f json --no-open, then use XcodeGraph to parse the result.
You then conduct a breadth-first search of all local dependencies and run the same for each of them, keeping track of what you’ve visited. Every time you discover a dependency, you add it to the sparse checkout, You’ll eventually crawl the whole dependency graph, iteratively adding folder as you go, thus enabling the feature.
This does rely on the fact that tuist graph is capable of outputting valid JSON even when the project can’t build (because dependencies aren’t there) and it requires the user to specify a starting project as the root node of the graph.
While this totally works, it’s miserably slow (30+ minutes) on a huge project. Step one was to simply parallelize the work of spitting out the graph and adding to the sparse index.
Parallelization did help a lot. Using smart chunking based on number of processor cores and Swift Concurrency it brought the sparse checkout down to ~2 minutes on our very large project.
However, that’s a ridiculously long time for a checkout, so that brought me to a final optimization…not generating the graph. Here’s a rough sketch of how it works…
The CLI uses a storage mechanism (in my case I chose GRDB). The idea is that you can use some kind of key value store and retrieve a previously cached project graph if the project file hasn’t changed.
There’s a couple ways you can detect whether the project file has changed. One way is to just read it and use a checksum (or hash). I’ve actually got support for both. I think my CRC32 checksum algorithm is a bit faster than SHA256, which matters for a large number of project files. However, both of these have a decided downside of actually having to read the file contents.
There’s another way to get clever, which is to use the Git object ID for a file, Git already did the work for us…This isn’t quite as straightforward as you’d think because there’s a difference between what’s in the index, what’s staged, what’s in the working tree and there’s an order that matters when trying to gather that object identifier. Basically you want what’s on disk to win if it’s an unstaged change, what’s staged to win if there’s anything there…and what’s in the git index to win otherwise (note that git ls-files is thankfully unaffected by sparse checkout).
These optimizations brought my sparse checkout time down to about 30s…which still wasn’t good enough in my opinion, so the next optimization pass was about batching.
You can use git ls-files to get the object hash for everything matching Project.swift, then batch read from the database to find all cache hits for that particular Project.swift file. Once you’ve got all those graphs all you’ve got to do is identify related dependencies that you weren’t able to batch read. In the best case, nobody changed a Project.swift definition and there are only 2 sparse checkout commands, one to make sure the Tuist folder and other minimal required folders exist, and one to add all the dependencies.
On a warm cache this brought my sparse checkout time down to 4s…finally something reasonable for the benefits received out the other end.