Sparse checkout based on a dependency graph

So new version of git ship with a feature called sparse checkout. In monorepos this can be really nice. If I’ve got several apps that might share some dependencies and might not and the repo is really large, I can sparse checkout just the folders I care about and build just the apps that matter.

I recently built a tool at my company for exactly this purpose. It ties into a CLI we provide to our developers. So for example you can run ios generate Apps/MyApp --sparse and it’ll do a sparse checkout of just the folders you need for that app.

This not only speeds up actions like git status in large repos (which can take quite a while depending on the repo size) but it decreases the repo size on disk…which has actually been a limiting factor for some of our developers.

Getting the tuning of this just right was actually quite the ordeal. A brute force approach with a large application is miserably slow. Notionally, you run tuist graph -d -f json --no-open, then use XcodeGraph to parse the result.

You then conduct a breadth-first search of all local dependencies and run the same for each of them, keeping track of what you’ve visited. Every time you discover a dependency, you add it to the sparse checkout, You’ll eventually crawl the whole dependency graph, iteratively adding folder as you go, thus enabling the feature.

This does rely on the fact that tuist graph is capable of outputting valid JSON even when the project can’t build (because dependencies aren’t there) and it requires the user to specify a starting project as the root node of the graph.

While this totally works, it’s miserably slow (30+ minutes) on a huge project. Step one was to simply parallelize the work of spitting out the graph and adding to the sparse index.

Parallelization did help a lot. Using smart chunking based on number of processor cores and Swift Concurrency it brought the sparse checkout down to ~2 minutes on our very large project.

However, that’s a ridiculously long time for a checkout, so that brought me to a final optimization…not generating the graph. Here’s a rough sketch of how it works…

The CLI uses a storage mechanism (in my case I chose GRDB). The idea is that you can use some kind of key value store and retrieve a previously cached project graph if the project file hasn’t changed.

There’s a couple ways you can detect whether the project file has changed. One way is to just read it and use a checksum (or hash). I’ve actually got support for both. I think my CRC32 checksum algorithm is a bit faster than SHA256, which matters for a large number of project files. However, both of these have a decided downside of actually having to read the file contents.

There’s another way to get clever, which is to use the Git object ID for a file, Git already did the work for us…This isn’t quite as straightforward as you’d think because there’s a difference between what’s in the index, what’s staged, what’s in the working tree and there’s an order that matters when trying to gather that object identifier. Basically you want what’s on disk to win if it’s an unstaged change, what’s staged to win if there’s anything there…and what’s in the git index to win otherwise (note that git ls-files is thankfully unaffected by sparse checkout).

These optimizations brought my sparse checkout time down to about 30s…which still wasn’t good enough in my opinion, so the next optimization pass was about batching.

You can use git ls-files to get the object hash for everything matching Project.swift, then batch read from the database to find all cache hits for that particular Project.swift file. Once you’ve got all those graphs all you’ve got to do is identify related dependencies that you weren’t able to batch read. In the best case, nobody changed a Project.swift definition and there are only 2 sparse checkout commands, one to make sure the Tuist folder and other minimal required folders exist, and one to add all the dependencies.

On a warm cache this brought my sparse checkout time down to 4s…finally something reasonable for the benefits received out the other end.

Do note that sparse-checkout actually suffers from glob style patterns. It’s better to give it exact paths because otherwise you might force it into O(N *M ) (where M == number of files in the repo) time lookups.

Optimization things I tried that were a bad idea:

  • I attempted to just write the sparse-checkout file myself located a .git/info/sparse-checkout and then run sparse checkout reapply with the assumption that it’d do better when given everything all at once. Do note that this doesn’t suffer from the ARG_MAX problem of calling git sparse-checkout add with a too large list of files. However, this was surprisingly slower than incremental adds from the batches of dependencies.
  • I really tried to make this somehow work from any folder in the repo…but the problem is if you execute this from repoRoot/myModule and then sparse checkout removes that folder, no other git commands will work.

Optimizations I still want to explore:

  • There’s a potential way to get that 4s lower by re-using a sparse-checkout file. Imagine a case where I regularly sparsely checkout some group of folders, instead of starting over each time from a clean state and having to rebuild the sparse index you could find a way to just discover the deltas between what is checked out and what needs to be checked out, rewrite the file, then reapply with some heuristic based on how many changes are necessary
  • It could be possible/reasonable to have a daemon running that captures all Project.swift changes so the cache is constantly warm. That could also be total overkill, hard to say without a bit of exploration. It doesn’t quite help with changes you pull in from others, though.

Do note that it is necessary to have a minimum number of required files. Targets often rely on non-obvious files outside their module tree: shared scripts (e.g., ./Scripts), xcconfig hierarchies, derived source generators, assets in shared bundles, plugins, and repo-level config (.swiftlint.yml, .clang-format, “Resources/Branding/*”). If any of those are omitted, the graph may be “correct,” yet the build fails or Xcode behaves oddly.

For repos using submodules (I feel bad for them) extra work would be necessary to appropriately fetch them. This is probably best thought of as a “hook” before a sparse checkout is performed rather than an explicit action as folks might be using other non-standard tools (like git-subrepo instead of submodules). Arguably the same is true for git LFS although practically that installs git hooks that should make it a non-issue

I also didn’t mention this, but sparse checkout should use cone mode and the sparse index for maximum effectiveness here.

Hi @Tyler-Keith-Thompson :wave:

This is quite impressive work, thanks for sharing it. I was not aware of sparse-checkout but we might need to use it at some point because we embraced a monorepo setup, and we will experience slowness in our git operations at some point.

I have some questions regarding your setup.

Before you run tuist graph, I assume you have to clone the manifests first? Do you have all of them in some conventional location (e.g. the root) such that you can just clone that folder and run your CLI?

I was very surprised to see that the processing of the graph took 30+ minutes. We do some operations in graphs internally, and they are usually in the range of seconds, so I’d be curious to know more about where that time is being spent.

If it’s useful, we have a command, tuist hash, where you can get fingerprints for all the targets in the graph. I was wondering if we can extend that command namespace to output fingerprints in a format that you can use to uniquely identify the values cached in the local database. Am I right assuming then that developers have a global database in their system? What happens in the case of remote environments like CI?

This is a very smart idea. We thought about doing something like that too for detecting when a project needs to be re-generated or a cache needs to be warmed. So it’d happen in the background in developers’ environments.

BTW, if you explore this path, you might find this tool useful. They provide a nice DX around defining and running daemons scoped to a project.

Before you run tuist graph , I assume you have to clone the manifests first?

So imagine a hypothetical command tuist sparse-checkout Apps/MyApp. It doesn’t need to get all the manifests first, it needs to just get Apps/MyApp/Project.swift (although sparse-checkout in cone mode is folder based, so all of Apps/MyApp). It then just gets that graph. When it’s got it, it does a breadth-first search of all local dependencies the graph called out, which involved checking those folders out.

This is why it took so long with a huge project. It’s doing tuist graph a large number of times (in my case, hundreds).

If it’s useful, we have a command, tuist hash , where you can get fingerprints for all the targets in the graph.

Maybe, but speed and batching are both very important. If this has to be run on a per-file basis, it could negatively impact performance. I’d also have to play with it to see if it works with targets that aren’t yet in the repo as the breadth-first search continues.

Am I right assuming then that developers have a global database in their system?

My prototype has one in the repo, but git ignored. Since it’s “just” SQLite there’s no reason it couldn’t be hosted and used as a distributed cache. Alternatively, of course, it may be possible to reuse your current distributed cache in Tuist. We’d need more data to know whether it’s even worth it, though. A partial cache miss should be on the order of seconds and a full cache miss could be on the order of minutes…depending. I’ve got no sense of how often that’d happen. There’d need to be widespread Project.swift changes. Certainly possible after somebody hasn’t pulled from a repo in a long time.

Thanks for calling out pitchfork! I didn’t know about it, but I’ll take a look and see what it has to offer.

Thanks a lot @Tyler-Keith-Thompson for your clarification. Some people seemed quite interested in this, so once you are happy with the setup, I’d be happy to help make Tuist have a built-in solution for this.