Improve debugging of CLI issues

Need/problem

When CLI issues occur, it’s important that developers have an easy way to debug what has happened – and to be able to provide the Tuist team with the necessary context.

We’ve already done a great step in that direction by storing log files in the file system and we surface the path when an error occurs. However, this is still not perfect:

  • The log file is not accessible when the error occurred on the CI – in that case, developers would need to upload the Tuist state directory as an artifact.
  • Developers need to copy and reupload the log file when sending the logs to their colleagues or to the Tuist team – another point of friction.

Detailed design

To fix these friction points, I’m proposing to upload the logs automatically when an error occurs.

The steps would be as follows:

  • Error occurs
  • The logs file is stored on disk (already implemented)
  • The logs file is zipped and uploaded to the Tuist storage in a conventional path based on the run (command event) id.

As of now, we wait for the upload of run analytics on the CI – so, we’d additionally wait for the logs to be uploaded as well before posting the link to the run detail.

Locally, we stop the execution without waiting for the run detail to be uploaded, so we don’t prolongue the execution of the CLI in interactive environments and instead upload the analytics in the background as part of a future execution. That does mean, however, we can’t immediately post the link to the run detail with the logs.

To upload the run along with the logs, we can either:

  • Keep things as-is – we only show the local path and the logs would be uploaded sometime in the future.
  • Add a new command to upload the latest run in the local queue – such as tuist analytics update

While not being directly to this RFC, I’d also propose to remove the system of event queues and instead upload the events in the background as we do for inspecting builds. That way, upload of events doesn’t depend on developers running another Tuist command.

Drawbacks

The logs can be potentially quite large – size of >100 MBs of the zipped logs would probably be not out of the ordinary. That does mean increased costs for our storage and some additional time spent on the CI to finish the upload of the logs. I’d argue the benefits outweigh these cons, at least on the CI.

Unresolved questions

I’d argue that the CI flow is certainly a good improvement. As for the local environment, I’m not sure if uploading the logs in the background or with a dedicated command would improve the flow a lot as opposed to sharing the file from disk directly with colleagues or the Tuist team.

@vojtechvrbka mentioned that Develocity has something similar – would you mind looking at how they deal with failures in local environments?

I think this is a great initiative that we should definitely have. On the operational side, we should have a concept of a plan on how we want to deal with abuse of large uploads like this (this also includes previews). Whether that be us adding a storage limit, or a toggle that we can disable uploads after it reached a certain state, or a toggle that we need to enable it when someone sends a support request, not sure. Doesn’t need to be implemented right away, but we should at least consider the option that someone spam uploads large files when we allow large request body sizes for these uploads.

What if we stream the logs to the server as soon as the CLI starts executing? The client can establish a WebSocket connection with the server and use it to stream stdout and stderr as they come. In the CLI, it’d be another logger.

It changes the model slightly since we’d need to create a start run, rather than on completion, but I think it’s worth investing in this since we’ll be able to live-stream builds that are happening anywhere, aligning with the developer experience that Dagger Cloud is providing. I’m also trying to push the Swift team towards that direction.

We could certainly do that, but the debugging logs are really large – I wonder if it could turn out to be somewhat expensive for us to store all those logs on the server. If we decide to do that, we might want to be more selective about logs that we send over.

Storage is quite affordable these days, and it’ll continue to become even more so. What we can do is perform some napkin mathematics with customer data and estimate the cost. We don’t need to keep the logs forever, so that we can say “window of the past week”.