Taking Tuist QA to the next level

We’re getting close to unlocking our first Tuist QA workflow:

  • PR is created with your changes
  • tuist share is run as part of the CI
  • You run the Tuist QA by adding a comment such as /tuist qa Test my featureX
  • Tuist spins up a new agent that tests the Preview based on the prompt
  • Once the agent is finished, Tuist posts the QA run summary, including:
    • what was tested
    • issues found
    • screenshots as it tests the functionality

While initially, we will certainly spend a lot of time in improving the core functionality (improving the base prompt, iterating on the output, etc.), we also think it’s good to start thinking ahead. I will outline a couple of features that I feel would make the Tuist QA significantly better – but happy to hear your ideas for what you’d find useful.

App context

There is some context that, while not required, will make the agent more efficient across runs.

App description

The larger the app, the more difficult it will be for the agent to reliably test a feature based on the testing prompt. Features are named specific ways and they can be deep inside the app and hard for the agent to find. We will need to explore what techniques lead to the best results, but initially, I’m thinking of two different pieces of app context:

  • Overall description of the app – what’s it used for, how individual app domains are called, etc.
  • Detailed description of where features are located

We can try generating both with the AI, but it might be better if the former is manually written instead.

The latter, however, should be generated by an agent who would use the human-supplied description of the app and then it would go ahead and explore. Based on the exploration, it would provide a summary how to best navigate around the app and what are all the features of the app. This exploration run would be quite long for large apps, but it would then make subsequent QA runs way more efficient. I see this exploration being done regularly, such as once per day, to keep the description relatively fresh.

Log in data

Even though the agent could probably figure out how to sign up and sign in on every run, this is definitely not optimal. We’ll need a way to specify the login credentials and how to use them (such as “Sign by email, username: xx, password: 12345”).

We can also explore if we can supply the credentials from the command line instead of the agent re-running the same sign in flow over and over again.

Triggering QA outside of PRs

While our initial focus is on triggering Tuist QA from the PR, we definitely see the value in triggering QA also directly from the Tuist dashboard or from the Tuist app.

Tuist Previews should have a button to trigger Tuist QA along with the prompt. Additionally, each Preview detail should link to Tuist QA runs associated with that preview.

Tuist QA insights

We should have a page similar to our Previews or Bundles where we:

  • List all QA runs as they happen
  • Tuist QA analytics for a given time frame:
    • Number of runs
    • Average time it takes to run Tuist QA
    • App issues found
    • … ? I think what time-frame analytics will be useful will become more obvious once teams start using it more actively

Tuist QA detail

We should start with surfacing basic metadata about the QA run:

  • length of the run
  • preview that was tested
  • triggered by
  • started at
  • duration

We should also surface the same information we’re surfacing in the GitHub PR:

  • Summary
  • Steps taken
  • Screenshots taken

Agent replay

Once we surface the most important information from the agent, the next step will be to do an agent replay, a chat-like replay, including artifacts like screenshots and the majority of the agent logs as it navigates the app.

This will help folks understand better how exactly the agent tested the app and debug their prompts.

Additionally, we should be taking a recording and show a simulator view with the recording, so you can see always exactly what was on the screen when the agent was testing the app.

Live sessions

This is not too different from the agent replay – but we’d be streaming what the agent is doing live, including the simulator screen.

Human-in-the-loop

This is definitely something more long-term, but once we can show what the agent is doing live, we can also let human redirect the agent as it’s running the tests, particularly useful when understanding what prompts work the best or when the feature has a wide scope and is more exploratory.

Automatically triggering Tuist QA

Another step will be automatically triggering the Tuist QA.

Deriving what to test based on the PR description

If the PR includes a PR description with a summary of what this feature adds, we can automatically run Tuist QA without engineers specifying a specific prompt. We can either take a conventional approach of what should be included in the PR description (such as ## Tuist QA instructions) or we can derive that with AI. In that case, we would trigger Tuist QA only if the AI had a high degree of certainty that the PR description describes well enough what should be tested.

Common scenarios

We see the first iteration of Tuist QA to be focused on testing PR-specific changes without re-running a given set of tests on every PR/merge to main. But we think Tuist QA could eventually be run more often also for more repetitive tasks, so teams don’t have to maintain their own UI test suite – and these tests could easily be written by non-engineers, too. This is something I’d leave to explore later once we nail the more dynamic UI test replacements with our first iteration.

Gathering more information from runs

If there’s an issue, it’s important to provide as much context as possible. Right now, we’re limited ourselves to the agent interactions and screenshots, but we can certainly expand this to:

  • gathering network logs
  • integrating with libraries like TCA to track the app state
  • … and more

Feedback

This post is highlighting the overall direction of Tuist QA. We’re still early, so a lot of this is subject to change. And the space is moving fast.

If there are specific areas of Tuist QA that you’d like us to explore, we’d love to hear those.

Overall, any feedback is appreciated :heart:

1 Like

Thanks for writing this up. I think QA has a huge potential to help teams release work confidently with a low cost. Adding some comments:

I can see a TUIST.md or QA.md that contains this information. The claude code CLI has a /init command that people can run, so I can see a similar workflow where the agent would spend some time navigating around the app, and capturing all possible workflows to speed up future runs.

Name-wise, I wonder if we should use the following termonology instead of “insights”:

  • Feature: Tuist QA
  • Individual execution: QA Session
  • Result: Report

I’d play with asking the agent to categorize errors, such that developers can filter and sort using the category (e.g. UI misalignments, broken flows, erroring flows). As we know more about the types of errors that happen, that’s information that we can pass to the agent before starting the session.

What other information can we collect without a development SDK? I assume the logs? I’d also check if we can get the UserDefaults and the Keychain values.


Profiles

Additioinally, I was wondering whether we should have the concept of “profiles”. For example, if youa re interested in taking a “design” or “linguistic” angles to the review, you can select that profile, and we’ll pass additional context to the agent so that they place the focus on some dimensions. Or maybe it’s just fine to say focus on everything since the agent will navigate the app anyways.

That could be an alternative, but it would mean we’d need to have access to the source code. Since the QA feature doesn’t require access to the source code and I’d expect organizations to be more cautious about third parties that gain access to the source code, I would still leave an option to include a description of the project in the dashboard.

Sounds good. I’m thinking whether a QA Run would be more in line with the rest of our dashboard than a session.

I would expect user defaults to be accessible – keychain most likely, too.

Right, this is definitely where the agent will be really useful, such as: “Test this feature with the highest font accessibility settings. Focus on UI issues.”

We can start with ensuring the agent has tools to change the font size or rotate the device – and the LLM should be smart enough to use these tools if they are available.

That makes sense. I didn’t think about the repository access. In that case yeah, having that context in the DB and providing a simple editing UI sounds the most sensible step to take.