Thousands of servers hacked due to insecurely deployed Ray AI framework

Researchers warn that thousands of servers have been compromised over the past seven months because of lack of authentication by default in an open-source compute framework called Ray, which is used to distribute machine learning and AI workloads. The framework’s developers don’t recognize the lack of built-in authentication as a vulnerability since it’s an intentional and documented design decision, but this hasn’t stopped organizations from exposing deployments to the internet.

“Thousands of companies and servers running AI infrastructure are exposed to the attack through a critical vulnerability that is under dispute and thus has no patch,” researchers from runtime application security firm Oligo said in a report this week. “This vulnerability allows attackers to take over the companies’ computing power and leak sensitive data.”

So far Oligo has identified compromised servers from organizations in many industry sectors including education, cryptocurrency, biopharma, and video analytics. Many of the Ray servers had command history enabled, meaning attackers could easily discover sensitive secrets that were used in previous commands on those servers.

Ray is often used to run workloads that are used for training, serving, and tuning AI models and some of the jobs include Python scripts and bash commands that can contain credentials needed to integrate with third-party services. “An ML-OPS environment consists of many services that communicate with each other, inside the same cluster and between clusters,” the researchers said. “When used for training or fine-tuning, it usually has access to datasets and models, on disk or in remote storage, such as an S3 bucket. Oftentimes, models or datasets are the unique, private intellectual property that differentiates a company from its competitors.”

An intended feature with security implications

Last year security researchers from Bishop Fox found and reported five vulnerabilities in the Ray framework. Anyscale, the company that maintains the software, decided to patch four of them (CVE-2023-6019, CVE-2023-6020, CVE-2023-6021 and CVE-2023-48023) in version 2.8.1, but claimed that the fifth one, assigned CVE-2023-48022, was not really a vulnerability so it was left unfixed.

That’s because CVE-2023-48022 is actually directly caused by the fact that the Ray dashboard and client API do not implement authentication controls. So, any attacker who can reach the API endpoints can submit new jobs, delete existing jobs, retrieve sensitive information, and essentially achieve remote command execution.

The problem is, as a framework whose main goal is to facilitate the execution of workloads across compute clusters, “remote command execution” is essentially a feature and the lack of authentication is also by design. “Due to Ray’s nature as a distributed execution framework, Ray’s security boundary is outside of the Ray cluster,” Anyscale said in its advisory. “That is why we emphasize that you must prevent access to your Ray cluster from untrusted machines (e.g., the public internet). This is why the fifth CVE (the lack of authentication built into Ray) has not been addressed, and why it is not in our opinion a vulnerability, or even a bug.”

The Ray documentation clearly states that “Ray expects to run in a safe network environment and to act upon trusted code” and that it’s the responsibility of developers and platform providers to ensure those conditions for safe operation. However, as we’ve seen with other technologies in the past that lacked authentication by default, users don’t always follow best practices and insecure deployments will make their way on the internet sooner or later. While Anyscale doesn’t want users to put all their trust in an isolation control like authentication inside Ray instead of isolating the entire framework and clusters with external controls, it has decided to work on adding an authentication mechanism in future versions.

Insecure-by-default configurations

Until then, however, many organizations are likely to continue to unwillingly expose such servers to the internet because, according to Oligo, many deployment guides and repositories for Ray, including some of the official ones, come with insecure deployment configurations. Misconfigurations are also made easier by the fact that by default the Ray dashboard and the Jobs API binds to 0.0.0.0, which basically means all available network interfaces on a system and opens port forwarding in the firewall to all of them.

“AI experts are NOT security experts—leaving them potentially dangerously unaware of the very real risks posed by AI frameworks,” the researchers said. “Without authorization for Ray’s Jobs API, the API can be exposed to remote code execution attacks when not following best practices.”

Vulnerabilities