Wiz Research Finds Critical NVIDIA AI Vulnerability Affecting Containers Using NVIDIA GPUs, Including Over 35% of Cloud Environments

Critical severity vulnerability CVE-2024-0132 affecting NVIDIA Container Toolkit and GPU Operator presents high risk to AI workloads and environments.

6 minutes read

Executive summary 

Wiz Research has uncovered a critical security vulnerability, CVE-2024-0132, in the widely used NVIDIA Container Toolkit, which provides containerized AI applications with access to GPU resources. This impacts any AI application – in the cloud or on-premise – that is running the vulnerable container toolkit to enable GPU support. 

The vulnerability enables attackers who control a container image executed by the vulnerable toolkit to escape from that container and gain full access to the underlying host system, posing a serious risk to sensitive data and infrastructure. 

On September 26, NVIDIA released a security bulletin along with a patched version of the affected product. Thank you to the entire NVIDIA team that worked with us throughout the disclosure process. We greatly appreciate their transparency, responsiveness, and collaboration during this engagement. 

In this post, we will provide a high-level overview of the discovery and its implications. Given the prevalence and sensitivity of this bug, we will save some of the technical details for a future installment, omitting exploit information for now so that impacted organizations have time to address the vulnerability. 

Organizations using the NVIDIA Container Toolkit are strongly encouraged to update the affected package to the latest version 1.16.2, while focusing on container hosts that might run untrusted container images.  

Impact 

Wiz Research discovered a container-escape vulnerability (CVE-2024-0132) affecting the widely used NVIDIA Container Toolkit library, that would allow an attacker who controls the container images run by the Toolkit to perform a container escape and gain full access to the underlying host.  

The urgency with which you should fix the vulnerability depends on the architecture of your environment and the level of trust you place in running images. Any environment that allows the use of third party container images or AI models – either internally or as-a-service – is at higher risk given that this vulnerability can be exploited via a malicious image.  

A few illustrative examples: 

  • Single-tenant compute environments: If a user downloads a malicious container image from an untrusted source (as a result of a social engineering attack, for example), the attacker could then take over the user’s workstation. 

  • Orchestrated environments: In shared environments like Kubernetes (K8s), an attacker with permission to deploy a container could escape that container and gain access to data and secrets of other applications running on the same node – or even on the same cluster – thereby affecting the entire environment. 

While the second scenario is applicable to any organization running a shared compute model, it is especially relevant for AI service providers that allow customers to run their own GPU-enabled container images. In this case, the vulnerability becomes even more dangerous. An attacker could deploy a harmful container, break out of it, and use the host machine’s secrets to target the cloud service’s control systems. This could give the attacker access to sensitive information, like the source code, data, and secrets of other customers using the same service. 

Who and what is affected? 

Background: what is NVIDIA Container Toolkit? 

Running GPUs in a shared compute environment allows sharing a single GPU across different workloads and potentially different users. To enable native GPU access from within the container environment, NVIDIA built a set of drivers and tools that are deployed on the container host and integrate with the container runtime. 

The NVIDIA Container Toolkit is the industry standard of this integration, facilitating seamless GPU utilization within containerized environments. In recent years, the toolkit has become increasingly popular, paralleling the explosive growth in AI and container technologies. 

This library is widely adopted as the go-to NVIDIA-supported solution for GPU access within containers. Moreover, it comes pre-installed in many AI platforms and virtual machine images (AMIs), as it's a common infrastructure requirement for AI applications. 

 The NVIDIA GPU Operator is a Kubernetes operator that automatically deploys and manages the NVIDIA Container Toolkit in Kubernetes clusters. Its widespread adoption in GPU-enabled Kubernetes environments significantly expands the footprint of the NVIDIA Container Toolkit, making it present in more containerized GPU workloads across various organizations. 

Affected Components: 

  • NVIDIA Container Toolkit: All versions up to and including v1.16.1 

  • NVIDIA GPU Operator: All versions up to and including 24.6.1 

Note: The vulnerability does not impact use cases where Container Device Interface (CDI) is used. 

Mitigation

Affected organizations should upgrade to the latest version of Container Toolkit (v1.16.2) and NVIDIA GPU Operator (v24.6.2).

Patching is highly recommended for container hosts running Container Toolkit in vulnerable versions, while prioritizing hosts that are likely to run containers, especially those built from images originating in untrusted sources. Further prioritization can be achieved through runtime validation, so as to focus patching efforts on instances where the toolkit is definitely in use. 

Note that Internet exposure is not a relevant factor for triaging this vulnerability, as the affected container host does not need to be publicly exposed in order to load a malicious container image. Instead, initial access vectors may include social engineering attempts against developers; supply chain scenarios such as an attacker with prior access to a container image repository; and containerized environments allowing external users to load arbitrary images (whether by design or due to a misconfiguration). 

Why research the NVIDIA Container Toolkit? 

In the course of our work investigating AI service providers (Hugging Face, Replicate, SAP AI Core, and others), Wiz researchers have identified that these providers tend to run AI models and training procedures as containers in shared compute environments, where multiple applications from different customers share the same GPU device. This insight raised an interesting research question: Could the shared GPU device potentially allow access to the AI models, prompts, or datasets of other customers? This led us to investigate NVIDIA’s Kernel modules, SDK, and runtime tools. 

When we encountered the NVIDIA Container Toolkit, we discovered a wide attack surface for container breakout vulnerabilities, which could potentially allow us to escape our container in the service and access the data of other customers sharing the same GPU resources. This discovery led us to set aside our GPU-focused research and dive deeper into the helper tools NVIDIA provides to its customers. 

The attack flow 

The attack has three main stages: 

  1. Creating a malicious image: The attacker crafts a specially designed image to exploit CVE-2024-0132. (Note: Specific technical details about exploiting this vulnerability are not provided at this stage for the reasons we mentioned earlier.

  2. Gaining full access to the file system: The attacker runs the malicious image on the target platform. This can be performed either directly (for example in services allowing shared GPU resources) or indirectly through a supply chain or social engineering attack (e.g., a user running an AI image from an untrusted source). By exploiting the vulnerability, the attacker gains the ability to mount the entire host file system, obtaining full read access to the underlying host. This gives the attacker full visibility to the underlying infrastructure, and potentially allows access to other customers' confidential data. 

  3. Complete host takeover: With this access, the attacker can now reach the Container Runtime Unix sockets (docker.sock/containerd.sock). These sockets can be used to execute arbitrary commands on the host system with root privileges, effectively taking control of the machine (this is a known attack path for containerized systems, see here). Note that while the vulnerability initially grants only READ access to the filesystem, an attacker can exploit a nuance in Unix socket behavior. In Linux, sockets remain writable even when mounted with read-only permissions.   

Disclosure timeline 

  • September 1, 2024 – Wiz Research reports the vulnerability to the NVIDIA Product Security Incident Response Team (PSIRT). 

  • September 3, 2024 – NVIDIA acknowledges the report. 

  • September 26, 2024 – NVIDIA fixes the reported vulnerability and ships a patched version. 

Key Takeaways  

When discussing AI security risks, this vulnerability once more highlights that the real and immediate security risk for AI applications today comes from AI infrastructure and tooling.  

While the hype concerning AI security risks tends to focus on futuristic AI-based attacks, “old-school” infrastructure vulnerabilities in the ever-growing AI tech stack remain the immediate risk that security teams should prioritize and protect against. 

This practical attack surface is the result of the fast-paced introduction of new AI tools and services, and hence it is vital that security teams work closely with their AI engineers, gaining visibility into the architecture, tooling, and AI models used. Specifically, as we see in the case of this vulnerability, it is important to build a mature pipeline for running AI models with full control over the source and integrity of the models themselves. 

Additionally, this research highlights, not for the first time, that containers are not a strong security barrier and should not be relied upon as the sole means of isolation. When we design applications, especially multi-tenant applications, we should always “assume a vulnerability” and design to have at least one strong isolation barrier such as virtualization (as explained in the PEACH framework) . Wiz Research has written about this issue extensively; you can read more about it in our previous research blogs on Alibaba Cloud, IBM, Azure, Hugging Face, Replicate, and SAP

Note that this blog post omits some technical details. In short order we will publish a “part two” that shares more technical information related to this discovery. We are holding off on disclosing those details for the time being in order to give organizations time to evaluate and mitigate this vulnerability in their environments.  

Continue reading

Get a personalized demo

Ready to see Wiz in action?

“Best User Experience I have ever seen, provides full visibility to cloud workloads.”
David EstlickCISO
“Wiz provides a single pane of glass to see what is going on in our cloud environments.”
Adam FletcherChief Security Officer
“We know that if Wiz identifies something as critical, it actually is.”
Greg PoniatowskiHead of Threat and Vulnerability Management