Every few months it seems a company has a security incident where users are presented with the account of another user when they log in. Many of these are assumed to be due to web caching misconfigurations. While some may be easily understood by software engineers in hindsight, others have been caused by request collapsing, which is a feature of caching services that can result in unexpected behavior. Even when using the HTTP header Cache-Control: no-cache, this form of caching can still occur, which may result in sensitive data that is destined for one user, being returned to multiple other users.
In this post, I will explore this problem and explain how to avoid it.
What is web caching?
Web caching is an optimization technique to avoid transferring the same data multiple times, often to reduce latency, or to avoid generating the same response to equivalent requests multiple times. This may mean keeping a copy of a file in a server near a major city so that requests from that city don’t have to all travel potentially to another continent, or it could mean keeping a copy of a response to a request that involves a lot of work to prepare.
For example, the homepage for a webserver might be hosted out of one AWS region, such as us-east-1, but Amazon CloudFront might be used to cache parts of that homepage in its 600+ locations. This idea works great when a request arrives for /images/logo.png as all users should see the same thing, but if a request comes in for /api/user_account_details this should have a different response for each user so it should not be cached.
How does this go wrong?
I’ve been keeping track of security incidents that match the symptoms of a web caching problem (link), with 14 incidents recorded, of which 11 have happened in the past 4 years. The evidence I use is frequently just people posting on Reddit that when they login to their accounts, they see what looks like someone else’s account, so I can’t verify with certainty that these incidents are due to web caching. However, sometimes companies acknowledge the incident, and sometimes they even publish post-mortems that identify the specific cause. Two of those incidents specifically mentioned the unexpected behavior of request collapsing as the root cause (1, 2).
Cache policies define what content to cache. They use a cache key to determine when to return the same response. By default, for CloudFront, the cache key is the domain name and the URL path. The cache policy then defines which cache keys should result in caching. In our previous example, requests to /images/logo.png should be cached, while /api/* should not.
Web caching misconfigurations are difficult to test for because you need to use multiple users and know which content should differ. The timing between requests must also be considered. Cache policies are difficult to programmatically check because a cache policy that is correct for one application is likely wrong for another.
Here is a simple example of how a cache works, where the first request results in a copy of the response being cached, so that the second request does not need to go to the origin (the web server).
In the above diagram, if a request for /api/user_account_details was cached and returned to multiple users, you would have a security incident, because sensitive data for the first user would be sent to the second user. To avoid that from happening, the cache policy will exclude certain cache key patterns from being cached.
Often the origin web server will include headers on responses that define how they should be cached, so that developers can maintain the code and cache directives in one place. The origin might therefore provide a response that tells the cache server Cache-Control: no-cache or similar guidance. CloudFront will follow this directive and not cache the file, but if request collapsing occurs, the header will be ignored.
What is request collapsing?
Request collapsing, which is sometimes also referred to as “Request coalescing”, occurs when more than one request for the same cache key occurs before the response from the first request is returned. The cache server recognizes that multiple identical requests could overwhelm the origin, so it waits for the response from the first request instead of sending all requests as they arrive. Then once the response comes in, it sends it to all the requestors.
This all aligns with expectations for many and is done to prevent thundering herd problems which could overwhelm the origin, which is largely the benefit of using a cache server in the first place. However, where it becomes unexpected is how the cache policy is handled by Amazon CloudFront under these circumstances. Most would expect that if the response contains Cache-Control: no-cache, then only the original requestor should receive that response and the delayed requests should now go to the origin, but instead the delayed requests will receive the same response that wasn’t supposed to be cached! This is explained in the AWS documentation here.
How can you avoid these incidents?
In order to avoid request collapsing on requests that should not be cached on Amazon CloudFront, you have the following choices:
Use the managed cache policy “CachingDisabled” which as its name implies will avoid all caching for the specified cache key pattern.
Set the minimum TTL for the cache behavior to 0 AND configure the origin to send an HTTP header such as Cache-Control: no-cache for each object that should not be cached. You must do BOTH. If you simply use the HTTP header directive, then your cache policy will appear to function correctly for many caching related tests, until you make simultaneous requests that result in request collapsing. This is the most important part of the blog post to note, as it is unexpected to many.
When testing your cache, you should not only make requests from different users in sequential tests, but also simultaneously. A cache and the origin server are very intertwined, so changes to either could result in caching working incorrectly, and potentially leading to a security incident. For example, if the cache policy prevented caching of /api/, but the origin changed the APIs to use the path /apiv2/ then sensitive data might be cached, and a security incident could occur. For additional best practices on using CloudFront, refer to the AWS documentation.
Attackers can take advantage of a quirk of the default AWS configuration (without SourceIdentity configured) to potentially make detecting and attributing their actions more difficult.