Cloudbleed Retrospective
Notes from the last few days of incident response
These are my notes while helping companies respond to the Cloudbleed incident from the last few days. This was a highly unusual incident response effort that caused work for security teams globally.
What happened?
Between late September 2016 and February 2017, Cloudflare had a bug causing random memory leakage to spill into HTTP responses. This resulted in leaked sessions, passwords, and web content to be stored and cached by mass crawlers, notably search engines (first noticed by Google).
What was the window of exposure?
Varying information on exposure window came about. It was best interpreted as “full window” (September 22 until February 18, according to Google) and “high impact” (February 13 until February 18, according to Cloudflare)
The greatest period of impact was from February 13 and February 18 with around 1 in every 3,300,000 HTTP requests through Cloudflare potentially resulting in memory leakage (that’s about 0.00003% of requests).(link)
Tavis pointed out on Twitter a longer timeframe that this behavior may have exhibited itself. It’s unknown how significant this longer tail is, so it’s probably better to use this longer timeframe from Tavis while assessing impact on what you may have leaked.
(Note: this window now appears in the Cloudflare blog in their timeline.)
What was leaked?
The volume of the leaks, and the content of the leaks were very much estimations and samples. The Cloudflare blog was heavily reliant on data and investigations from search engine companies:
The infosec team worked to identify URIs in search engine caches that had leaked memory and get them purged. With the help of Google, Yahoo, Bing and others, we found 770 unique URIs that had been cached and which contained leaked memory. Those 770 unique URIs covered 161 unique domains. The leaked memory has been purged with the help of the search engines. (link)
And a “150 customers impacted” metric began floating around security teams, likely from this quote.
Prince: Yeah, there were about 150 websites that we were able to discover had some private data that was indexed by Google. (link)
These samples don’t actually matter much, except for proving a minimum numerator in how much we know has been leaked and captured. The denominator will be hugely evasive in determining impact.
Here’s why:
There are an incredible amount of non-search engine based crawlers that scour the web. One should assume they have no clue what they’re up to, how long they’re caching data, their ability to access this data, etc. So it should not be assumed that search engines can protect this data by purging it, unless you have an omniscient perspective on what is crawled and cached on the internet, and by whom.
This led to an incredible amount of frustration for incident responders, who would much prefer to respond to specific facts instead of fluctuating, broad estimations on what may have leaked.
Was *I* impacted?
A goal for an incident responder is to determine if, and how they were specifically impacted by an incident. Misinformation spread around several IR groups.
This fact below was relayed across blogs, slack channels, and responders:
Three of Cloudflare’s features (email obfuscation, Server-side Excludes and Automatic HTTPS Rewrites) were not properly implemented with the parser, causing random chunks of data to become exposed. (link)
This was widely misinterpreted by security teams.
IR teams were basing their impact assessments on whether they used these features. These features were irrelevant. The platform spewed random data across the Cloudflare user base, whether you used these features or not.
As an example, while your configuration may have not have been leaky, another customer using this Cloudflare configuration would have been leaking your data. This is the nature of a memory leak on a shared platform.
Worse, at some point, this incorrect impact advice became part of Cloudflare messaging through their support channels and customer phone calls.
Once a few teams caught word of this, it seems to have been given as feedback to Cloudflare and was corrected after the course of maybe few hours with follow up calls with engineers.
A critical eye from an incident response team would have noticed this inconsistency, due to the blog post clearly stating:
Because Cloudflare operates a large, shared infrastructure an HTTP request to a Cloudflare web site that was vulnerable to this problem could reveal information about an unrelated other Cloudflare site.
Additionally, Cloudflare has been directly messaging customers as to whether they have been impacted or not. This is likely based on whether their data appeared in sample leaks from search engines, and should not have been taken as a complete “all clear” for an IR team.
Remember, you have no clue how much crawling and caching has taken place over this timeframe, outside of what search engines have done.
If you were a Cloudflare user, the safest assumption to make was that you were impacted. This would be the expectation of your customers, who are clearly able to see that you are using Cloudflare. The only exception is if you literally have no sensitivity in your content, like a static jekyll site.
What do I do if I was a Cloudflare customer?
First, you have to make the assumption that sessions leaked. These would have appeared in HTTP headers and web content which were directly identified as a form of leaked data from Cloudflare and Google’s analysis. Destroying sessions and forcing a re-authentication would help mitigate this. If you have other strange tokens and sessions you deliver in HTTP responses that are secret, you’d consider those too.
Second, you have to consider whether you should force a password reset for users who registered or authenticated within the windows of exposure. A forced password reset is a highly aggressive mitigation, and may make more sense to advise users to reset these passwords themselves. If you are able to identify at risk users through these windows of exposure, it’s possible you can limit your messaging to this set instead of notifying a whole userbase that may not have exposure.
If every impacted website forced a password reset today, it would be a nightmarish experience for end users everywhere. This leans me further toward pushing users to do this themselves with a notification.
Third, you may want to have a plan in the event that sensitive content (like a private message) would have leaked. This will be near impossible to prove or find examples of, but you can see an example of this leaked content in the form of an OKCupid message in the Google P0 thread. Because of the significant amount of attention on this incident, it’s likely that examples will be found. Large scrapers like the Wayback machine may have examples of leaked content, or other scrapers that no one has thought about yet and checked.
Lastly, offering a multifactor option to users helps keep a segment of your user base calm. Suggesting to your user base that strong, unique-per-website passwords are favored against shared passwords will help you fall in line with the rest of the web’s messaging around this.
What do I do if I was *not* a Cloudflare customer?
You should consider any partnerships, vendors, cloud products and other integrations you may have with companies that use Cloudflare, and check in on their response.
Developer integrations or automation you may have with companies using Cloudflare are at risk of breaking, should any of your dependencies rely on an API supported by Cloudflare. As an example, a developer integration you use may suddenly force a rotation of their API keys if they are feeling risk averse.
Most companies will likely send a warning message and request that you change these proactively.
Additionally, you may want to roll these secrets yourself, even if you’re not messaged about it.
What should I do, personally?
There is wide, far reaching messaging from social media and tech blogs announcing a “reset all your passwords” statement for individuals. This doesn’t hurt, but this is only really necessary for sites that use Cloudflare. It’s not practical to expect the public to filter this recommendation to Cloudflare sites only, so maybe it’s best not to qualify what should be reset and suggest to do it all.
At the moment, the assessment is that these leaks will be incredibly hard to target attacks with. This is up for some debate and dependent on how we learn about where this data was accessed at scale by crawlers that aren’t so obvious as search engines. It’s widely known that security companies and others do a significant amount of this.
It’s really not harmful to change passwords you’re most concerned about. Multifactor authentication is proven over and over again to be a good idea, and it would help mitigate any damage to someone who was leaked in this event.
What if I am Cloudflare?
Cloudflare has already fixed this specific issue. They were fast and thorough in communication and incident response. Most of the public criticisms are nitpicks and complaints made in a panic. They’ve already suggested a fuzzing driven audit of this area of their application risk, and I imagine they’ll probably explore other ways to limit the risk of memory leaks in HTTP responses as well on an infrastructure basis.
My critique would be that the impact assessments based on sample data from search engines, customer calls about vulnerable configurations, and scrubbing were a bit misleading but forgivable considering the rapid speed of their overall response.
However, a conservative approach to risk wouldn’t have really taken these tidbits into account anyway, since a response team would have likely started aggressive mitigations instead of looking for reasons to limit their response. It’s always better to have some kind of data, though.
That leaves my critiques on their response as very minor. Pretty good job, Cloudflare, and I hope it’s at least a few years before we see something this sized again.
@magoo
I write security articles on medium.