Revisiting the Bluekeep Forecast

A follow up retrospective on Forecasting Bluekeep

5 min readAug 1, 2019

Any industry that values prediction needs to be rigorous in understanding prediction failures. This applies to all industries that are concerned with risk from meteorology to the intelligence community. This includes us in cyber security as well.

Sometimes meteorologists forecast events that don’t materialize. Or, they don’t adequately represent threats that end up materializing and don’t result in action. They call these a bust.

High profile busts attract lots of retrospective. We should do this, too!

Busted forecasts are the raw materials for systematic improvement of risk measurement. Our profession tries to predict unwanted events. I use a repeatable and testable process for this that allows me to relay subjective predictive statements like the following on behalf of a large panel of security professionals who participate.

Reducing Error in Forecasts

All forecasts that are not 100% or 0% will have error. It’s important to note that uncertain statements are intended to be wrong in the first place. We need to invest our time finding where error came from whether we have a little or a lot.

This forecast ended up with lots of error.

Here’s the panel data, for those still following from the previous essay:

With the advantage of hindsight: 100%: In August, or later, if at all was the most optimal answer. Instead, we only felt comfortable assigning 27.85% to this outcome, which is close to total uncertainty (25%). The calibrated panel did worse (22%).

The whole panel gets a Brier score of0.7061, (Calibrated: 0.8133). That’s only a little bit better than what agnosting guessing (🤷) would receive (0.75).

What information would have helped us get close to 100%? How do we get better at finding it?

We should retrospective our BlueKeep predictions.

We’re beyond the window we expected to see BlueKeep in the wild but I (and most others) still think its appearance is inevitable. It’s possible to decouple the unwanted event with the prediction and revisit the analysis and decisions we made ahead of the event.

It’s important to systematically improve our ability to produce narrower predictions surrounding high risk vulnerabilities and their ability to go “wild”, whether we see them materialize or not.

There are two categories of effort we need to take to truly understand how we can improve our predictions of imminent and large scale events.

First: The prediction.
What information did we need to increase our certainty of an unwanted outcome?

Second: The decisions.
Did the industry act properly with the information we had?

What I want to avoid is falling victim to a repeated acknowledgement that opting out of analysis and declaring that this is just an uncertain space. That’s known! That’s the job. Our job is to increase certainty in an uncertain space. Our efforts influence uncertainty.

Where can we track down areas for improvement?

First: Improving the prediction

I want our industry subject matter experts to take the following areas as a task and rigorously explore how we could do better from their worldview of the industry.

What data should we have taken more seriously?
What data was not available… but could have been?
What data wasn’t as valuable as we thought it was?
What data do we wish was available from other industry who are not producing it or making it available?

Malware: We closely compared BlueKeep to a potential malware campaign with EternalBlue’s relationship with WannaCry. EternalBlue leak (April 14, 2017) to WannaCry attacks (12 May 2017). That’s about a month. We’re a bit over two, now. So, what’s different?

What are the explainable characteristics of BlueKeep that might influence malware adoption?

Exploit: The public dissemination of exploit knowledge seemed to travel more slowly and sporadically than EternalBlue. It didn’t drop all at once. There seemed to be meaningful differences in exploitation criteria. Did we underestimate any exploit development barriers? Is secrecy a meaningful factor? How is this different from Bluekeep? How much does it matter?

Comparable to other vulnerabilities that have found their way into the wild, what are the most significant characteristics of BlueKeep’s disclosure that impact its timeline?

Economic: Is anything substantially different in malicious incentives to research, develop, and exploit this vulnerability? Has this contributed to the slowness of its development? It ransomware still paying off? Does that impact the extensiveness of in the wild attacks?

How can we better understand if economic factors have a substantial effect on in the wild behavior?

Internet Vulnerability: There seemed to be substantial exposure for exploitation, and diminishing slowly. Did this contribute meaningfully to the velocity of “in the wild” events after disclosure?

Are there better ways to model exposure as it relates to “in the wild” behavior?

Risk Measurement: There may be significant bias in the structure of my forecast itself. It’s designed to represent a now, soon, or later approach to exploitation. Ideally, I want more information to make these options even more narrow.

I can speak for this. I want more organizations to gather and publish security event data openly. It is critical risk measurement infrastructure. Like this and this. It would be a lot easier to structure forecasts knowing how fast bugs go from disclosure to the wild.

Second: Improving decisions with information

Our informative data (% belief of “in the wild exploitation”) leads to normative declarations (“We should patch ASAP”).

I don’t think we failed on decisions. The industry screamed “patch” and there’s still a possibility that this will be exploited in the future. BlueKeep is not a storm that blows over, we’re still not out of the woods.

Alternatively, by finding and understanding our sources of uncertainty, we can refine our certainty about in the wild activity for future vulnerabilities.

That’s what we want. But, it’s hard to estimate what information is in the most demand without a retrospective discussion.

Going forward

I think that the wide scale warnings and journalism surrounding BlueKeep exploitation were appropriate given the uncertain information available.

It makes me wonder: what kinds of awesome, proactive work could we do if we could develop higher certainty for extremely close range events. For instance, if can become sophisticated enough to hit >90% beliefs for imminent events… what sorts of emergency mitigations would we unlock as an industry or organization?

In the past, we’ve seen domain name industry cooperation around automated Conficker takedowns, ISP containment around worms and mailers, etc. These become easier to organize if the outcomes of these efforts are near certain. It might become more reasonable for leadership to take the typically unusual step of taking things offline before patching, akin to an evacuation which is only done with reasonable certainty.

We stand a better chance against large scale disasters if we retrospective our previous predictions and improve on them. We also avoid burnout and can kill “chicken little” predictions where we end up sounding alarms too frequently.

We’ll get better with regular lookbacks and I think BlueKeep is a good one to do.