Scoring a risk forecast
Quantitative measurement of wrong-ness
I recently organized a panel that issued a risk forecast about the Bluekeep vulnerability and the probability that it will be exploited in the short term.
This forecast will soon have to be scored to determine how wrong the panel was. Spoiler: it will be wrong, it’s just a matter of how wrong it was.
I will describes a couple of the approaches used to score and interpret forecast measurements. To keep things simple, we’ll use the most familiar forecast we know… tomorrows weather.
This scenario will serve as our example for the rest of the essay:
It will rain in downtown San Francisco on June 19th, 2019.
Let’s say our forecast on the 18th was 1% Yes
. Today is after the 19th, and it didn’t rain on the 19th. So, we’re scoring our forecast on the 18th with today’s knowledge.
First, we start with our “Brier Score”.
1. Calculating and understanding a Brier Score
A Brier Score allows us to measure and monitor the error of our forecasts. It’s described as the the sum of the squared error of outcomes (simple calculator math).
We can look at the outcomes in our forecast as 1% Yes
and 99% No
.
Additionally, we know that Yes
was False
because it didn’t rain, leaving No
as the True
outcome. This gets coded as a 1
(correct outcome) or 0
(incorrect outcome) in our scoring method.
This can be confusing to speak with “Yes and No” with “True and False” simultaneously. Here’s a visual to help:
I ultimately memorize it as:(outcome — belief) ^ 2 + ... = Brier Score
Append the squared error of as many outcomes as you are tracking and sum them up.
Brier Score and Intuition
A lower score is better. The more wrong a forecast is, the higher the Brier Score will be. We want to watch the scores of any forecast source (person, machine, panel, etc) to progressively shrink over time and show improvement of our methods.
A perfect score is 0
. A total bust is 2
.
The first simple benchmark is to be better than random guessing. Just like a coin flip. Completely uncertain strategies have a Brier Score that we can benchmark quantitatively. It looks like this:
In this two-outcome forecast, you want a forecast source to at least perform better than a 0.5
score on average. Otherwise you could have relied on a coin instead. This benchmark is different depending on the number of outcomes. For instance, a four outcomes indifference benchmark is 0.75
.
A second simple benchmark is to compare scores to simple forecast models.
You know a few already!
“Always bet on the home team”, or “always bet on the incumbent politician”, or “yesterday’s weather is tomorrow’s forecast”.
If you have an expensive process to forecast a scenario, it may better be cheaper or worth the reduced error than what you can get from a simple rule of thumb. This sort of model thinking can beat expensive, quantitative methods, if the simple model has also been tested.
Lastly, it’s reasonable to compare two forecast sources with each other if they’re forecasting the same scenario, but not two forecast sources forecasting different scenarios. Here’s an analogy:
You can’t compare a 100 meter dash time with a marathon time, right? These are two different events. But, they’re both measured with time.
Thus, a Brier Score of someone forecasting sports will have likely have a worse Brier Score on average than someone forecasting tomorrow’s precipitation in the desert. It’s easy to say a sprinter is faster than a marathon runner, but their lower time is not necessarily more impressive. So, directly comparing Brier scores won’t make sense, unless it’s measurements for the same forecast.
2. Calculating the calibration of a forecast source
Calibration relates to how reliable a forecast source is over a volume of forecasts. This is possible if you track belief compared to actual outcomes over time.
Let’s say you forecast the rain 10 days in a row. Each day you’re 10% sure it will rain.
You can see that it actually did rain one out of ten days in the above image. (Day 5). This means you were 10%
wrong.
Because these ten forecasts had 10%
certainty assigned to them… we can compare your historical belief with your actual track record. In this example, your calibration is perfect. When you are 10%
certain, you are 10%
correct. Ten percent of the time you said “10% certain”, you were correct.
On a line graph, perfect calibration is a 1:1 line when percentage belief is the same as historical results.
Adding forecasts to a simple line graph can show calibration over time. You will spot deviations from that 1:1 line over time. Here’s a real calibration chart from the 538 blog which has been forecasting for a decade.
Calibration Intuition
It takes a lot of forecasts to chart calibration. That’s why calibration training is so useful. It forces a forecaster to produce the volume needed to chart and observe their deviation from 1:1. An individual quickly calibrates when they see how poorly calibrated they are.
The “expert” forecasting situations I’ve found in practice hardly dump out forecasts in the volume needed to chart them nicely. This means that new teams won’t find calibration as a quickly available measurement tool. It’s immediately useful for training, though.
I haven’t found a widely used and numeric approach to showing calibration. They exist, but there does not seem to be any leading and widely used approach.
Lastly, precise forecasts (with belief in the decimal format) are difficult to chart. Most approaches to visualize will group forecasts by rounding them or bucketing them. For extremely rare event forecasting, this might harm the usefulness of calibration charts.
Conclusion
Brier scores and calibration are crucial concepts to understand in forecasting. They’re useful in many situations, but it’s helpful to understand the extent of that usefulness.
Ryan McGeehan writes about security on scrty.io.