Enigma Forecasting Follow-Up
I co-ran a forecasting workshop with Netflix at Enigma 2019. We can now look back and score the panel’s forecasting skill. The scenario we forecasted was written as follows:
What is the likelihood that a “sec-critical” Firefox exploit will be discovered “in the wild” in February 2019?
That scenario was not observed. The workshop scenario is scored as “No” and we can score the workshop panel forecasts with a Brier score of
This was a more accessible version of the “expert” panel organized here.
About the panel
Our primary goal was to expose participants to forecasting. We were gathering insight into lightweight forecasting approaches behind the scenes.
The first panel average had would have had a larger error of
0.182633 and reduced to
0.06573938 after discussion. It began with a ~30% belief that the scenario would confirm, and reduced to ~18% belief after the discussion.
The following represents the individual beliefs before (scatterplot 1) and after discussion (scatterplot 2). I’ve included what it might look like if panelists guessed randomly, just for reference (scatterplot 3).
A larger-than-normal panel and anonymous forecasts felt beneficial to the process. A brief amount of training we gave may have assisted with forecasts as well.
Unscreened and “walk in” participants were likely a double-edged sword. There was no formal calibration training enforced, although a few of the panelists have undergone calibration training.
We provided some “base rate” data in discussions which could be viewed as a biasing effect. This included exploit advisory data, browser market data, and an odds table to translate verbal frequencies into a percentage. We prepared the data we figured would be required to keep things timely.
Panelists were aware that we structured the scenario to have a delayed judgement to capture and hindsight observations. It allowed us to capture a “Yes” if a delayed discovery event were to happen. This seems like a useful technique and didn’t harm much but has an expense of a delayed measurement. It allowed panelists to capture a delayed discovery in their forecast.
Tools from the delphi method could have been desirable to help smoothen out contribution from overly enthusiastic participants who overwhelmed discussion at times. With delphi, conversation happens in a structured manner. Allowing “walk in” participants from a public conference has the risk of allowing overzealous contributors to discussion. The “strength in numbers” mitigation approach drowned them out, as well as having anonymous forecasts.
The most vocal individual ended up having the greatest forecast error at
1.62. We normally wouldn’t be able to observe this (all forecasts were anonymous) but they boldly mentioned how they forecasted at the end. We felt the panel method was a great mitigation if this were a workplace discussion and if they were a sole decision maker. This could have been further mitigated if structured more like delphi.
Overall this is a useful example of a casual forecasting session. Panelists weren’t screened or trained and some rules were broken. We polled people during and after the process if the process was useful, and even the most vocal contributors opposing interpretations of the subject matter or process were enthusiastic about the entire process being useful, which was very encouraging.
The data gathered from the Enigma panel suggested a once every few quarters rate of exploitation in Firefox. The measurement approach felt useful for moderators and panelists and we gathered some insight into improving the approach for the future.
Thanks to Netflix and Enigma for having me!