Incident Response: Writing a Playbook
An incident response plan should consider the “first time” reader, who may not have ever expected to be responding to an incident.
I prefer to keep plans concise and relevant to a wide variety of incidents without heavy weight prescription. “Who’s in charge”, “Who do we call”, and “Who can help?” are all panic reducing answers in the first hour of an incident.
To keep the main incident response plan simple, some teams augment their plan with playbooks (or runbooks) that act as helpful manuals for more specific situations. Playbooks focus on the step by step directions for a well scoped incident task.
As a result, you’ll have a incident response plan which is applicable in most incidents, and incident response playbooks that are applicable for specific incidents.
The naming convention is irrelevant, it’s the separation which helps support an individual stuck in a crisis by keeping documentation crisp, concise and relevant. A plan helps bring resources together with clear roles and communication, and playbooks help support some of the actions that might take place, keeping irrelevant information out of the way.
What playbooks should I build?
In deciding on what playbooks might make sense to build, consider it to be an exercise in prediction. What types of incidents are you likely to have? This is an exercise in understanding your risks.
Playbooks do not always need be technical. For instance, How to triage emergency phone calls to every single customer could be a runbook for a sales and support team after a public crisis. The playbook could instruct:
- Approval sources for sensitive messaging.
- Triage method to divy up customer contacts across a support team.
- Emergency public FAQ and blog authoring & hosting.
- Conditions for “top customers” who need an executive phone call, first.
An overall risk assessment can help surface risks that are specific to your organization. From there, you want to find scenarios that you could feasibly be in that would otherwise never be questioned. Like, “Who makes this decision” or “Who has access to this system” or “Who needs to know that this happened?”
What are some more examples?
These are runbooks that I’ve written or come across, but I’ve also included some which are a wish list that would have been useful in an incident. You can consider authoring these with your own risks in mind, and assess whether authoring your own version for your own environment makes sense.
A volatile termination. A rushed decision to terminate an employee or founder is expected to go very poorly. This individual’s access is so pervasive that a rigorous effort to enumerate it must happen. Manual review of accounts, expenses, contracts, and network activity is expedited, along with a device seizure. HR and legal conversations may need to review employment agreements that have been signed.
An account is hijacked. An employee has had login credentials or a session hijacked for a platform that you don’t own. (Facebook, GMail, Twitter, etc.) I have written an example here. Getting the victim back online and removing persistent access from an adversary can be different on every platform.
We have to rotate a secret. The biggest culprit of leaked secrets among development teams are their own copy paste buffers and repositories. If a credential were to leak to the wild, what are the steps involved with rotating it? Who has this knowledge? What follow up investigation would have to occur if you believe it was exploited?
Email is compromised. When an individual’s email is compromised, you usually follow up with a large triage effort to discover risks that could follow up in the press. Additionally technical attacks could arise from leaked email. Did the victim say anything embarrassing? Did they email private keys to their co-workers? Were password resets issued and accounts hijacked?
Break glass, shut down. Some systems may seem too legacy, too profitable, or too complicated to confidently disable if they’re exposed or compromised by an attack. You may instead want to focus on the approval steps, severity guidelines, and expectation for user notification in a forced outage, instead of attempting a step-by-step guide for a system no one understands or couldn’t shut down by themselves anyway. Consensus on something this sensitive can often leave a group of people afraid to own the decision, and having a firm policy can avoid damaging hesitation or an early action.
Investigations based on an “Indicator of Compromise”. When you have confirmed evidence of a breach, you’ll want to quickly leap around an organization like a swat team to clear it out. When an individual’s access is compromised, your starting indicator might be a username. You might start also with an IP address, domain name, malware hash, a specific URL, an attachment name, etc. With each bit of evidence, you’ll have circular follow up tasks towards other indicators. Playbooking this investigation, or at least documenting the data sources you’ll need may help involve extra sets of eyes onboard into an incident response effort. For example, example Splunk queries and the phone number of an administrator who owns the system.
A really scary vulnerability. Consider Meltdown, Heartbleed, or vendor incidents like Cloudbleed or OneLogin’s issues. Complicated mitigation efforts with high uncertainty may need some project structure to ensure they’re managed to the end. Executive sign-off on a playbook may help peel resources away from other business efforts, even if they’re also urgent. Having a designated project manager for an incident can help organize the fact finding tasks along with mitigation steps that are taken.
This describes the “playbook” aspect of an opinionated approach to incident response plans. I believe that incident response plans should be extremely readable and useful, and a few high value playbooks should augment them.
Ryan McGeehan writes about security on medium.