Blips and Downtime: Incident Management 101

Content By Devops .com

Nothing gets the blood pumping like seeing an IT incident alert flash across the screen. Is the network traffic anomaly a potential intrusion? Are we about to have an outage? Is it just a random blip on the radar? The implications of these different potential incidents vary wildly, and each requires a different set of response tactics.

Responding to incidents both large and small is essential for every team. Especially in a remote-first world, it is more important than ever to ensure that all developers are in the same war room to tackle the problem as efficiently and effectively as possible. It’s also critical to have a single view that aggregates data from all of the systems and services used to monitor and respond to incidents. But not every team has invested in a fully fledged incident management strategy.

In fact, it’s not uncommon for incident management to be completely ad hoc. Teams wait for something to happen and then take a “dive and catch” approach to incidents. Some teams have some processes in place, but those plans may be missing key components. In both cases, teams struggle with slower time to resolution and often waste resources in the process of responding to incidents. Critical stakeholders often suffer from a lack of clarity in communication around an incident.

Depending on the type of incident, the consequences can be challenging to overcome. Running afoul of a compliance law, like GDPR, can result in business-ending financial penalties. Incidents that impact uptime can keep an organization from meeting SLA terms and potentially losing customers and revenue. Regardless of the size of your company or customer base, formulating an effective incident response begins with people.

People are the Cornerstone of Incident Response

Companies should assess their business needs and the risk profile of their applications and services. With those barometers in place, leaders can then determine the appropriate staff levels to ensure 24/7/365 incident management coverage. The size and breadth of the team will depend on the size of the company and the structure of response plans, but the team should attempt to cover a few areas.

Ideally, a team has at least three responders.

  1. Incident owner. This is the team’s quarterback. They coordinate everything, make sure the team follows runbooks, involves the correct people and removes roadblocks.
  2. Incident investigator. The investigator has their hands on the keyboard. They run the investigation, pull and analyze logs, review metrics and generally try to determine how to best mitigate the issue.
  3. Incident communicator. Smaller teams may not have the personnel to have a designated communicator in every incident. On those teams, the incident owner will often assume this role, as well. Communicators focus almost exclusively on two tasks: documenting the incident via an “incident timeline” and communicating with stakeholders when appropriate. It’s critical that a complete timeline is built during the incident. It’s nearly impossible to put together a properly documented incident timeline after the fact. Critical details will get lost or forgotten, and your retrospectives and RCAs will be low quality.

Runbooks offer Muscle Memory During Incident Management

The incident response team, no matter how experienced, will need a set of runbooks—a compilation of processes and steps to take when incidents arise. Runbooks will help guide teams as they perform everything from analyzing and mitigating common problems, to post-action investigations. They represent the process portion of incident responses, and act as a guidebook for the team when the pressure is high.

Typically, an alert system will ping the on-call engineer about a metric that doesn’t look right. The runbook will help that on-call person determine if the alert should be categorized as an incident or not. When that person determines the alert indeed represents an incident, they must estimate the severity of the incident. Different teams may use different definitions for severity, but here is an example:

Severity one – Production down or highly degraded.
Severity two – Serious bug impacting customers or system stability.
Severity three – Important bug. Requires tracking until mitigation, but likely not a 24/7 workstream.

Runbooks can also guide teams through analysis of the situation. After a root cause is determined, the book will remind teams to perform an immediate post-action investigation to see if specific processes worked well, if gaps exist, and if communications were sent out to customers or other stakeholders.

Tools are Important; Tying Them all Together is Crucial

People and processes are important for responsive incident management. Without the right tools, even seasoned incident-management veterans may struggle with the complexity and pressure of an incident. Tooling can prevent a lot of mistakes and omissions when the heat is on.

There are numerous platforms that address different components of incident management response. Depending on the size of the organization, teams will need some or all of the following:

  • On-call rotation management
  • Observability and alerting
  • Ticketing system
  • Communication system

One of the biggest challenges in incident management is having the ability to tie together the data and actions defined by this broad toolset. If you’re working a severity level one incident, teams will likely rotate people into and out of the incident based on their geographic location and timezone. How does the next person in the rotation come up to speed on what’s been done so far? How do we document a detailed timeline that can be used for an RCA investigation later? How can we provide an up-to-the-minute view of the status of an incident for key stakeholders? Ideally, teams will have a single place where all of this data is collected and managed and where the data can be filtered and exported at the end of the incident.

Effective Incident Management: People, Processes and Tools

Is that traffic spike a sign of impending doom, or just another meaningless bump? Without the proper people, processes and tooling, answering that question becomes very difficult. Unfortunately, the wrong answer can lead to disastrous consequences. Only with the right set of tools and processes can companies develop an incident response system that provides maximum stability and keeps customers happy.

Leave a Reply

Your email address will not be published. Required fields are marked *