Product News
Announcing Cloud Insights for Amazon Web Services

Industry

How To Recover From an Internet Outage

By Mike Hicks
| | 12 min read

Summary

Let’s do outage response right. Here are five key steps NetOps team can take to minimize an Internet outage’s impact—and guard against future issues.


Outages are one of the most stressful situations a NetOps team will deal with. Under immense pressure to get services back up and running, teams can make hasty diagnoses that exacerbate an outage, or learn the wrong lessons and face a similar problem down the line.

Outages can happen in any environment. It’s how quickly you identify, mitigate, and resolve the problem that sets apart major user-facing impacts from more minor ones. Here, we’re going to take you through the process of recovering from an outage step by step, exploring the factors that should be considered at every stage.

The Five Phases of Effective Internet Outage Response

There are generally five main steps in any outage response:

  1. Identifying the outage

  2. Immediate mitigation

  3. Forensic analysis

  4. Post-incident review

  5. Preventative measures

Let’s discuss each of these steps to help you deliver quality digital experiences for your customers and make sure your next outage is less likely to become a full-blown crisis.

1. Identifying the Outage

The moment an outage is first reported is a critical one. It’s important not to jump to assumptions based on limited signals, but to accurately identify the issue, understand its impact, and determine which services/regions are affected. That is, of course, easier said than done when the dashboard’s flashing red or the phone lines are ringing with people demanding to know what’s going on.

It’s critical that you don’t take the wrong mitigation steps during the onset of an outage. The first step is to confirm that an outage has actually occurred, so you’re not dealing with a false flag or an isolated issue. For example, having a correct password rejected could lead to the conclusion that a user’s security had been compromised—we’ve seen failures with authentication systems cause outage conditions, where users’ login details were being rejected. This led users to reset their passwords, which didn’t resolve the situation but added to the volume of network traffic.

Therefore, it’s important to use all available resources to accurately determine the source of an issue—whether it's within your own infrastructure, a third-party service, or somewhere else in the network.

An initial impact assessment is also important. Is the outage affecting all users or only a subset? Is it affecting a particular region? Knowing exactly who’s impacted can help you to prioritize the response effort.


Check out the Internet Outage Survival Kit for more insights on mitigating the impacts of outages.

2. Immediate Mitigation

Once the cause of an outage has been isolated and confirmed, the attention switches to mitigation. What is the most effective way to get services up and running again?

You’ll want to activate your incident response teams to manage the situation and plot the best way forward. Communication is critical at this stage, too, keeping customers, employees, and other stakeholders informed about the outage and the steps being taken to resolve it.

This is also the time to consider temporary fixes, solutions that will allow you to restore services while you work on a more long-term answer. This might involve switching to backup systems, rolling back software patches, or rerouting traffic to overcome a connectivity issue.

Don’t lose sight of the fact that the best answer might be to do nothing. For example, if a data center has been taken offline because of a localized power failure that’s expected to be resolved soon based on a provided restoration time, but it will take three hours to instigate your data recovery plan, it might be more prudent to sit tight than switch to a disaster recovery site. The impact on users must also be assessed: If the power outage is only affecting a small subset of your customers because you have a geographically distributed workload, it may be more prudent to ride it out, while helping affected customers as much as possible.

NetOps teams often feel the pressure to be seen doing something in times of crisis, but keeping calm and using all the available information to identify the quickest route to recovery is vital.

3. Forensic Analysis

The cause of the outage has been identified, and a fix is being implemented. Now, it’s time to investigate further so that you can guard against similar problems in the future. This involves understanding the sequence of events that led to the outage.

Start with data collection. Gather logs, system states, and any other relevant evidence from before, during, and after the outage. You must review contextual, correlated data across the entire service delivery chain.

Next, double confirm the root cause or causes. In a complex delivery chain, multiple triggers can combine to create an issue. Don’t assume that the one you identified in the initial assessment was the sole cause of the problem. Possible deeper root causes can be security breaches, hardware failures, or network problems.

This is also the stage for a more thorough impact assessment. How many users were affected? What was the precise duration of the downtime? Were there any downstream impacts on customers or employees?

Only when you’ve got a full forensic analysis of what went wrong can you set about making sure this problem doesn’t rear its head again. For example, it might be possible to use a degree of automation based on early warning signals to help prevent a similar problem from impacting users. Don’t rush to implement such solutions before fully understanding the underlying cause(s), however.

4. Post-incident Review

With all the evidence collected on what went wrong and how it was fixed, it’s prudent to conduct a post-incident review.

First, you should compile an incident report, documenting full details of the outage, the cause, the scale of the impact, and the eventual resolution. Don’t leave out details of mistakes made during the recovery process, such as acting on false flags or bad fixes. The point of the report is to provide learnings for future incidents, so that the same mistakes may be avoided if a similar outage occurs.

Part of this process should include a team debrief with the incident response team, discussing what went well and which processes could be improved. This might also help surface preventative measures that can be applied at the next stage.

Finally, remember to solicit feedback from stakeholders affected by the outage: customers, employees, and senior management. Do they feel they were kept well informed during the outage? What was the business impact? How did it affect third-party systems? Don’t look exclusively through a NetOps lens; consider the impact across the entire company.

5. Preventative Measures

Having fully reviewed the outage and its impact, it’s time to turn attention to measures that could prevent similar incidents in the future—not the quick fixes that got you out of this hole, but architectural or process changes that could make a permanent difference to your network’s resilience.

An example of a process change might be a retail company adopting a staggered approach to implementing security updates from a vendor so that only 50% of its point-of-sale terminals are updated at a time. Using this approach, if any problems with the patches did occur, only half of a store’s terminals would be out of action. A process change like this could help improve business continuity.

If an outage was triggered by a patch or configuration change from a SaaS vendor, which they then rolled back to restore service, chances are your network’s going to face the same issue when they attempt to reimplement the change in a few days’ time. So what steps will you need to take to mitigate any impact the next time?

It might be that bigger changes are required. Do you need to change your network architecture? Your cloud provider? Your ISP or who they peer with? Taking an evidence-based approach to identifying the cause of an outage can help you make the business case for such major changes.

Lastly, remember to identify any opportunities for continuous improvement. Were there flaws in the way you responded to the outage that can be used to update your response plans? Did the outage identify any gaps in staff knowledge that could be addressed with training? Sometimes, the solutions aren’t technological but human.

Making a Full Recovery

If you don’t want your NetOps team to be constantly firefighting, following a structured outage recovery process is essential.

Not all outages are avoidable, not all will have an obvious cause. But every outage is a learning opportunity. They will all leave breadcrumbs and clues that can feed into your root cause analysis, potentially preventing problems further down the line.

Only if you do the work to fully identify, review, and remediate an outage will you reap the rewards when a similar incident strikes in the future.


Spot outages in no time with ThousandEyes Internet Insights. Learn more.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail