Product News
Announcing Cloud Insights for Amazon Web Services

Industry

The Anatomy of an Outage

By Mike Hicks
| | 12 min read

This post is also available for: Germany (Deutsch), Spain (Español), France (Français) & Italy (Italiano).

Summary

Outages often stem from the failure of a critical component in a distributed system. To prevent disruptions, it's essential to thoroughly analyze and identify vulnerabilities, take a proactive approach to managing risks, and optimize systems.


When you hear the word “outage,” you might assume that we’re talking about the total collapse of a service, that every part of the service delivery chain has suddenly become unavailable. However, when you dig into the anatomy of most outages, it’s rare (though not impossible) to discover a total wipeout.

Recently, we’ve noticed an uptick in what we call “functional failures,” where maybe only a single part of a service has failed, but it’s having downstream effects that disrupt the entire delivery chain. That might be a payment gateway failing or a problem with a user authentication process. These components are often provided by third parties and are beyond the service provider’s control, but they can have a devastating impact on service delivery.

It's only by examining the anatomy of such outages that you can begin to mitigate them—both from a service provider’s point of view and from a customer’s. Let’s explore how a deeper understanding of how outages occur can help lower your long-term risks of failure.

The Rise of Functional Failures

If you’re a regular listener to The Internet Report podcast, you’ll have heard us talk increasingly about functional failures at some of the world’s biggest Internet services in recent months. Those failures often appear to the user as the service being “down.” But, when we get under the surface of these outages, we discover that it’s often only a small part of the application that’s failed, such as a third-party plugin, a payment gateway, or the authentication provider.

We’re in the age of distributed architecture, where services are made up of many different component parts that must work in harmony. This setup has many advantages, including optimized service and reduced latency. If you’re an Internet retailer, for example, you probably don’t want to be running your own payment gateway. Why would you? It’s not your core expertise, it’s a security minefield, and it’s almost always cheaper to deploy a third-party option that doesn’t come with any development or ongoing maintenance costs. Even the world’s biggest tech companies rely on third-party components for parts of their apps. It’s not technology for technology’s sake; it’s often a way to boost performance and reduce overheads.

But there is an element of risk, too. These components can become a single point of failure. If a user can’t authenticate with a social network, for example, it doesn’t matter if the messaging element of the service is still working perfectly. They can’t get through the front door. 

Similarly, we’ve seen outages where authentication is fine, and everything with the app appears to be working normally. But the functional failure of a backend service means that when a user actually tries to take an action, like send a message or purchase an item, the service fails.

Although most service providers will have an element of redundancy built into their services, sometimes the risk/reward ratio makes a failover system difficult to justify. For example, deploying a backup payment gateway is an expensive, technically complex solution that many service providers might decide to do without. Similarly, consumers rarely buy two of the same type of service in case one falters.

But that doesn’t mean there’s nothing to be done but shrug our shoulders and accept the occasional outage—either from the service provider’s side or the end user’s.

Understanding an Outage

So, what can you do in these situations? First, you need to understand the anatomy of an outage.

You have to start by identifying the fault domain (i.e., where the problem actually occurred and who is responsible and accountable for resolving the issue). Was it a third-party service or the data center hosting that service that’s at fault, for example?

However, we’re not here purely to apportion blame. You need to understand the impact of that failure and the downstream effects it might create in the future, should there be a repeat. Was there a complete loss of service, or did it just hamper performance? What were the consequences of increased latency in a particular part of the service chain? Only once you’ve fully understood the full consequences of an outage can you start to think about how that might be avoided—or the impact lessened—in the future.

As a business, you will need to decide on where and when redundancy makes sense. As we said previously, it’s highly unlikely you'll ever have two payment gateways because it’s usually not financially viable. However, you might decide to explore having two CDN providers or splitting them across regions. That will come at a compute cost, but that will be part of the risk assessment.

There’s a risk assessment to be made by customers of these services, too. How much does a failure of this type cost and how much would it cost to remediate? As a customer of a cloud service provider, for instance, you might have to make a decision about whether to invest in two individual availability zones. That requires a cost analysis: What is the additional compute cost? How are you going to keep data in sync between the two? Does it require multiple CDNs? Does it make more financial sense to pull back into a tier-two cloud provider?

All of this comes back to having a full understanding of the entire service delivery chain. You need to know why this outage occurred, how likely it is to appear in the future, what your level of risk is, and whether you want to take action to avoid it happening again.

You want to avoid the “find fault, fix fault” mentality and put yourself in a position of continuous improvement, where you can anticipate potential problems and mitigate them before they occur.


Catch "The Anatomy of an Outage" podcast episode to learn more about outage anatomy and explore some real-world case studies.

Identifying the Key Information 

So, we’re trying to get ourselves out of this reactive state and into a preventative mode where the organization continually looks for ways to improve performance across the network. To do that, we’ve got to make sure we’re focusing on the right data. 

It’s easy to fall into the trap of trying to stay on top of absolutely everything, pulling down every single metric you can lay your hands on. That’s not only time-consuming, but can be enormously expensive. And even if you can afford to collect all that data, you’re soon wading into needle-in-a-haystack territory.

The key is to identify the pertinent information that will make a difference to your business and allow you to optimize operations for the future because you fully understand the risks and the costs/benefits of addressing any potential problems or potential solutions.

In my career, I worked with one household name in the gambling industry. The company had decided to move operations out of its own data center and into the cloud—a “lift and shift” of their infrastructure, as they say. The company was told that simply replicating its current setup in a cloud architecture would reduce costs and improve performance.

However, when they had completed the move, the firm was surprised to find that performance actually decreased, despite moving to more modern hardware, because they hadn’t matched the architecture to the requirements of the new cloud infrastructure. Any kind of delay in app performance is critical in the gambling industry because customers are placing wagers on live events and want to react instantly to changes in the game. But it was taking twice as long to place bets on the new infrastructure than on the old, potentially driving customers away.

To compound the issue, the company's compute costs were increasing because they had migrated an on-premises build to the cloud without optimizing the backend for cloud infrastructure.

To fix the problem, they had to step back and examine the situation afresh. They rearchitected the code based on the cloud platform to bring the costs back down, but not at the expense of performance. In fact, they needed to improve efficiency. To do that, they had to split the service and bring the cloud infrastructure geographically closer to the user base so that there wouldn’t be this big of a delay when they went to place bets.

Identifying the Breaking Points

This experience comes back to our point about proactively seeking improvement. The most important thing an organization can do is identify the weak spots within this environment. Where are the potential bottlenecks?

It’s not necessarily about identifying every single point of failure; it’s about finding where you're most likely to have performance degradation. If you can identify that, that’s also most likely the part that’s going to break—because if you think about any outage, its leading indicator is loss. That’s when you get degradation and retransmission, increased latency, and all the major indicators of a failure.

But, if you can identify these bottlenecks before something breaks, it gives you a target for optimization. It’s the place where you're probably going to get the most bang for your buck in terms of IT spending. And, it’s one less day you’ll spend looking at metrics, wondering what went wrong.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail