Product News
Announcing Cloud Insights for Amazon Web Services

The Internet Report

Configuration Change Trouble & Other 2024 Outage Trends

By Mike Hicks
| | 16 min read
Internet Report on Apple Podcasts Internet Report on Spotify Internet Report on SoundCloud

Summary

Configuration changes were behind many 2024 outages. Explore this and other recent outage trends—and how ITOps teams should plan accordingly in 2025.


This is The Internet Report, where we analyze outages and trends across the Internet through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or listen to the podcast for firsthand commentary.


Internet Outages & Trends

Across 2024, we analyzed insights from the over 650 billion data points that ThousandEyes collects every day—and a pair of topline trends emerged. First, the number and proportion of outages attributed to cloud infrastructure increased throughout the year. Second, we saw a notable amount of outages caused by misconfigurations during the year. We’ll spend the first part of this week’s episode analyzing both of these trends in a bit more detail.

We will also go under the hood of recent incidents at OpenAI and Google Cloud Pub/Sub, and we’ll dive deeper into an incident that impacted Netflix at the end of last year.

Read on to learn more, or use the links below to jump to the sections that most interest you:


Cloud Service Provider Outages Continue To Rise

In a late-December episode, we highlighted that the ratio of outages attributed to Internet service providers (ISPs) compared to cloud service providers (CSPs) began to shift in early 2024, progressing throughout the year. Now that the final 2024 figures are in, we can confirm that the 2024 ratio of ISP:CSP outages was 73:27, compared to 83:17 in 2023, and 89:11 in 2022.

The rebalancing we’re seeing could signal a shift in the outage landscape—the percentage of total CSP outages has increased significantly, rising from 17% to 27% over the course of the year. In contrast, the share of ISP outages has declined. When we look at ISP and CSP outages collectively, ISPs now represent 73% of those outages, down from the 83% recorded at the same time last year.

While ISPs still account for the majority of outages, the growing number of CSP outages cannot be overlooked, highlighting a critical change in trends regarding Internet stability and reliability.

Bar chart showing the ratio of ISP to CSP outages increasing across 2024
Figure 1. Ratio of ISP to CSP outages across 2024

Accidental Misconfigurations Trending for Clouds and Apps

Accidentally misconfiguring routing policies has long been a cause of network incidents.

Over the past year, configuration-related errors were a problem in the cloud, as well—notably for two Azure incidents in January and July and for Salesforce in October. Configuration mishaps also became an issue within applications themselves—the CrowdStrike sensor update incident was attributed to an issue with a single configuration file. Moreover, a series of issues with ChatGPT appeared to be related to user experience improvements, and, in the case of Square, a deployed configuration could not be interpreted by Android devices.

Modern software engineering practices could be at the root of many configuration issues that are affecting cloud services and applications.

The first such practice is the growing prevalence of continuous improvement and continuous delivery (CI/CD), which has become essential in contemporary software engineering. CI/CD allows product and engineering teams to deploy updates and enhancements at an accelerated pace, facilitating more frequent, incremental changes to applications.

This approach enables organizations to respond quickly to user feedback and market demands. However, in my opinion, the rapid implementation of these updates can often come with the drawback of limited comprehensive end-to-end testing. As teams seek to rapidly deploy new features, there is often insufficient time to validate the changes across the entire application, potentially leading to unforeseen issues. Moreover, the constantly changing nature of the codebase can result in diminished predictability, where the behavior of the application may vary unexpectedly from one deployment to the next. This lack of stability can create daily challenges for teams, as they strive to maintain functionality while implementing new features.

The second configuration-related trend is the expedited rollout of services and the distributed architecture of applications. Digital applications comprise numerous components that must work in harmony to produce a cohesive user experience. Typically, different agile teams develop these components, which may run on a mix of in-house and third-party infrastructure. In this collaborative environment, it is common for teams to focus on enhancing their specific modules without fully understanding the potential impacts of their changes on the broader system. This limited visibility can lead to situations where seemingly isolated modifications inadvertently disrupt the functionality of other interconnected components, ultimately resulting in outages that could have been prevented with better communication and oversight.

AI (and specifically AIOps) will have a role to play in catching misconfiguration errors. This is particularly the case for the network, where misconfiguration errors appear less prevalent, at least compared to those impacting cloud services or applications. AI may also help to detect and remediate individual errors within a specific cloud service, development silo, or application component.

However, overall, the service delivery chain for even your average digital experience involves significant complexities and moving parts. Changes to various aspects or components, combined with increased scale, distribution, and reliance on dependencies, continue to amplify the effects of a modification in one part of this chain. These changes can have unpredictable effects on the entire service delivery system. As a result, misconfiguration-related outages will continue to occur.

As the implementation of configuration changes and modifications becomes increasingly automated, individual changes can result in numerous outcomes. Although testing can be conducted, it is impractical to test every possible scenario fully. Instead, monitoring the service as a whole provides the best opportunity to mitigate disruptions and user impact. This approach also offers valuable insight into faults or degraded components, allowing for appropriate actions, such as rolling back a change, redirecting to alternative resources, or implementing a contingency plan.


For more on outage trends and lessons from some of the most notable outages of 2024, join us for the Top Outages of 2024 webinar.

OpenAI Outage

On December 26, OpenAI services including ChatGPT, Sora, and its API experienced issues due to a power failure at a data center operated by one of its cloud providers. The impact manifested as “internal server errors” for users trying to access these services, although the front-facing impact may have been somewhat mitigated by the holiday timing. Still, the duration of the incident—just shy of eight hours for ChatGPT, during the daytime for those on Pacific Standard Time—was likely a significant cause of concern for Operations teams and for the company, which has pledged “a major infrastructure initiative” in response.

A key focus for OpenAI will likely be making sure that an issue with a single infrastructure cluster or hosting provider can’t cause widespread issues outside of that region for critical backend service—such as database access—and “for an extended period.” In a post-incident report, OpenAI said its databases “are globally replicated but region-wide failover currently requires manual intervention from the hosting cloud provider.” Manual failover processes were initiated, but OpenAI suggested this did not work well in practice: “Our scale elongated the mitigation time.”

The company said it will start working to improve its failover processes and capabilities in the coming weeks.

Google Cloud Pub/Sub Disruption

On January 8, Google Cloud experienced a misconfiguration-related incident that impacted multiple regions of Pub/Sub, Cloud Logging, and BigQuery Data Transfer Service for close to 75 minutes. Pub/Sub is messaging middleware used for streaming analytics, data integration pipeline, (micro)service integration, and other tasks.

Google engineers traced the root cause of the incident to a “bad service configuration change” to a regional database that stores “information about published messages and the order in which those messages were published for ordered delivery.” The configuration change “unintentionally over-restricted the permission to access this database” and “was rolled out to multiple regions,” which conflicted with internal rollout processes. Google added that the problem was not picked up in pre-production “due to a mismatch in the configuration between the two environments.”

While the configuration change was rolled back, a “latent bug” in Pub/Sub itself led to a prolonged duration, requiring engineers to make further repairs.

Lessons From a Netflix Incident

Eagle-eyed readers—or boxing fans—will recall the issues with the record-breaking streaming of the Jake Paul vs. Mike Tyson and Katie Taylor vs. Amanda Serrano boxing events in late November. A number of users on the livestream reported buffering, freezing, and laggy performance.

Since we first covered this incident on the blog, I’ve studied it further and wanted to highlight a few additional takeaways for ITOps teams (also see our full analysis here). Our observations include that:

  • Different regions experienced disruptions at varied times: U.S. ISPs noticed increased error rates at the start of the main event, while Australian ISPs faced issues throughout broadcasts, although errors were not widespread.

  • No single provider or network path had significant issues.

  • The issues suggest congestion at appliances running Netflix’s content delivery network, Open Connect.

The incident offers some valuable lessons for ITOps teams seeking to assure flawless digital experiences during major events (notable sporting events, big sales like Black Friday, etc.). First, it’s vital to understand your application’s full service delivery chain so you can determine the most relevant signals and vantage points to monitor, enabling you to quickly identify issues and possible areas for optimization. Also map out the requirements of normal day-to-day service, as well as the special event or season you’re preparing for. With this knowledge, your team can make informed decisions about what processes to implement so users have quality experiences during these critical moments.


By the Numbers

In addition to our earlier discussion of the 2024 outage trends, let’s close with our usual deep dive into the global trends that ThousandEyes observed over recent weeks (December 16, 2024 - January 12, 2025) across ISPs, cloud service provider networks, collaboration app networks, and edge networks.

  • During this four-week period, the total number of global outages initially showed a downward trend before rising again. In the first week, ThousandEyes observed a 4% decrease in outages, dropping from 205 to 197. This downward trend continued into the following week (December 23-29), with the number of outages decreasing significantly from 197 to 76, marking a 61% drop compared to the previous week.

  • However, this downward trend ended in the third week, as outages began to rise again, increasing from 76 to 148, representing a 95% increase compared to the previous week. The following week (January 6-12) outages doubled again, rising from 148 to 296.

  • This pattern was not fully reflected in the outages observed in the United States. There was a slight increase during the first week (December 16-22), with outages rising by 2%. However, the following week saw a significant reversal, as outages decreased dramatically from 111 to 28, representing a 74% decrease compared to the previous week. In the next two weeks, the upward trend returned, with outages initially rising 179%, jumping from 28 to 78 during the week of December 30 - January 5. Outages then increased again the next week, rising from 78 to 117, a 50% increase.

  • From December 16-29, an average of 51% of all network outages occurred in the United States. This marks a decrease from the 58% reported during the previous period, December 2-15. As we headed into the new year, U.S. outages accounted for 44% of all outages from December 30 - January 12. On average, throughout 2024, U.S.-centric outages represented 42% of total global outages, up from 38% observed in 2023.

  • In December, a total of 724 outages were observed globally, reflecting a 14% decrease from the 840 outages recorded in November. In the United States, outages also declined, dropping from 501 in November to 408 in December. This trend is consistent with previous years, as total outages typically decrease from November to December, both globally and in the U.S., due to the holiday period.

Bar chart showing global and U.S. network outage trends over eight recent weeks
Figure 2. Global and U.S. network outage trends over eight recent weeks

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail