The Top Internet Outages of 2024: Analyses and Takeaways

As we head into 2025, digital resilience is a top priority for IT Operations teams around the globe. When outages happen, it’s how you identify and recover from them that makes the big difference for your users and your business. And beyond that, consistent proactive optimization is essential to both elevate digital experiences for users and guard against potential problems before they impact your customers.

The biggest outages of 2024 provide plenty of learnings for ITOps teams charged with improving digital resilience in their business, with recurring themes emerging—most notably the number of outages that were the consequence of configuration changes or automation related.

Here, we go through some of the most notable outages and disruptions of 2024, identifying key takeaways to help you assure great digital experiences for your users in 2025.

Microsoft Teams Service Disruption (January 26)
Meta Outage (March 5)
Atlassian Confluence Disruption (March 26)
Google.com Outage (May 1)
CrowdStrike Sensor Update Incident (July 19)
Cloudflare Disruption (September 16)
Microsoft Outage (November 25)
OpenAI Outage (December 11)

Dive Deeper: Watch the Top Outages of 2024 webinar now.

Microsoft Teams Service Disruption | January 26, 2024

Microsoft Teams was disrupted for more than seven hours in January, when a problem inside Microsoft’s own network affected the collaboration service.

Frozen apps, login errors, and users left hanging in meeting waiting rooms were some of the symptoms reported during the disruption, which began early in the workday for many Americans.

ThousandEyes’ own observations during the incident indicated that the failure was consistent with issues in Microsoft’s own network. Failover didn’t appear to relieve the issue for many users; although further “network and backend service optimization efforts” did eventually restore service.

Explore This Disruption in ThousandEyes | Read More

Meta Outage | March 5, 2024

On March 5, Meta experienced an outage that prevented users from accessing services including Facebook, Instagram, Messenger, and Threads. While the platform appeared to be reachable, many users were unable to proceed beyond the login or authentication process.

Shortly after the outage began, Meta confirmed that it was experiencing problems with its login services. The issue was likely caused by a failure in one of the dependencies that the login system relies on. ThousandEyes observations also point to a backend cause, as Meta’s systems appeared reachable and network paths connecting to the services didn’t display any significant network conditions that could have led to the outage.

This outage serves as a reminder that issues with just one part of the application delivery chain can render the whole service functionally unusable. It’s crucial to have full visibility into your whole digital delivery chain to help you identify any drops in performance or functionality.

Read Analysis

Atlassian Confluence Disruption | March 26, 2024

In late March, workspace application Atlassian Confluence experienced issues, resulting in customers having problems accessing the service and receiving HTTP 502 bad gateway errors.

While this was a relatively short outage, lasting just over an hour, ThousandEyes’ analysis revealed it affected users all over the globe. By tracing the network paths to the application’s frontend web servers, hosted in AWS, it was clear that this was a backend issue rather than network connectivity itself.

This is one of those outages where relying on error messages would only give you half the story. Identifying the root cause requires you to consider factors such as any third-party dependencies. Being able to rule out issues with a cloud hosting provider, say, gets you one step closer to identifying the real problem.

Explore This Disruption in ThousandEyes

Google.com Outage | May 1, 2024

In early May, Google.com experienced a global disruption lasting around an hour, during which users encountered HTTP 502 error messages instead of the expected search results.

The HTTP 502 status code often indicates a proxy server failing to connect with the origin server. It can also be a sign of overwhelming levels of traffic, but there was no reason to suspect that Google was suddenly struggling under demand, with no extraordinary events to trigger such an influx of search traffic.

ThousandEyes analysis revealed a “lights on/lights off” scenario, where service suddenly dropped, suggesting a problem with backend name resolution or something connected to policy/security verification, rather than an issue with the search engine itself.

Explore This Outage in ThousandEyes | Read More

CrowdStrike Sensor Update Incident | July 19, 2024

Organizations in Australia and New Zealand began experiencing issues on Friday, July 19, at mid-afternoon. A range of industries and major brands simultaneously reported outages as their Windows machines reportedly got stuck in a boot loop that ultimately resulted in the BSOD (Blue Screen of Death). The impact quickly spread to other geographies, causing problems with airline booking systems, grocery stores, and hospital services. And these were just the tip of the iceberg.

Initial responsibility for the widespread outage was thought to lie with Microsoft, but a different common denominator emerged: CrowdStrike, a managed detection and response (MDR) service used to protect Windows endpoints from attack.

CrowdStrike published guidance on actions and workarounds for IT administrators, and an early technical post-incident report that attributed the incident to an issue with a single configuration file that “triggered a logic error resulting in a system crash and blue screen (BSOD) on impacted systems.” Recovery wasn’t a simple task, requiring IT staff to physically attend machines to get them functional. At one point, Microsoft reported that up to 15 reboots per machine may be needed.

Read More

Cloudflare Disruption | September 16, 2024

Cloudflare is one of the world’s biggest CDN providers, so when it catches a cold, other well-known services start sneezing.

Cloudflare’s September 16 outage lasted for around two hours, and affected applications such as Zoom and HubSpot. The ThousandEyes platform showed the impact on these third-party applications clearly, with agents in the United States, Canada, and India all failing to connect to the various applications during the outage.

This is a good example of how you can avert the “is it just me?” problem. By tracking the entire service delivery process of your applications, you can follow the network paths taken by your apps—and the suppliers they are connected to.

Explore This Disruption in ThousandEyes | Read Analysis

Microsoft Outage | November 25, 2024

Microsoft’s late November outage, which affected services such as Outlook Online, occurred in two parts and wasn’t always easy to spot.

Problems emerged around 2 AM (UTC), with symptoms such as timeouts, resolution failures, and the occasional HTTP 503 error message. The problems were intermittent and not always obvious to end users, with the service sometimes presenting as slow or laggy.

The issue appeared to be resolved within an hour or so, but four hours later problems emerged again, and this time with greater severity. ThousandEyes observed an increase in packet loss at the edge of the Microsoft network and increased congestion connecting to services.

Microsoft later explained the problem was caused by a configuration change that caused an “influx of retry requests routed through servers.” The outage was resolved by performing “manual restarts on a subset of machines that [were] in an unhealthy state.”

Read Analysis

OpenAI Outage | December 11, 2024

We almost made it through an entire year of outages without mentioning AI. Almost.

OpenAI’s December outage affected ChatGPT and the new generative video service, Sora. Users witnessed partial page loads, with requests for further information prompting HTTP 403 error messages.

ThousandEyes observations pointed to backend application issues and that was later confirmed by OpenAI, which revealed that a new telemetry service deployment had “unintentionally overwhelmed the Kubernetes control plane,” causing cascading failures.

Explore This Outage in ThousandEyes | Read More

Key Takeaways From 2024

You’ll notice that most of the major outages of 2024 stemmed from a backend configuration change that had unintended consequences or the failure of an automated system.

ITOps teams have limited control over faulty configuration changes made by service providers. However, they can enhance their overall visibility into service delivery paths, which allows them to quickly identify the source of any errors when they occur. This approach provides valuable insights into faults or degraded components, enabling teams to take appropriate actions, such as rolling back changes, redirecting to alternative resources, or implementing contingency plans. By thoroughly understanding their service delivery chains, teams can also regularly optimize services to improve digital experiences and enhance digital resilience.

As we have observed in several significant outages of 2024, error messages typically provide only a hint about what has happened; they cannot in isolation identify the cause. If 2024’s major outages deliver one lesson, it’s that your digital resilience depends on knowing what’s gone wrong—or what could potentially go wrong—even before the service providers themselves acknowledge an issue.

More Outage Insights

For more on these outages and important lessons learned, watch the Top Outages of 2024 webinar on-demand. And to stay updated throughout the year on Internet health and outage news, subscribe to The Internet Report podcast on Apple Podcasts, Spotify, SoundCloud, or wherever you get your podcasts.

To experience how ThousandEyes can help you improve digital resilience, start your free trial today.