March 17
2025
Workday
ThousandEyes screenshot showing transaction time dropped to zero, while front-edge services remained available, with no significant network loss.

Transaction time dropped to zero during the outage, while front-edge services remained available, with no significant network loss, indicating application-side issues.

March 17, 2025
Workday Outage
What Happened?

On March 17, ThousandEyes observed an incident impacting Workday users attempting to access the service. Throughout the approximately one-hour disruption, users received a Workday-generated “service unavailable” message, indicating a potential issue with the application backend. Further pointing to an application-related issue, ThousandEyes didn’t see any network issues connecting to Workday’s frontend web servers during the disruption.

Learning

During an outage, viewing signals in isolation, instead of considering all data points holistically, can lead to incorrect assumptions about the cause. While it might be tempting to jump to conclusions about a disruption’s source—especially when the incident coincides with a major event that seems an obvious culprit, such as Workday’s R1 feature release—doing so can prompt inappropriate mitigation strategies and hinder resolution. In this case, the Workday outage ended up having no connection to the feature release, according to the company’s statement.

Geo Impact
GLOBAL
GLOBAL
March 10
2025
X
ThousandEyes screenshot showing a series of disruptions affecting social media platform X on March 10.

A series of disruptions affecting social media platform X on March 10. See more in the ThousandEyes platform (no login required).

March 10, 2025
X Outage
What Happened?

On March 10, social media platform X experienced a series of disruptions during an 8.5-hour period. These incidents rendered the service inaccessible to some users worldwide. During the disruptions, ThousandEyes observed network conditions characteristic of a DDoS attack, including significant traffic loss conditions, which would have hindered users from reaching the application.

Learning

Having contextual visibility throughout your entire service delivery chain allows your IT team to quickly identify and clarify areas of responsibility. This insight helps enable informed decision-making regarding the necessary steps or mitigation processes to implement. Furthermore, with comprehensive visibility, teams can evaluate the effectiveness of their mitigation efforts. Understanding whether these efforts are working or inadvertently worsening the situation is vital.

Geo Impact
GLOBAL
GLOBAL
February 26
2025
Slack
Screenshot from the ThousandEyes platform showing that Slack users around the globe were receiving HTTP 500 errors.

Slack users around the globe received HTTP 500 errors. See more in the ThousandEyes platform (no login required).

February 26, 2025
Slack Outage
What Happened?

On February 26, ThousandEyes detected a service disruption affecting global users of Slack that lasted for over nine hours. Network connectivity to the application’s frontend servers remained intact; however, users attempting to access the application received server errors, suggesting a backend application issue.

Learning

Verifying the performance of an entire service delivery chain for a service that’s experiencing issues is crucial for pinpointing the specific fault domain. This understanding guides the actions your team takes to help ensure the continuity or recovery of the impacted service. Possible responses may include switching to an alternative system, implementing mitigations on your side, or deciding to do nothing and waiting for the issue to resolve itself.

Geo Impact
GLOBAL
GLOBAL
January 23
2025
ChatGPT
ThousandEyes screenshot showing the ChatGPT application experiencing page load timeouts alongside corresponding server-side error messages with no coinciding network conditions

ChatGPT application experiencing page load timeouts, and HTTP service unavailable errors, with no corresponding network conditions. See more in the ThousandEyes platform (no login required).

January 23, 2025
ChatGPT Outage
What Happened?

OpenAI's ChatGPT application experienced an outage that lasted around three hours, preventing users worldwide from accessing the service. Observations from ThousandEyes during the outage indicated issues with the backend services. While there were problems loading site content and errors such as "service unavailable" and "bad gateway," no network issues affecting the connection to the ChatGPT frontend web servers were detected.

Learning

In the event of an outage, it’s crucial to consider all data points and interpret the signals within their appropriate context. This approach will help you establish whether the issue stems from the network or the application backend. Keep in mind that a single signal, such as a timeout error, should be regarded as just a symptom, not a definitive diagnosis. To pinpoint the root cause effectively, it’s important to analyze multiple signals from both the network and the application.

Geo Impact
GLOBAL
GLOBAL
December 11
2024
Meta
Screenshot of ThousandEyes shows a drop in availability and HTTP 500 Internal Server Errors

All regions affected; the issue manifested as HTTP 500 and server timeout errors. See more in the ThousandEyes platform (no login required).

December 11, 2024
Meta Outage
What Happened?

On December 11, Meta services, including Facebook and Instagram, experienced an outage. ThousandEyes detected internal server errors and timeouts, which may indicate issues with Meta's backend services. During the incident, network connectivity to Meta’s frontend web servers appeared to remain unaffected.

Learning

Identifying commonalities during an outage can be helpful in determining the cause. For instance, when an outage impacts multiple services from the same company across various regions, it can help narrow down the potential issue. In such cases, it’s more likely to be linked to a backend service problem.

Geo Impact
GLOBAL
GLOBAL
December 11
2024
OpenAI
ThousandEyes screenshot showing partial page load and request for further information is met with HTTP 403 response

Partial page load and request for further information is met with HTTP 403 response. See more in the ThousandEyes platform (no login required).

December 11, 2024
OpenAI Outage
What Happened?

On December 11, OpenAI experienced an outage that prevented users from using ChatGPT and Sora services. The company reported that the issue was as a result of a “new telemetry service deployment” that “unintentionally overwhelmed the Kubernetes control plane, causing cascading failures across critical systems.” ThousandEyes observations during the outage also pointed to backend application issues; while there were site content loading problems, no network issues affecting the connection to the ChatGPT frontend web servers were detected.

Learning

Understanding all the dependencies in your service delivery chain improves monitoring, helping you better understand the impact and nature of issues that occur when deploying new services, features, or updates. This knowledge allows you to take the most appropriate and efficient recovery steps, such as reverting changes or applying workarounds, to minimize the impact on users.

Geo Impact
GLOBAL
GLOBAL
November 25
2024
Microsoft

The impact of the Microsoft outage shown using ThousandEyes Internet Insights.

November 25, 2024
Microsoft Outage
What Happened?

Microsoft users in multiple regions encountered difficulties with some of the company’s services, including Outlook Online, due to a prolonged outage. Initially, the incident manifested as intermittent timeout and application errors, but the outage’s scope increased about five hours after its start, and again about 10.5 hours in. During the outage, ThousandEyes observed various conditions, including server errors, timeouts, and packet loss.

Learning

Intermittent issues often show as slow performance, making them difficult to identify. ITOps teams need a clear performance baseline to spot deviations that indicate outages. When disruptions happen, quickly identifying the source is crucial, starting by ruling out non-issues. By leveraging this information with other data, teams can better understand the outage's cause and communicate effectively with users.

Geo Impact
GLOBAL
GLOBAL
October 1
2024
Workday
ThousandEyes screenshot showing global users saw timeouts when trying to connect to a Workday instance hosted in Portland

Global users seeing timeouts when attempting to connect to a Workday instance hosted in Portland. See more in the ThousandEyes platform (no login required).

October 1, 2024
Workday Outage
What Happened?

On October 1, ThousandEyes observed an outage impacting users connecting to Workday, specifically customers whose instances are hosted in Workday’s WD5 data center in Portland, Oregon. Impacted users intermittently experienced server timeouts, which suggests application-side issues.

Learning

Understanding your entire service delivery, including all dependencies, can help identify single aggregation points that, under fault conditions, could impact an entire service. This allows you to either remove them by building in redundancy or prepare mitigation plans in the event that they do encounter issues.

Geo Impact
GLOBAL
GLOBAL
October 1
2024
Salesforce
ThousandEyes screenshot showing users receiving HTTP server errors when trying to connect to the Salesforce service

Salesforce users receiving HTTP server error messages when attempting to connect to the Salesforce service. See more in the ThousandEyes platform (no login required).

October 1, 2024
Salesforce Outage
What Happened?

On October 1, Salesforce instances hosted in AWS experienced an outage in which customers encountered timeouts, slow performance, and inaccessibility of the service, indicating application-related issues.

Learning

When an outage occurs, it is crucial to quickly identify the fault domain. This not only helps determine who is responsible for resolution but also guides the steps that can be taken to address the issue.

Geo Impact
GLOBAL
GLOBAL
September 16
2024
Cloudflare
ThousandEyes screenshot showing that the Cloudflare incident impacted various companies and regions around the globe.

The Cloudflare incident impacted various companies and regions around the globe. See more in the ThousandEyes platform (no login required).

September 16, 2024
Cloudflare Outage
What Happened?

On September 16, Cloudflare experienced an approximately two-hour performance incident that led to reachability issues for some applications leveraging Cloudflare’s CDN and networking services. Impacted applications included Zoom and HubSpot.

Learning

It's crucial to have a deep understanding of your entire service delivery process, including the applications you depend on and the suppliers they are connected to. This will enable you to respond effectively and offer solutions to minimize or alleviate any issues that may arise.

Geo Impact
GLOBAL
GLOBAL
September 12
2024
Microsoft
ThousandEyes screenshot indicating Microsoft incident impacted a subset of users connecting to Microsoft’s network via AT&T

From ThousandEyes observations, the Microsoft incident only impacted a subset of users connecting to Microsoft’s network via AT&T. See more in the ThousandEyes platform (no login required).

September 12, 2024
Microsoft Incident
What Happened?

On September 12, some users experienced issues reaching Microsoft services like Microsoft 365. During the approximately 1.5-hour incident, ThousandEyes observed significant network packet loss, as well as connection timeouts, within Microsoft’s network. From what ThousandEyes saw, the problems only impacted a subset of users connecting to Microsoft’s network via AT&T.

Learning

When an issue occurs, it's important to quickly determine who is affected—whether it's all users or a certain group. It's also helpful to identify if the issue is impacting any specific services. This will not only help determine who is accountable for resolution, but also what mitigation steps can be taken, such as moving to alternate systems or altering network entry points.

Geo Impact
GLOBAL
GLOBAL
August 12
2024
Google
Screenshot from the Cisco ThousandEyes platform showing the outage in Google’s U.K. network detected by ThousandEyes.

Outage in Google’s U.K. network detected by Cisco ThousandEyes. See more in the ThousandEyes platform (no login required).

August 12, 2024
Google Outage
What Happened?

On August 12, high levels of traffic loss within Google’s network in the United Kingdom impacted connectivity for some users trying to reach various Google services, including Gmail and Google Meet. Google acknowledged that the issue was due to the failure of both the primary and backup power feeds caused by a substation switchgear failure impacting their europe-west2 region. Users with workloads or services within that region appeared most affected; however, ThousandEyes also observed some impact on global connectivity.

Learning

Distributing resources across various zones and regions helps lower the risk of an entire infrastructure outage affecting all resources at the same time. As teams strive for resilient infrastructure, it’s also wise to maintain independent visibility throughout the entire digital delivery chain so any problems can be quickly resolved before they have impact.

Geo Impact
UNITED KINGDOM / MULTI-REGION
UNITED KINGDOM / MULTI-REGION
August 5
2024
LinkedIn
ThousandEyes screenshot showing that elevated loss was observed in Microsoft’s network when attempting to access LinkedIn.

Elevated loss was observed in Microsoft’s network when attempting to access LinkedIn. See more in the ThousandEyes platform (no login required).

August 5, 2024
LinkedIn Outage
What Happened?

On August 5, Microsoft experienced an incident that affected the availability of LinkedIn for some users around the globe in a disruption that lasted a little over an hour. The outage manifested as elevated packet loss in Microsoft’s network, as well as DNS resolution timeouts and HTTP errors. ThousandEyes observed some residual network latency issues after the reported resolution; however, they did not appear to prevent users from interacting with LinkedIn services.

Learning

During the outage, the connectivity and timeout issues were intermittent and impacted users unevenly. Intermittent issues can be particularly time consuming to identify and address, especially when they involve various kinds of symptoms. It's important to analyze these symptoms collectively rather than in isolation to avoid a misdiagnosis.

Geo Impact
GLOBAL
GLOBAL
July 30
2024
Microsoft Azure
Screenshot from ThousandEyes platform showing network disruptions impacting reachability and performance of Microsoft Azure.

Network disruptions impacting reachability and performance of Microsoft Azure. See more in the ThousandEyes platform (no login required).

July 30, 2024
Microsoft Azure Disruption
What Happened?

Microsoft experienced global network disruptions that impacted the reachability and performance of some Microsoft services, including the Azure portal. During the incident, ThousandEyes detected network degradation in parts of the Microsoft network. After about two hours, the incident appeared to be mostly resolved. Microsoft attributed the initial trigger event as a distributed denial-of-service (DDoS) attack, but said it appears that “an error in the implementation of [Microsoft’s] defenses amplified the impact of the attack rather than mitigating it.”

Learning

Sometimes the cause of a disruption is multifaceted, with multiple factors—including your own mitigation efforts. When a disruption occurs, it's crucial to make sure any actions taken to remediate the issue are working as expected and not inadvertently making the problem worse.

Geo Impact
GLOBAL
GLOBAL
July 19
2024
CrowdStrike
Web service responding with HTTP 500 error when trying to retrieve content from backend resources running on Windows hosts

Web service responding with an HTTP 500 Internal Server error when trying to retrieve content from backend resources running on Windows hosts.

July 19, 2024
CrowdStrike Sensor Update Incident
What Happened?

On July 19, a software update issue with Crowdstrike's security software caused widespread outages for various organizations, including airlines, banks, and hospitals. CrowdStrike stated that the problem originated from a single configuration file, which led to a logic error, resulting in system crashes and blue screens of death (BSOD) on affected Windows systems. Due to the extensive range of impacted services and the various outage conditions affecting endpoints, servers, and applications, network issues were considered as a potential cause. While Internet connectivity problems are often the common thread behind simultaneous outages across multiple apps and services, in this case, that was not a factor.

Learning

While digital services rely heavily on networks, web-based apps, and cloud services, the recent CrowdStrike incident serves as a reminder that other influences are also at play. It's crucial to efficiently pinpoint the source of a disruption, and a key part of this process is identifying what isn't causing the problem. Ensuring a smooth digital experience involves more than just the network—customers need to consider everything from the device to the app.

Geo Impact
GLOBAL
GLOBAL
May 29
2024
Starlink
ThousandEyes platform screenshot showing global users connected to Starlink experiencing the impacts of the network outage

ThousandEyes observed global users connected to Starlink experiencing the impacts of the network outage.

May 29, 2024
Starlink Outage
What Happened?

On May 29, ThousandEyes observed network outage conditions impacting Starlink, with users connecting from the U.S., Europe, and Australia unable to access the Internet through the service for approximately 45 minutes. Many users would have experienced the outage as DNS timeouts when they tried to reach sites and services, since the network outage would have prevented the reachability of any Internet service, including DNS resolvers.

Learning

Redundancy matters—it’s worth considering having a backup network provider to minimize disruptions if your primary ISP experiences temporary issues. This can be particularly important for hybrid and remote workers and may warrant consideration of backup network services, such as 5G MiFi or a mobile hotspot.

Geo Impact
GLOBAL
GLOBAL
May 16
2024
Salesforce
DNS disruptions impacting Salesforce reachability as visualized by ThousandEyes Internet Insights.

DNS disruptions impacting Salesforce reachability as visualized by ThousandEyes Internet Insights.

May 16, 2024
Salesforce DNS Disruption
What Happened?

On May 16, Salesforce experienced a disruption that intermittently impacted some customers’ ability to reach the service for more than four hours, caused by intermittent failures from a third-party DNS service provider.

Learning

ITOps teams may encounter challenges when trying to diagnose intermittent issues. It is crucial to thoroughly analyze all the data within your complete digital supply chain, including third-party dependencies, and to establish ongoing monitoring for convenient comparison of conditions.

Geo Impact
GLOBAL
GLOBAL
May 14
2024
Meta
Screenshot from the ThousandEyes platform showing that the disruption impacting multiple Meta services

The disruption impacted multiple Meta services, including Instagram, Facebook, and WhatsApp. See more in the ThousandEyes platform (no login required).

May 14, 2024
Meta Services Disruption
What Happened?

On May 14, Meta services, including Facebook, Instagram, and others, experienced a 3.5-hour disruption that impacted some global users attempting to access the applications. The cause appeared to be a backend issue, with network paths clear and web servers responding. The severity of the problems was sporadic, with the number of affected servers fluctuating. The disruption was mostly resolved for users by 2:25 AM (UTC).

Learning

When diagnosing a service disruption, it’s helpful to assess three things: 1) whether you’re dealing with a full-fledged outage or just a degradation (intermittent issues are typically a good clue it’s more aligned to a degradation), 2) the radius of impact, and 3) how the disruption is impacting the application’s functionality (it is just slow to respond or completely unusable?).

Geo Impact
GLOBAL
GLOBAL
May 1
2024
google.com
Screenshot from the ThousandEyes platform showing multiple regions experiencing HTTP errors when trying to access google.com

Multiple regions experienced HTTP errors when trying to access google.com. See more in the ThousandEyes platform (no login required).

May 1, 2024
google.com Outage
What Happened?

ThousandEyes observed a global disruption of google.com on May 1 that lasted nearly an hour. During the incident, many users attempting to perform a search on google.com appeared to receive HTTP 502 Bad Gateway errors, indicating an issue reaching a backend service. The problem seemed to lie with the connectivity or linkage between google.com and the search engine, rather than with the search engine itself, as other services utilizing the search engine's functions appeared unaffected.

Learning

When issues arise, quickly determine the domain from which the issue stems (backend, front-end service, local environment, network connectivity, CDN, etc.) and assess the extent of the disruption. Identify which functions or services are impacted. This will allow you to make informed decisions about what actions to take to minimize the impact on your users—preferably before they are impacted at all.

Geo Impact
GLOBAL
GLOBAL
April 29
2024
X
Screenshot from the ThousandEyes platform showing that locations around the globe appeared impacted by the X outage

Locations around the globe appeared impacted by the X outage. See more in the ThousandEyes platform (no login required).

April 29, 2024
X Outage
What Happened?

On April 29, X (formerly Twitter) experienced an outage that lasted about one hour and appeared to prevent some users from interacting with the social media platform. During the outage, requests to the application seemed to be timing out, suggesting an issue with the application backend. The network also didn’t appear to be experiencing problems, further indicating that the issue lay with backend systems.

Learning

In the event of an outage, it is crucial to consider all data points and interpret the signals within their context to identify whether the issue is related to the network or the application backend. It is important to keep in mind that a single signal, like a timeout error, should be seen as a symptom and not the diagnosis. To accurately isolate the root cause, you will need to examine multiple signals from across the network and application.

Geo Impact
GLOBAL
GLOBAL
April 5
2024
Microsoft Azure
Screenshot from the ThousandEyes platform showing outage conditions in Microsoft’s UK network.

ThousandEyes observed outage conditions in Microsoft’s UK network. See more in the ThousandEyes platform (no login required).

April 5, 2024
Microsoft Azure Outage
What Happened?

Around 8:50 AM (UTC), ThousandEyes observed degraded availability for customers using Azure services that depend on Azure Front Door. These customers would have likely experienced intermittent degraded performance, latency, and/or timeouts when attempting to access services hosted in the United Kingdom South region, which appeared to be centered on nodes located in London, England.

Learning

Once you identify where an outage is occurring, it may be possible to route around it or to an alternate location to avoid disruption. Determine which location and which workloads your services are leveraging so that you can easily identify an alternate route if needed.

Geo Impact
EMEA
EMEA
April 3
2024
WhatsApp
Screenshot from the ThousandEyes platform showing that network paths to WhatsApp services appeared to be clear.

Network paths to WhatsApp services appeared to be clear during the disruption.

April 3, 2024
WhatsApp Service Disruption
What Happened?

On April 3, Meta’s WhatsApp experienced global service disruptions that impacted users’ ability to send or receive messages successfully. ThousandEyes data indicates that the service disruptions were not due to network issues connecting to Meta’s frontend web servers. Network paths to WhatsApp services appeared clear during the incident, suggesting that the issue was on the application's backend.

Learning

Efficiently identifying which parts of your service delivery chain are functioning correctly is an important step in troubleshooting and figuring out the root cause of an issue.

Geo Impact
GLOBAL
GLOBAL
March 26
2024
Atlassian Confluence
Screenshot from the ThousandEyes platform showing that the Atlassian Confluence disruption impacted users across the globe.

The Atlassian Confluence disruption impacted users across the globe. See more in the ThousandEyes platform (no login required).

March 26, 2024
Atlassian Confluence Disruption
What Happened?

At approximately 1:20 PM (UTC) on March 26, ThousandEyes observed a global disruption of team workspace application Atlassian Confluence. The application began responding to access requests with 502 Bad Gateway server errors, suggesting an application backend issue. Network paths to the application’s frontend web servers, which are hosted in AWS, were clear throughout the incident. By 2:35 PM (UTC), the service had completely recovered.

Learning

Error messages can give you some clues about what went wrong, but without context, these codes are like breadcrumbs, leading you in a direction but not giving the full picture. Isolating the true root cause requires looking at a variety of factors in context, including any third-party providers that your application or network infrastructure rely on.

Geo Impact
GLOBAL
GLOBAL
March 6
2024
LinkedIn

Global users were unable to access the LinkedIn web application starting at 8:45 PM (PST) on March 6. See more in the ThousandEyes platform (no login required).

March 6, 2024
LinkedIn Outage
What Happened?

On March 6, ThousandEyes detected an almost two-hour service disruption for global users of LinkedIn that manifested as service unavailable error messages, suggesting a backend application issue.

Learning

Applications can be complex with many dependencies, creating a network of interdependent services that can encounter issues and fail at any time. When issues arise, it's important to identify the responsible party and determine if it's within your control. Additionally, having a backup plan is essential to ensure business continuity in case of any unexpected failures.

Geo Impact
GLOBAL
GLOBAL
March 5
2024
Comcast
Screenshot from the ThousandEyes platform showing the Comcast network outage impacting the reachability of various services.

Comcast network outage impacting the reachability of various services. See more in the ThousandEyes platform (no login required).

March 5, 2024
Comcast Outage
What Happened?

On March 5, ThousandEyes observed outage conditions in parts of Comcast’s network, which impacted the reachability of many applications and services, including Webex, Salesforce, and AWS. The nearly two-hour outage seems to have affected traffic as it traveled Comcast’s network backbone in Texas, including traffic coming from states such as California and Colorado.

Learning

When services you rely on experience issues, a good first question to investigate is whether the problem lies with you, that application, or a third-party provider. Then you can implement appropriate backup plans and remediation efforts.

Geo Impact
GLOBAL
GLOBAL
March 5
2024
Meta
ThousandEyes platform screenshot showing users around the globe experienced login failures when attempting to access the Facebook application.

Users around the globe experience login failures when attempting to access the Facebook application.

March 5, 2024
Meta Outage
What Happened?

On March 5, starting at approximately 15:00 UTC, Meta services, including Facebook and Instagram, experienced an approximately two-hour disruption that prevented users from accessing those apps. ThousandEyes observed that Meta’s web servers remained reachable; however, users attempting to log in received error messages, suggesting problems with a backend service, such as authentication, might have caused the disruption.

Learning

Even the most robust system can be vulnerable to disruptions and errors. Complete visibility of the service, and its associated service delivery chain is crucial to identify any decrease in performance or functionality. This enables quick issue resolution and helps establish processes for reducing the impact of current and future issues.

Geo Impact
GLOBAL
GLOBAL
January 26
2024
Microsoft Teams
Screenshot from the ThousandEyes platform showing a more than seven-hour disruption that impacted Microsoft Teams.

ThousandEyes observed a more than seven-hour disruption that impacted Microsoft Teams. See more in the ThousandEyes platform (no login required).

January 26, 2024
Microsoft Teams Service Disruption
What Happened?

During a more than seven-hour disruption, Microsoft Teams users around the globe experienced service failures that impacted their ability to use the collaboration tool. ThousandEyes didn’t observe any packet loss connecting to the Microsoft Teams edge servers; however, ThousandEyes did observe application layer failures that are consistent with reported issues within Microsoft’s network that may have prevented the service’s edge servers from reaching application components on the backend.

Learning

During the disruption, Microsoft Teams appeared partially available, but some users experienced issues with core functions like messaging and calling. As a result, some may have initially assumed the problem was on their end, not Microsoft’s. Taking a look at the full end-to-end service delivery chain matters for accurate troubleshooting.

Geo Impact
GLOBAL
GLOBAL
November 8
2023
Optus

100% packet loss observed during the Optus outage.

November 8, 2023
Optus Outage
What Happened?

On November 8, Optus, a major network provider in Australia, experienced an outage that affected both their mobile and broadband network services across the country, impacting over 10 million people and thousands of businesses. ThousandEyes observed a 100% packet loss in Optus’ network, indicating that users were unable to access Internet-connected services. The disruption began at 4:00 AM AEDT and lasted until approximately 12:00 PM to 1:00 PM AEDT, after which connectivity slowly began to return. Service levels eventually normalized for most users by 2:00 PM AEDT.

Learning

Even major providers experience outages sometimes, and when they do, the impacts can be far reaching. Even if your business isn’t directly affected, another service you rely on might be. It’s essential to be able to quickly identify the source of any issues you’re encountering so you can take the right steps to minimize the impact on your own customers.

Geo Impact
AUSTRALIA
AUSTRALIA
October 26
2023
OneLogin

OneLogin disruption experienced by multiple locations.
See more in the ThousandEyes platform (no login required).

October 26, 2023
OneLogin Disruption
What Happened?

OneLogin, an identity and access management solution provider, experienced an intermittent service disruption that lasted approximately two hours, with some users receiving HTTP 5XX responses. While packet loss was observed in the path towards the end of the incident, it was likely a secondary symptom or a result of the recovery. The problem appeared to be application-related, given the HTTP 5XX errors that were received.

Learning

Sometimes issues like packet loss can be a secondary symptom or result of outage recovery, rather than a direct result of the incident. Beyond the network layer, also measure and assess upper-layer application performance to properly narrow root cause.

Geo Impact
NORTH AMERICA
NORTH AMERICA
September 8
2023
Square

Global availability issues for Square services.

September 8, 2023
Square Outage
What Happened?

On September 8, contactless payments terminal and service provider Square experienced system connectivity-related issues that led to businesses being unable to process transactions. The disruption hit multiple Square services, and took 18.5 hours to clear completely. ThousandEyes observed intermittent dropouts and 503 “service unavailable” errors, indicative of an internal routing issue or similar backend system problem. Square confirmed that backend issues were the cause: The outage impacted their DNS and was caused by changes to their internal network software.

Learning

Square is taking a number of steps to guard against future issues, including expanding offline payment capabilities. Backup systems like this are critical for minimizing the impact of outages.

Geo Impact
GLOBAL
GLOBAL
August 2
2023
Slack

During the incident, ThousandEyes observed higher-than-normal page load times for global users trying to reach Slack, with this increase in page load time coinciding with incomplete page load.

See more in the ThousandEyes platform (no login required).

August 2, 2023
Slack Outage
What Happened?

On August 2, the communication platform Slack suffered a two-hour outage resulting in several issues, such as users being unable to share screenshots or upload files, and images appearing blurred or grayed out. Slack reported that the root cause was a “routine database cluster migration” that accidentally reduced database capacity to the point at which it could not support a regularly scheduled job that was running.

Learning

Run in isolation, there was nothing wrong with the cluster migration or the other scheduled job; it was only when they ran in parallel that a problem manifested. While it’s unclear what communication the Slack team had in place, the incident is a reminder that good coordination between teams can help guard against issues when working on complex, distributed web-based applications or services.

Geo Impact
GLOBAL
GLOBAL
June 13
2023
AWS

Global locations failing to access an application hosted within AWS.
See more in the ThousandEyes platform (no login required).

June 13, 2023
AWS Outage
What Happened?

On June 13, Amazon Web Services (AWS) experienced an incident that impacted a number of services in the US-EAST-1 region. During the incident, which lasted more than 2 hours, ThousandEyes observed an increase in response times, server timeouts, and HTTP 5XX errors affecting the availability of applications hosted within AWS.

Learning

Many services offered by cloud providers have fundamental architectural dependencies on one another. Companies using cloud services should understand the relationships in their digital ecosystem, regardless of whether those relationships are services or networks.

Geo Impact
GLOBAL
GLOBAL
April 5
2023
Concur

The Concur outage impacted users in various geographies across the globe.
See more in the ThousandEyes platform (no login required).

April 5, 2023
Concur Outage
What Happened?

Around 3:20 PM (UTC), Concur faced an outage that lasted for approximately three hours. During this time, some users around the world experienced problems accessing the app. Users in certain locations experienced HTTP timeouts, and others experienced spiking page load times. Recovery began around 6:25 PM (UTC) and was complete around 6:35 PM (UTC).

Learning

To provide your customers with the seamless digital experience you want to give them, it’s crucial to quickly identify the root cause of any disruption that arises. Then you can take appropriate measures to mitigate the outage’s impact—and guard against similar disruptions in the future.

Geo Impact
GLOBAL
GLOBAL
April 4
2023
Virgin Media

Site hosted within the Virgin Media UK network (AS 5089) is unreachable to users during the incident.
See more in the ThousandEyes platform (no login required).

April 4, 2023
Virgin Media UK Outages
What Happened?

On April 4, Virgin Media UK (AS 5089) experienced two outages that impacted the reachability of its network and services to the global Internet for hours. The two outages shared similar characteristics, including the withdrawal of BGP routes to its network, traffic loss, and intermittent periods of service recovery.

Learning

When making a BGP change, it's important to understand the effect on the data plane, including the impact to network traffic paths, IP performance, and service reachability.

Geo Impact
UK
UK
March 14
2023
Reddit

Reddit users experienced content loading issues.
See more in the ThousandEyes platform (no login required).

March 14, 2023
Reddit Outage
What Happened?

On March 14, starting at approximately 19:05 UTC, ThousandEyes observed an outage impacting global users of Reddit, a popular online platform that’s home to thousands of communities. Content loading issues and HTTP 503 service unavailable errors indicated a backend or internal app problem. The outage lasted a total of five hours, according to Reddit’s official status page.

Learning

Just because a site is reachable doesn’t mean it’s working as desired. Robust visibility is needed to be able to identify any backend problems that might be responsible for challenges users are encountering.

Geo Impact
GLOBAL
GLOBAL
March 7
2023
Akamai

Multiple geographies and applications were impacted by the Akamai outage, including LastPass, SAP Concur, and Salesforce.
See more in the ThousandEyes platform (no login required).

March 7, 2023
Akamai Outage
What Happened?

At approximately 20:20 UTC, a number of applications served by Akamai Edge experienced sporadic availability issues for approximately 1.5 hours. ThousandEyes observed occasional packet loss and timeout conditions that might have prevented some users from accessing or interacting with apps that rely on Akamai's CDN service. Some of the affected applications include Salesforce, Azure console, SAP Concur, LastPass, and others. The issue seemed to have mostly been resolved by around 21:55 UTC, but some applications and users continued encountering intermittent issues for a longer period of time.

Learning

It’s important to deeply understand your end-to-end service delivery chain, not only the apps you rely on but also the providers they use.

Geo Impact
GLOBAL
GLOBAL
March 6
2023
Twitter

Twitter users globally experienced access and service disruption.
See more in the ThousandEyes platform (no login required).

March 6, 2023
Twitter Outage
What Happened?

At approximately 16:45 UTC on March 6, Twitter experienced a one-hour service disruption that prevented many Twitter users around the world from accessing its app or accessing links. While the app was reachable from a network standpoint, users were receiving HTTP 403 forbidden errors, indicating a backend application issue.

Learning

It is important to continuously monitor the performance of the applications that you rely on. This will help you identify small and large performance degradations as they happen, which could, cumulatively, have a similar impact on your digital experience as a complete outage. For instance, on March 6, Twitter experienced a service-wide disruption, which was their only such event over the preceding six months. However, during that same period, instances of partial service degradation increased.

Geo Impact
GLOBAL
GLOBAL
February 7
2023
Microsoft Outlook

The Microsoft Outlook outage, as seen on the ThousandEyes Internet Outages Map.
View more in the ThousandEyes platform (no login required).

February 7, 2023
Microsoft Outlook Outage
What Happened?

Microsoft Outlook users across the globe had trouble accessing the email service due to an outage that lasted about 1.5 hours, with intermittent problems persisting for several more. ThousandEyes data identified elevated server response timeouts and slow page loading, which is indicative of an application-related issue.

Learning

Even large, established providers experience outages sometimes. With proper visibility, your team can quickly discern the source of a problem (is it on your end or theirs?) and take the right steps to respond.

Geo Impact
GLOBAL
GLOBAL
February 3
2023
Okta

During the Okta incident, some users received HTTP 403 forbidden errors.
See more in the ThousandEyes platform (no login required).

February 3, 2023
Okta Outage
What Happened?

Beginning at approximately 6:10 PM (UTC), ThousandEyes observed a disruption to Okta availability for some global users. Throughout the incident, HTTP 403 forbidden errors were seen, indicating a backend service issue rather than Internet or network problems connecting to Okta. The incident appeared to resolve for most users approximately 30 minutes later, around 6:40 PM (UTC).

Learning

As a single sign-on (SSO) service, for many organizations, Okta serves as the “front door” to various other apps they rely on, so when Okta’s not available, teams may have difficulty accessing other apps as well. In the complex digital ecosystem that companies rely on today, it’s important to understand all your dependencies and to have backup plans in place should the need arise.

Geo Impact
GLOBAL
GLOBAL
January 25
2023
Microsoft

Spike in Microsoft service outages, as seen through
ThousandEyes Internet Insights.

January 25, 2023
Microsoft Outage
What Happened?

Microsoft experienced a significant global disruption that impacted connectivity to many of its services, including Azure, Teams, Outlook, and Sharepoint. The outage was triggered by an external BGP change by Microsoft that impacted connected service providers. The bulk of the incident lasted about 1.5 hours, but residual connectivity issues were observed into the following day.

Learning

Change always carries risk, but with diligent process, change reviews, prepared rollback plans, and quality assurance testing before and after any change, you can greatly reduce your chances of disruption.

Geo Impact
GLOBAL
GLOBAL
December 5
2022
AWS

Path visualization showing packet loss for some AWS customer traffic as it transited between us-east-2 and a number of global locations.
See more in the ThousandEyes platform (no login required).

December 5, 2022
AWS Outage
What Happened?

ThousandEyes observed significant packet loss between two global locations and AWS' us-east-2 region for more than an hour. The event affected end users connecting through ISPs to that region's cloud infrastructure provider's services.

Learning

With public cloud, it’s important to monitor not just the applications themselves but also the cloud infrastructure components, including individual cloud regions and cloud availability zones and any dependent cloud software services.

Geo Impact
GLOBAL
GLOBAL
October 25
2022
WhatsApp

WhatsApp experienced intermittent packet loss during the incident.

October 25, 2022
WhatsApp Outage
What Happened?

A two-hour global WhatsApp outage left users unable to send or receive messages. Related to backend application service failures rather than a network failure, this incident occurred during peak hours in India, where the app is hugely popular.

Learning

An immediate feedback loop should be available to people and teams making changes to production systems so mistakes can be identified and rectified. Having data that can help rule out the network as the culprit when a production system error occurs can speed up the resolution of technical issues.

Geo Impact
global
global
October 25
2022
Zscaler

Sudden spike to 100% packet loss appears for traffic destined to a Zscaler proxy.

October 25, 2022
Zscaler Outage
What Happened?

Customers using Zscaler Internet Access (ZIA) experienced connectivity failures or high latency in reaching Zscaler proxies. Because Secure Service Edge (SSE) implementations typically proxy web traffic and critical business SaaS tools, apps like Salesforce, ServiceNow, and Microsoft 365 could have been made unreachable for some customers by this incident. The most significant packet loss lasted approximately 30 minutes and connectivity was fully restored for all user locations in about 3.5 hours.

Learning

SSE is another piece of the Internet puzzle to consider when things go awry. Having network-agnostic data for complex scenarios like this can enable quicker attribution and remediation.

Geo Impact
GLOBAL
GLOBAL
September 15
2022
Zoom

The Zoom outage as seen in the ThousandEyes platform.

September 15, 2022
Zoom Outage
What Happened?

A brief Zoom outage left users around the world unable to log in or join meetings. In some cases, users already in meetings were kicked out of them. The root cause appeared to be in Zoom’s backend systems, around their ability to resolve, route, or redistribute traffic.

Learning

Sometimes the app itself is causing issues rather than the network. Having visibility into which it is can prevent confusion and finger-pointing during root cause analysis.

Geo Impact
GLOBAL
GLOBAL
August 9
2022
Google Search & Maps

Outage renders Google domain properties unreachable in several countries.
See more in the ThousandEyes platform (no login required).

August 9, 2022
Google Search & Google Maps Outage
What Happened?

Google Search and Google Maps became unavailable to users worldwide for approximately 60 minutes, with those attempting to reach the services receiving error messages. Users from the United States to Australia, Japan to South Africa could not load sites or execute functions. Applications dependent on Google’s software function also stopped working during this rare outage.

Learning

It’s important to monitor not just your application frontends but also the performance-critical dependencies that power your app.

Geo Impact
GLOBAL
GLOBAL
July 28
2022
AWS
July 28 AWS Outage diagram

Affected interfaces in the AWS network (AS 16509).

July 28, 2022
AWS Outage
What Happened?

An AWS Availability Zone power failure caused an AWS outage that impacted applications such as Webex by Cisco, Okta, and Splunk. Not all users or services were affected equally, however. Some services that had physical redundancy in place remained operational, while other services and apps took up to three hours to recover.

Learning

For cloud-delivered applications and services, some level of physical redundancy, either geographic or physical, should be factored in. Additionally, contingency or failover plans should be understood, documented, and rehearsed so that you’re ready to execute should the need arise.

Geo Impact
GLOBAL
GLOBAL
July 8
2022
Rogers Communications

100% packet loss observed for locations connecting to a Rogers customer.

July 8, 2022
Rogers Communications Outage
What Happened?

An issue with Rogers’ internal routing affected millions of users and many critical services across Canada. Rogers withdrew its prefixes due to the internal routing issue, which meant the Tier I provider was unreachable across the Internet for nearly 24 hours.

Learning

No provider is immune to outages, no matter how large. Understand all of your dependencies—even indirect ones—and build in redundancy for every critical service dependency identified. Consider a backup network provider so you can reduce the impact of any one ISP experiencing a disruption in service.

Geo Impact
CANADA
CANADA
April 5
2022
Atlassian
April 5, 2022
Atlassian Outage
What Happened?

Atlassian's Jira, Confluence, and Opsgenie are three products that many developer teams rely on. Due to a maintenance script error, these services experienced a days-long outage that impacted roughly 400 of Atlassian's customers.

Learning

Companies shouldn’t rely on status pages alone to understand application or service outages. If status pages go down or don’t share enough detail, it can be difficult to understand if or how your organization is being impacted.

Geo Impact
GLOBAL
GLOBAL
March 28
2022
Twitter
March 28 Twitter Outage diagram

Start of BGP advertisement of 104.244.42.0/24 by RTComm (AS 8342).

March 28, 2022
Twitter Outage
What Happened?

Twitter was rendered unreachable for approximately 45 minutes after a Russian Internet and satellite communications provider blackholed traffic by announcing one of Twitter’s prefixes. BGP misconfigurations are not uncommon. However, they can be used to block traffic in a targeted way, and it’s not always easy to tell when the situation is accidental versus intentional.

Learning

While your company might have RPKI implemented to fend off BGP threats, it's possible that your telco won't. Something to consider when selecting ISPs.

Geo Impact
GLOBAL
GLOBAL
February 25
2022
British Airways
February 25, 2022
British Airways Outage
What Happened?

British Airways experienced an online services outage that caused hundreds of flight cancellations and disruptions in the airline's operations. While the network paths to the airline’s online services (and servers) were reachable, the server and site responses were timing out. The issue was likely due to a central backend repository that multiple front-facing services rely on.

Learning

Architecting backends that avoid single points of failure can reduce the likelihood of a chain of events.

Geo Impact
GLOBAL
GLOBAL

More Outage Insights

The Internet Report Podcast

Tune in for a podcast covering what’s working and what’s breaking on the Internet—and why.

EXPLORE THOUSANDEYES

Request a free trial or book a demo of the ThousandEyes Internet and digital experience visibility platform.

In-Depth Outage Analyses

Keep your finger on the pulse of Internet health and notable outage events with expertly crafted blogs from ThousandEyes’ Internet Intelligence team.