
Transaction time dropped to zero during the outage, while front-edge services remained available, with no significant network loss, indicating application-side issues.
On March 17, ThousandEyes observed an incident impacting Workday users attempting to access the service. Throughout the approximately one-hour disruption, users received a Workday-generated “service unavailable” message, indicating a potential issue with the application backend. Further pointing to an application-related issue, ThousandEyes didn’t see any network issues connecting to Workday’s frontend web servers during the disruption.
During an outage, viewing signals in isolation, instead of considering all data points holistically, can lead to incorrect assumptions about the cause. While it might be tempting to jump to conclusions about a disruption’s source—especially when the incident coincides with a major event that seems an obvious culprit, such as Workday’s R1 feature release—doing so can prompt inappropriate mitigation strategies and hinder resolution. In this case, the Workday outage ended up having no connection to the feature release, according to the company’s statement.


A series of disruptions affecting social media platform X on March 10. See more in the ThousandEyes platform (no login required).
On March 10, social media platform X experienced a series of disruptions during an 8.5-hour period. These incidents rendered the service inaccessible to some users worldwide. During the disruptions, ThousandEyes observed network conditions characteristic of a DDoS attack, including significant traffic loss conditions, which would have hindered users from reaching the application.
Having contextual visibility throughout your entire service delivery chain allows your IT team to quickly identify and clarify areas of responsibility. This insight helps enable informed decision-making regarding the necessary steps or mitigation processes to implement. Furthermore, with comprehensive visibility, teams can evaluate the effectiveness of their mitigation efforts. Understanding whether these efforts are working or inadvertently worsening the situation is vital.


Slack users around the globe received HTTP 500 errors. See more in the ThousandEyes platform (no login required).
On February 26, ThousandEyes detected a service disruption affecting global users of Slack that lasted for over nine hours. Network connectivity to the application’s frontend servers remained intact; however, users attempting to access the application received server errors, suggesting a backend application issue.
Verifying the performance of an entire service delivery chain for a service that’s experiencing issues is crucial for pinpointing the specific fault domain. This understanding guides the actions your team takes to help ensure the continuity or recovery of the impacted service. Possible responses may include switching to an alternative system, implementing mitigations on your side, or deciding to do nothing and waiting for the issue to resolve itself.


ChatGPT application experiencing page load timeouts, and HTTP service unavailable errors, with no corresponding network conditions. See more in the ThousandEyes platform (no login required).
OpenAI's ChatGPT application experienced an outage that lasted around three hours, preventing users worldwide from accessing the service. Observations from ThousandEyes during the outage indicated issues with the backend services. While there were problems loading site content and errors such as "service unavailable" and "bad gateway," no network issues affecting the connection to the ChatGPT frontend web servers were detected.
In the event of an outage, it’s crucial to consider all data points and interpret the signals within their appropriate context. This approach will help you establish whether the issue stems from the network or the application backend. Keep in mind that a single signal, such as a timeout error, should be regarded as just a symptom, not a definitive diagnosis. To pinpoint the root cause effectively, it’s important to analyze multiple signals from both the network and the application.


All regions affected; the issue manifested as HTTP 500 and server timeout errors. See more in the ThousandEyes platform (no login required).
On December 11, Meta services, including Facebook and Instagram, experienced an outage. ThousandEyes detected internal server errors and timeouts, which may indicate issues with Meta's backend services. During the incident, network connectivity to Meta’s frontend web servers appeared to remain unaffected.
Identifying commonalities during an outage can be helpful in determining the cause. For instance, when an outage impacts multiple services from the same company across various regions, it can help narrow down the potential issue. In such cases, it’s more likely to be linked to a backend service problem.


Partial page load and request for further information is met with HTTP 403 response. See more in the ThousandEyes platform (no login required).
On December 11, OpenAI experienced an outage that prevented users from using ChatGPT and Sora services. The company reported that the issue was as a result of a “new telemetry service deployment” that “unintentionally overwhelmed the Kubernetes control plane, causing cascading failures across critical systems.” ThousandEyes observations during the outage also pointed to backend application issues; while there were site content loading problems, no network issues affecting the connection to the ChatGPT frontend web servers were detected.
Understanding all the dependencies in your service delivery chain improves monitoring, helping you better understand the impact and nature of issues that occur when deploying new services, features, or updates. This knowledge allows you to take the most appropriate and efficient recovery steps, such as reverting changes or applying workarounds, to minimize the impact on users.


The impact of the Microsoft outage shown using ThousandEyes Internet Insights.
Microsoft users in multiple regions encountered difficulties with some of the company’s services, including Outlook Online, due to a prolonged outage. Initially, the incident manifested as intermittent timeout and application errors, but the outage’s scope increased about five hours after its start, and again about 10.5 hours in. During the outage, ThousandEyes observed various conditions, including server errors, timeouts, and packet loss.
Intermittent issues often show as slow performance, making them difficult to identify. ITOps teams need a clear performance baseline to spot deviations that indicate outages. When disruptions happen, quickly identifying the source is crucial, starting by ruling out non-issues. By leveraging this information with other data, teams can better understand the outage's cause and communicate effectively with users.


Global users seeing timeouts when attempting to connect to a Workday instance hosted in Portland. See more in the ThousandEyes platform (no login required).
On October 1, ThousandEyes observed an outage impacting users connecting to Workday, specifically customers whose instances are hosted in Workday’s WD5 data center in Portland, Oregon. Impacted users intermittently experienced server timeouts, which suggests application-side issues.
Understanding your entire service delivery, including all dependencies, can help identify single aggregation points that, under fault conditions, could impact an entire service. This allows you to either remove them by building in redundancy or prepare mitigation plans in the event that they do encounter issues.


Salesforce users receiving HTTP server error messages when attempting to connect to the Salesforce service. See more in the ThousandEyes platform (no login required).
On October 1, Salesforce instances hosted in AWS experienced an outage in which customers encountered timeouts, slow performance, and inaccessibility of the service, indicating application-related issues.
When an outage occurs, it is crucial to quickly identify the fault domain. This not only helps determine who is responsible for resolution but also guides the steps that can be taken to address the issue.


The Cloudflare incident impacted various companies and regions around the globe. See more in the ThousandEyes platform (no login required).
On September 16, Cloudflare experienced an approximately two-hour performance incident that led to reachability issues for some applications leveraging Cloudflare’s CDN and networking services. Impacted applications included Zoom and HubSpot.
It's crucial to have a deep understanding of your entire service delivery process, including the applications you depend on and the suppliers they are connected to. This will enable you to respond effectively and offer solutions to minimize or alleviate any issues that may arise.


From ThousandEyes observations, the Microsoft incident only impacted a subset of users connecting to Microsoft’s network via AT&T. See more in the ThousandEyes platform (no login required).
On September 12, some users experienced issues reaching Microsoft services like Microsoft 365. During the approximately 1.5-hour incident, ThousandEyes observed significant network packet loss, as well as connection timeouts, within Microsoft’s network. From what ThousandEyes saw, the problems only impacted a subset of users connecting to Microsoft’s network via AT&T.
When an issue occurs, it's important to quickly determine who is affected—whether it's all users or a certain group. It's also helpful to identify if the issue is impacting any specific services. This will not only help determine who is accountable for resolution, but also what mitigation steps can be taken, such as moving to alternate systems or altering network entry points.


Outage in Google’s U.K. network detected by Cisco ThousandEyes. See more in the ThousandEyes platform (no login required).
On August 12, high levels of traffic loss within Google’s network in the United Kingdom impacted connectivity for some users trying to reach various Google services, including Gmail and Google Meet. Google acknowledged that the issue was due to the failure of both the primary and backup power feeds caused by a substation switchgear failure impacting their europe-west2 region. Users with workloads or services within that region appeared most affected; however, ThousandEyes also observed some impact on global connectivity.
Distributing resources across various zones and regions helps lower the risk of an entire infrastructure outage affecting all resources at the same time. As teams strive for resilient infrastructure, it’s also wise to maintain independent visibility throughout the entire digital delivery chain so any problems can be quickly resolved before they have impact.


Elevated loss was observed in Microsoft’s network when attempting to access LinkedIn. See more in the ThousandEyes platform (no login required).
On August 5, Microsoft experienced an incident that affected the availability of LinkedIn for some users around the globe in a disruption that lasted a little over an hour. The outage manifested as elevated packet loss in Microsoft’s network, as well as DNS resolution timeouts and HTTP errors. ThousandEyes observed some residual network latency issues after the reported resolution; however, they did not appear to prevent users from interacting with LinkedIn services.
During the outage, the connectivity and timeout issues were intermittent and impacted users unevenly. Intermittent issues can be particularly time consuming to identify and address, especially when they involve various kinds of symptoms. It's important to analyze these symptoms collectively rather than in isolation to avoid a misdiagnosis.


Network disruptions impacting reachability and performance of Microsoft Azure. See more in the ThousandEyes platform (no login required).
Microsoft experienced global network disruptions that impacted the reachability and performance of some Microsoft services, including the Azure portal. During the incident, ThousandEyes detected network degradation in parts of the Microsoft network. After about two hours, the incident appeared to be mostly resolved. Microsoft attributed the initial trigger event as a distributed denial-of-service (DDoS) attack, but said it appears that “an error in the implementation of [Microsoft’s] defenses amplified the impact of the attack rather than mitigating it.”
Sometimes the cause of a disruption is multifaceted, with multiple factors—including your own mitigation efforts. When a disruption occurs, it's crucial to make sure any actions taken to remediate the issue are working as expected and not inadvertently making the problem worse.


Web service responding with an HTTP 500 Internal Server error when trying to retrieve content from backend resources running on Windows hosts.
On July 19, a software update issue with Crowdstrike's security software caused widespread outages for various organizations, including airlines, banks, and hospitals. CrowdStrike stated that the problem originated from a single configuration file, which led to a logic error, resulting in system crashes and blue screens of death (BSOD) on affected Windows systems. Due to the extensive range of impacted services and the various outage conditions affecting endpoints, servers, and applications, network issues were considered as a potential cause. While Internet connectivity problems are often the common thread behind simultaneous outages across multiple apps and services, in this case, that was not a factor.
While digital services rely heavily on networks, web-based apps, and cloud services, the recent CrowdStrike incident serves as a reminder that other influences are also at play. It's crucial to efficiently pinpoint the source of a disruption, and a key part of this process is identifying what isn't causing the problem. Ensuring a smooth digital experience involves more than just the network—customers need to consider everything from the device to the app.


ThousandEyes observed global users connected to Starlink experiencing the impacts of the network outage.
On May 29, ThousandEyes observed network outage conditions impacting Starlink, with users connecting from the U.S., Europe, and Australia unable to access the Internet through the service for approximately 45 minutes. Many users would have experienced the outage as DNS timeouts when they tried to reach sites and services, since the network outage would have prevented the reachability of any Internet service, including DNS resolvers.
Redundancy matters—it’s worth considering having a backup network provider to minimize disruptions if your primary ISP experiences temporary issues. This can be particularly important for hybrid and remote workers and may warrant consideration of backup network services, such as 5G MiFi or a mobile hotspot.


DNS disruptions impacting Salesforce reachability as visualized by ThousandEyes Internet Insights.
On May 16, Salesforce experienced a disruption that intermittently impacted some customers’ ability to reach the service for more than four hours, caused by intermittent failures from a third-party DNS service provider.
ITOps teams may encounter challenges when trying to diagnose intermittent issues. It is crucial to thoroughly analyze all the data within your complete digital supply chain, including third-party dependencies, and to establish ongoing monitoring for convenient comparison of conditions.


The disruption impacted multiple Meta services, including Instagram, Facebook, and WhatsApp. See more in the ThousandEyes platform (no login required).
On May 14, Meta services, including Facebook, Instagram, and others, experienced a 3.5-hour disruption that impacted some global users attempting to access the applications. The cause appeared to be a backend issue, with network paths clear and web servers responding. The severity of the problems was sporadic, with the number of affected servers fluctuating. The disruption was mostly resolved for users by 2:25 AM (UTC).
When diagnosing a service disruption, it’s helpful to assess three things: 1) whether you’re dealing with a full-fledged outage or just a degradation (intermittent issues are typically a good clue it’s more aligned to a degradation), 2) the radius of impact, and 3) how the disruption is impacting the application’s functionality (it is just slow to respond or completely unusable?).


Multiple regions experienced HTTP errors when trying to access google.com. See more in the ThousandEyes platform (no login required).
ThousandEyes observed a global disruption of google.com on May 1 that lasted nearly an hour. During the incident, many users attempting to perform a search on google.com appeared to receive HTTP 502 Bad Gateway errors, indicating an issue reaching a backend service. The problem seemed to lie with the connectivity or linkage between google.com and the search engine, rather than with the search engine itself, as other services utilizing the search engine's functions appeared unaffected.
When issues arise, quickly determine the domain from which the issue stems (backend, front-end service, local environment, network connectivity, CDN, etc.) and assess the extent of the disruption. Identify which functions or services are impacted. This will allow you to make informed decisions about what actions to take to minimize the impact on your users—preferably before they are impacted at all.


Locations around the globe appeared impacted by the X outage. See more in the ThousandEyes platform (no login required).
On April 29, X (formerly Twitter) experienced an outage that lasted about one hour and appeared to prevent some users from interacting with the social media platform. During the outage, requests to the application seemed to be timing out, suggesting an issue with the application backend. The network also didn’t appear to be experiencing problems, further indicating that the issue lay with backend systems.
In the event of an outage, it is crucial to consider all data points and interpret the signals within their context to identify whether the issue is related to the network or the application backend. It is important to keep in mind that a single signal, like a timeout error, should be seen as a symptom and not the diagnosis. To accurately isolate the root cause, you will need to examine multiple signals from across the network and application.


ThousandEyes observed outage conditions in Microsoft’s UK network. See more in the ThousandEyes platform (no login required).
Around 8:50 AM (UTC), ThousandEyes observed degraded availability for customers using Azure services that depend on Azure Front Door. These customers would have likely experienced intermittent degraded performance, latency, and/or timeouts when attempting to access services hosted in the United Kingdom South region, which appeared to be centered on nodes located in London, England.
Once you identify where an outage is occurring, it may be possible to route around it or to an alternate location to avoid disruption. Determine which location and which workloads your services are leveraging so that you can easily identify an alternate route if needed.


Network paths to WhatsApp services appeared to be clear during the disruption.
On April 3, Meta’s WhatsApp experienced global service disruptions that impacted users’ ability to send or receive messages successfully. ThousandEyes data indicates that the service disruptions were not due to network issues connecting to Meta’s frontend web servers. Network paths to WhatsApp services appeared clear during the incident, suggesting that the issue was on the application's backend.
Efficiently identifying which parts of your service delivery chain are functioning correctly is an important step in troubleshooting and figuring out the root cause of an issue.


The Atlassian Confluence disruption impacted users across the globe. See more in the ThousandEyes platform (no login required).
At approximately 1:20 PM (UTC) on March 26, ThousandEyes observed a global disruption of team workspace application Atlassian Confluence. The application began responding to access requests with 502 Bad Gateway server errors, suggesting an application backend issue. Network paths to the application’s frontend web servers, which are hosted in AWS, were clear throughout the incident. By 2:35 PM (UTC), the service had completely recovered.
Error messages can give you some clues about what went wrong, but without context, these codes are like breadcrumbs, leading you in a direction but not giving the full picture. Isolating the true root cause requires looking at a variety of factors in context, including any third-party providers that your application or network infrastructure rely on.


Global users were unable to access the LinkedIn web application starting at 8:45 PM (PST) on March 6. See more in the ThousandEyes platform (no login required).
On March 6, ThousandEyes detected an almost two-hour service disruption for global users of LinkedIn that manifested as service unavailable error messages, suggesting a backend application issue.
Applications can be complex with many dependencies, creating a network of interdependent services that can encounter issues and fail at any time. When issues arise, it's important to identify the responsible party and determine if it's within your control. Additionally, having a backup plan is essential to ensure business continuity in case of any unexpected failures.


Comcast network outage impacting the reachability of various services. See more in the ThousandEyes platform (no login required).
On March 5, ThousandEyes observed outage conditions in parts of Comcast’s network, which impacted the reachability of many applications and services, including Webex, Salesforce, and AWS. The nearly two-hour outage seems to have affected traffic as it traveled Comcast’s network backbone in Texas, including traffic coming from states such as California and Colorado.
When services you rely on experience issues, a good first question to investigate is whether the problem lies with you, that application, or a third-party provider. Then you can implement appropriate backup plans and remediation efforts.


Users around the globe experience login failures when attempting to access the Facebook application.
On March 5, starting at approximately 15:00 UTC, Meta services, including Facebook and Instagram, experienced an approximately two-hour disruption that prevented users from accessing those apps. ThousandEyes observed that Meta’s web servers remained reachable; however, users attempting to log in received error messages, suggesting problems with a backend service, such as authentication, might have caused the disruption.
Even the most robust system can be vulnerable to disruptions and errors. Complete visibility of the service, and its associated service delivery chain is crucial to identify any decrease in performance or functionality. This enables quick issue resolution and helps establish processes for reducing the impact of current and future issues.


ThousandEyes observed a more than seven-hour disruption that impacted Microsoft Teams. See more in the ThousandEyes platform (no login required).
During a more than seven-hour disruption, Microsoft Teams users around the globe experienced service failures that impacted their ability to use the collaboration tool. ThousandEyes didn’t observe any packet loss connecting to the Microsoft Teams edge servers; however, ThousandEyes did observe application layer failures that are consistent with reported issues within Microsoft’s network that may have prevented the service’s edge servers from reaching application components on the backend.
During the disruption, Microsoft Teams appeared partially available, but some users experienced issues with core functions like messaging and calling. As a result, some may have initially assumed the problem was on their end, not Microsoft’s. Taking a look at the full end-to-end service delivery chain matters for accurate troubleshooting.


100% packet loss observed during the Optus outage.
On November 8, Optus, a major network provider in Australia, experienced an outage that affected both their mobile and broadband network services across the country, impacting over 10 million people and thousands of businesses. ThousandEyes observed a 100% packet loss in Optus’ network, indicating that users were unable to access Internet-connected services. The disruption began at 4:00 AM AEDT and lasted until approximately 12:00 PM to 1:00 PM AEDT, after which connectivity slowly began to return. Service levels eventually normalized for most users by 2:00 PM AEDT.
Even major providers experience outages sometimes, and when they do, the impacts can be far reaching. Even if your business isn’t directly affected, another service you rely on might be. It’s essential to be able to quickly identify the source of any issues you’re encountering so you can take the right steps to minimize the impact on your own customers.


OneLogin disruption experienced by multiple locations.
See more in the ThousandEyes platform (no login required).
OneLogin, an identity and access management solution provider, experienced an intermittent service disruption that lasted approximately two hours, with some users receiving HTTP 5XX responses. While packet loss was observed in the path towards the end of the incident, it was likely a secondary symptom or a result of the recovery. The problem appeared to be application-related, given the HTTP 5XX errors that were received.
Sometimes issues like packet loss can be a secondary symptom or result of outage recovery, rather than a direct result of the incident. Beyond the network layer, also measure and assess upper-layer application performance to properly narrow root cause.


Global availability issues for Square services.
On September 8, contactless payments terminal and service provider Square experienced system connectivity-related issues that led to businesses being unable to process transactions. The disruption hit multiple Square services, and took 18.5 hours to clear completely. ThousandEyes observed intermittent dropouts and 503 “service unavailable” errors, indicative of an internal routing issue or similar backend system problem. Square confirmed that backend issues were the cause: The outage impacted their DNS and was caused by changes to their internal network software.
Square is taking a number of steps to guard against future issues, including expanding offline payment capabilities. Backup systems like this are critical for minimizing the impact of outages.


During the incident, ThousandEyes observed higher-than-normal page load times for global users trying to reach Slack, with this increase in page load time coinciding with incomplete page load.
On August 2, the communication platform Slack suffered a two-hour outage resulting in several issues, such as users being unable to share screenshots or upload files, and images appearing blurred or grayed out. Slack reported that the root cause was a “routine database cluster migration” that accidentally reduced database capacity to the point at which it could not support a regularly scheduled job that was running.
Run in isolation, there was nothing wrong with the cluster migration or the other scheduled job; it was only when they ran in parallel that a problem manifested. While it’s unclear what communication the Slack team had in place, the incident is a reminder that good coordination between teams can help guard against issues when working on complex, distributed web-based applications or services.


Global locations failing to access an application hosted within AWS.
See more in the ThousandEyes platform (no login required).
On June 13, Amazon Web Services (AWS) experienced an incident that impacted a number of services in the US-EAST-1 region. During the incident, which lasted more than 2 hours, ThousandEyes observed an increase in response times, server timeouts, and HTTP 5XX errors affecting the availability of applications hosted within AWS.
Many services offered by cloud providers have fundamental architectural dependencies on one another. Companies using cloud services should understand the relationships in their digital ecosystem, regardless of whether those relationships are services or networks.


The Concur outage impacted users in various geographies across the globe.
See more in the ThousandEyes platform (no login required).
Around 3:20 PM (UTC), Concur faced an outage that lasted for approximately three hours. During this time, some users around the world experienced problems accessing the app. Users in certain locations experienced HTTP timeouts, and others experienced spiking page load times. Recovery began around 6:25 PM (UTC) and was complete around 6:35 PM (UTC).
To provide your customers with the seamless digital experience you want to give them, it’s crucial to quickly identify the root cause of any disruption that arises. Then you can take appropriate measures to mitigate the outage’s impact—and guard against similar disruptions in the future.


Site hosted within the Virgin Media UK network (AS 5089) is unreachable to users during the incident.
See more in the ThousandEyes platform (no login required).
On April 4, Virgin Media UK (AS 5089) experienced two outages that impacted the reachability of its network and services to the global Internet for hours. The two outages shared similar characteristics, including the withdrawal of BGP routes to its network, traffic loss, and intermittent periods of service recovery.
When making a BGP change, it's important to understand the effect on the data plane, including the impact to network traffic paths, IP performance, and service reachability.


Reddit users experienced content loading issues.
See more in the ThousandEyes platform (no login required).
On March 14, starting at approximately 19:05 UTC, ThousandEyes observed an outage impacting global users of Reddit, a popular online platform that’s home to thousands of communities. Content loading issues and HTTP 503 service unavailable errors indicated a backend or internal app problem. The outage lasted a total of five hours, according to Reddit’s official status page.
Just because a site is reachable doesn’t mean it’s working as desired. Robust visibility is needed to be able to identify any backend problems that might be responsible for challenges users are encountering.


Multiple geographies and applications were impacted by the Akamai outage, including LastPass, SAP Concur, and Salesforce.
See more in the ThousandEyes platform (no login required).
At approximately 20:20 UTC, a number of applications served by Akamai Edge experienced sporadic availability issues for approximately 1.5 hours. ThousandEyes observed occasional packet loss and timeout conditions that might have prevented some users from accessing or interacting with apps that rely on Akamai's CDN service. Some of the affected applications include Salesforce, Azure console, SAP Concur, LastPass, and others. The issue seemed to have mostly been resolved by around 21:55 UTC, but some applications and users continued encountering intermittent issues for a longer period of time.
It’s important to deeply understand your end-to-end service delivery chain, not only the apps you rely on but also the providers they use.


Twitter users globally experienced access and service disruption.
See more in the ThousandEyes platform (no login required).
At approximately 16:45 UTC on March 6, Twitter experienced a one-hour service disruption that prevented many Twitter users around the world from accessing its app or accessing links. While the app was reachable from a network standpoint, users were receiving HTTP 403 forbidden errors, indicating a backend application issue.
It is important to continuously monitor the performance of the applications that you rely on. This will help you identify small and large performance degradations as they happen, which could, cumulatively, have a similar impact on your digital experience as a complete outage. For instance, on March 6, Twitter experienced a service-wide disruption, which was their only such event over the preceding six months. However, during that same period, instances of partial service degradation increased.


The Microsoft Outlook outage, as seen on the ThousandEyes Internet Outages Map.
View more in the ThousandEyes platform (no login required).
Microsoft Outlook users across the globe had trouble accessing the email service due to an outage that lasted about 1.5 hours, with intermittent problems persisting for several more. ThousandEyes data identified elevated server response timeouts and slow page loading, which is indicative of an application-related issue.
Even large, established providers experience outages sometimes. With proper visibility, your team can quickly discern the source of a problem (is it on your end or theirs?) and take the right steps to respond.


During the Okta incident, some users received HTTP 403 forbidden errors.
See more in the ThousandEyes platform (no login required).
Beginning at approximately 6:10 PM (UTC), ThousandEyes observed a disruption to Okta availability for some global users. Throughout the incident, HTTP 403 forbidden errors were seen, indicating a backend service issue rather than Internet or network problems connecting to Okta. The incident appeared to resolve for most users approximately 30 minutes later, around 6:40 PM (UTC).
As a single sign-on (SSO) service, for many organizations, Okta serves as the “front door” to various other apps they rely on, so when Okta’s not available, teams may have difficulty accessing other apps as well. In the complex digital ecosystem that companies rely on today, it’s important to understand all your dependencies and to have backup plans in place should the need arise.


Spike in Microsoft service outages, as seen through
ThousandEyes Internet Insights.
Microsoft experienced a significant global disruption that impacted connectivity to many of its services, including Azure, Teams, Outlook, and Sharepoint. The outage was triggered by an external BGP change by Microsoft that impacted connected service providers. The bulk of the incident lasted about 1.5 hours, but residual connectivity issues were observed into the following day.
Change always carries risk, but with diligent process, change reviews, prepared rollback plans, and quality assurance testing before and after any change, you can greatly reduce your chances of disruption.


Path visualization showing packet loss for some AWS customer traffic as it transited between us-east-2 and a number of global locations.
See more in the ThousandEyes platform (no login required).
ThousandEyes observed significant packet loss between two global locations and AWS' us-east-2 region for more than an hour. The event affected end users connecting through ISPs to that region's cloud infrastructure provider's services.
With public cloud, it’s important to monitor not just the applications themselves but also the cloud infrastructure components, including individual cloud regions and cloud availability zones and any dependent cloud software services.


WhatsApp experienced intermittent packet loss during the incident.
A two-hour global WhatsApp outage left users unable to send or receive messages. Related to backend application service failures rather than a network failure, this incident occurred during peak hours in India, where the app is hugely popular.
An immediate feedback loop should be available to people and teams making changes to production systems so mistakes can be identified and rectified. Having data that can help rule out the network as the culprit when a production system error occurs can speed up the resolution of technical issues.


Sudden spike to 100% packet loss appears for traffic destined to a Zscaler proxy.
Customers using Zscaler Internet Access (ZIA) experienced connectivity failures or high latency in reaching Zscaler proxies. Because Secure Service Edge (SSE) implementations typically proxy web traffic and critical business SaaS tools, apps like Salesforce, ServiceNow, and Microsoft 365 could have been made unreachable for some customers by this incident. The most significant packet loss lasted approximately 30 minutes and connectivity was fully restored for all user locations in about 3.5 hours.
SSE is another piece of the Internet puzzle to consider when things go awry. Having network-agnostic data for complex scenarios like this can enable quicker attribution and remediation.


The Zoom outage as seen in the ThousandEyes platform.
A brief Zoom outage left users around the world unable to log in or join meetings. In some cases, users already in meetings were kicked out of them. The root cause appeared to be in Zoom’s backend systems, around their ability to resolve, route, or redistribute traffic.
Sometimes the app itself is causing issues rather than the network. Having visibility into which it is can prevent confusion and finger-pointing during root cause analysis.


Outage renders Google domain properties unreachable in several countries.
See more in the ThousandEyes platform (no login required).
Google Search and Google Maps became unavailable to users worldwide for approximately 60 minutes, with those attempting to reach the services receiving error messages. Users from the United States to Australia, Japan to South Africa could not load sites or execute functions. Applications dependent on Google’s software function also stopped working during this rare outage.
It’s important to monitor not just your application frontends but also the performance-critical dependencies that power your app.


Affected interfaces in the AWS network (AS 16509).
An AWS Availability Zone power failure caused an AWS outage that impacted applications such as Webex by Cisco, Okta, and Splunk. Not all users or services were affected equally, however. Some services that had physical redundancy in place remained operational, while other services and apps took up to three hours to recover.
For cloud-delivered applications and services, some level of physical redundancy, either geographic or physical, should be factored in. Additionally, contingency or failover plans should be understood, documented, and rehearsed so that you’re ready to execute should the need arise.


100% packet loss observed for locations connecting to a Rogers customer.
An issue with Rogers’ internal routing affected millions of users and many critical services across Canada. Rogers withdrew its prefixes due to the internal routing issue, which meant the Tier I provider was unreachable across the Internet for nearly 24 hours.
No provider is immune to outages, no matter how large. Understand all of your dependencies—even indirect ones—and build in redundancy for every critical service dependency identified. Consider a backup network provider so you can reduce the impact of any one ISP experiencing a disruption in service.


Atlassian's Jira, Confluence, and Opsgenie are three products that many developer teams rely on. Due to a maintenance script error, these services experienced a days-long outage that impacted roughly 400 of Atlassian's customers.
Companies shouldn’t rely on status pages alone to understand application or service outages. If status pages go down or don’t share enough detail, it can be difficult to understand if or how your organization is being impacted.


Start of BGP advertisement of 104.244.42.0/24 by RTComm (AS 8342).
Twitter was rendered unreachable for approximately 45 minutes after a Russian Internet and satellite communications provider blackholed traffic by announcing one of Twitter’s prefixes. BGP misconfigurations are not uncommon. However, they can be used to block traffic in a targeted way, and it’s not always easy to tell when the situation is accidental versus intentional.
While your company might have RPKI implemented to fend off BGP threats, it's possible that your telco won't. Something to consider when selecting ISPs.


British Airways experienced an online services outage that caused hundreds of flight cancellations and disruptions in the airline's operations. While the network paths to the airline’s online services (and servers) were reachable, the server and site responses were timing out. The issue was likely due to a central backend repository that multiple front-facing services rely on.
Architecting backends that avoid single points of failure can reduce the likelihood of a chain of events.

More Outage Insights
The Internet Report Podcast
Tune in for a podcast covering what’s working and what’s breaking on the Internet—and why.
EXPLORE THOUSANDEYES
Request a free trial or book a demo of the ThousandEyes Internet and digital experience visibility platform.
In-Depth Outage Analyses
Keep your finger on the pulse of Internet health and notable outage events with expertly crafted blogs from ThousandEyes’ Internet Intelligence team.