This is The Internet Report, where we analyze recent outages and trends across the Internet, through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or tune in to the podcast for firsthand commentary.
Internet Outages & Trends
In our analysis of outages, we often observe multiple parties experiencing problems simultaneously. Sometimes, a single shared dependency such as a network interconnection point, a cloud service provider, or an open-source library becomes the single point of failure in multiple service delivery chains. In other cases, it may seem like there's a common denominator, but the issues are not directly linked and simply coexist. These coincidental cases can lead to incorrect attribution, which can also result in ineffective mitigation and resolution attempts.
Last week, we encountered two such incidents where, despite appearing to have a common theme, they were in fact separate incidents.
First, due to a problem with a U.S. region of Microsoft Azure, some companies experienced issues with Azure services dependent on Azure Storage in part of the United States.
Then, a content configuration update made by cybersecurity technology company CrowdStrike, as part of its regular operations, led to a widespread, global series of IT outages. These outages left airlines and airports, banks, transportation operators, restaurants, supermarkets, and more with fleets of Windows machines displaying the BSOD—Blue Screen of Death.
Because both the Azure and CrowdStrike incidents occurred around the same time and involved Microsoft products, some initially thought the disruptions were related; however, they turned out to be independent, unrelated events.
Read on to learn about these incidents, or use the links below to jump to the sections that most interest you:
CrowdStrike Sensor Update Incident
It was mid-afternoon on Friday, July 19 in Australia and New Zealand when organizations first began to experience problems. A range of industries and major brands simultaneously reported outages as their Windows machines reportedly got stuck in a boot loop that ultimately resulted in the BSOD (Blue Screen of Death). The impact quickly spread to other geographies, causing problems with airline booking systems, grocery stores, and hospital services. And these were just the tip of the iceberg.
Perhaps nothing was more illustrative of the impact than a timelapse video of flight movements over the U.S.—as the Federal Aviation Administration (FAA) issued a ground stop following requests from several airlines.
Initial responsibility for the widespread outage was thought to lie with Microsoft, as affected systems and devices were all Microsoft based—and the company had experienced an outage in its Central US region hours earlier that day. The outage affected systems and services using Windows Client or Windows Server, impacting services that may have been running on Windows systems, such as Active Directory (AD).
However, as analyses by IT administrators started to filter through to Internet forums and social media, a different common denominator emerged: CrowdStrike, a managed detection and response (MDR) service used to protect Windows endpoints from attack.
CrowdStrike provided its first public statement on the incident, helping to clear up the initial confusion over where the issue lay. Following this initial statement, CrowdStrike published guidance on actions and workarounds for IT administrators, and an early technical post-incident report that attributed the incident to an issue with a single configuration file that “triggered a logic error resulting in a system crash and blue screen (BSOD) on impacted systems.”
The CrowdStrike incident is notable for a few reasons:
-
First, the scale and breadth of the outages that flowed from this single shared dependency has potential impacts for digital economies worldwide. It was branded as the “largest IT outage in the world.” There may be some short-term memory at play there, but it was one of the largest in recent memory: A 2021 incident involving a DDoS mitigation service also comes to mind, as does a very similar 2010 incident that led to boot loops and bricked PCs.
-
Second, recovery was not a simple task, requiring IT staff to physically attend machines to get them functional. At one point, Microsoft reported that up to 15 reboots per machine may be needed.
The Need for Trusted Visibility
During a large-scale outage, the first question is rarely "Is there an outage?" but more often "What is causing this?" closely followed by "Who is responsible for fixing this?". The speed at which these questions can be answered depends on the level, completeness, and accuracy of the available information. Operating in the absence of complete information for extended periods can lead to assumptions about what might be wrong, often based on one perception, such as "services can't be reached."
Given the wide range of services impacted and the broad set of outage conditions manifested across endpoint, server, and applications—only some of which IT might have access to and knowledge of—it’s not surprising that the network may have been suspected. Internet connectivity issues are often the common factor linking outages that occur simultaneously across many different apps and services.
ThousandEyes observed the impact of the CrowdStrike incident in two ways:
- As traffic drops at the “front door” of apps and services hosted on Microsoft Windows servers. Because the servers were not functioning, they were not able to receive and respond to incoming traffic, leading to traffic loss as shown in Figure 2 below.
- As server timeouts and HTTP 500 server errors for apps and services hosted on content delivery networks (CDNs) or other infrastructure that were unable to reach backend servers running Microsoft Windows (see Figure 3).
Based on this information, IT users could have narrowed down the fault domain and potential cause by combining observations to trace and identify the root of the issues. This process starts with the network and by association, the Internet. It represents the largest domain and as a result, where the traffic will spend the majority of time. Therefore, it is not only the best place to start troubleshooting, but identifying that it’s not an issue immediately rules out a significant number of potential causes.
In this case, the network displayed no significant issues or degradation, extending across multiple regions and hosting locations, indicating that the network was not the issue. However, both scenarios observed above are highly correlated with server issues, such as a server becoming unresponsive or application code running on the server no longer functioning properly.
This incident highlights that there's much more that goes into a digital experience than some might realize. It only takes one component, or even just one function, to fail or degrade in order to bring the entire service delivery chain to a halt. When a disruption happens, it’s important to efficiently determine the source—and a critical step in this process is identifying what isn't causing the problem. Assuring digital experiences involves more than just the network—customers need to consider everything from the device to the app.
Adding to the confusion was the fact that Microsoft had experienced an issue in one of their US Azure regions, which coincidentally was also hosting a number of customer Windows servers.
Azure Outage
Mere hours before the impacts of the CrowdStrike incident were felt, Microsoft experienced an unrelated that affected access to various Azure services and customer accounts configured with a single-region service in the Central US region. This outage occurred around the same time as the CrowdStrike incident, from 9:56 PM (UTC) on July 18 to 12:15 PM (UTC) on July 19. The close timing of the two incidents may have caused some confusion and led to the larger global IT outage being mistakenly attributed to Microsoft. Although Microsoft systems were affected during the CrowdStrike incident, it was completely unrelated to the Azure incident.
According to a status update, some customers experienced issues with multiple Azure services, including “failures of service management operations and connectivity or availability of services.” The network path to the Central US region appeared unaffected. However, connectivity into the Central US region appeared impaired, with forwarding loss being observed at the ingress points to the affected region. This issue mainly affected customers with resources operating in the Central US region, with users configured for single-region availability in this region experiencing availability issues while users with multiple read regions and a single write region in the Central US likely experiencing performance degradation. Among those impacted were Confluent, Elastic Cloud, and Microsoft 365.
Microsoft’s status update also identified a configuration change as the underlying cause that impacted the connectivity of backend services, specifically storage clusters and compute resources. This then triggered some automated mitigation with services being restarted repeatedly.
With access to the storage clusters impacting Azure SQL database access for some users, Microsoft initiated mitigation workflows that involved geo-failover for customer databases to assist with recovery.
In situations like this where digital experiences are impacted, it’s crucial to understand what is related and what isn’t in order to avoid wasting time and resources. In this scenario, understanding which services or functions are being impacted, where they are being severed from, and what areas are being affected provides a solid foundation for making informed decisions regarding mitigation or future optimization.
Workday Outage
On July 6, ThousandEyes observed an uptick in HTTP failures, along with increased redirect and page load times on Workday for about three hours. From ThousandEyes’ observations, there appeared to be an issue with the initiating authentication process, as opposed to actually attempting to authenticate. ThousandEyes’ analysis of page load and an end-to-end transaction process showed that the system recognized there was an issue and used a temporary redirect (denoted by a 302 HTTP status code) that pointed users to a static maintenance page, confirming a backend service issue within Workday.
Not only did the redirect time increase, but also there was also a decrease in SSL, which is often indicative of an issue with authentication processes. The throughput likely also dropped, resulting in an improved response time, indicating that the full page wasn't loaded. The point is that while each of these signals could be interpreted in different ways in isolation, combining them gives a more precise idea of what happened.
Grammarly Incidents
Grammarly experienced several incidents, across July 13, 15, and 18. The latter two were described as “service interruptions” that manifested as login and functional issues, such as “not being able to see suggestions or corrections.”
During the disruptions, ThousandEyes observed an increase in page load times. This helped validate that the network integrity and the connection to the application were likely not having an adverse impact on the digital experience. This was further confirmed by the absence of any significant loss or increase in latency across paths from various geographies. The fact that the page load time was elevated, combined with an increase in response time, provided a good indication that some backend services were not responding. This narrows the fault domain, enabling an informed decision about what process, workflow, or actions could be taken.
The July 13 incident, meanwhile, was directly attributed to an issue with Azure. Azure reported that the Azure OpenAI (AOAI) service has an automation system that is implemented regionally but uses a global configuration to manage the lifecycle for certain backend resources. A change was made to update this configuration to delete unused resources in an AOAI internal subscription. There was a quota on the number of storage accounts on this subscription, which were unused and intended to be cleaned up to prevent storage quota pressure.
However, this resource group also contained other resources, such as backend Managed Instance Resource endpoints used for deployment, management, and use of OpenAI models. Additionally, the resource group and these critical backend resources were not correctly tagged to prevent automation from executing the specified cleanup operations. When the automation kicked off, these critical backend resources were unintentionally deleted, causing AOAI online endpoints to go offline and be unable to serve customer requests.
By the Numbers
Let’s close by taking a look at some of the global trends ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over the past two weeks (July 8-21):
-
In contrast to the recent downtrend, there was a reversal in this period, with outage numbers trending up throughout. From July 8 to 14, outages increased 21% compared to the previous week, rising from 151 to 183. This upward trend continued the following week, with outages increasing slightly from 183 to 187 between July 15 and 21, marking a 2% increase compared to the previous week.
-
The United States also saw an upward trend, though with much larger increases than observed globally. There was a 52% increase from July 8-14, followed by a further 12% increase from July 15-21. The rise in outages globally and in the U.S. during this time is somewhat expected and aligns with ThousandEyes' typical observations for this time of year.
-
Due to the significant increase in outages in the United States, U.S.-centric outages accounted for over 40% of all observed global outages. In the two-week period from July 8 to July 21, 48% of network outages occurred in the United States, compared to 34% in the previous two weeks (June 24 to July 7).