ThousandEyes actively monitors the reachability and performance of thousands of services and networks across the global Internet, which we use to analyze outages and other incidents. The following analysis of the Microsoft outage on November 25, 2024, is based on our extensive monitoring, as well as ThousandEyes’ global outage detection service, Internet Insights. See how the outage unfolded in this analysis—more updates will be added as we have them.
Outage Analysis
On November 25, ThousandEyes observed a prolonged outage impacting some Microsoft services, including Outlook Online. The outage appeared to be intermittent, occurring in two main phases. Our analysis of the symptoms indicated that the root cause was likely a backend issue, which Microsoft later confirmed, noting that the problems stemmed from "a change that caused an influx of retry requests routed through servers, impacting service availability."
This outage is a helpful reminder for ITOps teams on best practices for approaching intermittent issues, the importance of knowing your baseline performance, and why it’s critical to look holistically at all available clues when diagnosing the cause of an outage.
We’ll discuss these key takeaways later in this blog post, but first let’s take a closer look at how the Microsoft outage unfolded.
Microsoft Outage Phase 1: Intermittent Issues and a Backend Problem
First observed around 2:00 AM (UTC), the outage initially appeared intermittent, with the impact seemingly restricted to a small number of regions. The outage manifested with symptoms such as timeout errors, resolution failures, and, in some cases, HTTP 503 status codes, indicating that the backend service or system was unavailable.
Notably, the path to the edge servers did not show any adverse network conditions that could explain these timeouts, such as increased packet loss rates at the edge. When combined, these signals point to the most likely source being issues affecting backend services. In other words, while the services were not responsive, errors on the receiving side, along with server-side status codes, indicated that while the service front end was reachable, subsequent requests for components, objects, or other services were not consistently available. The intermittent nature of the problem also meant that it was not always obvious to end users, often presenting as slow or lagging responses.
Microsoft Outage Phase 2: Packet Loss Increases and More Regions Impacted
The issue initially appeared to clear around 3:05 AM (UTC) but reappeared around 7:00 AM (UTC), manifesting again as timeouts and service unavailability errors. The second instance appeared to impact more regions than the first, with the number of affected services reflecting a cyclical pattern—essentially, a rise and fall in the number of impacted servers, which can suggest a backend request load issue.
As the second occurrence of the outage progressed, ThousandEyes observed an increase in packet loss occurring at the edge of the Microsoft network, along with the timeout and service unavailable errors. The observed loss rate, while constant in terms of occurring at the egress of the Microsoft network, was elevated compared to the previous disruption. However, it was not a consistent 100% loss across all paths and tests throughout the time period. This may have been attributed to increased congestion when connecting to the services, coupled with an inability to reach or connect with backend services.
Around 9:00 AM (UTC), Microsoft acknowledged issues affecting Exchange Online access and the Microsoft Teams calendar. The company announced that a fix was initiated at about 2:00 PM (UTC), which involved performing "manual restarts on a subset of machines in an unhealthy state." Shortly after this, the number of reported errors rose significantly, with more servers being impacted. At 5:25 PM (UTC) Microsoft reported that “targeted restarts are progressing slower than anticipated for most affected users.”
Microsoft later provided further insight into the root cause, explaining that the problems stemmed from "a change that caused an influx of retry requests routed through servers, impacting service availability."
To address the issues, Microsoft implemented optimizations aimed at enhancing the processing capabilities of its infrastructure. With these adjustments in place, service appeared to gradually be restored. This aligns with observations from ThousandEyes, which noted a series of timeout-related errors where services failed to respond, as well as HTTP 503 (service unavailable) and 404 errors. These errors indicated that, while communication with the front server was established, the server could not locate or reach the requested resource.
Lessons and Takeaways
Intermittent issues like the ones Microsoft experienced can be challenging to pinpoint, as they often present as slow or laggy performance. ITOps teams should have a clear understanding of their optimal baseline performance so that they can more easily detect performance deviations that may indicate an outage. And it doesn’t take much to cause an outage: If just one component—or even a single function—fails or degrades, the entire service delivery chain could be halted.
When a disruption occurs, it's crucial to identify the source efficiently—and a vital step in this process is establishing what isn't causing the problem. By combining this knowledge with the other data points you have access to, you can start to gain a clearer understanding of the outage’s cause, allowing you to quickly decide on the next steps and communicate effectively with your users.
[November 25, 2024, 2:00 PM PT]
ThousandEyes has been observing a prolonged outage impacting some Microsoft services, including Outlook Online. The incident began as intermittent timeout and application errors starting at around 2:00 AM (UTC) on November 25. The scope of the incident appeared to increase at approximately 7:00 AM (UTC) and again at 12:40 PM (UTC). During the incident, various conditions were observed, including server errors, timeouts, and packet loss. While the incident appears to have partially resolved, we are still seeing issues for some users attempting to access affected services.