Unpacking the Slack Outage & Other Backend Issues

This is The Internet Report, where we analyze outages and trends across the Internet through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or listen to the podcast for firsthand commentary.

Internet Outages & Trends

Backend problems can often be difficult for ITOps teams to diagnose due to the complexity of web-based architecture that underpins the frontend, customer-facing experience.

To maintain resilience and reduce downtime, it’s important to monitor the performance of your entire service delivery chain and quickly pinpoint problematic patterns. It’s about having the data and being able to view it in context, so that your team can make the right decision based on all available evidence.

In certain situations, it may be more beneficial to switch to an alternative system that can temporarily handle the workload while the primary service is being repaired. Implementing mitigations on your side, such as adjusting configurations, optimizing performance parameters, or applying patches, can also be effective strategies. Alternatively, there may be instances where you decide to do nothing and just allow the issue to resolve itself, especially if it seems temporary.

The decision will depend a lot on the nature of the impacted service: It’s a lot easier to move to an alternative web-based collaboration or transcription service, for example, than it is to change a monitoring dashboard for an IT environment.

In every case, a structured approach to evaluating the situation and potential responses enhances your team's ability to manage service interruptions effectively and efficiently.

Let’s explore four recent backend-related disruptions that impacted Slack, Microsoft 365, Grafana Cloud, and Otter.ai—and the lessons they impart.

Read on the learn more, or use the links below to jump to the sections that most interest you:

Slack outage
Microsoft 365 outage
Grafana Cloud disruption
A pair of Otter.ai outages
By the numbers

Slack Outage

On February 26, around 3:00 PM (UTC) or 7:00 AM (PST), ThousandEyes detected a service disruption affecting Slack users. The issue initially presented as timeouts, despite stable network connectivity, suggesting a problem with the backend application. Initially, the impact was sporadic and not concentrated in any specific geographical area. However, as the disruption worsened, the affected area expanded, and users began encountering HTTP 500 server errors. This escalation turned the issue into a global problem, lasting over nine hours and rendering the collaboration tool inaccessible for much of the business day across the United States.

At the outage’s height, a significant percentage of Slack users experienced issues with various features, including sending and receiving messages, using workflows, loading channels or threads, and logging into Slack. Many of these features were degraded or, in some cases, completely unusable.

Interestingly, it initially appeared that the situation was improving, with some areas still operational. There were periods where Slack seemed to be operating as expected, indicated by the castellation effect in the timeline just prior to the severe downtime event (see Figure 1 below).

ThousandEyes screenshot showing that the Slack disruption began as timeouts, before expanding to HTTP 500 errors — Figure 1. Disruption began as timeouts, before expanding to HTTP 500 errors

As the global impact grew, ThousandEyes began to observe instances of HTTP 500 internal server errors in response to requests. The HTTP 500 error indicates that the web server has encountered an internal issue and cannot complete the request. The causes of an HTTP 500 error can vary, but some of the most common reasons include server configuration errors, coding errors within programming languages, database errors, and issues related to permissions or authorization.

The HTTP 500 errors, along with other signals observed by ThousandEyes, indicated that the outage likely originated from an issue with a backend server. There were no accompanying problems in the network paths, such as increased latency or data loss.

ThousandEyes screenshot showing latency remained consistent throughout the Slack outage. — Figure 2. Latency remained consistent throughout the disruption, with no corresponding network conditions observed

As stated, during the disruption, there appeared to be no coinciding network conditions that could be impacting the application in the way it was observed. This was noted across all observed paths from multiple regions around the globe, each appearing to terminate and be served by an instance relative to the requesting agent's geographical location.

This observation is relevant because, during an outage of this kind, we can look for commonalities or aggregation points, such as a specific region being impacted, a content delivery network (CDN), or a provider. However, the fact that all observed network paths appeared intact and the error condition observed was consistent across all these regions indicated that the source of the issue was likely a function or feature of the service that is common and central to its operation.

ThousandEyes screenshot showing no coinciding network errors observed on paths from multiple regions to distributed instances — Figure 3. No coinciding network errors observed on the paths from multiple regions to the distributed instances

ThousandEyes observed that the issue affected all platforms, including the desktop client, mobile app, and web browser. The absence of network-related problems, the lack of HTTP 502 or HTTP 503 errors—often indicative of backend forwarding issues, no obvious authentication errors, and functions that were impacted, suggested that the common factor may have been the database.

This was later confirmed by Slack, which identified the cause of the issue as a maintenance action in their database systems. This issue, combined with a latency defect in their caching system, resulted in an overload of heavy traffic directed at the database. Consequently, approximately 50% of instances relying on this database became unavailable.

To address the situation, Slack took several steps to reduce the heavy load on the database system and implemented a fix for the root cause of the overload. By around 5:32 PM (UTC) / 9:32 AM (PST), some sessions showed improved health metrics for affected Slack features. Finally, by 12:13 AM (UTC) / 4:13 PM (PST), the issue was resolved for all affected users.

This Slack outage underscores that when it comes to backend issues—or indeed, any type of issue—it’s important to monitor the performance of your entire service delivery chain, so that you can quickly pinpoint the specific fault domain when a problem pops up. With this understanding, you can take the right steps to mitigate the user impact and resolve the issue. These steps may include switching to a backup system or taking other mitigation actions on your end. And in some cases, you may discern that it’s best to simply wait for the issue to resolve itself.

Microsoft 365 Outage

On March 1, Microsoft 365 customers reported experiencing problems with various services including Outlook.

During the disruption ThousandEyes observed that while network connectivity to the application’s frontend servers appeared to be functioning properly, the issue seemed to affect authentication processes across multiple Microsoft services. This indicated that the problem lay within the backend.

Microsoft later confirmed this, attributing the issue to a “problematic code change.” When they reverted this change, the issue resolved and service was restored.

ThousandEyes screenshot showing issues observed across the authentication process — Figure 4. Issues observed across the authentication process

Grafana Cloud Disruption

A subset of users of the observability service Grafana Cloud experienced “longer than expected load times” for multiple days, starting on February 24.

The issue appeared to only affect instances hosted in AWS environments, both existing instances and the ability to create new ones. Additionally, Grafana engineers say most issues are experienced “in the prod-us-east-0 and prod-eu-west-2 regions” of AWS.

Screenshot of the Grafana status page showing only AWS instances as degraded — Figure 5. Grafana status page showed only AWS instances as degraded

The fact that only AWS-hosted environments are experiencing the issue and that users are having trouble creating new instances, suggests that the problem was likely linked to a specific service or functionality within that environment.

This issue manifested as degraded performance and slower load times than anticipated. Users would likely have experienced challenges when attempting to initiate new instances within these AWS-hosted settings, further indicating that the performance degradation may have affected operational efficiency.

In this case, we are not dealing with an application-wide issue, a network issue, or even a problem with the cloud provider. Instead, we are focusing on a specific function that is uniquely related to the interaction between Grafana and the hosted instances. When diagnosing issues, looking for commonalities not only helps in identifying the cause of the problem but also informs the mitigation strategies, workarounds, or processes that are implemented to alleviate the issue.

A Pair of Otter.ai Outages

On both February 24 and 26, Otter.ai, an AI meeting assistant, experienced an outage that affected users' ability to interact with the system. During the disruptions, users reported encountering HTTP 502 Bad Gateway errors in their browsers, as well as occasional HTTP 500 Internal Server errors

The 5xx errors indicated that the frontend system was available but wasn’t able to execute requests properly. Specifically, the presence of the 502 Bad Gateway error suggests that the server received an incomplete or invalid request and couldn’t respond. This issue can also occur when security measures are blocking communication between the gateway server and the upstream server, which may lead to a 502 error.

Anecdotally, while users could access the login page, they were unable to authenticate with the service. This aligns with the 502 error, which indicates backend service issues. It's also noteworthy that while other systems reportedly functioned correctly, the service accessibility issues were present through the webpage, further suggesting that the problem was associated with the authentication process.

The timing of the outages suggests that they may have been planned to coincide with off-peak hours, minimizing potential disruption for most users worldwide. The occurrence of these two outages in a similar timeframe could indicate that the team was attempting to implement changes or conduct maintenance work that had been attempted previously.

By the Numbers

Let’s close by taking a look at some of the global trends ThousandEyes observed over recent weeks (February 17 - March 2) across ISPs, cloud service provider networks, collaboration app networks, and edge networks.

During this period, after a slight initial decline, the total number of global outages resumed the upward trend we’ve seen since early February. In the period’s first week (February 17 - 23), ThousandEyes reported a small 0.2% decrease, with outages decreasing from 398 to 397. However, the following week (February 24 - March 2) saw a return to an upward trend, as the number of outages increased from 397 to 447, representing a 13% rise compared to the previous week.
The United States outages followed a different pattern. Initially, outages increased slightly, rising from 196 to 199, a 2% increase compared to the previous week. However, during the second week (February 24 - March 2), outages dropped from 199 to 189, reflecting a 5% decrease.
From February 17 to March 2, an average of 46% of all network outages occurred in the United States, down from the 55% reported during the previous period (January 27 to February 2). This figure of 46% continues a trend observed throughout 2024, where U.S.-centric outages accounted for at least 40% of all recorded outages.
Looking at the month-over-month numbers, in February, there were 1,595 outages observed globally, marking a 15% increase from the 1,382 outages recorded in January. In the United States, outages rose from 657 in January to 811 in February, a 23% increase. This trend aligns with patterns observed in previous years, as total outages have typically increased from January to February, both globally and in the U.S.