ServiceNow, Microsoft & Workday Outages, Explained

This is The Internet Report, where we analyze recent outages and trends across the Internet, through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or tune in to the podcast for firsthand commentary.

Internet Outages & Trends

Enterprise users of cloud-based services experienced a range of issues over the past fortnight that temporarily stood between them and their data.

For a subset of ServiceNow users, an expired root certificate led to connectivity, performance, and functionality issues; some Microsoft 365 users found their legitimate credentials just didn’t work, leaving them seemingly stuck in an authentication loop; while Azure Virtual Desktop users in parts of the U.S. could not make new connections, perform management actions, or access resources due to a configuration management issue. Lastly, an issue impacting a shared resource prevented a number of Workday customers from interacting with their instances.

For cloud service providers, the variety of problems highlights that digital service delivery chains are only as strong as their weakest link. If just one component of the chain encounters issues, it can disrupt the entire service. Meanwhile for users, the variety of problems that caused outages and the varying amount of status information available for each circumstance, highlights the importance of maintaining independent visibility across the range of cloud services used in the average enterprise.

Read on to learn more about these outages and recent trends, or use the links below to jump to the sections that most interest you:

ServiceNow outage
Microsoft 365 outage
Azure Virtual Desktop outage
Workday outage
Reliance Jio outage
By the numbers

ServiceNow Outage

About 600 ServiceNow customers were reportedly impacted by the expiry of a root certificate—and an unsuccessful attempt to update it—on September 23.

The outage impacted a Management, Instrumentation, and Discovery (MID) Server—a crucial component of ServiceNow that “enables communication and the movement of data between a ServiceNow instance and external applications, data sources, and services.” It supports various use cases such as integrations, orchestration, and discovery.

Understanding the MID Server’s role is key to comprehending the impact of the incident. A subset of customers reportedly experienced connectivity problems between cloud instances and MID Servers, as well as other performance and functionality issues.

SSL server certificates, which are used by web servers providing content through HTTPS URLs, are issued by certification authorities and must be digitally signed. When a root certificate expires, it can lead to service and connectivity interruptions. In this specific case, it appears that ServiceNow was aware of the impending certificate expiration but encountered issues when trying to replace the expiring certificate.

The outage serves as a reminder of the critical role each function in a digital ecosystem or end-to-end delivery chain plays in maintaining seamless operations. An application or service is only as strong as its weakest link. For engineering and ITOps teams, that means committing to continuous testing, optimization, and improvement mechanisms to try to ensure systems hit uptime and performance expectations. In spite of these efforts, errors or unforeseen circumstances may still arise from time to time, leaving engineering teams with learnings they can apply to guard against similar issues in the future.

Certificate expiry or renewal is a theme that we’ve discussed before, and certificate-related incidents are often more complex than a simple omission. In the past year, we’ve seen certificate changes at an upstream provider impact third-party services downstream, and erroneous manual changes to a certificate lead to service inaccessibility.

Learnings from that latter incident still apply today: “It only takes a degradation or outage in one component to have a flow-on impact, potentially taking out the entire service. Given this reality, some teams may choose to invest in tools that provide them visibility and early warning into things like soon-to-expire certificates—or put other strategies in place to guard against such issues.”

Microsoft 365 Outage

On September 24, Microsoft also experienced an outage where issues with one component of the digital service delivery chain—the authentication step—rendered the service unusable for some users. Starting around 9:10 AM (UTC), users got stuck in an “authentication loop” that prevented them from accessing some Microsoft services, including Microsoft365.com and office.com.

A screenshot of Microsoft’s X post acknowledging the issues that was posted at 9:10 AM (UTC) — Figure 1. Microsoft’s X post acknowledging the issues was posted at 9:10 AM (UTC)

The use of the term "authentication loop" implies that users were able to reach the authentication screen and enter their credentials, only to encounter an error or have the action canceled.

ThousandEyes screenshot taken during Microsoft 365 outage, showing connection request redirected back to sign on, before timing out — Figure 2. Connection request redirected back to sign on, before timing out

The issue resulted in timeouts, preventing users from being able to authenticate and access online collaboration services. The sign-on request seemed to be redirected back to a sign-on page, rather than being sent for authentication, which resulted in the timeouts.

ThousandEyes screenshot showing that multiple regions around the globe exhibited timeouts during the Microsoft 365 outage — Figure 3. Multiple regions around the globe exhibited timeouts

During the outage, ThousandEyes’ observations of network paths showed no specific evidence of conditions that would contribute to the inability to log on. In other words, the traffic was able to reach the associated services but was unable to execute, indicating that the issue was within the backend system.

ThousandEyes screenshot showing no coinciding network conditions indicating application or associated backend service issues — Figure 4. ThousandEyes observed no coinciding network conditions that would indicate the issue lay with the application and associated backend services

This type of authentication failure—where users are stuck in a loop, entering their correct credentials only to see it fail—may lead users to believe that they have entered the wrong password or that their credentials have been compromised. However in this scenario, it appeared that the credentials were not rejected per se but simply asked to be reentered.

At all points in the service delivery chain—including the authentication step, it’s important to recognize the signs of normal or abnormal situations and then take appropriate action to address, resolve, or find a workaround for the issue. It’s also critical to have a deep understanding of all the services that make up that chain. For example, does a certain provider handle authentication natively or do they rely on a third-party system or service authentication process (e.g., logging in using your Google or Facebook account)? This understanding can help your team efficiently pinpoint the source of an outage.

Azure Virtual Desktop Outage

Between 6:46 PM (UTC) and 8:36 PM (UTC) on September 16, a subset of Azure Virtual Desktop users in various U.S. regions “experienced failures to access their list of available resources, make new connections, or perform management actions.” The biggest impact was to customers whose configuration and metadata was stored in the East US 2 region of Azure.

According to Microsoft, the problem arose due to a degradation with a SQL database that stores configuration data and an associated process that replicates that configuration data from the primary database to “multiple secondary copies.” Internal telemetry alerts indicated that one of the secondary database replicas was operating “several hours behind” the primary. Engineers manually intervened to complete the failover, and that resolved the issues.

The issue affected customers in the Central US, East US, East US 2, North Central US, South Central US, West Central US, West US, West US 2, and West US 3 regions of Azure. Additionally, end users with connections established via any of these U.S. regions may have also experienced problems.

It appeared that there was nothing customers could have done to prevent or minimize the impact of this specific incident. Microsoft stated that the issue stemmed from a deterioration in the geo database serving the U.S., causing failures in accessing service resources. This incident serves as a reminder that seemingly unrelated processes can also lead to system failures, beyond what would traditionally be considered the typical service delivery chain.

Workday Outage

On October 1, Workday experienced an outage that impacted customers whose production tenants are hosted in Workday's WD5 data center in Portland, Oregon. The outage affected customers trying to interact with their instances. Customers intermittently experienced server timeouts, which suggests application-side issues.

ThousandEyes screenshot showing users globally seeing timeouts attempting to connect to a Workday tenant hosted in Portland — Figure 5. Users globally seeing timeouts attempting to connect to a Workday tenant hosted in Portland, OR

ThousandEyes tests to tenants in the WD5 data center revealed a number of errors in the receive phase. This pertains to the amount of time required to receive a response from the server, essentially the time from the first byte to the last byte of the payload, encompassing the loading of all components, some of which may involve calls to backend services or resources. The receive phase errors also suggest that the disruption’s source originates from the backend services or systems. In simpler terms, the actual site could be reached, but users wouldn’t have been able to interact with it. It's important to note that during the outage, no significant adverse network conditions, such as high loss or excessive latency, were observed on connections to WD5 tenants.

ThousandEyes screenshot showing network paths to a WD5 tenant not displaying any adverse network conditions during the outage — Figure 6. Network paths to a WD5 tenant not displaying any adverse network conditions during the outage

Workday confirmed that the issue was impacting production tenants in the WD5 data center. They pointed to an issue with a shared resource. The intermittent nature of the issue and associated timeouts, along with the fact that it only affected tenants in WD5, indicates that the shared resource was not only associated with that facility but most likely involved in distributing and delivering the requests to the backend services.

Screenshot of Workday's acknowledgement of an issue with connections to production tenants at WD5 — Figure 7. Workday acknowledged issue with connections to production tenants at WD5

Workday posted a status update in their community forum, stating that they were adjusting the configuration of a shared service to address intermittent latency and "connection closed" errors impacting integrations. Subsequently, connectivity to services began to return.

Reliance Jio Outage

On September 17, Reliance Jio customers across multiple areas of India faced significant connectivity issues, including mobile Internet access problems and frequent call drops. The outage seemed to affect some users more than others, with reports of inconsistent Internet connectivity. Many users mentioned that Jio's services were particularly impacted in Mumbai.

According to a report from Reuters, which cited a source with "direct knowledge of the matter," the cause of the outage was a fire in one of Reliance's data centers. Specifics about the data center experiencing issues were not shared.

ThousandEyes observed forwarding loss for traffic and connections on nodes located in Mumbai.

ThousandEyes screenshot showing the Reliance Jio outage

This Reliance Jio outage represents a second major outage reportedly caused by a data center fire this September. On September 10, Alibaba Cloud also experienced an outage when lithium batteries in a Singapore data center exploded, leading to fire and elevated temperature.

By the Numbers

Let’s close by taking a look at some of the global trends ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over two recent weeks (September 16-29):

The downward trend observed since mid-August came to an end during the period of September 16-29, as the total number of global outages increased. There was a 5% increase in the first week, with outages rising from 170 to 178. This trend continued into the following week, with outages increasing from 178 to 192 between September 23 and 29, marking an 8% increase compared to the previous week.
The United States followed a different pattern, decreasing throughout the fortnight. Initially, outages dropped by 26% during the first week of the period (September 16-22). There was a further decrease in the following week (September 23-29), with outages falling from 64 to 58, a 9% decrease.
Only 33% of all network outages occurred in the United States during this period, compared to 50% in the previous two weeks (September 2-15). This represents a deviation from the previously observed pattern, where U.S.-centric outages typically accounted for at least 40% of all observed outages.

Global and U.S. network outage trends over the eight weeks from August 5 to September 29 — Figure 9. Global and U.S. network outage trends over eight recent weeks

The Internet Report

ServiceNow, Microsoft & Workday Outages, Explained

Summary

Internet Outages & Trends

ServiceNow Outage

Microsoft 365 Outage

Azure Virtual Desktop Outage

Workday Outage

Reliance Jio Outage

By the Numbers

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

related blogs