Product News
Announcing Cloud Insights for Amazon Web Services

The Internet Report

Talking Proactive Optimization, ChatGPT Issues & More

By Mike Hicks
| | 23 min read
Internet Report on Apple Podcasts Internet Report on Spotify Internet Report on SoundCloud

Summary

Powerful things happen when ITOps teams move beyond a break-fix approach and lean into proactive optimization. Tune in for more on this and recent disruptions at OpenAI’s ChatGPT, Grammarly, Bluesky, and others.


This is the Internet Report, where we analyze outages and trends across the Internet, from the previous two weeks, through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or tune in to the podcast for firsthand commentary.

Internet Outages & Trends

Given our focus on analyzing outages and disruptions, it’s natural that a lot of the discussion in this blog series is around break-fixes. Visibility has a key role to play here: It often takes an independent view of the end-to-end service delivery chain to understand all the components within it, and which of them is the likely cause of a degradation or outage. That visibility enables Operations teams to shorten the mean time to detect and respond to an incident, by pinpointing exactly which component of the service is having problems. Its owner or maintainer can be identified, and a resolution sought. We saw this exact situation unfold with the recent Bluesky disruptions. The Ops team was confronted with multiple signals, and needed a way to cut through them to determine what was causing issues for the social media service.

But in a world where change is constant, visibility also has a much more important use case: It can be used to locate potential bottlenecks and areas for optimization and improvement, helping teams deliver better and better digital experiences. This data-driven intelligence is essential for application and service teams to identify performance improvement opportunities and measure how their efforts enhance the user experience. Streamlining one small part of a complex process could shave seconds off the total transaction time; do this for every part of the process, and the efficiency savings can quickly add up.

In recent weeks, it appeared OpenAI’s ChatGPT was undergoing this type of optimization, as we observed material improvements in page load times, following a series of disruptions over the space of a week, which manifested as a range of error codes being displayed to users. Based on ThousandEyes observations of ChatGPT over the last few weeks, it appears that the service could have undergone some configuration changes and re-architecture in the pursuit of performance improvements, and—with independent visibility of the exercise—it seems the efforts paid off.

Read on to learn more about what happened at ChatGPT, as well as other incidents at companies including Grammarly, Bluesky, and Netflix, or use the links below to jump to the sections that most interest you:


ChatGPT Disruptions

ChatGPT was impacted by a degradation, followed by disruptions, on consecutive days in early November (November 7 and 8) that were confirmed by OpenAI on its status page. The root of the November 7 incident may correspond to work that started days earlier, aimed at reconfiguring aspects of the generative AI service to improve performance. This incident raises important conversations about best practices for proactively identifying optimization opportunities, as well as making sure the rollout is successful and quickly mitigating any unexpected hiccups that arise.

We’ll discuss that more later, but first let’s dive deeper into what we saw at ChatGPT.

On November 4, three days before OpenAI acknowledged the November 7 issue, news reports noted unreliable services as users experienced 404 errors. These initial problems went officially unacknowledged; however, when the issues appeared to return days later, ChatGPT did release a formal statement about the disruption.

One possible explanation for these two disruptions is that whatever occurred on November 4 was preparatory work for what would be implemented during November 7. ThousandEyes’ observations indicate it’s possible that ChatGPT was making some changes to optimize for user experience.

After November 7, ThousandEyes saw an interesting pattern that reflected a significant reduction in page load times for ChatGPT interactions. While significant reductions in page load times can sometimes be indicative of an outage, that doesn’t appear to be what was happening here. When page load times drop during an outage, this frequently coincides with a reduction in the number of page elements or objects. While it does then appear that the page is loading faster, it does so in an incomplete state. These conditions are also temporary, with page load times returning to normal once whatever is causing page elements not to load, or to load incorrectly, is remediated.

However, in this case, things played out differently for ChatGPT. Page load times decreased substantially and then maintained that new level, suggesting that the action was likely deliberate. The ChatGPT page structure seems to have changed, with the object count within the page reduced from 72 to 12. An inspection of the page components suggested that this might have been an attempt to enhance the user experience. Certainly, this change improved performance across a number of ThousandEyes’ tests and customer organizations.

ThousandEyes screenshot showing decreased page load times for ChatGPT
Figure 1. ThousandEyes observed decreased page load times for ChatGPT

As we mentioned at the start of our discussion of this disruption, the ChatGPT incident raises important conversations about best practices for proactively identifying optimization opportunities and making sure that these optimizations are as successful as possible. When companies are making the type of proactive service optimizations that it appears ChatGPT may have been doing here, temporary disruptions may occur as the team rolls out and refines the changes. In these situations, it’s valuable to have deep visibility into your full end-to-end service delivery chain both to identify opportunities for optimization in the first place, as well as to identify and guard against any problems that your proposed updates might cause. And despite the best laid plans, unexpected issues can pop up, so it’s also important to be able to quickly catch and mitigate any surprise problems.

And what about the November 8 disruption? Did that also appear to be related to performance enhancement efforts?

On November 8, ThousandEyes observed another 200 response code instance—suggesting an inability to retrieve resources at all. ThousandEyes also saw some error codes present in the response.

ChatGPT acknowledged the November 8 issue, noting that ChatGPT was unavailable for 24 minutes
Figure 2. ChatGPT acknowledged the November 8 issue, noting that ChatGPT was unavailable for 24 minutes

While on the surface, the November 8 incident appeared similar to the two previous disruptions on November 4 and 7, it may not actually have been related. OpenAI has provided a post-incident write-up on the November 8 incident, noting that “the root cause was a configuration change to the load-balancing configuration for a downstream service which ChatGPT depends on. This config change activated a latent bug in the logic to actuate the config, rapidly increasing server worker’s memory usage which led to all of them crashing.”

ThousandEyes screenshot showing ChatGPT’s availability completely dropping off during the incident
Figure 3. During the incident, ThousandEyes observed ChatGPT’s availability completely dropping off

The configuration change was rolled back to restore service. OpenAI said that in coming weeks it “will significantly refactor [their] configuration delivery systems to prevent this class of outage from happening again.”

Grammarly Issue

On November 8, Grammarly experienced an issue that impacted some users' ability to log in, edit their documents, or access their account settings. Additionally, some users could not see suggestions or corrections from Grammarly.

The fact that the issues were related to specific functions suggested that the problem was on the application side rather than network-related. Additionally, users could connect to the associated server, further confirming that there were no significant network issues associated with the disruption.

Screenshot from Grammarly website noting that the service interruption created various issues for users
Figure 4. Grammarly said the service interruption created various issues for users, including difficulties with logging in, editing documents, or accessing account settings

The issue appeared to be intermittent and affected only some users. This may have been related to the fact that different users were accessing different functions across varied workflows, meaning they were interacting with the application in different sequences.

Upon further analysis, ThousandEyes observed inconsistencies and fluctuations in page load times across different regions. While there was a slight increase in response time, the most significant observation during the disruption was the increase in page load time.

When analyzing the causes of the page load issues, it became apparent that the additional page load time was associated with "blocked time." In the design of some web applications, instead of waiting, browsers maintain an open connection to the server for a certain period after the page is fetched. This allows the browser to fetch any additional resources (such as images, JavaScript, CSS, etc.) in parallel without needing to re-establish the connection. Typically, this connection remains open for about 10 seconds before being closed but can enter a wait state until all fetch requests are completed or the timer expires, at which point the page essentially times out. This means that the web browser is waiting for other requests to finish before issuing a new one.

Since some of these functions, such as checking for updates, require dynamic content, any issue with retrieving the requested resources can prolong the execution time to the point of failure. This was evident during the disruption, indicating that the issues lay in resource retrieval.

ThousandEyes screenshot showing extended times for resource retrieval during the Grammarly disruption
Figure 5. During the disruption, ThousandEyes observed extended times for resource retrieval

ThousandEyes screenshot showing shorter resource retrieval times before and after the Grammarly disruption
Figure 6. Before and after the disruption, ThousandEyes observed shorter resource retrieval times

Bluesky Disruptions

Bluesky, a social media platform, has been growing rapidly of late, and had a pair of issues in mid-November. The first manifested as slow-loading feeds and notifications from the social media app; the second was attributed to third-party network issues, coinciding with Bluesky’s “highest traffic day ever.”

On the application side, messages like “Invalid Handle” suggest an inability to connect to all backend services. Users reported experiencing issues like links not working, random errors, or blank screens. This suggests that while the service front door could be reached, the content/request was not fully serviced or was operating in a read-only mode.

The root cause of the network issues is a bit more uncertain. Though one carrier did report issues around the same time, it’s unclear if this had a flow-on impact to Bluesky, or whether its network-related problems were located elsewhere in the end-to-end service delivery chain. The existence of multiple signals indicating potential points of failure can make it difficult to determine which one is causing the functional failures or slowdowns for users, and therefore that needs to be addressed first. Only by verifying and clarifying all components can an Operations team truly know how to proceed with mitigation or remediation.

OVHcloud Peering Problems

On October 30 at 1:16 PM (UTC), a number of OVHcloud customers experienced issues ranging from increased latency to packet loss. This situation had repercussions for several telecom providers at various levels.


Explore this incident further in the ThousandEyes platform (no login required).

ThousandEyes screenshot showing that the OVHcloud incident impacted the Washington D.C. region.
Figure 7. The incident impacted the Washington D.C. region

ThousandEyes screenshot showing that ThousandEyes observed paths withdrawn during the OVHcloud incident
Figure 8. During the incident, ThousandEyes observed paths withdrawn

The root of the problem was traced back to a faulty configuration in a peer network, specifically Worldstream, which is utilized by OVHcloud. Their peer mistakenly announced the full Internet routing table via its peering session instead of just its own routes. This led to a large volume of traffic being directed towards Worldstream's peering, ultimately causing congestion on their links. The issue was mitigated after 24 minutes.

ThousandEyes observed excessive packet loss at OVHcloud ASN.

ThousandEyes screenshot showing excessive packet loss seen at OVHcloud ASN
Figure 9. Excessive packet loss seen at OVHcloud ASN

In addition, we observed data attempting to flow to the OVHcloud-hosted environment dropping across multiple providers, indicating that the issue likely lay within the OVHcloud network. This, along with the excessive packet loss across various providers and the withdrawal of routes, aligns with what OVHcloud has stated.

ThousandEyes screenshot showing forwarding loss for paths into OVH
Figure 10. Forwarding loss for paths into OVH

In the aftermath of this incident and through a thorough analysis, the OVHcloud Network team discovered that a maximum prefix limit configuration had not been applied on one of the peering links with Worldstream. This oversight allowed the full table to be learned, contributing to the issue. According to OVHcloud, the configuration has since been updated to prevent a recurrence of this problem.

To further enhance the security of the OVHcloud network, a comprehensive review of the configurations on peering devices has been conducted. Additionally, automatic mitigation measures have been implemented, which will trigger an automatic shutdown of the peer session if the number of announced routes exceeds a predetermined limit.

In summary, OVHcloud determined that the incident stemmed from the faulty configuration in the Worldstream peer network, along with the lack of a necessary configuration on one of the peering links.

Salesforce Outage

On November 15, Salesforce experienced a global outage that impacted five data centers. From ThousandEyes’ observations, the issue appeared to impact the backend, resulting in a series of server-side errors, including "page/resource not found" and timeouts. The timeouts appeared as "connection reset by peer," which usually occurs when the peer crashes. However, this can also happen due to applications or frameworks that do not close their TCP connections properly. The inconsistency suggested a domino effect that caused greater load and ultimately led to a system crash.


Explore the Salesforce outage further in the ThousandEyes platform (no login required).

ThousandEyes screenshot showing page load and availability impacted with no coinciding network conditions during Salesforce outage
Figure 11. Page load and availability impacted with no coinciding network conditions, pointing toward server side or backend issues

As the service was restored, ThousandEyes also observed 5xx internal server type response codes. Interestingly, though this issue followed a scheduled change to the environment, Salesforce reported these were contributing factors that triggered the outage and not the root cause. As a result, simply rolling back the change wouldn’t have resolved the outage, because the underlying issue that had been triggered would still be present. It appears that the changes triggered high levels of network traffic, which subsequently impacted database stability.

This outage highlights the importance of considering all available data points when diagnosing an issue’s cause. While at first glance, the scheduled changes might have seemed like the root cause, the situation was actually more complex.

Netflix Issues

Netflix’s hosting of the Jake Paul vs. Mike Tyson and Amanda Serrano vs. Katie Taylor events broke records for the streaming company. The Paul vs. Tyson event is reportedly the most-streamed global sporting event with a peak of 65 million concurrent streams and the Taylor-Serrano match ranked as the most-watched professional women’s sports event in U.S. history.

This weight of interest also seemed to break infrastructure—or at least push capacity past thresholds—as a number of users experienced buffering, freezing, and laggy performance. There hasn’t been any official explanation for the problems, only the announcement that the event set a new mark for Netflix use at any one time.

Verizon Fios Disruption

Verizon experienced Internet issues on November 12 on its Fios service, a fiber optic-based network that offers bundled telecommunications and TV to subscribers.

ThousandEyes observed some disruption, although it seemed to be limited to specific regional areas and to customers within those areas. Given the localized nature of the issues, they were most likely the result of a configuration issue or a problem with some aggregated common infrastructure. Restoration times also support this theory, indicating that the issue was addressed through software adjustments. Typically when disruptions are caused by damage to infrastructure like fiber cables, a truck roll is usually required to make physical repairs such as resplicing, and results in significantly longer resolution times.

ThousandEyes screenshot showing localized packet loss across parts of Verizon’s network
Figure 12. ThousandEyes observed localized packet loss across parts of Verizon’s network

While the disruption primarily affected customers directly connected to Fios, traffic that was flowing to or across peering or access points related to the Fios environment may also have been impacted. Although disruptions and user connectivity issues seemed to be limited to Fios broadband subscribers, the challenges in passing through or returning traffic were evident.


By the Numbers

Let’s close by taking a look at some of the global trends ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over recent weeks (November 4-17):

  • The total number of global outages initially decreased. In the first week of the period, ThousandEyes observed a 5% drop in the number of outages, from 187 to 178. However, this downward trend was short lived, as the following week—November 11 to November 17—saw a significant increase in outages. They rose from 178 to 250, a 40% increase compared to the previous week.

  • During this period, the United States did not experience a decline; instead, outages increased by 30% in the first week (November 4-10). This was followed by an even larger surge the following week, with outages rising from 112 to 161, representing a 44% increase compared to the previous week.

  • From November 4 to November 17, an average of 64% of all network outages occurred in the United States. This was a significant increase from 42% in the previous period (October 21 to November 3). This is the largest level seen this calendar year but does align with a pattern often seen in 2024, where U.S.-centric outages typically account for at least 40% of all recorded outages.

Bar graph showing global and U.S. network outage trends over eight recent weeks, from September 23 to November 17
Figure 13. Global and U.S. network outage trends over eight recent weeks

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail