Product News
Announcing Cloud Insights for Amazon Web Services

Outage Analyses

Netflix Broadcast Disruption: Lessons for Major Live Events

By Mike Hicks
| | 14 min read

Summary

The issues Netflix experienced during the Jake Paul vs. Mike Tyson boxing event leave valuable lessons for ITOps teams seeking to assure flawless digital experiences.


“Dead air”—the unsettling silence that sometimes replaces regular content—remains a significant concern for traditional TV and radio stations. In those critical moments of nothingness, viewers may be tempted to switch channels or turn off their devices completely. As a result, broadcasters have always been keen to avoid any interruptions in programming or transmission.

In the online streaming business, the parallel dilemma appears as glitches, whether stemming from network issues or backend service failures. Viewers may experience pixelated images, buffering delays, or streams that abruptly stop altogether.

The significance of these lost seconds or minutes varies depending on the type of event being streamed. For instance, a sudden glitch during a live sports event could mean the difference between witnessing a world record being set or missing the moment entirely.

This kind of issue reportedly occurred during the Jake Paul vs. Mike Tyson boxing event streamed live on Netflix, with many users reporting problems such as poor video quality, pixelation, stuttering, and even outright crashes leading up to and during the main event.

Although this wasn’t Netflix's first attempt at live streaming, it was reportedly the most-streamed global sporting event, and the high number of simultaneous connection requests may have contributed to the issues.

This Netflix event leaves valuable lessons for ITOps teams seeking to assure flawless digital experiences during major events—whether it be a much-anticipated sporting event or a Black Friday sale. We’ll discuss those later in this blog, but first let’s discuss what happened and why.

How does Netflix serve content?

Before discussing what happened, it’s important to understand how services like Netflix serve content to their users.

Netflix relies on its own content delivery network (CDN), called Open Connect, to manage and deliver 100% of its video traffic to end users. At the heart of Open Connect are specially designed server appliances known as Open Connect Appliances (OCAs). Netflix deploys OCAs at Internet exchange points (IXPs) in key markets worldwide. These OCAs connect with local Internet service providers (ISPs) through either public or private peering arrangements. Additionally, Netflix also offers an embedded deployment option, allowing OCAs to be set up directly within the ISP networks.

These appliances store encoded video and image files and deliver them to client devices via HTTP/HTTPS. For generic, scalable computing needs, Netflix turns to Amazon Web Services (AWS) for all activities that take place before a user hits “play.” This encompasses everything from application interface logic and content discovery to recommendation algorithms and transcoding.

OCA servers continuously report their status back to the Open Connect control plane services within AWS. This data includes health metrics, BGP routes, and the files stored on the servers. The control plane services analyze the information provided by the OCAs to direct clients (using a URL) to the most optimal OCAs based on file availability, server health, and network proximity to the user. Moreover, these services manage file storage, add new files to OCAs, optimize storage and hashing performance, and handle telemetry collection and analysis relevant to the playback experience.

Does the same process apply to livestream content?

Typical Netflix content, such as movies and TV shows, would essentially be pre-loaded in the regional OCAs with appropriate content for each geographical location. This fill takes place during a fill window that is scheduled outside of peak hours pertinent to the geographical location. According to Netflix, this fill window is typically between 2 AM and 2 PM local time, and all OCAs within a provider must begin filling at the same time.

However, when dealing with a live stream, the source data would need to be processed and transcoded before being forwarded to the requesting distributed OCAs in real or close to real time.

So, what happened during the event?

ThousandEyes conducted tests across a distributed global network from various user locations to effectively measure streaming performance from the Netflix caches located at the OCAs, which typically reside within ISP networks. This gauges the download speed of video, video startup time, and TCP connection time—essentially, measures of latency—across a range of providers. These insights indicate how efficiently content can be delivered when requested by a user.

Between 12 and 1 AM (UTC) on November 16 (7 to 8 PM (EST) on November 15), ThousandEyes’ tests revealed a significant increase in issues, which manifested as “Operation timed out with 0 bytes received” across multiple regions. This timing coincided with the start of the livestream and the beginning of the event—1 AM (UTC) / 8 PM (EST), suggesting potential congestion issues on the OCA serving the content. The situation improved considerably from 1 to 3 AM (UTC) / 8 to 10 PM (EST), during which error rates and timeouts appeared to decrease.

However, another notable spike in errors was observed between 3 and 4 AM (UTC) / 10 to 11 PM (ET), this time appearing as “0 bytes received” in addition to HTTP errors like 500, 502, and 503. The presence of these 5xx errors indicates that the requested data or service was unavailable, or unavailable to be served, which could be due to congestion or delays in data transmission, potentially indicating delay or load issues with the encoded data during delivery to the OCAs.

Was this felt consistently across regions?

Different regions appeared to experience the disruptions differently depending on the time of day locally. For example, several Australian ISPs encountered issues throughout the broadcasts, while different ISPs in the U.S. noticed increased error rates toward the start of the main event. However, these errors were not widespread across their entire service areas, with an average of 25% of tests reporting errors. Since no single provider or network path exhibited significant issues, it appeared that the common factor was centered on the OCAs themselves.

Why did this disruption occur?

During peak demand, 25% of tests experienced errors.

When a test encounters a content error, it first attempts to reconnect. Most of the time, it successfully streams at a lower bit rate after stepping down. However, if it is still unable to receive a positive confirmation from the OCA, it will report a NETWORK_CONTENT_ERROR. Notably, during the time period from 3 to 4 AM (UTC) / 10 to 11 PM (EST), around 25% of tests experienced difficulties streaming without stalling. This coincided with what would be perceived as peak demand during the main event.

Connectivity was not the main issue.

There did not appear to be any significant changes in connect times, prebuffering metrics, or maximum streamable bitrates in the successful tests; it was largely an all-or-nothing situation. In other words, when a connection was established data was streamed successfully—albeit, in some cases, at a reduced rate. There was a slight dip in download speed, but it was not significant enough to have caused the disruption, and generally, most broadband service providers provision above what is necessary for streaming Netflix.

Congestion on the OCA was likely the cause.

In summary, the bottleneck appeared to be related to content availability as it was requested from the OCAs during the broadcast period. The fact that this issue didn’t occur across every test indicates that it is less likely to have stemmed from problems in delivering content to the OCA itself. In other words, there was still a high proportion of successful stream requests, just at a lower bit rate, which may suggest contention and congestion issues on the OCAs themselves. Initial signs of problems occurred post-midnight (UTC) and peaked between 3 and 4 AM (UTC) on November 16—10 to 11 PM (EST) on November 15, with the greater proportion of errors observed in the U.S., which, given the broadcast time, likely had the largest concentration of users at this time.

This time-of-day factor was observed in both the Australia and New Zealand tests and coincided with the start of the broadcast. Additionally, slightly elevated rates of "Successful After Step-Down" tests were noted. The highest rate of content errors appeared between 3 and 4 AM (UTC) / 10 and 11 PM (EST), followed by an increase in "Successful After Step-Down" instances from 4 to 6 AM (UTC) / 11 PM to 1 AM (EST). Performance returned to normal after 7 AM (UTC) / 2 AM (EST). This is significant because it coincided with local content windows for this time zone, which may have further contributed to congestion conditions for users in this region.

Chart showing that the highest error rate percentage was observed between 3 and 4 AM (UTC) / 10 and 11 PM (EST)
Figure 1. Highest error rate percentage observed between 3 and 4 AM (UTC) / 10 and 11 PM (EST)

The observed errors suggest that congestion and content availability were the most likely cause of the degradation. However, it is important to note that during the outage, no systemic network problems were observed globally or with individual providers. In conclusion, the situation seems to indicate a combination of congestion at the OCAs, as indicated by the rates of “Successful After Step-Down,” along with a significant increase in content errors during peak broadcast hours.

Lessons Learned

Applications are rarely discrete these days. Instead, we are looking at a service composed of a combination of applications, protocols, functions, and dependencies. Each component must interact and operate seamlessly with the others to deliver the best digital experience. It is important to understand these components along with the characteristics and objectives of the complete service delivery chain. This understanding helps us determine the most relevant signals and vantage points to monitor, enabling us to identify not only when something goes awry but also, more importantly, potential areas for optimization.

While the time of day is always an important consideration, other factors—such as change windows and backups—may negatively impact the ability to assure a digital service. The actual characteristics of the service may require a different schedule or playbook. In this case, while a broadcast may be available for replay, the primary selling point was a simultaneous live global broadcast of a sporting event. The requirements for delivery, beyond the normal service provided, changed accordingly.

This doesn’t necessarily mandate a requirement to change the architecture; after all, it is impractical to scale out specifically for isolated events. However, understanding these characteristics and requirements measured against a baseline allows for informed decisions about what actions or processes should be implemented and ultimately executed.


For more insight like this, follow The Internet Report podcast. We’ll keep you up-to-date on the latest Internet outages and relevant news.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail