This is the Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. On this week’s episode, we discuss a recent BGP-related outage at a major public cloud provider, as well as a recent announcement that Cogent Networks has rolled out RPKI in an effort to strengthen its BGP route security. We’re also joined by Kemal Sanjta, principal engineer on our customer success team and our resident expert on Internet routing and security, to chat about these events. Catch this week’s episode here to dive into BGP with us.
Find us on:
Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.
Show Links:
- FCC has “serious doubts” that SpaceX can deliver latencies under 100ms
- Ajit Pai doubts Elon Musk’s SpaceX broadband-latency claims
- Elon Musk on Twitter
- Cisco Live Session: Expanding the Internet for the Future, Today: Supporting First Responders and Society at Large - DLBPOS-17
- NAMEX Session: Panel on Internet Exchange Points and Internet Resilience
Catch up on past episodes of The Internet Report here.
Follow Along with the Transcript
Angelique Medina:
Hello everyone. This is The Internet Report, where we uncover what's working, and what's breaking on the Internet, and why. The big story from last week was IBM Cloud's global outage, which prevented its customers and its customer's customers from reaching its services. The cloud wide outage took place during normal business hours in the United States and lasted approximately two hours. In other news, Cogent Networks jumped on the BGP security bandwagon and successfully rolled out RPKI. It's now filtering invalid announcements for routes covered by ROAs. This is a big deal because Cogent is a major transit provider, particularly in the US, and we've seen in the past how its inability to filter illegitimate announcements has led to major outages—such as when it propagated routes that were leaked by Safe Host co-location provider to China Telecom, which led to some amount of traffic that was destined for Facebook, getting blackholed.
Angelique Medina:
Finally, the FCC is relenting somewhat and its dust-up with SpaceX over its attempt to have its satellite Internet service, Starlink classified as “low latency.” The FCC will now allow SpaceX to proceed and making its claim that its service belongs in that performance tier. This will go a long way to enabling SpaceX to competitively bid for a piece of the $16 billion over 10 years, that's up for grabs to deliver Internet service to rural parts of the United States. This is by no means a victory for SpaceX, as the FCC is still highly skeptical of its performance claims, and laid out its doubts in its response to the company. Is the FCC saving rural Americans from notoriously slow satellite connectivity, or depriving it of a fast and futuristic Internet service that uses low orbit satellites that are peer-to-peer connected and use a proprietary protocol that to quote Elon Musk "Will be much simpler than IPv6 and have a smaller packet overhead." More on that later in the show. But, first up, we're going to go under the hood on the IBM Cloud outage and show you exactly how it all unfolded.
Angelique Medina:
Last Tuesday, June 9th, between approximately 5:50 PM and 8:20 PM Eastern time, IBM Cloud experienced a network-wide outage that disrupted the reachability of services hosted within the public cloud provider. The impact of the outage was global, preventing users from reaching hosted services regardless of where they were located. ThousandEyes observed the outage from hundreds of external vantage points distributed around the world. Almost simultaneously, these vantage points were no longer able to reach IBM Cloud-hosted services. Looking deeper into the underlying issue, we can see that high levels of packet loss indicate the disruption to be network related.
Angelique Medina:
The packet loss appears to primarily occur on infrastructure operated by SoftLayer Technologies, which became IBM Cloud in 2013. Interestingly, traffic destined for IBM Cloud is not getting dropped, while traffic from the destination back to user locations is either completely or partially dropped within IBM Cloud's network. By approximately 8:20 PM, network traffic began to flow normally with service reachability restored almost simultaneously across global regions. The last major cloud network outage also took place in June. On June 2nd last year, GCP unintentionally executed a maintenance update for all of the server clusters controlling parts of its US network. As a result, impacted network infrastructure was essentially headless, and no longer able to route traffic internally. Even though the infrastructure itself was fine, it had no way of knowing where to send traffic because its control plane was offline. Traffic inbound to affected parts of GCP's network was subsequently dropped at its edge, and the available parts of its US network experienced congestion as they were forced to take on greater traffic loads.
Angelique Medina:
The GCP outage last year took approximately four hours to resolve, while IBM Cloud was able to restore network service in just over two hours. There was no indication that the IBM Cloud issue was caused by a similar internal control plane failure. The fact that traffic appeared to properly route to internal destinations from external networks during the incident supports that assumption. It also excludes the possibility of the BGP hijacking, which involves announcing a prefix that doesn't belong to you, as in this instance, traffic was still getting routed appropriately to IBM Cloud. Early indications based on IBM Cloud's public statements point to a BGP route leak from one of its peers, and potentially, issues with a third-party networking partner. As we've covered in many past reports, BGB incidents can have a wide-reaching impact on the reachability of services over the Internet.
Angelique Medina:
In the IBM Cloud incident, three factors: 1) the global impact, 2) the high but not always a hundred percent packet loss indicating traffic constraint or impedance, and 3) the fact that traffic egressing rather than ingressing the cloud provider was getting dropped all point to a BGP route leak or mishap. Such is the delicate nature of delivering services over the Internet. Next up, Archana Kesavan sat down with our guest this week to discuss Internet route security.
Archana Kesavan:
Thank you, Angelique. This week's guest is Kemal Sanjta, Principal Engineer in our Customer Success team, and our resident expert on internet routing and security. Before ThousandEyes, Kemal held network engineering roles at Amazon and Facebook. Kemal thank you so much for being on the show.
Kemal Sanjta:
Thanks for having me. It's my pleasure.
Archana Kesavan:
So Kemal, in light of some of these recent outages, one common theme that emerges is the fragility of the Internet, right, a massive network of networks delicately tied together for the principle of this fascinating protocol BGP, which from what I heard can wreak havoc on Internet businesses and digital businesses if not handled properly. Why is that the case? What about BGP makes the Internet so fragile?
Kemal Sanjta:
Well, the protocol itself was invented at the time when security and security concerns were not in the focus. They were focused on making it work, and similarly to how they believe that Internet is not going to pick up, which resulted in a major outage of IPv4 for outer space and later invention of the IPv6, they did not envision that the security might be a big concern. In general, there's a lot of inherent trust between operators that are participating on the Internet as a result of the fact that there are no enforced security methods.
Archana Kesavan:
Interesting. So how does the Internet, right, and the inherent trust system, or maybe I should say lack there off, negatively impact online businesses today?
Kemal Sanjta:
Well, there are two different ways. The first one is route leaks, and the second one is hijacks. While they are quite similar in what they actually result in, they are pretty distinct. So RFC 7908 defines the route leak as a propagation of routing announcements beyond their intended scope. Quite often, those are unintended consequences of the configuration change on the edge router that went badly, or they are an effect of devices such as BGP optimizers, which are basically traffic engineering devices that are supposed to make better routing decisions internally within the ASN. And when those are advertisements leak out to upstream providers and they propagate it further, we have quite negative effects. On the contrary, BGP hijacks are malicious attempts that have very similar effects.
Archana Kesavan:
So what are the cases of route leaks going bad, right? Something that you said was an edge router within an ISP could have a misconfiguration, and that can impact the services that are kind of dependent on that ISP. So what are some of the cases of route leaks going bad?
Kemal Sanjta:
Well, I believe that the most popular example would be potentially one from 2008, as part of which, a national Pakistani telecom advertised ISP addresses that belonged to YouTube, and as a result of which YouTube traffic was redirected to Pakistan and unfortunately, they didn't have the capacity or resources to serve the content that was requested. So basically, once the preference is propagated, YouTube was pretty much inaccessible for its users. That's an example of an upstream leak, but most recently, Cloudflare had the quite big issue as a result of Verizon propagating more specific prefixes from different networks that were generated by the BGP optimizer that was in the ownership of a quite small company. So that particular route leak basically caused quite big effects for Cloudflare users.
Archana Kesavan:
Those are some really powerful examples, right? Because in both the cases, we saw almost half of the Internet shut down because of these route leaks that happened. What are some of the most common myths about BGP routing leaks and hijacks?
Kemal Sanjta:
There's a misconception that all of the BGP route leaks are malicious. While they have quite negative effects, they quite often happen as previously mentioned, as a result of configuration mistakes, causing unintended effects. Companies at which those happen quite often get a lot of negative press, they have a financial impact and so on, right? So it's not just that they are causing a negative effect for others. They are causing negative effects for themselves. And as previously mentioned, there are usually just configuration mistakes.
Archana Kesavan:
Cool. That was helpful. Thanks, Kemal. One final question before we wrap up is what is your biggest BGP pet peeve?
Kemal Sanjta:
So I would say that the fact that RPKI adoption is going so slow. RPKI has quite big benefits, and so far a quite proven track of record, right? Even before the mentioned event that affected Cloudflare, it proved that operators such as AT&T or KPN who implemented did not drop the Cloudflare's traffic, which means it clearly worked. However, even with all of that, this quite proven track of record, today we have as part of which quite often we need to pull some of the tier one providers to implement it, right? So for example, Cloudflare and I believe it was called out on the show, started a webpage as part of which they are calling out the providers that didn't implement it. I don't think that should be the case, right?
Archana Kesavan:
But this progress last week we say Cogent Networks jumped on the BGP security bandwagon and successfully wrote out RPKIs. So hoping this kind of becomes the norm rather than the exception going forward.
Kemal Sanjta:
I fully agree on there's a hope that it's going to happen, that more broad adoption is going to happen.
Archana Kesavan:
Right. With that, Kemal thank you as always. You've been absolutely great to host. And thanks for all that knowledge you shared.
Kemal Sanjta:
Thank you very much.
Archana Kesavan:
This week, we covered a major cloud outage, RPKI adoption by a large transit provider, and SpaceX battle with the FCC. What should you take away from last week, and what should you pay attention to this week? Angelique and I give you our top Internet takeaways. Outages on the Internet and the Cloud are inevitable. If you're moving to the cloud and relying on the Internet now more than ever, factor in outages as a part of your application design and architecture for resiliency, for recovery and redundancy. Wondering how to contribute towards making BGP safe? Check on Cloudflare's “isbgpsafeyet.com” and see if your ISP is RPKI compliant. Educate and share with your community to build a safer Internet.
Archana Kesavan:
What am I looking forward to this week? There are two events that particularly caught my attention. One is the Cisco Live two-day event. It's a virtual event on the 16th and the 17th of June, and in particular, the session on June 16th at 1:00 PM on expanding the Internet, and it involves AT&T, Cox Communications, Verizon and Facebook—I'm looking forward to that. And then the second event is a much smaller scale digital event again, put together by the Nautilus Mediterranean eXchange called NAMEX. And on the 18th, they have a panel on Internet exchange points and resiliency. So that's it. That's what I'll be busy doing this week.
Angelique Medina:
My first takeaway comes from one of our headlines earlier. It has to do with SpaceX's claim that they can achieve ground-to-satellite round-trip times of approximately 20 milliseconds. Even if that were the case, Internet performance is much more complex than the results of a simple ping test. Most Internet users are interested in content: Netflix, gaming and so on. Traditional ISPs have learned that content is king, and they've spent years working with content providers, whether they be cloud providers, gaming companies, or media companies to optimize content delivery. So unless SpaceX tucks Netflix server babies onto the underside of its satellites, it's going to be hard-pressed to deliver Internet service that is meaningful from a user's perspective. My second takeaway is just a suggestion on what to tune in for this week. Paul Vixie, who is a DNS luminary and a primary author of BIND, will be speaking at NAMEX this week in a mysteriously titled session “The Art of the Impossible.” Those of you who follow the DNS community closely may wonder if the title should have been the Art of Being Impossible. Either way, it's something I plan to tune into, and recommend you do as well.
Angelique Medina:
That's our show. Please remember to subscribe and follow us on Twitter. And as always, if you have any questions, or feedback, or guests that you'd like to see featured, feel free to drop us a note at internetreport@thousandeyes.com. That's also where you can claim the free t-shirt if you've subscribed to the show. Just send us your address and your t-shirt size, and we'll get that right over for you. Also, stay tuned over the next few days as we're going to be reopening registration for our virtual summit, The State of the Internet, which is now scheduled for July 16th. We have a great lineup of speakers scheduled, and we'd love to see you there. Until next time, have a great week.