An IXP And Streaming Music Provider Walk Into An Outage Bar

Watch on YouTube - The Internet Report - Ep. 20: Aug 17 – Aug 23, 2020

This is the Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. On this week’s episode, Archana and I cover some recent outages that made headlines. This includes the Spotify outage, caused by an expired TLS certificate, that prevented users from accessing its platform. We also cover off on a widespread outage at Cogent during (what seems to be) a maintenance window. Then, we go “under the hood” on the prolonged outage at an IXP on August 18th to understand exactly what infrastructure was impacted and which downstream providers were subsequently impacted. We’re also joined by our guest, Prabhnit Singh, who currently leads ThousandEyes’ Internet & WAN product line, to discuss why we’re seeing an increased number of outages caused by expired TLS certificates and to cover some examples of past high-profile outages.

Show Links

Follow along and explore this week's outage analysis within ThousandEyes -- no login required!

Find us on:

Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.

Catch up on past episodes of The Internet Report here.

Listen on Transistor - The Internet Report - Ep. 20: Aug 17 – Aug 23, 2020

Follow Along with the Transcript

Angelique Medina:
This is the Internet Report, where we uncover what's working and what's breaking on the internet, and why.

Angelique Medina:
This was an interesting week last week because we had a number of fairly high-profile incidents that impacted not only consumers, but also business users, as well. So there was a pretty significant Spotify outage on August 19th, which was a Wednesday. And that happened fairly early in the morning from the standpoint of Eastern Time. And so there was a lot of users, for example, in the UK, that complained about not being able to use the service.

Archana Kesavan:
Right. And again, the reason we only saw a lot of users in the UK complain was because it was fairly early, but the outage actually was more global in nature. And, as observed by some interesting customers, it looks like there was an expired TLS certificate that actually resulted in the outage.

Angelique Medina:
Yeah, absolutely. And we have a great interview later on in the show with someone who's going to talk around TLS certificates, some trends there, and why we're seeing more outage incidents that are being triggered from those TLS certificates expiring. So that's going to be really interesting.

Angelique Medina:
I also thought another interesting thing about this particular event, as you brought up, which is that we saw, for example, if you were to go to Downdetector, you saw that a lot of the complaints were showing the hotspot to be the UK and then maybe a little bit on the East Coast. And that's, again, it's like one of these things, it's a little misleading because the outage was really widespread and it was really just that users were noticing and complaining about the incident in those areas because of the time zone they were in, that they were actually up and awake during that time.

Archana Kesavan:
Right. And again, one of the interesting things that we noticed is once we got wind of the outage, and actually after, we started monitoring to a particular domain, and interestingly, we actually did see a certificate renewal the morning of August 19th. So while Spotify has not come out with a formal root cause, what was speculated by some of their customers who were able to confirm that looking at the platform.

Angelique Medina:
Yeah. And considering some of the incidents that we've seen in the past, this was a fairly quick resolution. They were able to effectively turn around this update pretty quick.

Archana Kesavan:
And that is something interesting our guest speaker today is going to talk about as well. So definitely stay tuned for the expert section today.

Angelique Medina:
Awesome. And then another one that really impacted a lot service providers, and then of course, because it impacted service providers, it would have had a downstream impact on users, but this was a very prolonged outage at Equinix's facility in London, their Docklands facility. Apparently there was a power outage that impacted both their supposedly independent sources of power, and having lost two feeds. This lasted like the whole day.

Archana Kesavan:
Almost 12 hours. Yeah.

Angelique Medina:
Yeah. Yeah, so between 12 and 15 hours. Not all of their customers were impacted in the same way, which is something we'll talk about in a bit when we get into this a little bit more, but that was a pretty significant incident that happened on August 18th, starting in the early hours of the morning.

Archana Kesavan:
And talking about ISP outages, we had another, yet another Cogent outage on Wednesday, August 19th. It happened pretty late in the night, so we didn't really see an impact of the outage because it happened around, I think like 10:50 Pacific. So it was pretty much across in LA, Boston. So we saw it widespread from the perspective of areas that were impacted, but because it was so late in the night, we did not necessarily see any user impact. Most likely a maintenance window given the timeframe.

Angelique Medina:
Yeah. Yeah. And we see that fairly frequently with a large transit provider like Cogent, when they make changes to their network and they do have these maintenance events, they will simultaneously impact a lot of different locations that are really distributed. So oftentimes, not only the US, but their EMEA infrastructure will be impacted as well. So that was something that we saw here, but just interesting to note kind of how they operate effectively, right?

Archana Kesavan:
Right.

Angelique Medina:
And if you're in Europe or if you're in the US, you can expect kind of a similar time frame in terms of when maintenance is done.

Archana Kesavan:
All right. So we're going under the hood on the Equinix outage. We saw some really interesting shares. Like Angelique was saying, this was a really long outage, like almost 12 to 15 hours, and we see an array of ISPs impacted. So yeah, let's get into that really quick now.

Angelique Medina:
Yeah. So we're looking just kind of at a tweet that was sent out by Equinix regarding the outage. Again, it took place at their Docklands facility LD8, and it was a power outage. And so it impacted their customers there, as well as other providers, including other exchange providers, like LINX also reported that they were impacted as well because they maintain infrastructure within this facility and also peer and offer various services through that the facility as well.

Angelique Medina:
What's interesting is that at the time of the outage, so we're looking here kind of around 3:25 UTC, which would have been around 3:20, or, excuse me, 4:20 AM PST

Archana Kesavan:
4:20 AM PST. Yeah.

Angelique Medina:
So this was basically at the start of the outage, we see, just simultaneously, a number of different service providers, for example, TeliaNet, we see Cogent, NTT, and Level 3, and if we drop down into a little more detail on each one of these, we can see, for example, this is the infrastructure that is in the UK, and the same with Cogent as well. So obviously they have a lot of infrastructure, and again, infrastructure within the UK is impacted. A good contrast to what we mentioned earlier in terms of when Cogent has a planned incident, it's usually distributed across a number of different sites. And we don't see that here. We really see that this is contained to kind of the blast radius of this Equinix power outage.

Archana Kesavan:
And actually, that's a really good point that you mentioned, because from a timing perspective, this did happen, local time, pretty early on, right? Like 4:30. So while one could think that this is a maintenance window, another way to kind of characterize that kind of level set is it's not that distributed, right?

Angelique Medina:
Yeah. Yeah. Yeah, and it's also useful to kind of look at this incident within the context of like, what do their maintenance windows typically look like? When do they typically take place? And this doesn't look like that at all, just based on what we've seen with Cogent in the past. So having that context about even individual providers and what their typical operations look like is really useful.

Angelique Medina:
And then, looking at NTT, again, the UK, so this is ...

Archana Kesavan:
What I really like about this snapshot, Angelique, is it shows kind of not just the breadth of the outages, but like in one place, you can see all the providers that are impacted, right? And that typically happens at like a colo like Equinix and in a location like London, that impact is like really widespread.

Angelique Medina:
Absolutely. Yeah. And this is just one particular interval of time. We also picked up lots of kind of packet loss and an outage for British Telecom and other service providers as well, so not just the ones we're seeing here.

Angelique Medina:
But it's interesting because there were some of the customers of Equinix who reported that they were affected by this outage. It didn't look like every one of their customers was impacted the same way. Some of them lost connectivity to their infrastructure across the board, and then others, just some of their racks, for example. And so we can see here, for instance, that some service providers and users connecting from certain locations, we only saw the effect for a very brief period of time at the onset of the incident, right?

Archana Kesavan:
Yeah, you're right. And then if you filter here on Northampton, right, just to kind of reduce the scope of what you're seeing here, is that red dot, which is a hundred percent packet loss, is very clearly seen in the Equinix facility. And to your point, this looks like a blip just from how long this outage lasted and this could be because of not all providers were affected, not all power was impacted.

Angelique Medina:
Right. And it may also be that some of the providers were able to recover from having that particular facility go down. And that's also something interesting to kind of bear in mind if you're an enterprise and you see that certain service providers are maybe more resilient in the face of say kind of a peering connection going down, that might be something also to consider when you kind of manage and evaluate your vendors.

Archana Kesavan:
And then from a backup perspective, right? Not just from like where you're located, you have redundancy …

Angelique Medina:
Right. Yeah. Yeah.

Archana Kesavan:
But even from a power backup perspective, the ones who were really affected were really unfortunately unlucky because they lost both the primary and the backup, right? So that's why the recovery took long. I think this is an example where actually you see the impact for that 12 hours.

Angelique Medina:
Well, it's also interesting, too, because I think that brings up a question of what is resiliency and talking to your providers, because it sounded like, based on some of the statements of their customers, I think it was GigaNet that put out a statement, again, they mentioned like both the A and B power sources. Well, typically those, for resiliency purposes, are completely independent of one another, and if one goes down, the other is meant to not be affected in the same way, but they were both affected. So I think there are a lot of questions that this brings up in terms of how resilient were they really in the first place?

Angelique Medina:
So this was interesting because we also saw, again, across a broad set of providers, Level 3, Cogent, and others, that the outage was very prolonged. So in contrast to that little blip where some users were only impacted for a little while, we see here starting at around 3:25 UTC, which is 4:25 British Standard Time, which is exactly at the start of the outage, so there's a really nice marker here, we see this a hundred percent packet loss, and this continues for hours. And we see that it doesn't resolve until like 12 hours later, right? So that is really interesting that it was that prolonged for some providers, for some users connecting through that service.

Archana Kesavan:
I was going to say that, just talking about redundancy, there was this other outage I think a couple of months ago, on Google, I think GCP, which was also related to a power outage, but it then raised that question of redundancy because it affected like multiple availability zones in one specific region of theirs. So if you're thinking about redundancy, I mean, you have to think about it from various different perspectives while you're architecting your application, not just from at the app level, but the physical infrastructure and then your peerings. So it's not an easy task.

Angelique Medina:
No. No. Yeah, there's, there's a lot of hidden dependencies too. And sometimes, just knowing who your providers' providers are is useful as well. Because, for example, like LINX said, that their own customers were impacted because LINX had a dependency on Equinix. So it can really be like this kind of nesting doll of dependencies and you kind of have to understand what your exposure is. So you certainly have to go to a few degrees deep.

Archana Kesavan:
Deeper. Yeah.

Angelique Medina:
So again, this was pretty interesting. Equinix said that by around 9:50 local time, it was completely resolved. There's some indications that it was resolved a lot earlier for customers. And we can see here that, in this particular incident, it lasted around 12 hours, which is, from an outage standpoint, pretty significant.

Archana Kesavan:
Pretty significant, especially because it's a working day, it's right in the middle of the day.

Angelique Medina:
Yeah, absolutely.

Archana Kesavan:
Yeah, it was impactful.

Angelique Medina:
Yeah, totally. So that was a really kind of interesting incident. Now we're going to kind of pivot and go back to the Spotify outage because there's a lot of lessons that can be gleaned from this particular incident. So Archana, you're going to go on a deep dive with Prab Singh, and you guys are going to talk a lot about interesting stuff related to different trends and different browsers, and kind of the evolution of TLS certs and what's changing and all of that. So stick around and catch that interview.

Archana Kesavan:
In our expert spotlight section, this week we have Prab Singh, senior product manager at a ThousandEyes. Hey, Prab.

Prabhnit Singh:
Hi, Archana. How are you?

Archana Kesavan:
Doing good. And thanks for being on the show.

Prabhnit Singh:
Yeah. I'm really excited. Thanks for inviting me. I'm really excited to have a conversation on certificates.

Archana Kesavan:
Yeah. So for those of you who don't know Prab, he is our senior product manager at ThousandEyes, and he's a 10 year networking professional and is currently leading our Internet & WAN product line. And one of the reasons that Prab's on the show today is in relation to the outage that we just discussed, which happened last week, the Spotify outage that was related to a TLS certificate expiring. And Prab's been doing a lot of work in that area. So Prab, talking about like recovery rate, how long does it take to reissue a certificate and be back up and running?

Prabhnit Singh:
Yeah, I think it really depends. It can be pretty quick. I think most of the problems that we've seen around certificate expiration have really been around detection. Depending upon what type of application it is, if it's a client that users use to be able to access a service, it's a bit harder to detect as an end user that can essentially put this on Twitter and all of a sudden you know as a signal that there's a problem going on.

Prabhnit Singh:
So applications that are being accessed via clients may not have the error reporting capabilities that perhaps browsers do. With Apple or Safari or Chrome or Firefox, we all know that they're, especially if that comes up where a certificate is not trusted and do you really want to proceed, and then you have to kind of, the user really has to do that. And browsers are doing that because they want to protect users from these spoofing problems.

Prabhnit Singh:
So I think a lot of the time that's spent around certificate expirations and how do you remediate this issue is around detection, and depending on the application you access, whether it's browser-based or client-based, will depend upon how long it takes. But the renewal process itself, if it's email based validation and you have the right authority, it can be as quickly as a within few minutes to 30 minutes, where you go to the certificate authority's website, you generate a certificate signing request, you validate that you have an email and that domain gets sent an email validation, and then you install the new cert.

Archana Kesavan:
Okay. Yeah. What would your recommendations be for enterprises to kind of prevent this from happening?

Prabhnit Singh:
Yeah, I mean, I think overall, there are some short term benefits, right, that we can start to really put an automation in place that are not user centric, but almost service centric, where they're tied to an organization. And if there are renewal emails that are being sent, they're being sent to not particular users, but particular sort of mailers where users can be really alerted. Over time though, I think that's just like a bandaid. Over time, it would probably be moving towards automation, complete automation of certificates, where certificates are automatically renewed via vendors. Today, like we talked about, but even there's a certain point starting to provide those.

Prabhnit Singh:
So I think the overall recommendation is really going to be moving toward automation as much as possible and removing any manual intervention that is required that can cause blips like this on the Internet.

Archana Kesavan:
Right. That makes sense. All right, Prab, thanks so much for being on the show. This was really good information, so thanks again.

Prabhnit Singh:
Yeah. Appreciate it. Good chatting with you and good luck.

Archana Kesavan:
Thank you.

Angelique Medina:
That was a great interview, Archana. Lots of interesting stuff related to why we're seeing more of these incidents happening, some very high profile outages that have resulted from just not keeping your cert updated.

Archana Kesavan:
It seems like a very simple thing that IT teams should take care it, right?

Angelique Medina:
It is, yeah. Right.

Archana Kesavan:
Like you said, if it's expiring. And what was really interesting for me was how there is kind of like this tug of war between browsers trying to push something and then the certificate authority trying to say, "Once a year is too much, so we should do once in two years."

Angelique Medina:
Right.

Archana Kesavan:
Yeah. So that was interesting. And I think Prab also touched upon the global aspect of automation, right? Like how do you avoid this? And yeah, sure, you could make sure automation is in place and your certificate is constantly updated, but we are not there, and in the interim, what can you do to prevent this is to monitor and then to make sure you are not a victim to these outages.

Angelique Medina:
Absolutely. That's really critical. One other kind of interesting thing that you bring up there is that this is sort of another example of, in this case, the browser providers, so more of like the app level folks kind of dictating how the infrastructure needs to work. And they're really strong-arming. And we see that repeatedly from application providers, cloud providers, that are starting to really dictate how the Internet works.

Archana Kesavan:
Right. Right. And Safari started it, Chrome and Firefox followed. I mean, that's basically the majority of your browsers, right? So they do have …

Angelique Medina:
A lot of power.

Archana Kesavan:
The power to twist arms if they need to. So something similar to what we're seeing on the Apple and Epic Games that's going on.

Angelique Medina:
That's right. Yeah. So lots to watch. We'll keep the popcorn warm and see what happens.

Angelique Medina:
All right. Well, that's all we have time for this week, so don't forget to subscribe, and if you do subscribe, we have a little prize for you. Just send an email to InternetReport@thousandeyes.com with your address and your t-shirt size, and we have a really great t-shirt that we'll send out to you. And until next time, take care.

Archana Kesavan:
All right. Bye guys.

The Internet Report

Ep. 20: An IXP And A Streaming Music Provider Walk Into An Outage Bar

Summary

Catch up on past episodes of The Internet Report here.

Follow Along with the Transcript

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

Stay Connected

Subscribe to the Internet and Cloud Intelligence Blog!

related blogs