Product News
Announcing Cloud Insights for Amazon Web Services

The Internet Report

Ep. 10: It’s ALWAYS DNS!

By Angelique Medina
| | 22 min read
Internet Report on Apple Podcasts Internet Report on Spotify Internet Report on SoundCloud

Summary


Watch on YouTube - The Internet Report - Ep. 10: May 25 – May 31, 2020

On this week’s episode of the Internet Report, I’m joined by my colleague, Michael Batchelder (aka Binky), to discuss a DNS-related service disruption that affected users trying to access Amazon.com. We also talk about a recently discovered DNS vulnerability that could leave DNS providers susceptible to DNS amplification DDoS attacks. If you’re curious about what went wrong with Amazon’s service last week and want to know more about the role of DNS and why it’s so important, don’t miss this episode.

Find us on:

Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.

ThousandEyes T-shirt Offer

Show Links:


Listen on Transistor - The Internet Report - Ep. 10: May 25 – May 31, 2020

Catch up on past episodes of The Internet Report here.

Follow Along with the Transcript

Angelique Medina:
Welcome to the Internet Report, where we unpack all of the interesting events, outages and news from last week on the Internet. We had a really interesting outage event that happened last week at Amazon.com, and we're going to cover that today with our special guest, Michael Batchelder. Michael Batchelder, he is a principal customer engineer here at ThousandEyes, where he helps customers to monitor their networks and applications. He's kind of a nerd about DNS, so it's very good that we have him on the show this week. Prior to ThousandEyes, Michael worked at Jet Propulsion Laboratory and Solar City, which is now part of Tesla as well as F-5 Networks. Michael is known to his friends and coworkers as Binky, and that is how I will refer to him today. So Binky, welcome.

Michael Batchelder:
Hi Angelique. Nice to be with you.

Angelique Medina:
So before we get started, just a little bit of housekeeping. For those of you who are not subscribed to the show, we're available on YouTube, we're on all of the major podcast channels. So anywhere you like to subscribe to your podcast, we're available. So make sure that you do that. And also just some news on the upcoming virtual summit we have scheduled for June 18th, we will very shortly be posting registration links to sign up for that. We have some amazing guests lined up, including Geoff Houston from APNIC. He'll be speaking along with speakers from CenturyLink, Akamai, we have Verizon Media and many others. So this is going to be a really interesting event to attend. It's a half-day event. So again, to register, we'll have a link in the show notes on our blog. So check that out. So on to the interesting event of the week, and that was, as I mentioned before, the Amazon website outage that happened. I believe this was on a Wednesday, and it happened about midday, so around, well, just after 12:00 PM Pacific time.

Michael Batchelder:
Right. So, actually, it was the Thursday rather than the Wednesday.

Angelique Medina:
Oh, excuse me. Yes, Thursday. So that was May 28th.

Michael Batchelder:
Yes.

Angelique Medina:
Yeah. So this happened on Thursday, and it lasted about 30 minutes. And anytime that a site with as much traffic as Amazon is offline for even a small period of time, it generates a lot of chatter. And that's what happened. We saw that there was suddenly a major spike in people reporting that the site was unavailable. And so we're going to talk about what happened and how this unfolded, because in fact, as Binky will show and pull up here, the website or the web servers themselves were available. So, the site itself was operational, but the problem was that users simply weren't able to reach the site because of a DNS resolution issue.

Michael Batchelder:
That's right. So as you can see here, displaying a ThousandEyes test to ‘www.Amazon.com’, this is a ThousandEyes page load test. So, it's essentially the same thing as would happen if you go to the URL in the browser. So we have agents, ThousandEyes agents all over the world, running this test to ‘https://www.Amazon.com’. And as of Thursday, May 28th at 3:00 PM Eastern time or 11:00 PM, sorry, 11:00 AM …

Angelique Medina:
… 12 PM.

Michael Batchelder:
12 PM Pacific time, sorry, everything seemed to be fine. There were a couple agents here that are reporting somewhat slower loading of the page, but that's not particularly problematic. Everything generally looks good. And then if I move forward in time slightly, we'll start to see problems. Some of the agents now are indicating that they are unable to load this URL, and I'll even jump one more round ahead in the test. And we'll see that every agent now is having a problem as the red dots are indicating. So if we look a little bit deeper, here's a list of all the agents running the test, and we'll take a look at what the error text shows us. It says “unable to resolve DNS,” and that's going to be the case for most of these. So everywhere from Hong Kong to Cape Town, South Africa, to Chicago, Illinois, we're going to see this error message, unable to resolve DNS.

Angelique Medina:
So DNS is important because that's effectively the first step to reaching a site. DNS is basically the translation between a human-readable name and an IP address. And so anytime that you want to reach a site like Amazon or Twitter, you need to do a lookup and you need to request a DNS record. And then when you get a response back, you can then connect to the IP address that you've received in the response. Now, what's interesting here, and if you go to HTTP server, you can also see kind of just this broad impact. It's pretty much across the board. This is a canonical or C name record, ‘www.Amazon.com’.

Michael Batchelder:
Right. So why don't I bring out my terminal window and use a tool called Dig to display the DNS record details for ‘www.Amazon.com’. So what you can see is when a browser or, in this case my command line tool Dig, asks for the IP address of ‘www.Amazon.com’ in DNS, instead of getting an IP address back directly, it gets this DNS record called the C name. And there are four C names for ‘www.Amazon.com’. And they point to a couple of different DNS... They point to multiple DNS records, C name records.

Angelique Medina:
Right. We have Akamai, we have Amazon themselves, which is either CloudFront, but also possibly hosted by their Route 53 DNS service. What's interesting is that, so at the time of the outage, the DNS record was... Basically, there was no response to the DNS request. At the same time, we didn't hear from Amazon’s Route 53 customers or from Akamai's customers about them having any similar issues. So it wasn't that there was an infrastructure issue where the DNS servers weren't available. It appeared to be something related to the configuration of the record itself or something along those lines. Because both Akamai and Route 53 services were up.

Michael Batchelder:
Correct. So if we had tests to the actual DNS servers in Akamai or Route 53, we wouldn't see any problems with the network getting to them. We wouldn't see any problems with the server generally responding. This appears to be an issue specifically with the configuration of the ‘Amazon.com’ DNS zone, and maybe more specifically with a few records in that zone.

Angelique Medina:
Well, it really specifically seemed to be a ‘www.Amazon.com’ because at the same time that this was happening in looking at the availability of the ‘Amazon.com’ A record, so that's not a C, that's an A record, their apex domain, that was... And actually, if we just look at the trace test first, we can see that there is in fact, you are able to get a mapping to IP addresses, and this is the exact same time, that same period in which there was no response for the C name record ‘www.Amazon.com’, the other record was resolving. But yet, yeah.

Michael Batchelder:
So if you tried to do a DNS lookup for just ‘Amazon.com’, there would not have been a problem. This test shows you that we were able to get results, get IP addresses back when we asked for the IP address of ‘Amazon.com’.

Angelique Medina:
That's right. And you can even see here that there, just as a side note, they are hosted by a totally different provider. They're hosted by UltraDNS, and you can see some Dyn servers as well. And Dyn is I believe now part of Neustar and their UltraDNS service. But what's interesting about this is that even though the record here was available, even if you went to ‘Amazon.com’, you still would not have been able to reach the site. And so why is that?

Michael Batchelder:
So when you get to ‘Amazon.com’, the webserver at that IP address is actually going to tell you in your HTTP, in the response to your HTTP request, it's going to tell you, I want you to go and issue a second request, ask for ‘ww.Amazon.com’. So we can see that if we look at the waterfall diagram in this particular test for ‘Amazon.com’. We'll see this successful request, which the server says, I'm redirecting you. That's the 301 response code. I am redirecting you to ‘www.Amazon.com’. So go make a request for ‘www.Amazon.com’.

Angelique Medina:
Yeah. And you can even see this if you were to select the header here for, basically, the 301 header and you see the response. It's basically saying, okay, go to ‘www.Amazon.com’. So the A record for ‘Amazon.com’, for a variety of reasons, it's not kind of optimal that that be a C name. So this is basically their way to get around that to just do an HTTP redirect to the C name record. And we can maybe talk a little bit about why they would do that.

Michael Batchelder:
Sure. So this concept of a C name, which is an alias, allows you to alias one name to another, like ‘www.Amazon.com’ to any of the four that we see on the screen right now. That's a normal thing to do in DNS. And we can talk about a couple of reasons why you would do that. But just to show you what the ‘Amazon.com’ record looks like ... this is a little different. There is no C name when you ask DNS for the A record of ‘Amazon.com’.

Michael Batchelder:
And the reason for that is back in the origins of DNS when the protocol was first designed, the rules stated that you could not have, or, sorry, I should say the rule stated that you had to have at least a couple types of records available at the apex of your domain, the apex being something like ‘Amazon.com’ or ‘thousandeyes.com’, the top of the domain. You needed to have a couple of different records available for DNS to work. And the DNS rules also said that when you use a C name, that is the only type of record available at that particular name. So if we had C name at ‘Amazon.com’, we couldn't have any other record type, and that would break the first rule. So those two rules were in conflict with each other, and the loser was the C name. You could never have a C name at the apex of your domain.

Angelique Medina:
Right. And, in this case, kind of using the C name here basically enables Amazon to perform traffic engineering, so they can basically use this mechanism to better distribute users to their edge or their web servers. They can just simply kind of route them as they see fit. So that's one of the benefits of using a C name.

Michael Batchelder:
Right. So it gets you a couple of things. It can, through load balancing mechanisms, it can facilitate improved performance and it can facilitate resiliency redundancy for the site. So essentially Amazon, by sending you quickly from ‘Amazon.com’ at the apex back to ‘www.Amazon.com’, as we see here, is just kind of quickly getting around the restriction of no C name at the apex. It's worth saying that DNS providers have found ways to get around this rule. Their solutions are somewhat proprietary at this point, or at least some of them are, but the rule is being bent if not broken in ways that are acceptable. So there might be ways that a company might avoid this problem, but this is the way Amazon chose to set up this particular domain structure, domain name structure.

Angelique Medina:
Yeah. So the end result then was, regardless of which domain you kind of typed into your web browser, the site wouldn't have been available. And so, I mean, this really speaks to kind of the criticality of the DNS. I mean, it's such a foundational system, basically mapping users to their destinations. And if the destinations are available, it doesn't matter because you don't have any way of knowing where the destination is. And so again, it's like one of these foundational things. And another interesting thing that kind of popped up recently, it may not have been last week, but I think it was the week before. There was some kind of news about a recently kind of discovered vulnerability in the DNS, which of course is, as we mentioned, really, really critical.

Michael Batchelder:
Right. So researchers at the University of Tel Aviv essentially outlined in a paper, an academic paper, a way to create a distributed denial of service attack on the DNS. Distributed denial of service, DDoS, being a way to use large numbers of attacking devices, usually referred to as bots, to generate traffic of some nature that will crash a service, like the DNS. There was a famous example back in 2016. An attack called Mirai used a botnet comprised of Internet of things devices, largely, things like security cameras, which they have to be Internet-connected to send their video streams up to the cloud. So that Internet connectivity was exploited, vulnerabilities in the devices were exploited in order to turn these devices into a botnet. And the botnet was used to attack a DNS provider called Dyn. Essentially, the Dyn servers were overwhelmed with traffic. A massive amount of traffic was flooding their servers, causing their DNS servers to be incapable of responding to queries.

Angelique Medina:
Right. And so this impacted a lot of companies, like Slack and I think even Amazon was impacted, as well. I mean, just a huge number of major companies that have very well-trafficked websites, where their revenue depends on their site. They were not find-able during parts, and in some cases, the entire duration of this outage. So, there are very kind of clear reasons why somebody would want to target DNS because, particularly if it's a large provider, somebody who hosts a number of DNS records, it could have a pretty sizable blast radius. So definitely check out the details on this particular vulnerability that was uncovered recently. And of course, the researchers, before they made this public, they went through proper channels and made the major DNS providers aware of this vulnerability and how it could be exploited before they put out their information more broadly, which is on a website that we'll have a link to in the show notes.

Angelique Medina:
So make sure that you check that out. So with that, really, really happy to have you on today, Binky, talking about DNS, which is one of our favorite topics. So again, we'll have all the notes up in the blog and don't forget to subscribe to the show and also register for the upcoming virtual summit. We also have our “working from home” t-shirt that we're giving away if you subscribe to the show. So do that, and then send an email to InternetReport@thousandeyes.com, and send your size and your address, and we'll get a t-shirt over to you. And also if you have DNS related questions that you want to ask, Binky's your guy, and you can send those questions over to the show as well. So with that, thank you for joining us.

Michael Batchelder:
Thank you for having me.

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail