On Friday, March 2nd we saw a severe outage across Amazon AWS’ US-East-1 region, located in Ashburn, VA. This outage impacted Amazon’s very own Alexa along with multiple apps and services hosted within the IaaS provider, including Slack, Twilio and Atlassian JIRA. Although the infrastructure recovered very quickly from what was a weather-related power outage, we saw prolonged and cascading impacts on many software applications and services running on AWS.
The outage primarily affected customers relying on AWS Direct Connect, a service that offers dedicated connectivity between the AWS cloud and enterprise networks. Internet access on the other hand recovered quickly and Amazon’s suggested workaround was to use their IPSec VPN service over the Internet.
So what does this mean for you as an enterprise? Should you subscribe to the private interconnect options available from the various cloud providers? The short answer is that in many cases you should, but this is not a silver bullet. You still need visibility, and you still need to design enough redundancy and fallback options into your application.
Options for Cloud Connectivity
IaaS and PaaS like Amazon AWS, Microsoft Azure and Google Cloud Platform allow you to create virtual server instances on demand. These instances live on a virtual private cloud (VPC), that typically lives on an isolated private network. So how do you talk to applications living on these cloud platforms?
The first option is to assign public IPs to these servers, so they can communicate with the wider Internet. This is great for external access — typically the web layer of public facing apps. Not so great for your internal database servers. The second option is to build an IPSec VPN tunnel from your enterprise network, into the cloud provider, and make the private address space routable within your enterprise. This works well for microservices architectures and internal apps that will only be accessed from within the corporate network. However, IPSec VPN tunnels require expensive encryption hardware (typically found embedded inside most modern-day firewalls), and can introduce unwanted latency into the application flows. Also, this option relies on the Internet as the underlying transport. Yes, the same Internet that carries videos of cats riding around on Roombas and the latest season of Game of Thrones. More on this later.
The third option involves establishing some kind of private connection between your enterprise network and the cloud provider, so that your cloud network addresses are now routable from within your enterprise networks, and vice versa. AWS calls this Direct Connect. Microsoft Azure calls this ExpressRoute, and Google calls this Cloud Interconnect. Each platform has variations in access methods and redundancy, but essentially accomplish the same thing — they allow your cloud resources to be routable from within your enterprise network.
All three services involve establishing a connection to, and peering with your cloud provider at one of many available exchange points. Some offer service provider partnerships that allow you to connect even without being at the exchange point. These connections can be capped at a certain bandwidth tier, or can be uncapped and billed based on actual usage. Most offer redundancy options as well, so that a failure on one link or router will not impact the connection.
Comparing Dedicated Connectivity with “Plain Internet”
So how do private connections compare with IPSec tunnels over the Internet? They do offer several advantages:
- Performance: At a certain bandwidth level, IPSec VPN tunnels get prohibitively expensive, and end up throttling your cloud bandwidth. Private connections allows you to seamlessly scale up as your bandwidth needs grow.
- Consistency: You get better control over the network paths, and they are less likely to change with time, unlike the Internet which is highly dynamic.
- Cost: You typically pay less per Gbps of Direct Connect bandwidth relative to Internet bandwidth.
However, private peering connections are not a silver bullet. One of the biggest advantages of the Internet is its resiliency. The high degree of connectedness ensures that data will usually find a path to get from point A to point B. In fact, there are many paths to choose from. However, Internet routing protocols do not always find you the the fastest path, or the most optimal path. And you have to share this path with a lot of other traffic streams.
Make sure that your private peering connections do not turn into a single point of failure, as we witnessed with many applications on March 2nd. These applications failed to detect and recover from the loss of back-end connectivity. The Internet is still a great fallback path that is always available and can help you maintain service availability & business continuity.
This also underscores the need for visibility and monitoring of your cloud applications at multiple layers of the protocol stack. Without this visibility, it becomes very difficult to determine the scope and root cause of an outage like this. We witnessed on Friday how a seemingly short-lived outage in the AWS infrastructure turned into a much longer outage impacting hosted applications. The cloud is a complex distributed system that is incredibly hard to debug. You don’t own the infrastructure, but you still own the outcome.
Get Visibility Into Your Cloud Connectivity
Operations teams typically spend over 70% of their time figuring out where a problem lies, and only then can they begin to implement a fix. In the cloud, this ratio can get even worse unless you have sufficient insight into the correlation between application performance, network paths, and Internet routing. That’s where ThousandEyes comes in. We deliver modern, cloud-aware network monitoring that cuts through the haze and lets you find root causes fast. To learn how to cut your cloud troubleshooting time down to seconds and keep on top of your cloud connectivity, request a demo or sign up for a free trial of ThousandEyes, and watch one of our webinars.