On May 17th 2016, we started seeing a string of network and BGP routing issues that were spread across continents. A deeper look revealed correlated behaviour resulting from possible mishaps in the SEA-ME-WE 4 (SMW-4) submarine communications cable affecting Tata Communications and Telecom Italia Sparkle (TISparkle). The SMW-4 cable is approximately 19,000 kilometers long and provides the primary Internet backbone, along with SEA-ME-WE-3 and IMEWE, between South East Asia, Middle East, Europe and the Indian subcontinent.
The EuroTrip
Beginning at 6:34 AM Pacific on May 17th, several ThousandEyes Cloud Agents in Europe (Palermo and Bucharest) started displaying severe packet loss. Before the issue started, both these agents were transiting the TISparkle network through Bucharest and Paris before exiting Europe to reach the Netflix service that was being monitored in Dubai. Feel free to follow along by using our shared link to the Netflix test.

To narrow down where exactly the loss was seen, we used Path Visualization that showed 100% packet loss at the TISparkle node in Bucharest, shown in Figure 2.

For the next couple of hours we observed intermittent connectivity accompanied by fluctuating packet loss from these agents. For a brief period we noticed that packets were being rerouted via TeliaNet and Level 3 and then again via TISparkle and Level 3. Interestingly, packets were being routed through the TISparkle Frankfurt PoP instead of the Paris PoP and exiting Europe through Level 3. BGP Visualization aligns well here, showing direct routes for the Netflix prefixes through TISparkle being withdrawn and peering with Level 3 to reach TISparkle.

Based on the data seen so far with 100% loss at certain TISparkle nodes and traffic being rerouted through TISparkle Frankfurt, completely avoiding TISparkle Paris, suggests a possible issue in the connectivity in the France region. The SME-4 underwater cable connecting the Indian subcontinent to Europe connects Palermo, Italy and Marseille, France indicating a possible cable fault in part of the transcontinental cable.
Passage to India
Around the same time circuits in Europe tripped, a few of our tests originating from the Indian subcontinent also started seeing packet loss that affected services like Craigslist and Salesforce.
The issues started roughly around 5:30 AM Pacific and were intermittent until 8:30 AM Pacific. Path Visualization showed a 100% packet loss at the SMW-4 egress point in India located in Mumbai.

We went back a few time slices ahead of the problem to identify how the original path looked like and, not surprisingly, the next hop after the Mumbai node was the SMW-4 Europe egress location at Marseilles. The highlighted nodes in Path Visualization seen in Figure 4 above and Figure 5 below shows the unchanged part of the network. The high latency seen between the Mumbai and Marseilles node is also an indication that it is a transcontinental link.

This is inline with the issues that were seen in Europe. There was most likely a cable fault in the SMW-4 link in the France-Italy region that affected services globally.
South of the Border
Until now the manifestations of a faulty cable link are consistent. We see network and routing issues that affect Europe and India. So far so good. However, it doesn’t stop there. The symptoms of the issue start percolating all the way to Latin America!
We had a couple of tests to a telecom service hosted in Slovakia from a global set of Cloud Agents. Locations in Latin America were routing packets through TISparkle’s local PoPs via the Miami PoP, which serves as the egress point of TISparkle’s transatlantic link to Italy.
At the same time agents in Europe were seeing issues, Cloud Agents in Santiago, Chile and Cordoba, Argentina started experiencing 100% packet loss at a the local TISparkle PoP.


Within a few minutes of experiencing packet loss, the network self adjusts and starts routing packets from the local TISparkle PoPs in Latin America to the TISparkle node in New York. This completely bypasses the Miami - Italy link. This was very similar to what we saw in Europe where the Paris TISparkle PoPs were bypassed.
Tying the Knots
Did a minor cable fault have multiple ripple effects across half the globe? Well, it definitely seems like it. Let’s, however, put things into perspective. We mapped out a partial TISparkle network along with the SMW-4 cable route to show how a minor cable fault in one part of the world possibly affected three different continents. Once we mapped it out, things started to fall in place.

If you look at Figure 8, the transcontinental and terrestrial backbones overlap in the Western European region. Columbus III connects Latin America to Western Europe and SMW-4 connects the Indian subcontinent to Western Europe. A possible cable fault around the overlapping area simultaneously affected the TISparkle and Tata Communications networks and further percolated to the TISparkle network in Latin America via Miami.
This was definitely an interesting event that spanned multiple locations. With the right tools and insight, you too can get a head start on seeing problems in your network and quickly addressing them. Sign up for a free trial of ThousandEyes and start viewing your network like never before.