In October 2020, we announced the availability of Multi-service Views, the next step in our evolution of data visualization and correlation, for early-access to our customers. Since then, we’ve had many customer conversations and over 50 enterprise organizations beginning to see the value of Multi-service Views. Today, we’re excited to announce that Multi-service Views is now available for all existing and new customers. We’ve learned and improved our capabilities in this period and endeavor to make continuous improvements to give more time back to our users. This blog post highlights three real-world problems we’ve seen customers solve faster using Multi-service Views:
- Is there a problem with my VPN?
- Is it the network or the Hadoop cluster?
- What is the scope of an Internet outage across my services?
Is there a problem with my VPN?
Given the situation most of us around the world find ourselves in, it’s not a surprise what’s top of mind for network administrators: VPN performance. As we all work from home, our virtual connection to the corporate network has become critical in accomplishing our day-to-day tasks. What happens when you receive user complaints that simply say “VPN is slow” but have received no alerts from your infrastructure monitoring system? The issue can be in many places—from the user’s local network, their last mile, a broader Internet issue, or the VPN next hop ISP. To complete the VPN visibility gap, enterprises have been using ThousandEyes Cloud Agents to run health checks against their Internet-facing VPN concentrators and look for Internet issues that could be impacting their remote employees. For many large companies today, there will be a globally distributed set of VPN concentrators that serve users locally in that region. How do you get a regional or global view of ISP connectivity to your VPN servers?
Let’s look at a real-world example. In this scenario, a user has created a Multi-service View to combine all external VPN endpoints within a region as a starting point to answer the general question: “Is there a problem with the VPN?” The test selector on the top left shows “4 selected tests,” which present the four VPN endpoints serving the users in South Africa. Leading into the days after Christmas 2020, we see that there are availability drops with VPN in South Africa. As we dig into the data using the new grouping options for multi-test analysis, we see that grouping by agent shows the problem originates from the Johannesburg vantage point, which shows a TCP Connect error. This also correlates with packet loss observed from Johannesburg across four VPN endpoints that points to a common problem that impacts all VPN servers in South Africa, as seen through Johannesburg.
Next, to understand why there seem to be problems with accessing the VPN through Johannesburg, a user would look at the path visualization. Without Multi-service Views, she would have to look at connectivity to each VPN server independently and cross-reference IP addresses and network names to identify a common root cause. Instead, Multi-service Views give us broader context on the root cause of the VPN problem by helping cross reference and visualize the issue seen across the VPN server tests. In the example below, we see a common problem in the 168.209.x.x network that provides connectivity to all four VPN servers, reducing the overall time to root cause. Also, notice the 196.223.14.10 IP that represents the bottleneck through which all four VPN servers are accessed.
Is it the network or the Hadoop cluster?
Another example of Multi-service Views’ potential is seen for customers that have application nodes clustered together and need a comprehensive view of the network to support this environment. In the example below, you see an existing view of two data centers accessing one of the Hadoop cluster nodes across an inter-data center link. There’s a period of over 2 hours with intermittent packet loss observed across both data center agents. The accompanying path visualization also shows a potential point in the network that’s seeing forwarding loss (highlighted in red) and is worth investigating.
However, there are alerts generated for other Hadoop cluster nodes, as well. To understand if these alerts are interconnected or are independent occurrences, this user combines multiple tests to the ten hadoop cluster nodes together to reveal a complete picture. A consolidated view clearly highlights the commonality in the network connectivity path and a potential bottleneck, with just two routers serving connectivity to all 10 Hadoop cluster nodes. Seeing varying amounts of loss across the middle network across the 2.5 hour time period meant that we needed to keep looking. Since this test traverses large parts of an internal network, ThousandEyes Enterprise Agents can be leveraged to gather infrastructure metrics through SNMP to help us correlate common infrastructure problems like high throughput, errors, and discards to packet loss.
Looking at device-level views, we see an interesting trend. Device-level views allow us to observe aggregate metrics for only devices in your infrastructure that are traversed through in the Multi-service View. Having narrowed the device focus, we observe a four-fold increase in input and output throughput that correlates with the packet loss time window. As we uncovered further, the data was important for the network team to re-evaluate their bandwidth needs—the inter-DC links for replication supported a maximum of 10 Gbps and these bursts of data transfers reached anywhere between 30 Gbps and 40 Gbps every day, causing packet loss.
What is the scope of an Internet outage across my services?
Another interesting example where Multi-service Views help to build a more complete picture is during Internet outages. In the example below, we see a certain test URL that saw intermittent packet loss across roughly two hours. While looking at a single test (Figure 8), the Internet outage swim lane gives us context that this is perhaps related to a larger Internet problem. Looking at the path visualization for this single test, we see 96% forwarding loss across a node in Microsoft’s network (104.44.22.60). However, combining this data across other tests gives us a more definitive and accurate picture about the source of this outage. Figure 9 shows the data from the Rio de Janeiro cloud agent to seven other tests that all saw intermittent problems during the same time period. Besides showing the problematic nodes in the path visualization more clearly, we also see a more complete account of the duration of the problem lasting 3 hours.
As enterprises across market segments rapidly adopt cloud and API-based services, it becomes increasingly important to collect and visualize data across a wide range of dependencies that play a critical role in delivering an optimal end user experience. Multi-service Views aim to make complex environments easy to understand, and we hope it helps our users solve problems faster. Users have long compared ThousandEyes’ ability to compare data across time to a “network DVR.” With Multi-service Views, we hope to leverage this “network DVR” across tests to build a more comprehensive movie of your network dependencies. To see how Multi-service Views works, watch the video above or view it from within your ThousandEyes accounts starting today.