Do We Need to Rethink Network Monitoring?

Regardless of whether you’re a knowledgeable computer user or not, there is a high probability that you’ve heard of or have even used traceroute and ping. People often ping google.com to test if the "Internet works" and use traceroute to find out more about their network performance. These two essential troubleshooting utilities have served us well for quite a long time.

However, as the complexity of computer networks has increased, some of the deficiencies in those tools have emerged. For example, traceroute can fail to discover nodes or report false links, which can send troubleshooting in the wrong direction. Ping works pretty well, but it relies heavily on ICMP and quite often these days, ICMP is either blocked or heavily policed.

These deficiencies inspired people to write better utilities. That’s how we got the Paris traceroute, which solves the majority of the issues seen in traditional traceroute. Innovation didn’t stop there; we got tools such as mtr, that network engineers commonly resort to, for troubleshooting packet loss. There’s also Dublin traceroute, which can peek beyond NAT boundaries, and even complete suites of utilities like NLNOG Ring. The list goes on.

Challenges with How Problems Get Detected

All these tools kick in during the troubleshooting cycle once issues are discovered. There are various ways in which issues are initially found. In the worst-case scenario, customers notice problems first, but often it is a network monitoring solution that detects and sends notifications. Network monitoring solutions have long relied on classic "sources of truth" like Syslog and SNMP. More recently, with the rise of the Network Reliability Engineering (NRE) approach, developers noticed that many important network metrics and counters weren’t exposed, so they started developing newer collections methods which rely on establishing a remote session with the target device, then executing specific commands and storing the results in backend solutions for analysis. These methods tend to be largely automated. For example, many popular networking vendors have implemented gRPC and streaming telemetry solutions.

However, there are challenges with all of these methods. SNMP collection may not have access to all the MIBs needed for sufficient visibility, or the monitoring platform may not support non-standard MIBs. Syslog can be configured to report only on certain severities, and as a result important messages can and often do get filtered out. The automated approach adopted by the Network Reliability Engineering teams has also shown that modern platforms have bothersome limitations. For example, it is quite easy to hit a maximum of allowed concurrent ssh sessions, and executing commands to gather detailed MPLS LSP statistics can create prohibitively high CPU overhead. Furthermore, all of these mechanisms tax the compute resources that both management and control planes rely on, and can starve resources for critical control plane functions such as Best Path Selection. Finally, some mechanisms like gRPC, aren’t widely available on current network infrastructure platforms.

Is the Network Telemetry Accurate?

The Network Reliability Engineering approach, using the Python and Go programming languages and solutions such as Salt, NAPALM, and Ansible means that much of the discovery and remediation of issues can be executed automatically. But once you gain confidence that automation can get information flowing properly, it’s only logical to question whether the telemetry generated by vendor equipment is in fact accurate. Not only are there somewhat unusual issues with accuracy of data from network equipment, such as bit flipping caused by solar flares (where no in-depth root cause analysis has ever been provided), but it’s not uncommon for engineers to find that metrics aren’t available to aid their troubleshooting (sometimes only after several hours of being engaged with vendor technical support teams).

Is Automation Enough?

Nobody is going to argue that automation can’t significantly improve response to events and help by remediating often repeated incidents that would otherwise consume engineering time. The investment put in automating those events pays off, in the form of more time available for engineers to spend on innovation.

However, the real question is whether automation alone is enough? Automation has helped, but let’s be honest, events often still go undetected for long periods, or even worse get spotted by your users first, which brings with it multiple adverse effects such as loss of confidence in a brand or negative financial impact.

Going Beyond Passive Data Collection

Generally, to alert on a specific event, you need to be aware of the possibility of its occurrence. That means, alerts are codified based on the previous occurrences. Unfortunately, that is not how things work in real life in production networks. New events come up, counters may not be available, SNMP may not have a relevant MIB, data may not be supported by your monitoring solution, or gRPC won't be supported on your platform. More fundamentally, getting all the data you might possibly need places a lot of strain on the networking devices themselves.

Passive data types aren’t bad. But they need to be complemented with synthetic or active monitoring. This means sending simulated user traffic (which is using the same characteristics as the real user traffic) to measure critical performance indicators such as packet loss and latency. An active monitoring approach with automation that provides fast response and remediation is a must, especially when you now rely on so many networks that aren’t directly under your control, meaning you can’t collect passive data from the network devices.

A Holistic Approach Is Needed

Whether you’re working in network or service reliability, teams should adopt a more holistic approach and stop blaming each other. No, the network is not an unlimited resource, as many developers tend to treat it. On the other side, not every issue should be addressed as a bug or as a service-related failure as network engineering is sometimes working hard to prove. From experience, we learned that symptoms in one layer of the stack often represent issues in the other one and vice versa. Therefore, it is quite essential to have full visibility on the service side.

All of these efforts, combined, are your chance to evolve your network monitoring to a state where you can reliably identify what the issue is and where it happened in a timely matter. You need it, and your business expects it.