Path Quality Part 1: The Surprising Impact of 1% Packet Loss

It is well understood that packet loss negatively affects network flows. Among network engineers, many understand how the various intricacies of TCP, such as congestion avoidance algorithms, work. TCP questions are actually quite common on network engineering interview loops. But even with all of that, if you were to ask engineers to quantify the effects of packet loss, it is likely you would be met with blank stares. Given how well-researched and widespread packet loss is, it is interesting how little we as a networking community understand its impacts.

Here at ThousandEyes, we often discuss the adverse effects of packet loss, especially in our outage analysis blog series. We often show correlations between the spikes in packet loss, and their negative impacts on the applications users rely on. More often than not, we see that a spike in packet loss overlaps with application degradation.

While operators often see sustained spikes in packet loss, which can overlap with application-level performance degradation, it’s my experience that network engineers tend to look past “small” levels of packet loss (say 1 or 2%). In this blog, I’ll explain why this behavior could be problematic and show results from our own research demonstrating how these “small” packet loss issues can potentially have a big impact on users’ experience.

Various Methods TCP Uses To Handle Packet Loss

TCP uses several methods to manage packet loss events and to recover from them, including Duplicate Acknowledgements, Timeouts, Explicit Congestion Notifications (ECN), Selective Acknowledgements (SACK), and Congestion Avoidance algorithms. TCP senders typically use packet loss and increased round-trip time as key indicators to recognize congestion.

Different congestion algorithms have emerged over the last two decades, tackling the same problem using different approaches. If you look at the systems deployed earlier, it is common to find them using Reno. Today, most operating systems use CUBIC as the default; although, there were significant strides to improve even further, as was done with BBR. Given that we are discussing one of the indicators (packet loss) in this post, it’s important to specify that we tested in the network environment using CUBIC as the congestion algorithm.

CUBIC: The Default Congestion Avoidance Algorithm

This research is going to provide results for an environment that uses CUBIC as the default congestion avoidance algorithm. In the future, we are planning to revisit this topic with results for other congestion avoidance algorithms.

Given the increased popularity of the Internet and the growth of computer networks, network engineers quickly realized that some earlier congestion avoidance algorithms, such as Tahoe, utilized available bandwidth slower than they should, especially on higher-bandwidth networks. That prompted further research, and it was discovered that the CUBIC function has properties that could serve congestion avoidance well in high-bandwidth networks.

Initially, the CUBIC congestion algorithm quickly expands the congestion window (the number of segments that can be sent at once without incurring drops on the path). As it starts approaching the point at which drops previously occurred, it slows down. If no drops occur, then it starts growing slowly at first and then speeds up. This approach proved to work well in high-bandwidth networks. While CUBIC has its own set of challenges, it proved to be more efficient than its predecessors.

Before diving into the results, we wanted to share some details about the setup we used to conduct these measurements.

Test Methodology: Symmetric and Asymmetric Network Paths

We have five Linux hosts configured to forward packets. Each host has a 1 gigabit ethernet interface connected to the switch. Routing is configured statically, as part of which we are pushing traffic from the first host (acting as an iperf3 client) to the last one (which acts as an iperf3 server). To achieve this, we configured subinterfaces in different VLANs, which also required configuration on the switch. The topology we used to conduct the research for a symmetric network path is shown in Figure 1.

Flow chart of a symmetric network where forward and reverse traffic path is the same — Figure 1. Symmetric network (forward and reverse traffic path is the same)

The network path in Figure 1 clearly depicts a symmetric network topology. Traffic going in the forward direction will take precisely the same reverse path. This setup closely resembles what network engineers often implement for private networks. They ensure that the networks are symmetrical for the ease of troubleshooting, setting up firewalls, and so on. So, if the issue appears on a node in the forward path, it will show on the same hops in the reverse direction. This eliminates the requirement to run MTR and other classical troubleshooting techniques from both sides—an approach commonly used for troubleshooting asymmetrical network paths, such as those on the Internet.

Since most Internet traffic is asymmetric, we wanted to conduct the same experiments in those conditions. We used the same five Linux hosts and reconfigured them so that the reverse traffic uses a different path (via the use of different VLANs). Figure 2 shows the network topology that we used for this purpose.

Flow chart of a symmetric network where reverse traffic is taking a different path when compared to the forwarding path — Figure 2. Asymmetric network (reverse traffic is taking a different path when compared to the forwarding path)

We are using iperf3 to conduct this research. As such, we are interested in throughput as the primary metric. Throughput refers to the amount of data successfully transmitted or received over a network within a specific period. Throughput considers various factors affecting data transmission, such as network congestion, protocol overhead, etc. Unlike bandwidth, which represents the maximum capacity of the channel, throughput reflects the real-world performance and efficiency of the data transmission process.

Establishing a Baseline

To establish the baseline for our testing, we started iperf3 without any packet manipulation. Testing was conducted for three hours (10,800 seconds) and resulted in a mean value of 804.6Mbps for throughput in the symmetric network experiment. Detailed results can be observed in Table 1.

	Baseline (sym)
Mean	804.673506
STD	13.0217464
Min.	710
25%	799.99
50%	809.93
75%	810.046
Max.	830.419

Table 1. Baseline results for throughput in the symmetric network experiment

We applied the same approach in an asymmetric network test, and we observed a mean value of 864.13Mbps, an increase of 7.3% compared to the values we observed in the symmetric network experiment. Detailed results for throughput in the asymmetric network experiment are shown in Table 2.

	Baseline (asym)
Count	10800
Mean	864.139471
STD	14.647341
Min.	720.067
25%	859.973
50%	869.965
75%	870.3815
Max.	900.002

Table 2. Baseline results for throughput in the asymmetric network experiment

Measuring Throughput in a Lossy Environment

Once the baseline was established, we introduced the packet loss using the tc ("traffic control") utility. tc is used to introduce traffic control in the Linux kernel, and it has capabilities such as shaping, scheduling, policing, and dropping. tc has an enhancement called netem ("network emulation") that allows adding delay, packet loss, duplication, and other characteristics to packets outgoing from a specific network interface. We used netem to introduce fixed packet loss for each measurement.

The Curious Case of 1% Packet Loss Impacts

We started by introducing 1% packet loss on the Linux device marked H3 on the interface towards device H4.

Symmetric Test Results

In the symmetric network experiment, our testing shows a stark difference between the baseline and the 1% packet loss probing, with 1% packet loss resulting in a mean value of 235.5Mbps of throughput. On average, 1% of packet loss caused a 70.7% decrease in throughput!

Detailed results can be observed in Table 3. Figure 3 shows the plotted baseline throughput and throughput achieved while testing with 1% packet loss in the symmetric network experiment.

	1% (sym)
Mean	235.513105
STD	13.5692798
Min.	93.967
25%	229.667
50%	236.635
75%	243.596
Max.	281.886

Table 3. Throughput results were achieved with a 1% packet loss in the symmetric network experiment

Graph of throughput shows differences between the baseline and 1% packet loss in the symmetric network experiment — *Figure 3. Throughput differences between the baseline and 1% packet loss in the symmetric network experiment*

Figure 3 shows a stark difference between baseline measurements, on average, compared to throughput measurements achieved while testing with 1% packet loss in the symmetric network experiment.

Asymmetric Test Results

In the asymmetric network experiment, testing with 1% packet loss resulted in 222.49Mbps on average. Compared to the average results we achieved while establishing the baseline in an asymmetric network experiment, this accounts for a 74.2% decrease in throughput, an even more significant throughput decrease compared to the symmetric network experiment.

	1% (asym)
Mean	222.493196
STD	13.7883065
Min.	51.21
25%	214.788
50%	222.729
75%	230.675
Max.	280.877

Table 4. Throughput results were achieved with a 1% packet loss in the asymmetric network experiment

Figure 4 shows the difference between throughputs achieved, on average, while establishing the baseline for the asymmetric network experiment and resulting throughput in the environment with 1% packet loss.

Graph shows throughput differences between the baseline and 1% packet loss in the asymmetric network — Figure 4. Throughput differences between the baseline and 1% packet loss in the asymmetric network

Lastly, comparing the impact of 1% packet loss on throughput in the symmetric and asymmetric networks reveals that the performance of the symmetric network was better in lossy conditions than that of the asymmetric network. We achieved 235.51Mbps testing on the symmetric network, while the same test resulted in 222.49Mbps on the asymmetric network, a 5.5% decrease in throughput on average.

Increasing the Packet Loss

We then increased the packet loss to 2% and measured what happened to the throughput. In the symmetric network experiment, throughput was 175.18Mbps on average, a 78.2% decrease compared to the baseline results achieved in the same network configuration. The throughput we achieved at 2% packet loss represents a 25.6% decrease compared to the test we conducted at 1% packet loss in the same environment.

Table 5 shows the detailed results. Figure 5 represents a throughput plot, which indicates how big of a difference there was between testing at 2% packet loss and the baseline.

	2% (sym)
Count	10800
Mean	175.186034
STD	37.47976
Min.	11.93
25%	158.08875
50%	190.9065
75%	199.863
Max.	223.724

Table 5. Throughput results were achieved with 2% packet loss in the symmetric network experiment

Graph shows throughput differences between the baseline and 2% packet loss in the symmetric network experiment — Figure 5. Throughput differences between the baseline and 2% packet loss in the symmetric network experiment

Throughput at 2% packet loss in the asymmetric network experiment was 168.02Mbps, as shown in Table 6. This represents an 80.5% decrease in throughput compared to the baseline in the same environment. Figure 12 shows the plot indicating the difference between the baseline and throughput at 2% packet loss in the asymmetric network.

	2% (asym)
Mean	168.028878
STD	34.9090933
Min.	5.965
25%	151.1405
50%	182.448
75%	191.893
Max.	212.788

Table 6. Throughput results were achieved with a 2% packet loss in the asymmetric network experiment

Graph shows throughput differences between the baseline and 2% packet loss in the asymmetric network — Figure 6. Throughput differences between the baseline and 2% packet loss in the asymmetric network

Comparing achieved throughputs we achieved in the asymmetric and symmetric networks, the throughput of 168.02Mbps in the asymmetric network represents a 4.2% decrease compared to the throughput achieved in the symmetric network.

Overall Results

Next, we wanted to understand the adverse effects of packet loss up to 10% in both symmetric and asymmetric networks. The results are shown in Tables 7 & 8 and Figures 7 & 8 below.

	1%	2%	3%	4%	5%	6%	7%	8%	9%	10%
Mean	235.51	175.19	109.76	65.68	41.37	23.95	16.75	11.00	7.52	5.29
STD	13.57	37.48	46.68	36.09	25.48	17.31	12.16	8.40	5.97	4.33
Min.	93.97	11.93	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
0.25	229.67	158.09	74.56	37.77	21.38	9.94	6.96	4.97	2.98	1.99
0.50	236.64	190.91	111.86	61.67	37.77	19.89	13.92	8.95	5.97	3.98
0.75	243.60	199.86	150.14	89.53	57.18	33.81	23.37	15.41	9.95	6.96
Max.	281.89	223.72	201.33	175.49	149.62	119.30	87.50	68.59	46.76	37.78

Table 7. Throughput results achieved in the symmetric network while increasing packet loss up to 10%

Figure 7. Plot showing throughput achieved with various packet loss percents in the symmetric network

	1%	2%	3%	4%	5%	6%	7%	8%	9%	10%
Mean	222.49	168.03	106.43	63.57	36.59	24.99	15.52	10.82	36.59	15.52
STD	13.79	34.91	44.62	34.81	24.44	16.93	11.58	8.26	24.44	11.58
Min.	51.21	5.97	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
25%	214.79	151.14	72.57	35.80	16.90	11.93	5.97	4.97	16.90	5.97
50%	222.73	182.45	108.35	59.66	31.84	21.87	11.94	8.95	31.84	11.94
75%	230.68	191.89	144.67	87.00	51.70	34.79	21.87	14.92	51.70	21.87
Max.	280.88	212.79	188.91	163.07	148.64	118.81	82.03	63.64	148.64	82.03

Table 8. Throughput results achieved in the asymmetric network while increasing packet loss up to 10%

Figure 8. Plot showing throughput achieved with various packet loss percents in the asymmetric network

Looking at these results, we can safely conclude that even a “small” amount of packet loss can have catastrophic effects on throughput. In this study, 1% of packet loss caused a 70.7% decrease in throughput. If this represented a customer-facing application, it could result in a poor experience. While we observed a damaging decline in throughput as the percentage of the sustained loss increased, it is evident that the compounding effect of the loss happened fairly early on.

What does this mean in terms of the actual impact on application performance?