The first week of November was a quiet one on the outage front, and this is entirely what you’d expect.
Outages follow a seasonal trend that sees numbers fluctuate in line with different times of the year and events. With no major calendar events that cause Internet traffic impacts at this time of year, it’s no surprise to see outage numbers trend lower week-on-week.
However, casting our eyes back over the past four weeks, we saw a record number of disruptions and outages for a single month. There were 1,081 ISP outages in the month of October to be precise, eclipsing a previous high of 1,007 outages in June this year. The high water mark before that was 999 in March of last year.
And yet, as we’ve alluded to in past weeks, October was an unusual month in the sense that we really only saw one disruption of any great note. The rest—and we are talking hundreds of disruptions—were either highly localized or passed by relatively unnoticed.
To understand how this could be the case, it’s worth backtracking to February–March of 2020 when the first indications of the pandemic started to show.
As everyone moved to set up remote work processes, network engineers weren’t exempt. Knowing they may not necessarily be able to physically attend sites, they appeared to implement more automation and remote ops practices and systems. However, this did not immediately occur en masse due to a view in some sectors that remote work requirements would be short-lived and that most of us would be back in close quarters with our racked equipment in no time.
Through the back half of 2020, as we now know, companies began to more permanently shift their ways of working. They accelerated and shifted to cloud-native and as-as-service operating models, and away from running their own data centers, under broader digital transformation efforts.
Network ops models changed alongside that shift. Automated provisioning and management of infrastructure and network capacity accelerated considerably, to the point that observations of outage and disruption patterns are now more or less a study of machine-to-machine interactions.
Given the nature and unpredictability of outages it’s difficult to exactly pinpoint when this trend actually started, but we started to see this more automated outage pattern appearing in early 2021, which is indicative of a more automated approach to rectification and mitigation.
Case-in-point: very little of what we saw in October 2021 points to any great degree of human involvement or intervention. Instead, what we saw were disruptions or failures followed by what looked like a series of automated corrective actions.
This is apparent from the data on outage duration. In October, outage duration averaged 30 minutes, compared to 35 minutes in June and 39 minutes in March of last year (just comparing months where outage numbers previously hit all-time highs).
But a 30-minute outage in October looked different. It wasn’t a single 30 minute disappearance of routes or presence from the public-facing Internet. Instead, a single 30 minute outage has the appearance of a series of shorter duration “blocks” that come together to form a single outage.
The uniformity of this block pattern lends itself to the idea that this is some sort of automated recognition of an issue and then automated correction taking place, all in the background. It is happening too quickly for someone to be going in and making these changes.
It’s worth noting that the move to automated ops isn’t just pandemic-related: it’s increasingly how tier one ISPs in particular are able to differentiate in a crowded market.
Service providers recognize that the bandwidth game—selling “speeds and feeds”—is a growth- and market-limiting move. Customer purchase agreements are increasingly made on “service quality” terms, particularly as most customer organizations now rely on digital systems, served over infrastructure they do not own, to conduct their business.
ISPs that differentiate in the market are those that offer uptime and availability guarantees. Those guarantees provide customers with confidence that the impact of an outage or disruption will be small or unnoticeable. But to do this, ISPs need automated playbooks in place to be able to remediate or mitigate an issue quickly.
October 2021 provided fresh indications that this is occurring.