The reason for the failure was that CenturyLink, being a Level3 provider, incorrectly formulated the BGP ruleFlowspec in security protocol. BGP Flowspec is used to redirect traffic, so this error led to serious problems with routing within the provider's network, which affected the stability of the global Internet. Of course, users in the US were hardest hit, but echoes of the problems were felt around the world.
It's important to note that CenturyLink is America's third-largest telecom company, just behind AT&T and Verizon.
BGP Flowspec by IETF is RFC 5575 and is described as a multi-protocol extension to BGP MP-BGP that contains Network Layer Reachability Information (NLRI) . BGP FlowSpec is an alternative method of dumping attacking DDoS traffic from a route, which is considered a more subtle way to evade an attack than RTBH (Remote Triggered Black Hole Filtering) , when all traffic from the attack address is blocked, or traffic to the destination address. In general, RTBH is a "doomsday weapon" and is a last resort for stopping an attack, since its use often allows an attacker to achieve what he wants, that is, isolating one of the addresses.
BGP FlowSpec is more subtle and is essentially a firewall filter that is inserted into BGP to filter specific ports and protocols and determines which traffic to pass which route. Thus, "white" traffic goes to the destination address, and defined as DDoS - is dropped from the route. Traffic is analyzed by at least 12 NLRI parameters:
- Destination Prefix. Specifies the destination prefix for the match.
- Source Prefix. Specifies the original prefix.
- IP protocol. Contains a set of {operator, value} pairs that are used to map the IP value byte in IP packets.
- Port. Determines if packets will be processed by TCP, UDP, or both.
- . , FlowSpec.
- . , FlowSpec.
- ICMP.
- ICMP.
- TCP.
- . IP- ( 2, IP-).
- DSCP. Class Of Service flag.
- Fragment Encoding
There are no full-fledged crash reports from CenturyLink themselves, they only mention their data center near Ontario. However, the routing failure was serious enough to be noticed not only by ordinary users, but also by CloudFlare engineers, who also use CenturyLink's services as a large provider.
It all started with a spike in 522 errors at 10:03 am GMT on August 30, according to a CloudFlare report .
For example, the automatic failure re-routing system was able to reduce the number of errors and reduce them to 25% of the peak value, but problems with network connectivity and resource availability were still persistent and were of a global nature. All this was done in a window between 10:03 am at the start of the crash and until 10:11 am UTC. During these eight minutes, automation and engineers disconnected their infrastructure from CenturyLink in 48 (!) North American cities and redirected traffic to backup channels of other providers.
Obviously, this was not only done at CloudFlare. However, this did not completely solve the problem. For clarity, what influence the problematic provider has on the telecom market of the United States and Canada, the company's engineers provided an official map of the availability of CenturyLink services:
In the US, the provider is used by 49 million people, which means that for some customers, if we talk about the CloudFlare report, and even entire data centers, CenturyLink is the only provider available.
As a result, due to the almost complete fall of CenturyLink, CloudFlare specialists recorded a 3.5% reduction in global Internet traffic. Here's what it looked like on the graph for the six main providers the company works with. CenturyLink is red on it.
The fact that the failure was global, and not just "a problem in a data center outside Ontario," as the provider said itself, is evidenced by the size of the updates to the Flowspec rules. Typically, BGP Flowspec configuration updates are about 2 megabytes in size, but CloudFlare experts recorded BGP configuration updates up to 26 Mb (!).
These updates, which are distributed every 15 minutes, share information with hosts about changes in route health. This allows you to flexibly respond to some local problems. Updates 10-15 times larger than usual indicate that almost the entire provider's network is down or there are extremely serious connectivity problems.
CloudFlare believes that the failure was caused by an incorrect global BGP Flowspec rule, which received the vast majority of routers, which then went into a reverse reboot in attempts to restore the connection. This fits into the picture of a crash that lasted over 4 hours. It was when the memory and CPU overload of the routers could cause engineers to lose remote access to a number of nodes and control interfaces.
By the way, this story is far from unique. A little over a year ago, the Internet all over the world "lay down" due to the fault of CloudFlare themselves and the failure of their DNS , plus the same company honestly mentions similar problems with Flowspec seven years ago , after which they abandoned its use.