At around 0:40 on the 5th of October, a failure occurred in the Facebook system and all systems were down. A system failure occurred not only on Facebook, but also on Instagram, WhatsApp, Messenger, and Oculus owned by Facebook, and it was in an inaccessible state until around 7 o’clock that day. Internet infrastructure company Cloudflare explains why Facebook is down worldwide.
Due to the system failure, access to Facebook and related services such as WhatsApp and Instagram became impossible. If DNS resolution of these service names fails, even some of the infrastructure IP addresses that support the service become unreachable. Cloudflare said it was as if someone had unplugged the data center cables all at once and removed them from the internet.
Cloudflare started to investigate the internal incident (Facebook DNS lookup returning SERVFAIL) around 1:51 on October 5, thinking that there might be a problem with the DNS resolver 188.8.131.52 due to the Facebook system failure. As a result of the investigation, it was found that the cause of the problem was the Border Gateway Protocol (BGP). BGP is a structure for exchanging routing information between autonomous systems, i.e. networks, on the Internet. Simply put, BGP plays a role in the Internet such as navigation that shows a route when a destination is entered.
Each AS has an AS number. All AS numbers must announce their connection routes to the Internet using BGP. Otherwise, this AS will not be discovered and connected to anyone. In addition, Facebook, Instagram, and WhatsApp AS number can be viewed at AS32934. Facebook is directly connected to the Internet without going through an Internet service provider by searching for an AS number on its own.
So Cloudflare kept track of all BGP updates and announcements it saw on its global network. As a result, a system error was confirmed, and a routing change peak was reported on Facebook around 0:40 on the 5th.
Shortly after the DNS server went offline around 1:50, Cloudflare engineers found that 184.108.40.206 could not resolve Facebook.com. A system failure could be suspected, and as a result, Facebook and related services were virtually no longer connected to the Internet.
And Cloudflare revealed that Facebook stopped announcing routes for DNS prefixes. In other words, at least Facebook’s DNS servers are unavailable at this point. As a result, Cloudflare’s DNS resolver 220.127.116.11 will not be able to respond to queries requesting Facebook.com or Instagram.com IP addresses.
According to Cloudflare, Facebook and related services are large, so an error will always cause delays or limits that cause tens of times the number of requests. In fact, the number of Facebook, WhatsApp, Messenger, and Instagram requests shown in 18.104.22.168 has increased nearly 30 times than usual from around 15:40. To prevent an increase in Facebook-related DNS requests, DNS resolvers around the world have stopped resolving Facebook-related domains.
In the aftermath of Facebook’s downfall, DNS queries to Twitter and other social media platforms such as Signal, Telegram and TikTok have also increased. At 4:52 on the 5th, Facebook’s CTO tweeted a sincere apology to everyone affected by the outage using Facebook, saying that network issues are occurring and the team is working hard to debug and recover as quickly as possible. . According to the report, the system failure is affecting the network backbone that interconnects all data centers.
The BGP activity on the Facebook network was updated around 6 o’clock on the 5th, and it was confirmed that the Facebook.com name could be verified at 22.214.171.124 around 6:20. At around 6:28, Facebook’s self-recovery was confirmed. Regarding this system failure, Facebook CEO Mark Zuckerberg said that Facebook, Instagram, WhatsApp, and Messenger are now back online, causing trouble, and how much customers use their services to maintain relationships with their loved ones. He said he was thinking about whether
As for Facebook’s system recovery, one security researcher explained that the recovery was delayed because the update couldn’t be modified by remote users and anyone with physical access didn’t have network and logical access. However, in the beginning, it did not provide a detailed explanation as to why the failure of the BGP Umdate that caused this failure occurred. The Facebook engineering team later issued a statement that the failure was due to a change in the configuration of the backbone router that coordinates network traffic between the data centers. Related information can be found here.