There was a time when people used to ask their friends “Are you Facebook? Let me add you”, now the question has changed “Is your Facebook working, mine is not”. Facebook is one of the biggest social media platforms with more than 3 billion users. On 4th October 2021, Facebook, WhatsApp, and Instagram went down for almost 6 long hours. Every person on these social media platforms was panicking and rebooting their devices in a hope of changing the clock to tick marks on WhatsApp and other platforms also.
Touted as one of the biggest outages in history, Mark Zuckerberg lost a whopping $7 billion dollars on Facebook. The global outage made us realize that how connected we are with these platforms and how significant is the role of the company in our day-to-day lives.
What is the network backbone?
If the outage is to be explained in most simple words, we can say is one wrong command took down the accounts of more than 7 billion people worldwide and Mr. Zuckerberg lost the same amount.
Well, when it’s a massive company like Facebook then it is not just one command that can cause an outage. It was a usual combination of errors that bought down Facebook and others.
The outage was prompted by the system that is responsible for managing the global backbone network capacity. All the computing facilities are managed by the backbone and have tens of thousands of miles of optical fiber that are placed all around the globe and are connected to the data centers. Now, the data centers are in a variety as some are big buildings that have massive machines to store data and process heavy computational loads. Smaller facilities connect with the backbone to the broader internet.
How exactly did the outage happen?
The Facebook feed loads up when you open the app. During this time, the request for the data is sent from your device to the closest facility and then it is shared with the backbone network of the big data center. That is the place for retrieving and processing the data which is then sent back to your phone via network.
Routers are responsible for managing the traffic between the facilities. It is very extensive work to keep this going on to maintain the infrastructure and is managed by engineers, some part of the backbone will go offline for maintenance. This can be anything from repairing fiber lines to updating software to adding the capacity.
During the outage of 4th October, one command input was made which was to evaluate the accessibility of the global backbone capacity. This resulted in unintentional pulling down the connection in the network backbone which disconnected the Facebook data centers across the globe. Now the Facebook system is designed in a manner that will audit such commands to stop the network failure to happen but the audit tool was laced with a bug that prevented the action and the disconnection happened.
What was the second issue that happened when data centers were disconnected?
After the total loss of connection with the data center, the second problem emerged. DNS is like the address book of the internet and its job is to respond to the requests asked by the smaller facilities. Simple web names that we type in the search bar are translated into specific server IP addresses which are then answered by the authoritative servers that have known IP addresses. These are advertised to the internet in terms of BGP – Border gateway protocol.
To make sure that a trustworthy operation is being performed, BGP was disabled by the DNS servers as they were not able to communicate with the data centers making it an “unhealthy connection”. Now with the recent outage, the backbone was removed from the working which in turn made all the locations unhealthy revoked the BGP. The DNS was unreachable although in operation and this made internet to reach the servers impossible.
It happened like a chain reaction and left the engineers baffled. These two obstacles were handles and corrected by our engineers who worked round the clock.
This is the easiest explanation we could manage, and we hope you have your questions answered on WHAT HAPPENED ON 4TH OCTOBER AND WHY DID Facebook GO DOWN FOR THIS LONG.
Zindagi Technologies is an IT consulting company having engineers and techs with combined decades of experience in planning, designing, and implementing Data Centers. You can give us a call on +919773973971 and we can answer all your queries.
Senior Content Writer