On June 2, a major disruption to the Google network resulted in the inability to use the services provided by Google and the various web services using Google cloud in some areas, or the operation became heavy. Google Cloud explains the cause of such a large-scale disaster through the official blog.
Due to the disruption that occurred in early June, there were some problems in the US and some parts of Europe with the services provided by Google such as Google Cloud, YouTube and G Suite, and web services using Google Cloud including iCloud. The Google cloud monitoring team (Google 24×7) explained that the official blog was used to apply to the servers in the neighboring area because the setting change applied to the specific region server was wrong. This case also affects software configuration errors and bugs.
Google computers in the data center are separated into several logical clusters. Dedicated management software is included with each of these clusters, enabling disaster recovery infrastructure changes, data center maintenance, and automatic event triggering. When setting up Google data center maintenance as an event, it is usually global maintenance, and managing only local servers is rare.
This time, the event was set up to stop the network for maintenance, greater control server in a particular area. 6 But you’ve settings have been applied to ensure that the local server also stops close to simultaneously manage a software bug that kept 11:45 May 2 maintenance event started. As a result, without the use of adjacent areas over server settings override discard half the available network capacity from one to cause network congestion.
The Google engineering team started a recovery operation two minutes after the failure. The recovery was supposed to be over in a few minutes, but due to network congestion, management software debugging was hampered and finally stopped the software to automate maintenance events 1 hour and 16 minutes later. After that, we redesigned the reservation for the first time and redistributed the server at 14:03. The network capacity was restored at 15: 9 and all services were resumed at 16:10.
This impact caused YouTube to lose 2.5% hits per hour, while Google Cloud Storage reported a 30% decrease in traffic. A few users were affected, but millions of users were unable to send and receive e-mails. For more information, please click here .