On Sunday afternoon, Google’s cloud services suffered notable downtime that lasted for several hours and affected third-party apps. Google today detailed the outage’s cause and what it’s doing to prevent future incidents.
Google started by apologizing for the outage that caused “low performance and elevated error rates on several Google services.” On the consumer side, YouTube, Gmail, Drive, and other products were impacted, along with third-party services like Apple’s iCloud and Snapchat reliant on Google Cloud.
- YouTube measured a 10% drop in global views during the incident
- Google Cloud Storage measured a 30% reduction in traffic
- Approximately 1% of active Gmail users had problems with their account; while that is a small fraction of users, it still represents millions of users who couldn’t receive or send email
- Low-bandwidth services like Google Search recorded only a short-lived increase in latency as they switched to serving from unaffected regions, then returned to normal
The issue was due to a “configuration change” intended for a “small number of servers in a single region” accidentally being applied to a “larger number of servers across several neighboring regions.” This caused Google’s Cloud regions to stop using over half of the available network capacity and resulted in congestion.
The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not. The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam.
While the issue was detected “within seconds,” the same network congestion hampered engineers trying to restore the correct configurations. Today’s post alludes to Google bringing on “additional help” to parallelize restoration efforts.
Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage.
Google is now working to ensure that these cascading series of events does not occur again:
With all services restored to normal operation, Google’s engineering teams are now conducting a thorough post-mortem to ensure we understand all the contributing factors to both the network capacity loss and the slow restoration. We will then have a focused engineering sprint to ensure we have not only fixed the direct cause of the problem, but also guarded against the entire class of issues illustrated by this event.
FTC: We use income earning auto affiliate links. More.
Comments