Download PDF
RCA TRAINING
Root Cause Analysis training by Sologic provides the tools, skills, and knowledge necessary to solve complex problems in any sector, within any discipline, and of any scale. Learn MoreSOFTWARE
Sologic’s Causelink has the right root cause analysis software product for you and your organization. Single users may choose to install the software locally or utilize the cloud. Our flagship Enterprise-scale software is delivered On Premise or as SaaS in the cloud. Learn MoreNOTE FROM SOLOGIC: This summary was provided by Google. We used this summary to create the cause and effect chart.
SUMMARY:
On Friday 5 August 2016, some Google Cloud Platform customers experienced increased network latency and packet loss to Google Compute Engine (GCE), Cloud VPN, Cloud Router and Cloud SQL, for a duration of 99 minutes. If you were affected by this issue, we apologize. We intend to provide a higher level reliability than this, and we are working to learn from this issue to make that a reality.
DETAILED DESCRIPTION OF IMPACT:
On Friday 5th August 2016 from 00:55 to 02:34 PDT a number of services were disrupted:
Some Google Compute Engine TCP and UDP traffic had elevated latency. Most ICMP, ESP, AH and SCTP traffic inbound from outside the Google network was silently dropped, resulting in existing connections being dropped and new connections timing out on connect.
Most Google Cloud SQL first generation connections from sources external to Google failed with a connection timeout. Cloud SQL second generation connections may have seen higher latency but not failure.
Google Cloud VPN tunnels remained connected, however there was complete packet loss for data through the majority of tunnels. As Cloud Router BGP sessions traverse Cloud VPN, all sessions were dropped.
All other traffic was unaffected, including internal connections between Google services and services provided via HTTP APIs.
ROOT CAUSE:
While removing a faulty router from service, a new procedure for diverting traffic from the router was used. This procedure applied a new configuration that resulted in announcing some Google Cloud Platform IP addresses from a single point of presence in the southwestern US. As these announcements were highly specific they took precedence over the normal routes to Google's network and caused a substantial proportion of traffic for the affected network ranges to be directed to this one point of presence. This misrouting directly caused the additional latency some customers experienced.
Additionally this misconfiguration sent affected traffic to next-generation infrastructure that was undergoing testing. This new infrastructure was not yet configured to handle Cloud Platform traffic and applied an overly-restrictive packet filter. This blocked traffic on the affected IP addresses that was routed through the affected point of presence to Cloud VPN, Cloud Router, Cloud SQL first generation and GCE on protocols other than TCP and UDP.
REMEDIATION AND PREVENTION:
Mitigation began at 02:04 PDT when Google engineers reverted the network infrastructure change that caused this issue, and all traffic routing was back to normal by 02:34. The system involved was made safe against recurrences by fixing the erroneous configuration. This includes changes to BGP filtering to prevent this class of incorrect announcements.
We are implementing additional integration tests for our routing policies to ensure configuration changes behave as expected before being deployed to production. Furthermore, we are improving our production telemetry external to the Google network to better detect peering issues that slip past our tests.