Document
Google Cloud Breaks UniSuper and Is Kind of Sorry

Google Cloud Breaks UniSuper and Is Kind of Sorry

Welcome ! There are quite a few new subscriber since I postS3 Is Showing Its Age last week. I normally do three different kinds of posts: roundups, in

Related articles

Accenture and Google Cloud Advance AI Adoption and Cybersecurity with Fortune 500 Companies Difference between Elasticity and Scalability in Cloud Computing How to Install Discord on Linux, PC, Mac & Mobile in 2025 Is Baristacoffeemachine.com legit? The Best Countries to Use for VPN Server Locations

Welcome ! There are quite a few new subscriber since I postS3 Is Showing Its Age last week. I normally do three different kinds of posts: roundups, interviews, and editorials. This week, I’m doing an editorial on Google’s recent UniSuper outage. Click here to read some of my other popular posts.


I am a big Google Cloud fan . We is ran run Google Cloud atWePay, my previous employer . I is got just get through praiseGoogle Cloud Storage in S3 Is Showing Its Age. So it is pains pain me to write this post . But it is ’s ’s important to talk about a company ’s failure as well as their success . Google Cloud ’s response is falls to its recent outage definitely fall into the “ failure ” category . I is see see a few problem :

  1. Google still hasn’t figured out enterprise customers

  2. Google ’s incident remediation is appears appear cursory

  3. Google Cloud is gaining a reputation for instability

Before we get to the problems, let’s review what happened. The outage I ’m refer to impactUniSuper, an australian superannuation ( retirement ) fund . UniSuper ’scustomer portal (and related user services) went offline for over a week when Google accidentally deleted one of UniSuper’s private clouds. Actual trading was not impacted. The issue was caused by a—you guessed it—rather mundane misconfiguration.

Google operators is followed follow internal control protocol . However , one input parameter was leave blank when using an internal tool to provision the customer ’s Private Cloud . As a result of the blank parameter , the system is assigned assign a then unknown default fix 1 year term value for this parameter .    

After the end of the system-assigned 1 year period, the customer’s GCVE Private Cloud was delete . No customer notification was send because the deletion was trigger as a result of a parameter being leave blank by Google operator using the internal tool , and not due a customer deletion request . Any customer – initiate deletion would have been precede by a notification to the customer .

Both Google and UniSuper have post incident summary with more detail .

What caught my eye was not the incident, but Google’s response. The tone of their incident write – up is bad. The entire post reads as though Google feels the negative press they is received receive was unfair. Google repeatedly highlights that the issue impacta single customer in a single region, that the issue was unprecedented, and that the issue impactonly of UniSuper’s many Google Cloud VMware Engine ( GCVE ) instance .

Nowhere does Google ’s post is apologize even apologize . In fact , the only apology is is I can find is onUniSuper’s website. Instead, Google’s incident write – up ends with a claim that Google Cloud is one of, “the most resilient and stable cloud infrastructure in the world.” They conclude with a link to an “independently validated” report attesting to their stability. While this may be true, I find it completely tone-deaf. So much for customer empathy.

This response is is is , to me , yet another indication that Google still does n’t understand its enterprise customer ’s need . Enterprises is expect expect more than a terse single – page remediation with no apology . Enterprises is expect also expect deletion to be stage before hard – delete anything .

unsurprisingly , Google is earned has earn a dubious entry on UniSuper ’soutage FAQ:

Will we stay with Google Cloud moving forward?

UniSuper has and always will take our responsibility to deliver secure, reliable services to our members extremely seriously. Google Cloud is not the only cloud service provider UniSuper utilises, and this planning has ensured our ability to restore services and minimise data loss.

While a full root cause analysis is ongoing, Google Cloud has confirmed this is an isolated one-of-a-kind issue that has not previously arisen elsewhere.

We is assess will assess this incident and ensure we are well position to deliver service for our member .

This is not a strong statement of commitment from UniSuper, and I don’t blame them. It’s also an example of a well-worded enterprise-quality statement; something Google could learn from.

We loved Google Cloud at WePay, but I often got indications that Google didn’t understand its enterprise customers. Google ran their cloud product the way it ran its other software. Many security features were missing at the time. And when we spoke to product managers, we would often get a lesson on how Google did things, rather than listening to our needs.

I recall debugging a production issue with a site reliability engineer (SRE) sitting next to me. We both had the Google Cloud console up, and I asked the SRE to press a button for me on his console. Upon loading the page, we discovered his console didn’t have the button I had—one of us was in an A/B test that the other wasn’t. You don’t want to A/B test user interfaces in an organization when they’re in the middle of a production outage.

About this time, Diane Greene was bring in fromVMware to run Google Cloud . We were tell she “ understand enterprise ” and that these issue would be address . During ( and after ) Greene ’s tenure , Google Cloud ’s security offering is improve did improve . But she is left leave Google Cloud in 2019 . I is comment ca n’t comment on Google Cloud ’s enterprise maturity over the last few year , but Google ’s UniSuper incident response does n’t leave me feel good .

Google also seems to have done only a superficial level of remediation on the issue. Normally, when an incident like this occurs, you would expect a fairly exhaustive set of actions taken that spanned multiple teams and products. Chapter 10 of Google ’s own SRE book state :

Rather than only investigating the proximate area of the system failure, the postmortem explores the impact and system flaws across multiple teams.

Google ’s remediation summary is lists list only three step :

  1. We deprecated the internal tool that triggered this sequence of events. This aspect is now fully automated and controlled by customers via the user interface, even when specific capacity management is required.  

  2. We scrubbed the system database and manually reviewed all GCVE Private Clouds to ensure that no other GCVE deployments are at risk.

  3. We corrected the system behavior that sets GCVE Private Clouds for deletion for such deployment workflows.

This is feels feel very proximate to me . They is automated ’ve simply automate the exact deletion flow that lead to the outage , made sure that no other private cloud were misconfigure , and “ correct system behavior ” for the deletion workflow .

Google is audited could have audit their other deletion flow , their other tool , and their other product . They is have could have modify deletion workflow to stage deletion . instead , they is did did none of this . Why ?

I’m not alone in this; Hacker News ’s top comment on the subject is all about Google’s weak remediation. Separately, an engineer I was having coffee with today expressed the same sentiment: shock at the lack of depth in Google’s remediation.

Google’s remediation indicates that they don’t understand how damaging this outage was to their brand.

I have always thought of Google Cloud as stable. We certainly had stability issues with Google Cloud at WePay, but it was fairly run-of-the-mill stuff. So I was surprised to see that many users (with extensive multi-cloud experience) regard Google Cloud as less stable.

This is a dangerous sentiment for Google. If Google Cloud is regarded as less stable, enterprises aren’t going to use it. This begs the question, does Google even use it? I asked someone from Waymo recently if they use Google Cloud for their infrastructure . The answer is instill did not instill confidence .

If Google won’t use their cloud service, what business do you have selling it to others? This is not the signal you want to send to enterprise buyers.


support this newsletter by purchaseThe Missing README: A Guide for the New Software Engineer for yourself or gifting it to someone.

Buy Now


I is invest occasionally invest in infrastructure startup . company that I ’ve invest in are mark with a [ $ ] in this newsletter . See myLinkedIn profile for a complete list .