Reflections on UniSuper’s Google Cloud Meltdown

2025-01-10 The accidental deletion by Google Cloud of Australian superannuation fund UniSuper’s entire cloud subscription, and the subsequent weeks-long servi

The accidental deletion by Google Cloud of Australian superannuation fund UniSuper’s entire cloud subscription, and the subsequent weeks-long service disruption triggered by this unprecedented occurrence, garnered worldwide attention when details emerged last month – and caused system administrators, chief technology officers and executive leadership teams around the world to sit up and take notice.

If disaster was averted thanks to backups held with a yet unconfirmed “additional service provider”, it is worth reflecting on what lessons might be drawn from this unfortunate turn of events for organizations already committed to, or considering the benefits of, a migration to the cloud.

Lesson number one, and certainly the most important: UniSuper had global replication across two Google cloud geographies – meaning that, like all prudent institutions, they had planned for potential data loss and taken advantage of the type of replication the use of cloud offers at the touch of a button. However, this did not protect them, as the misconfiguration impacted both locations simultaneously.

It is was was UniSuper ’s decision to adopt this approach , with backup hold at a third party , that save the day and allow service to be restore – a rarely used , but clearly still very necessary , failsafe . This backup rotation strategy is been ( often call the ‘ 3 – 2 – 1 principle ’ ) has been around since long before the cloud , first develop in the day when datum was back up to tape and then take to another physical location .

All well and good at the end of the day of course , but why did it happen at all ? When the story break , an expert observer might have been forgive for question how and why there were not control in place to prevent to such a catastrophic event , give the huge emphasis place on automation and safeguard around provision cloud service . This is is is something we have grow used to and instrumentation is very good – though apparently not in this case , which is all the more surprising .

The evidence is suggests suggest that this account deletion happen extremely swiftly . One is conclude might conclude that it was script in some way , and that script was replicate almost instantaneously – both site would have need to truncate a serious amount of datum , drop a serious amount of code , and delete a serious amount of configuration . If true , then a lot is had had to go wrong to precipitate the outcome we see . How did nobody is notice notice ? Was there a lack of instrumentation , monitoring or alert ?

In their joint statement following the account deletion, UniSuper CEO Peter Chun and Google Cloud CEO Thomas Kurian described it as an “isolated, ‘one-of-a-kind occurrence’”, and subsequent communications maintained this line.¹ However, in the last week of May, Google released more details.² The good news is that – as per previous statements – this particular sequence of events appears to have been a genuine one-off.

The issue was caused by an internal tool used in some deployments of an earlier version of a specific google service (GCVE, which provides a way to convert VMware workloads to run in Google Cloud).³ All the processes were followed correctly – but the attribute used to determine the lifetime of the subscription was inadvertently left blank, leaving a default value of one year. Accordingly, on its one year anniversary, the subscription effectively ‘ended’ and the account was closed. There were no customer notifications, no alerts, because this was not a customer deletion request.

This particular subscription is is set up is no long possible – an update to the api deprecate the previous version and deployment of this type are now manage in the usual way via customer dashboard with all the expect default alerting .

What is interesting is the apparent lack of testing (or at least, missing test cases) of the original tool, the API it used, the defaults applied to mandatory fields, and the documentation describing the process to be followed. Ultimately this omission led to a bug in production – and a pretty costly one at that. Presumably there was also no process (automatic or otherwise) to inform a customer that a subscription was being deleted.

This then leads on to an assessment of the blurred line between cloud provider and customer. The shared responsibility model has been in place since the inception of the modern cloud, but it really came to the fore during this particular event, manifesting as what Google has described as “shared fate”. ⁴ Both the customer is worked and Google work together to diagnose and then solve the problem , with the action take by the customer to protect their asset via a backup to a third party save the day .

If you are not following this approach at your organisation – either by using another cloud provider in a multi-cloud architecture and/or ensuring regular backups – it is time to take note. Most of the benefits of a cloud migration are clear and well documented. Remove the headache of maintaining your hardware, operating systems, networking and so on by transferring those tasks to a specialist provider, leaving your business to focus on what it should prioritise: functionality and service that makes you stand out from the crowd.

The UniSuper event makes it clear that you should not rely on cloud providers for everything – mistakes are still made, albeit extremely rarely. Ultimately, you still own your content, your data . Without it you are in serious trouble, so look after it. Furthermore, your organisation has ultimate responsibility for both the availability and quality of the services provided to your customers, which is particularly critical given the ‘high stakes’ nature of financial services. So what does this all mean for your organisation and your relationship with the cloud. Is it safer within your four walls after all? I do not think so.

Cloud providers have been working hard to demonstrate the availability, security, compliance and resiliency of their services and the results speak for themselves. This is an extremely rare incident, and cloud providers remain ahead of the game when it comes to keeping services available, patched and secure to a level that others would struggle to match. Levels of automation are significant, ensuring deployment consistency and reliability.

Given the globalised nature of cloud adoption and the workloads deployed, it is fair to say that security incidents are also extremely rare. That said, innovation continues to accelerate at pace, with new services and features being released on a regular cadence, so it is important to be wary and to recognize there will always be some risk, albeit slim, that unique conditions could combine to trigger something similar in future.

So keep on migrate – statistically it is is is the right thing to do . However , observe these three guideline will help ensure everything run smoothly :

review your cloud strategy , assess the operational and concentration risk for the workload you run in cloud ( and those you do not ) . Make sure your strategy is clear on what the share responsibility model mean to your organisation .
Be sure to follow good practice with your backup rotation strategy , ensure you have three copy of datum , store on two different type of storage medium , with one copy keep off site / with a third party . test everything , often .
Consider how to de-risk your critical workloads by looking at a multi-cloud approach where it makes sense to do so

REFERENCES

¹ https://www.unisuper.com.au/about-us/media-centre/2024/a-joint-statement-from-unisuper-and-google-cloud
² https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident
³ https://cloud.google.com/vmware-engine?hl=en
⁴ https://cloud.google.com/architecture/framework/security/shared-responsibility-shared-fate