If you follow my work, you know I love to make, and publish, videos on YouTube. I had just released my video about the ZUGU case for the iPad. I went to check my YouTube analytics to see how everyone liked the video. The analytics looked great. I was very excited and then…I was logged out—for no reason.
I couldn’t log back into my YouTube account. Worse, I couldn’t log in to my Gmail or GDrive either. Even social media was down. I thought I had been hacked.
Frantically, I started researching (ironically via google.com) for “Google Server Down?“. It appeared I wasn’t the only person asking this question, as many were finding they couldn’t log into their Google-related accounts either.
I felt helpless. With Google down, there wasn’t anything I could do. I was cut off from the world and all the 22000+ Subscribers I love to interact with on YouTube.
All I could do was wait.
It was June 2, 2019. Google discovered “an issue” that affected the Google Cloud Platform. Unusually high congestion was causing problems across multiple platforms.
It took engineers several hours to resolve the issue. (Google later stated that the service disruption was caused by a configuration change that was incorrectly applied to more regions than it was meant to be applied to. This caused those regions to stop using more than half of their available network capacity, resulting in congestion.)
This made me think about how dependent everyone is on Cloud Services.
Cloud Services are indispensable in today’s online world. Cloud platforms make starting and running an online business more accessible. This has led to large increase in the variety of companies that use Google Cloud, and other public cloud services like Amazon Web Services, Oracle, and Microsoft Azure.
Google’s Cloud Platform is a group of cloud computing services and management tools that include computing, data storage, data analytics, and machine learning. This infrastructure requires a lot of hardware, software, fiber optic cables, engineers, IT staff, etc. in locations spread across the globe.
While some companies use their own private cloud, or data center, to run their business, many are migrating to Google’s public cloud platform. This reduces cost, reduces labor, and increases load speed and security in many cases.
This also means that many of these companies are dependent on one service for their ability to conduct business.
Many of the services we rely on daily, depend on Google’s Cloud Platform to stay up and running. Google uses its own cloud platform to host all of its G Suite apps, including Gmail, Nest, and other Google-related sites such as YouTube. These were all affected by the network congestion on June 2nd.
There are a host of companies not owned by Google that also use the Google Cloud Platform.
Some of the companies currently using the Google Cloud Platform include Shopify, Snapchat, Discord, Vimeo, Ubisoft, Home Depot, Spotify, HSBC, Best Buy, Philips, Coca Cola, HTC, Domino’s, Sony Music, ShareThis, and Feedly. (Users of Shopify, Snapchat, Discord, and Vimeo reported being affected by the June 2nd Google network congestion.)
What can cause a Cloud Platform Outage?
In September of 2018, Microsoft’s Azure had a partial shutdown caused by severe weather. Lightning strikes lead to a power outage at one of Azure’s data centers, causing it to overheat.
It took engineers several days to get everything back up and running.
A cloud platform is a complicated and dynamic system. When one part malfunctions there are backups in place to prevent a shutdown. But when more than one factor contributes to the problem it can have a cascading effect on the rest of the system.
As seen in the lightning-related Microsoft Azure incident, these malfunctions can come from many places:
- Hardware failure
- Updating and integrating software
- Power failures
- Not enough capacity or increasing capacity demands
- Human error
- Security flaws
- Natural disaster
A Cloud Service interruption could severely affect companies that heavily rely on cloud access.
Loss of cloud access can be irritating for the average internet user, but for a company that depends on cloud platforms for daily operation, it can be a disaster. With services ranging from computing, data storage, and even payment processing, a loss of cloud platform use would result in immediate financial cost and lost productivity.
A shortlist of some of these unwanted effects includes:
- Brand damage
- Unable to do business as usual/loss of work
- Unable to perform internal processes
- Ability to process credit cards
- Compromise tasks based on cloud computing
- Ability to connect outside of the company, e.g. video chat or email
- A threat to internal and external security systems
- Interruption of data storage/Potential data loss
- Loss of file sharing
- Lost opportunities
Companies that manage their own data centers are advised to have redundancies, backups, and emergency spares already in place and running so that a failure in one part of the system will be virtually unnoticed.
Those precautions don’t make sense for a company using a public cloud platform. But it helps to know that these platforms have their own redundancies, backups, and emergency spares in place that usually keep up with the demand. The only time we notice a breakdown is when it is very large.
When a major shutdown event occurs, what are the emergency routines Cloud Services will carry out?
Each cloud platform service provider has backups to prevent complete system failure, and engineers with debugging software to quickly locate and identify problems to be repaired.
In the case of Google, its data centers are spread out around the world. In each data center, the machines are, according to Google, “segregated into multiple logical clusters which have their own dedicated cluster management software, providing resilience to failure of any individual cluster manager.”
These systems use cluster software that runs in tandem, each being able to perform functions independently. Because of this, separate clustered systems can be down for routine maintenance or a small problem, and the other systems will make up for the loss until the downed system is back online.
In the event of an unforeseen incident, engineers get an alert that there is a problem. They use debugging tools to locate the problem and determine the root cause. Then they begin the repairs. (They also have engineers ready to travel to the affected site if necessary.)
These incidents are usually resolved in a few hours to a few days, and considering the size of the networks, this is an amazing feat of efficiency.
We have to face reality.
Cloud Service platforms have made doing business in the information age much easier. Year after year more companies are migrating their in-house data centers onto cloud-based platforms, and more publically used websites and apps rely on the cloud the give their applications life.
Although we wish for a world where the Cloud Services never go down, we have to face reality:
Cloud-based data centers are heavily utilized in complex systems. Hardware degrades, the software has bugs, lighting happens, and so do cyber attacks. These risks are part of the reality of cloud-based platforms.
Before you decide to rely on a cloud-based platform:
- Determine if your business can survive potential downtime
- Decide if you need to use resiliency plans in your operations
- Design your systems with potential failures in mind
- Don’t rely on any single machine or component of your system, have backups
- Fully vet your cloud service provider to be sure they have the level of safeguards and backups already in place that meet your requirements.
Build your entire system on the premise that the cloud platform will have glitches. Then, if you build accordingly, you’ll have a system that’s resilient to those glitches.
And what about the rest of my story?
Everything at Google came back online, but it left me with a lot to think about.
I learned just how dependent we are on cloud services. I learned that as business people and everyday internet users, we are vulnerable to cloud platform shutdowns.
I also learned about the systems and safeguards the Cloud Service providers have in place to keep these services available, and to make our time spent online easier.
While the downtime was distressing, it helped me become aware of where our vulnerabilities are so I won’t be surprised the next time.
What about you?
Were you affected by the recent interruption in Google’s Cloud Platform?
Do you have plans and systems in place to help during outages?
Join the conversation in the Paperless Movement Facebook Group,