This post is part of my 'Learning in public' journey for Microsoft Azure. It summarizes the key concepts of keeping Virtual Machines available in several disaster scenarios, such as local outages or full data center failures.
Table of contents
Resources this concept can be used on
Availability sets are exclusive to Virtual Machines. Availability zones can be used for a variety of services, such as
- VM Scale Sets
- Load Balancers and Public IP Addresses (Standard SKU only)
- Managed disks
- Azure Backup & Site Recovery
- Several other Azure Services (Cloud Functions, Cosmos DB, VPN-Gateway, ...)
Availability Sets- and Zones are mutually exclusive. A virtual machine can't be part of both at the same time. Hence it's important to plan each architecture component accordingly.
- Availability sets provide an uptime SLA of 99.95%. They can be used for any application that does not require the maximum uptime guarantee
- Availability zones provide an uptime SLA of 99.99%. They are meant to be used for the most critical components of your infrastructure
Availability sets
Availability sets are logical groups that distribute Virtual Machines over several hardware nodes within a data center. In other words: They're different, independent computing instances running the same software stack.
Availability sets protect your architecture against smaller power outages within the data center. They have two key features: Fault domains and Update domains
Fault domains
Virtual machines that share the same hardware components, such as power source, cooling unit, and network switch. Simply said: It's a single server rack. Somewhere on the rack, a computer runs your VM.
If your availability set distributes your virtual machines across several server racks, they're less susceptible to power- or cooling device failure. If one goes down, the VM on the other rack can take over.
One Availability Set can have up to three fault domains.
Update domains
Virtual machines that share the same scheduled update cycle of the underlying hardware. Let's use an example to explain this.
- Assume you create 6 virtual machines in an availability set with 3 update domains
- Then Azure will automatically assign 2 VMs to each update domain
- Let's say: VM1 and VM2 are part of the same update domain, VM3 and VM4 share one and so do VM5 and VM6
- Now assume that the hardware that runs VM1 is scheduled for an update
- Then you can be 100% sure that so is the hardware that runs VM2
- But all other VMs remain fully operational and will not be updated until VM1 & VM2 are back in business
- Each update domain receives a 30-minute cooldown time before maintenance on the next one begins
One Availability Set can have up to 20 update domains.
Comparing fault domains and update domains
Both Availability Set features protect against planned and unplanned outages. While a fault domain can easily be pictured as a server rack, update domains seem more abstract. Let's put them together with an example
Assume you built a new software (perhaps Tinder for dogs). You know it's the next big thing and buy 5 virtual machines on Azure. You also want to ensure your potential customers (or their respective owners) can find each other, even when one of your instances goes down.
So you start by creating an availability set with 3 fault domains and 3 update domains.
Let's walk through a few scenarios of what might
- You underestimate the amount of dogs looking for love. The cooling source of Fault Domain 1 has a defect and the rack catches fire due to the sheer amount of traffic to your application.
=> Then VM2, VM3 and VM5 will still be available
- Microsoft Cloud admins notice that the computers running VM1 - VM4 are still units of Commodore 64. They schedule updates to replace the hardware.
=> Then VM1 and VM2 will receive maintenance at the same time
=> Once maintenance is done, VM3 and VM4 receive theirs after a 30-minute grace period
- After six months, you notice your current setup is not nearly strong enough. You decide to scale up to 25 virtual machines and max out your Availability Set. Then you will have a maximum of 3 fault domains and 20 update domains
=> Azure will automatically try to distribute the VMs as evenly as possible across fault- and update domains.
=>You will receive 3 fault domains, 2 with 8 and 1 with 9 Virtual Machines
=> You will receive 20 update domains, 15 with 1- and 5 with 2 Virtual Machine(s)
Let's walk through a fourth scenario: The outage of a whole data center
Availability zones
So it happened. The burning rack from scenario 1 collided with a defective fire sprinkler system. To make things worse, a group of anti-cloud protestors infiltrated the data center where your application runs and pulled a lot of plugs. The damage is so great that the data center cannot operate anymore and is closed for an unspecified amount of time.
Okay, this scenario is absurd. But what can you do against a whole data center being affected by an outage?
This is where Availability Zones come into play. Availability Zones distribute your VMs not in different server racks but across physical locations (basically across buildings).
You can view a 3D-map of all Azure regions and availability zones here:
Let's look at the region of 'Germany West Central'. It has three availability zones. That means somewhere in the greater region of Frankfurt, there are three buildings operated by Microsoft Azure. Each building can host one of your VMs that are part of the same Availability Zone. If one zone goes down, the two other zones can take over.
Distributing your architecture across several zones provides the highest SLA of 99.99% uptime Azure offers. It's a great choice for highly critical components where the slightest outage will result in extreme damage to you or your company.