Calculating availability with cloud service SLAs

‌‌When designing a cloud solution, it is beneficial to have a good understanding of service-level agreements (SLAs) cloud providers offer for individual services and how this impacts your overall system availability.

For example, a service offering

  • 99.9% uptime is expected to be down for 43 mins 47 seconds monthly
  • 99.99% uptime is expected to be down for 4 mins 22 seconds monthly

Systems are rarely built using a single cloud service, for example typical web application is likely to have 4-layers like Azure Front Door, Web App, API and Database.  

Sample Web Application (N- Tier)

‌Let's consider each cloud service has following SLA

  • Azure Front Door -> SLA 99.99%
  • Web App -> SLA 99.95%
  • API (Web App) -> SLA 99.95%
  • Database -> SLA 99.99%

Composite SLA

A system is considered up and running if all four services are running (let's leave user/app errors out for simplicity). To calculate the overall SLA of the system a worst-case scenario is considered i.e. each service goes out one after another as below:

Composite SLA 

Overall SLA (aka composite SLA) is calculated simply by multiplying all four SLAs together as follows:

Composite SLA = 99.99% x 99.95% x 99.95% x 99.99% = 99.88%

‌‌Multi-Region Deployments

One way to improve uptime is to deploy an application across multiple regions, which helps you improve uptime and withstand region-wide outages. So our new architecture would look like this (Azure Front door is a global service so it is shared)

Multi Region Deployment

Now let's calculate overall system SLA:

We will first calculate our SLAs for Region 1 and Region 2

Region 1 SLA = 99.95% x 99.95% x 99.99% = 99.89%
Region 2 SLA = 99.89% (same as region 1)

Since these regions run services in parallel composite SLA for Region 1 & Region 2 is calculated as

Composite SLA: for Parallel Deployment (multi-region)
  1. Region 1 Unavailability (R1) = 100-99.89 = 0.11
  2. Region 2 Unavailability (R2) = 100-99.89 = 0.11
  3. Multi-Region Unavailability (OU = (R1 * R2)/100) = (0.11 * 0.11)/100 = 0.000121
  4. Multi-Region Availability (OA = 100-OU) = 100-0.0121 = 99.999879

This gives us overall availability of service in parallel to be 99.999879%

The last step is to calculate composite availability with Azure Front Door as:

-> SLA for Azure Front Door x SLA for Multi-Region Availability

-> 99.99 * 99.999879 = 99.9897%

This is equivalent to monthly downtime of 4 mins 35 seconds

Overall SLA for Azure Front Door x SLA for Multi-Region Availability

Here is a simple spreadsheet you can use to perform SLA calculations (download here)

Sample Spreadsheet