Why is three nines better than four in cloud availability?

In building our TietoEVRY Hybrid IaaS on VMware solution, there's been a lot discussions also about availability SLAs. We're changing quite many things also in this area, so wanted to take bit deeper look into this. What's different and how to make best use out of the new model.

To look into the topic, let's first dive into few key contexts related to availability.

Availability zone

Availability zone is single physical location, essentially a data center, which is then sub divided into one or more failure domains. 
On IaaS scope, availability zone is the maximum scope for any high availability. Any resiliency actions going across availability zones, are always disaster recovery and are expected to be disruptive by default. Meaning RTO greater than zero.

Failure domain

From facility perspective failure domains are separate fireproof sections, which are expected to limit local failures in power, cooling or fire. 
From cloud perspective, they're they're clusters. Essentially meaning they're the scope of high availability. A cloud failure domain can be within one facility failure domain or span across multiple, we'll come to that later.

Difference between 99,9% and 99,99% SLA



The simple difference between these two is that 99,9% is a deployment which is contained within one facility failure domain and 99,99% is span across 3, making latter resilient to facility failure domain - room - loss. Now with legacy, or just quick math, thinking latter might sound better for more critical workloads, but actually in cloud way of thinking it's the opposite.
99,99% provides more resiliency for the infrastructure, but that's not actually what we want to protect. It's the applications.

Infrastructure vs application resiliency

Looking it from the application point of view, things look different. On 99,99% the infrastructure availability is exactly 99,99%, but if we spread the application across two different 99,9% failure domains we get something completely different.
When application is spread across two 99,9% failure domains, it makes the application resilient failure domain loss. Let's do the math, resiliency between two 99,9% availabilities:

100% - (100% - 99,9%)^2 = 99,9999%

Now why is this. It's because in 99,99% infrastructure model not necessarily seamlessly protected from all failures. If we take the failure domain less as an example, due to random load balancing in the capacity, it's completely possible that all components of the application are running in that failing domain. They'll all be automatically recovered, yes, but there's a break - which is never good.

Conclusion

By shifting the availability and infrastructure design, out from the infrastructure layer and shifting it to application level, we can reach completely new levels of availability. And this now just talking traditional IaaS layer, when we move to modern containers deployments, multi-availability zone/cloud models with global load balancing, we're truly ready for the always-on availability.
Highest availability is always reached when your solution is designed to work with failure, instead of trying to create a solution which has components that should never fail.

One key part that we're also changing is the segregation of capacity and workload SLA, driving towards model which is realized in any common cloud delivery and cloud operations delivery context. Of course still in context where we're responsible for the workload layer also, we still take end to end responsibility, just as before. But also in that case the internal delivery mechanics have changed. Cloud delivers API for managed capacity, workload operations consumes the API and created managed workloads. Key driver here is standardization, which is needed to enable end to end automation, in the end producing more efficient and higher quality end results.

Comments

  1. Casino at Harrah's Cherokee - MapYRO
    Casino at 안동 출장샵 Harrah's Cherokee · 포천 출장샵 Hotel. 상주 출장샵 21900 Highway 속초 출장안마 315, Cherokee, NC 28719. Directions · (857) 543-5000. Call Now · More Info. Hours, Accepts 광명 출장마사지 Credit Cards,  Rating: 4.7 · ‎957 reviews

    ReplyDelete

Post a Comment

Popular posts from this blog

reverse engineering VMware Cloud Director API

Join VMware Photon to Active Directory