From my experience, most teams moving to Kubernetes do not prioritize Disaster Recovery (DR) and High Availability (HA), being of the opinion that these items can wait till production. It is understandable given the amount of time and resources required to put high availability and disaster recovery in place.
However, having scalable Kubernetes High Availability & Disaster Recovery solutions in place from the get-go is important because there are multiple situations where your master node can be easily compromised, for example during demand spikes or when infrastructure goes down. Other events that can easily disrupt service include network interruptions, application bugs or a natural disaster (extremely rare).
Scenarios where kubernetes backup & dr would be highly beneficial
Kubernetes cluster backup and disaster recovery is particularly useful for these scenarios:
Accidental deployment deletion
When you have multiple developers working on the application, someone might accidentally delete a deployment. If a Kubernetes cluster becomes unrecoverable, having a stable snapshot of a previous state of Kubernetes will accelerate recovery.
Especially in these challenging times, accelerated global deployment is expected for many applications so it makes sense to plan for multi-geography replication from the onset. But even if you are just moving from staging to production, things can go wrong.
Kubernetes cluster migration
As you scale Kubernetes, you need to factor in cluster migration and how to move your applications across different cloud environments and platforms.
kubernetes service options
Managed Kubernetes platforms
On managed Kubernetes platforms such as AKS, EKS, GKE, ACK or CCE, you do not have to worry about managing the master node or cluster control plane. For HA, you can focus on how to deploy the application and select the instance type. It could be either of the 2 options mentioned below:
- Single zone cluster – master and worker nodes are created in one of the availability zones within a region
- Multi-zone cluster – master and worker nodes are created in multiple zones within a region. Multi-zone cluster helps in high availability of applications deployed on the worker nodes but not high availability of the cluster itself
Self-managed Kubernetes services provide more flexibility. You have more command over the cluster’s control plane as you can control the deployment and administration of your cluster. For example, you could have different instance types for different nodes. If you are running stateful applications on your Kubernetes cluster with persistent storage, you will need to plan for Kubernetes backup, restore and disaster recovery.
Bare metal Kubernetes (on-prem)
Running Kubernetes on-prem means you’re on your own when it comes to managing all complexities – including etcd, load balancing, availability, auto-scaling, networking, roll-back on faulty deployments, persistent storage, etc. Total control also means more responsibility, hence planning and preparedness for backup, restore and disaster recovery should be prioritized from the beginning.
Putting in place kubernetes high availability is resource intensive and complex
In practice, putting in place HA for Kubernetes requires a lot of effort given that you need to replicate resources across multiple environments (replicating and maintaining a minimum of 3 master nodes, 3 control planes and multiple worker nodes) as well as careful consideration of topology:
- Stacked control plane nodes (etcd nodes are collocated with control plane nodes) or
- External etcd nodes (etcd runs on separate nodes from the control plane)
Kubernetes HA besides being resource intensive is also operationally complex, tedious and requires expertise which explains why many teams do not put solutions in place early in their Kubernetes journey.
Kubernetes disaster recovery solutions are limited
One popular option for Kubernetes backup and migration is Velero, an open-source tool Currently, Velero provides the following options:
- Takes backup of persistent volumes
- Backup all objects in a namespace
However, Velero is designed for single tenancy, but ready-only mode for multi-clusters, has no GUI, and does not support overwriting objects during restore. For a more in-depth view, read my colleague Chakradhar Jonagam’s article on how to do back up and disaster recovery in Kubernetes.
Simplifying kubernetes disaster recovery and high availabity with cape
The complexity and resource intensiveness of advanced Kubernetes tasks was what led my teammates at Biqmind to develop CAPE.
CAPE is cloud agnostic and helps to achieve backup and restore of your Kubernetes cluster, be it on a managed service, self-managed platform or bare metal on-prem. CAPE simplifies and abstracts DR and HA tasks to a UI-based, easy to use experience.
Some key features I would highlight include:
- Perform on-demand & scheduled backup & restore across multiple k8s clusters
- Take Persistent Volume Claim (PVC) backups from cloud/on-prem storage environments
- Perform multi-cluster & multi-cloud backup & restore
- Deploy Kubernetes applications to multiple clusters
- Implement Git-Ops like deployments based on manifests and Helm charts
About the Author
Biqmind Specialist Services Consultant
Santhoshi Gopalakrishnan is a Specialist Services Consultant with Biqmind, helping customers accelerate their cloud-native journeys, specifically in DevOps implementation for major cloud platforms, on-prem and hybrid environments. With 15+ years of experience in IT, she started as a release engineer before transitioning to Sys Ops roles and has extensive experience in designing and implementing CI/CD (both Delivery & Deployment). Santhoshi has worked on large scale cloud and container migration projects across various verticals including telecommunications, technology and financial services.