Scaling Kubernetes Clusters
Introduction
All managed Kubernetes clusters (EKS, AKS, GKE) support cluster auto-scaling, aka adding more worker nodes when a node is running out of resources and cannot accommodate additional pods. The problem that stems from this approach is that in many cases it takes time to provision new nodes and move workloads to run on the new nodes. By the time this process is completed, the existing cluster is saturated and hangs up resulting in all applications getting stuck and the business is interrupted.
The optimal solution, without keeping aside standby unutilized node pools, is to detect when a worker node reaches a certain threshold before it becomes saturated, e.g. 85%, then being able to add additional nodes on time and shift work to the new nodes, without impacting the cluster performance.
Challenges with Kubernetes Scaling
Cluster auto-scaling in Kubernetes is useful when your scaling function is smoothly linear and you are able to predict the load patterns in advance.
Fig 1. presents a schematic diagram of how cluster auto-scaling is done in AWS EKS clusters.
Fig.2 presents the same mechanism on Azure AKS. We focus here on the cluster autoscaler.
But what do you do when you need to scale ad-hoc?
For example, something happens and all mobile users rush to read a certain news item. In this case, you have to scale-out at once. What happens in reality, is that your Kubernetes nodes have an overload and the cluster hangs out before it is able to add more nodes and move existing/new workload pods to the newly added nodes. And this impacts your business.
Some Web Scale companies encounter this problem, you can find a good article here.
You can solve this problem by either
- Provisioning many nodes in advance to prepare for these burst peaks, but ultimately they stay underutilized.
- Scale-out your cluster in a reasonable time and add the nodes before such an event happens. When peak time is over, scale-in and release the added worker nodes.
For the second option you need to define the following:
- A good KPI or a combination of KPIs that you constantly measure
- A good monitoring solution that can collect these KPIs
- A good policy that knows when to scale-out and when to scale-in. For example scaling-in and scaling-out each time you cross 85% of CPU load, causes fluctuation and a hysteresis effect. You have to have a cool-down window that you wait between scaling events. You can also define the scaling-in threshold lower than the scaling-out threshold, e.g. 75% CPU load.
- A workflow engine that knows what to do in case a scaling event occurs. You have to provision new nodes. This is typically done utilizing cloud provider APIs to provision new VMs with the right permissions and roles as well as other attributes and then adding them to the Kubernetes cluster.
- The scaling process should allow adding to the Kubernetes cluster a configurable number of nodes, each time a scaling event occurs.
Cloudify’s Solution
Cloudify is an open-source orchestrator which can trigger complex workflows based on defined policies. In this example, the solution is to monitor a set of KPIs and define a policy that triggers the scale-out event.
Fig. 3 presents a policy and scaling decisions that are done based on measured end-user response times. In general, scaling could be done based on any KPI, e.g. CPU load or a combination of KPIs, e.g. CPU load + End-User response times.
- Cloudify continuously monitored a given URL and measured it’s response time.
- We simulated a slow response time by defining a Flask endpoint that increased the Flask application response time. The Flask application was run on Kubernetes.
- Cloudify policy engine, which collects the KPIs realizes that the response time crossed a pre-configured threshold and triggers a scale-out workflow. Cloudify also starts a cool_down counter which doesn’t allow to scale-in before certain T-minutes elapses. This is to eliminate the hysteresis case of scaling-out and scaling-in too often to eliminate fluctuations.
Cloudify’s monitoring plugin could be found in this git repo.
4. The scale-out workflow communicates with the cloud provider APIs, allocates a configurable number of VMs, and adds them to the cluster.
5. When things are back to normal and the cool_down window time has elapsed, Cloudify policy engine will trigger a scale-in workflow to remove the extra added nodes for the peak time. Things are back to normal.
Fig. 4 shows code snippets of the scaling process and the cooldown window checking.
Summary and Conclusions
To summarize the subject, if you need a Kubernetes production system that can scale immediately for burst peaks you have to create and manage a mechanism that supports it.
There are a few options to do it and this blog presented one of them, keeping the cost relatively small and paying only for what you consume.
Of course, you can put aside a fleet of worker nodes but most of the time they will be underutilized.
Another approach that I haven’t touched here is to reduce the unit of work and move to serverless. Here, you just execute the function, but for that, you will need to make major architectural changes. You can find more on this approach in a good article published by Eitan Yanovsky from Optibus.
Last point but not least, in this blog we discussed scaling-out and scaling-in but you can add more capacity by scaling-up too. When you scale-out you can also increase the VMs sizes and scale-up at the same time, aka adding bigger nodes for the burst peak period.
More on scaling and other production-grade Kubernetes topics in the next blogs.