Today I want to walk through a useful way to visualize the relationship between pod and node resources in Kubernetes, or, put another way, the difference between the resources your applications request from the underlying infrastructure, and the resources that underlying infrastructure actually has to give. In focussing on that difference, we'll see that there are some cost-optimization principles you can tease out to help you run Kubernetes applications efficiently.
This post highlights some of the concepts covered in the excellent and newly released guide to running cost-optimized Kubernetes applications on GKE from Google Cloud, which I highly recommend you read in full if you're after some more detail.
How to waste resources on Kubernetes
As companies increase the pace of their migration to the cloud, many are choosing to use shared multi-tenant Kubernetes clusters. The logic goes that running a different cluster per team is sure to lead to a significant overhead, both in required cloud infrastructure, as well as cluster administration.
These multi-tenant clusters typically host applications of all shapes and sizes, from different teams, with different skillsets, and often include some legacy applications that have been "lifted and shifted" from a previous life in an on-premise VM.
These applications typically aren't very flexible — they may have been modified to scale horizontally (by running multiple pods), but normally don't take advantage of extra resources by scaling vertically (by using more CPU threads). Over time, this leads to cluster-wide "spikes" in resource usage that may cause applications to become unstable. Faced with this situation, most platform teams will turn to something like the Cluster Autoscaler offered by GKE to add more nodes to the cluster, often leaving a generous overhead in provisioned resources to ensure application stability during a scaling event.
This is a reasonable course of action, but might leave significant savings on the table if done in isolation. To understand why, let's look at a visual representation of the resources in a single cluster node.
With node memory usage on the horizontal axis, and CPU usage on the vertical access, we can visualize the resources consumed by the first pod scheduled on the node in red.
If we add a few more pods, we can see how they stack, with different pods of different shapes consuming different amounts of resources. Once either the CPU or the RAM gets close to capacity, the Kubernetes scheduler will look elsewhere when attempting to place additional pods in the cluster. If all the nodes are full, or pass some threshold, you can have cluster autoscaler kick in to add more nodes to the cluster.
There's another problem though — we've left a lot of memory unused on this node, which we might be paying for for a long time, depending on the lifecycle of the pods in your cluster. Depending on the shape of the pods scheduled on the node, this could equally apply to CPU as well as RAM.
At this level, this doesn't look too bad, but if you're running a 5000 node cluster, and you have this situation (to varying degrees) on every node, the costs can add up fast.
It gets worse — Kubernetes schedules pods based on the resources they request, not the resources they consume. So the red pods we've been looking at so far actually represent the resources requested, not the resources consumed.
If your applications are poorly designed, or developers have routinely requested more resources than they need, the consumed resources might be far lower than the requested resources.
If we stack the actual pod resource usage up for the node, we can see that the wasted resources are dramatically higher than we first thought.
Let's look at a few design principles we can adopt to avoid this situation, and more effectively use the resources available in the cluster.
Principles for designing flexible Kubernetes applications
Now that we know how and why node resources often go unused, we can devise a few principles for flexible application design that will help us get the most value from our clusters.
Design your applications to scale, both horizontally and vertically
Pods should ideally be stateless, perform work in discrete chunks, tolerate interruption if possible, and take advantage of multiple CPU threads to scale vertically.
Incentivize developers to set accurate resource requests
In multi-tenant clusters, it's particularly important to ensure individual application owners are incentivized to conserve cluster resources by setting accurate resource requests. If you only measure application performance, you'll drive the wrong behaviours, at the expense of higher total costs.
Match node shape with pod shape
If most of your applications consume twice as much CPU as memory, consider shaping your instances using a similar ratio to limit the amount of wasted resources. If you have different types of applications with very different resource requirements, consider combining a nodepool and node affinities to separate those applications and better consume resources. GKE even has a feature which will create optimized nodepools for you automatically, which works great for larger clusters.
Experiment and iteratively tune autoscaling
There is an inherent tradeoff in autoscaling between the amount of over-provisioned resources you have (the "buffer"), and the speed at which autoscaling can react to change. Treat this tradeoff with an experimental mindset, and optimize it over time.
When adding pods using Horizontal Pod Autoscaler (HPA), set it per workload, and choose the smallest buffer that allows you to react to expected spikes.
When adding nodes using Cluster Autoscaler on GKE, there is no native support for a "buffer" (you can't pre-spin idle nodes in the cluster), so ideally you should set your HPA utilisation targets accordingly. If you expect to scale very quickly, you can look at using pause pods to work around this using low-priority deployments that "hold" space in your cluster until required.
Use underlying infrastructure savings
Finally, if you're running large multi-tenant clusters, you should absolutely be taking advantage of underlying infrastructure savings offered by your cloud provider.
All offer some form of committed use discounts, which give steep discounts for contracted long-term use. You should use these for your "baseline" cluster use — if you don't know what it is, take some time to monitor usage for a while before committing.
Although challenging to use, if your workloads are fault-tolerant, you can also use preemptible or spot instances in your clusters, which offer steep savings over regular instances. These can also be handy for development clusters, though be warned that native Kubernetes availability features like Pod Disruption Budgets and pod grace periods are likely to be ignored when using these instances.
When separation of concerns is tricky
Kubernetes is a great tool for separating the concerns of developers and operators, but cost-optimization is one of those tricky areas that, done well, doesn't fall neatly into one bucket. However, if you get the incentives right across teams, and build applications in a cloud-native way, there are significant cost-savings to be had in running applications using Kubernetes.
This post has really only scratched the surface of cost-optimization for Kubernetes. If you want more details and a practical set of cost-savings techniques for GKE, I'd highly recommend the best practices guide I mentioned earlier.