🚶🏽 🐡 ◽️ Save on Kubernetes cloud costs on AWS 🐆 ⛩️ 🎦

The translation of the article was prepared on the eve of the start of the course "Infrastructure platform based on Kubernetes" .

How can I save on cloud costs when working with Kubernetes? There is no single right solution, but this article provides several tools to help you manage resources more efficiently and reduce cloud computing costs.

I wrote this article with an eye on Kubernetes for AWS, but it will apply (almost) in the same way to other cloud providers. I am assuming that your cluster (s) already have autoscaling ( cluster-autoscaler ) configured . Removing resources and scaling down your deployment will only save you if it also reduces your fleet of worker nodes (EC2 instances).

This article will cover:

cleaning up unused resources ( kube-janitor )
downscaling during off hours ( kube-downscaler )
using horizontal autoscale (HPA),
reducing overbooking of resources ( kube-resource-report , VPA)
using Spot instances

Cleaning up unused resources

Working in a fast-paced environment is great. We want technical organizations to accelerate . Faster software delivery also means more PR deployments, preview environments, prototypes, and analytics solutions. Everything deployable on Kubernetes. Who has time to manually clean up test deployments? It's easy to forget about deleting a week old experiment. The cloud bill will eventually grow due to the fact that we forgot to close:

(Henning Jacobs:

Zhiza:

(quoted) Corey Quinn:

Myth: Your AWS account is a function of the number of your users.

Fact: Your AWS account is a function of the number of your engineers.

Ivan Kurnosov (in response):

Real fact: Your AWS account is a function of the number of things you forgot to disable / delete.)

Kubernetes Janitor (kube-janitor) helps clean up your cluster. The janitor config is flexible for both global and local use:

General rules for the entire cluster can define the maximum time-to-live (TTL) for PR / test deployments.
Individual resources can be annotated with janitor / ttl, for example to automatically remove spike / prototype after 7 days.

General rules are defined in the YAML file. Its path is passed through a parameter --rules-fileto kube-janitor. Here's an example of a rule to remove all namespaces with -pr-in a name after two days:

- id: cleanup-resources-from-pull-requests
  resources:
    - namespaces
  jmespath: "contains(metadata.name, '-pr-')"
  ttl: 2d

The following example regulates the use of the application label on the Deployment and StatefulSet pods for all new Deployments / StatefulSet in 2020, but at the same time allows the execution of tests without this label for a week:

- id: require-application-label
  #  deployments  statefulsets   "application"
  resources:
    - deployments
    - statefulsets
  # . http://jmespath.org/specification.html
  jmespath: "!(spec.template.metadata.labels.application) && metadata.creationTimestamp > '2020-01-01'"
  ttl: 7d

Running a time-limited demo for 30 minutes on the cluster where kube-janitor is running:

kubectl run nginx-demo --image=nginx
kubectl annotate deploy nginx-demo janitor/ttl=30m

Another source of rising costs is persistent volumes (AWS EBS). When you delete Kubernetes StatefulSet, its permanent volumes (PVC - PersistentVolumeClaim) are not deleted. Unused EBS volumes can easily lead to costs in the hundreds of dollars per month. Kubernetes Janitor has a feature to clean up unused PVCs. For example, this rule will remove all PVCs that are not mounted by the pod and are not referenced by a StatefulSet or CronJob:

#   PVC,         StatefulSets
- id: remove-unused-pvcs
  resources:
  - persistentvolumeclaims
  jmespath: "_context.pvc_is_not_mounted && _context.pvc_is_not_referenced"
  ttl: 24h

Kubernetes Janitor can help you keep your cluster clean and prevent the slow-growing costs of cloud computing. For deployment and configuration instructions, follow the README kube-janitor .

Out-of-office zoom out

Test and intermediate systems are usually required to work only during working hours. Some production applications, such as back office / admin tools, also require only limited availability and can be turned off at night.

Kubernetes Downscaler (kube-downscaler) allows users and operators to downscale the system during off-hours. Deployments and StatefulSets can scale to zero replicas. CronJobs may be paused. Kubernetes Downscaler is configurable for the entire cluster, one or more namespaces, or individual resources. You can set either “idle time” or a set of “uptime”. For example, to reduce scaling as much as possible overnight and weekends:

image: hjacobs/kube-downscaler:20.4.3
args:
  - --interval=30
  #    
  - --exclude-namespaces=kube-system,infra
  #   kube-downscaler,    Postgres Operator,      
  - --exclude-deployments=kube-downscaler,postgres-operator
  - --default-uptime=Mon-Fri 08:00-20:00 Europe/Berlin
  - --include-resources=deployments,statefulsets,stacks,cronjobs
  - --deployment-time-annotation=deployment-time

Here's a graph for scaling the cluster worker nodes over the weekend:

Scaling down from ~ 13 to 4 worker nodes certainly makes a sizeable difference on the AWS bill.

But what if I need to work during the “downtime” of the cluster? Certain deployments can be permanently excluded from scaling by adding the annotation downscaler / exclude: true. Deployments can be temporarily excluded by using the downscaler / exclude-until annotation with an absolute timestamp in YYYY-MM-DD HH: MM (UTC) format. If necessary, the entire cluster can be scaled back by deploying a hearth with annotation downscaler/force-uptime, for example, by running nginx blanks:

kubectl run scale-up --image=nginx
kubectl annotate deploy scale-up janitor/ttl=1h #    
kubectl annotate pod $(kubectl get pod -l run=scale-up -o jsonpath="{.items[0].metadata.name}") downscaler/force-uptime=true

See the README kube-downscaler for deployment instructions and additional options.

Use horizontal autoscale

Many applications / services deal with a dynamic loading scheme: sometimes their modules are idle, and sometimes they work at full capacity. It is not economical to operate with a constant fleet of hearths to cope with the maximum peak load. Kubernetes supports horizontal auto-scaling through the HorizontalPodAutoscaler (HPA) resource . CPU usage is often a good metric for scaling:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        averageUtilization: 100
        type: Utilization

Zalando has created a component to easily connect custom metrics for scaling: The Kube Metrics Adapter (kube-metrics-adapter) is a universal metrics adapter for Kubernetes that can collect and serve custom and external metrics for horizontal autoscaling of pods. It supports scaling based on Prometheus metrics, SQS queues, and other settings. For example, to scale up your deployment for a custom metric represented by the application itself as JSON in / metrics, use:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
  annotations:
    # metric-config.<metricType>.<metricName>.<collectorName>/<configKey>
    metric-config.pods.requests-per-second.json-path/json-key: "$.http_server.rps"
    metric-config.pods.requests-per-second.json-path/path: /metrics
    metric-config.pods.requests-per-second.json-path/port: "9090"
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: requests-per-second
      target:
        averageValue: 1k
        type: AverageValue

Configuring autoscale horizontal with HPA should be one of the default actions to improve efficiency for stateless services. Spotify has a presentation with their experience and recommendations for HPA: scale your deployments, not your wallet .

Reducing resource redundancy

Kubernetes workloads determine their CPU / memory needs through "resource requests". CPU resources are measured in virtual cores or more often in millicores, for example, 500m means 50% vCPU. Memory resources are measured in bytes and common suffixes can be used, for example 500Mi which means 500 megabytes. Resource requests "lock" the capacity on worker nodes, that is, a module with a CPU request of 1000m on a node with 4 vCPUs will leave only 3 vCPUs available to other modules. [1]

Slack (excess reserve)Is the difference between the requested resources and the actual use. For example, a pod that requests 2 GiB of memory but uses only 200 MiB has ~ 1.8 GiB of "excess" memory. Excess costs money. It can be roughly estimated that 1 GiB of excess memory costs ~ $ 10 per month. [2]

Kubernetes Resource Report (kube-resource-report) displays excess reserves and can help you identify potential savings:

Kubernetes Resource Reportshows excess aggregated by application and team. This allows you to find places where resource demands can be reduced. The generated HTML report only provides a snapshot of resource usage. You should look at CPU / memory usage over time to determine adequate resource requests. Here is a Grafana diagram for a "typical" high CPU usage service: all pods are using significantly less than 3 CPU cores requested:

Reducing the CPU request from 3000m to ~ 400m frees up resources for other workloads and allows the cluster to shrink.

“The average CPU utilization of EC2 instances often fluctuates in the single-digit percentage range,” writes Corey Quinn . While for EC2, estimating the right size can be a bad decision.Changing some Kubernetes resource requests in a YAML file is easy and can bring huge savings.

But do we really want people to change values in YAML files? No, machines can do it much better! Kubernetes Vertical Pod Autoscaler (VPA) does just that: adapts resource requests and constraints to suit the workload. Here's an example of a Prometheus CPU (thin blue line) query graph adapted by VPA over time:

Zalando uses VPA in all of its clusters for infrastructure components. Non-critical applications can also use VPA.

Goldilocksfrom Fairwind is a tool that creates a VPA for each deployment in a namespace and then displays a VPA recommendation in its dashboard. It can help developers establish the correct processor / memory requests for their applications:

I wrote a small blog post about VPA in 2019, and VPA was recently discussed at CNCF End User Community .

Using EC2 Spot Instances

Last but not least, AWS EC2 costs can be reduced by using Spot instances as Kubernetes work nodes [3] . Spot instances are available with up to 90% off pricing on demand. Running Kubernetes on EC2 Spot is a good combination: you need to specify several different instance types for higher availability, meaning you can get a larger node for the same or lower price, and the increased capacity can be used by containerized Kubernetes workloads.

How do I run Kubernetes on EC2 Spot? There are several options: use a third-party service such as SpotInst (now called Spot, don’t ask me why), or simply add Spot AutoScalingGroup (ASG) to your cluster. For example, here's a CloudFormation snippet for a “capacity-optimized” Spot ASG with multiple instance types:

MySpotAutoScalingGroup:
 Properties:
   HealthCheckGracePeriod: 300
   HealthCheckType: EC2
   MixedInstancesPolicy:
     InstancesDistribution:
       OnDemandPercentageAboveBaseCapacity: 0
       SpotAllocationStrategy: capacity-optimized
     LaunchTemplate:
       LaunchTemplateSpecification:
         LaunchTemplateId: !Ref LaunchTemplate
         Version: !GetAtt LaunchTemplate.LatestVersionNumber
       Overrides:
         - InstanceType: "m4.2xlarge"
         - InstanceType: "m4.4xlarge"
         - InstanceType: "m5.2xlarge"
         - InstanceType: "m5.4xlarge"
         - InstanceType: "r4.2xlarge"
         - InstanceType: "r4.4xlarge"
   LaunchTemplate:
     LaunchTemplateId: !Ref LaunchTemplate
     Version: !GetAtt LaunchTemplate.LatestVersionNumber
   MinSize: 0
   MaxSize: 100
   Tags:
   - Key: k8s.io/cluster-autoscaler/node-template/label/aws.amazon.com/spot
     PropagateAtLaunch: true
     Value: "true"

Some notes on using Spot with Kubernetes:

You need to handle Spot completions, for example by draining a node at an instance stop
Zalando forked the official cluster autoscaling with node pool priorities
Spot nodes can be forced to accept “registrations” of workloads to run in Spot

Summary

I hope you find some of the tools presented useful to reduce your cloud computing bill. You can find most of the contents of the article also in my talk at DevOps Gathering 2019 on YouTube and as slides .

What are your best practices for saving cloud costs on Kubernetes? Please let us know on Twitter (@try_except_) .

[1] In fact, less than 3 vCPUs will remain usable as host bandwidth is reduced by reserved system resources. Kubernetes distinguishes between physical node capacity and "allocated" resources ( Node Allocatable ).

[2] Calculation example: one copy of m5.large with 8 GiB of memory is ~ 84 USD per month (eu-central-1, On-Demand), ie 1/8 knot blocking is approximately ~ $ 10 per month.

[3] There are many more ways to reduce your EC2 account, such as Reserved Instances, Savings Plan, etc. - I won't cover these topics here, but you should definitely check them out!

Learn more about the course.

Save on Kubernetes cloud costs on AWS