How to increase the reaction speed of Kubernetes when cluster nodes fail?

Kubernetes is designed to be robust and resilient to failures, and has the ability to automatically recover. And he does it all well! However, production nodes can lose connection to the cluster or fail for various reasons. In these cases, it is imperative that Kubernetes responds quickly to the incident.





, pods . , . , , Kubernetes, ?





, Kubernetes , Kubelet Controller Manager:





  1. Kubelet kube-apiserver , --node-status-update-frequency



    . 10 .





  2. Controller manager Kubelet –-node-monitor-period



    . 5 .





  3. Kubelet --node-monitor-grace-period



    , Controller manager Kubelet . 40 .





:





  1. Kubelet kube-apiserver, - node-status-update-frequency



    = 10 .





  2. .





  3. Controller manager , Kubelet, --node-monitor-period



    = 5 .





  4. Controller manager , , - --node-monitor-grace-period



    40 . Controller manager , NotReady.





  5. Kube Proxy endpoints, pods , pods .





pods, , , (NotReady) 45 .





Kubelet Controller Manager.





Kubernetes , :





-–node-status-update-frequency



1 ( 10 )





--node-monitor-period



1 ( 5 )





--node-monitor-grace-period



4 ( 40 )





, Kubernetes Kind . Kind Cluster , , .





kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
- |
  apiVersion: kubelet.config.k8s.io/v1beta1
  kind: KubeletConfiguration
  nodeStatusUpdateFrequency: 1s
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    controllerManager:
        extraArgs:
          node-monitor-period: 1s
          node-monitor-grace-period: 4s
- role: worker
      
      



deployment Nginx, control-plane worker. control-plane pod Ubuntu, Nginx, worker .





#!/bin/bash

# create a K8S cluster with Kind
kind create cluster --config kind.yaml 
# create a Ubuntu pod in control-plane Node
kubectl run ubuntu --wait=true --image ubuntu --overrides='{"spec": { "nodeName": "kind-control-plane"}}' sleep 30d
# untaint control-plane node in order to schedule pods on it
kubectl taint node kind-control-plane node-role.kubernetes.io/master-
# create Nginx deployment with 2 replicas, one on each node
kubectl create deploy ng --image nginx
sleep 30
kubectl scale deployment ng --replicas 2
# expose Nginx deployment so that is reachable on port 80
kubectl expose deploy ng --port 80  --type ClusterIP
# install curl in Ubuntu pod
kubectl exec ubuntu -- bash -c "apt update && apt install -y curl"
      
      



Nginx, curl pod Ubuntu, control-plane, endpoints, Nginx .





# test Nginx service access from Ubuntu pod
kubectl exec ubuntu -- bash -c 'while true ; do echo "$(date +"%T.%3N") - Status: $(curl -s -o /dev/null -w "%{http_code}" -m 0.2 -i ng)" ; done'

# show Nginx service endpoints
while true; do  gdate +"%T.%3N"; kubectl get endpoints ng -o json | jq '.subsets' | jq '.[] | .addresses' | jq '.[] | .nodeName'; echo "------";done

      
      



, , Kind, . , NotReady.





#!/bin/bash

# kill Kind worker node
echo "Worker down at $(gdate +"%T.%3N")"
docker stop kind-worker > /dev/null
sleep 15
# show when the node was detected to be down
echo "Worker detected in down state by Control Plane at "
kubectl get event --field-selector reason=NodeNotReady --sort-by='.lastTimestamp' -oyaml | grep time | tail -n1
# start worker node again
docker start kind-worker > /dev/null

      
      



, 12:50:22, Controller manager , 12:50:26, 4 .





Worker down at 12:50:22.285
Worker detected in down state by Control Plane at
      time: "12:50:26Z"
      
      



. 12:50:23, . 12:50:26.744 Kube Proxy endpoint, , .





...
12:50:23.115 - Status: 200
12:50:23.141 - Status: 200
12:50:23.161 - Status: 200
12:50:23.190 - Status: 000
12:50:23.245 - Status: 200
12:50:23.269 - Status: 200
12:50:23.291 - Status: 000
12:50:23.503 - Status: 200
12:50:23.520 - Status: 000
12:50:23.738 - Status: 000
12:50:23.954 - Status: 000
12:50:24.166 - Status: 000
12:50:24.385 - Status: 200
12:50:24.407 - Status: 000
12:50:24.623 - Status: 000
12:50:24.839 - Status: 000
12:50:25.053 - Status: 000
12:50:25.276 - Status: 200
12:50:25.294 - Status: 000
12:50:25.509 - Status: 200
12:50:25.525 - Status: 200
12:50:25.541 - Status: 200
12:50:25.556 - Status: 200
12:50:25.575 - Status: 000
12:50:25.793 - Status: 200
12:50:25.809 - Status: 200
12:50:25.826 - Status: 200
12:50:25.847 - Status: 200
12:50:25.867 - Status: 200
12:50:25.890 - Status: 000
12:50:26.110 - Status: 000
12:50:26.325 - Status: 000
12:50:26.549 - Status: 000
12:50:26.604 - Status: 200
12:50:26.669 - Status: 000
12:50:27.108 - Status: 200
12:50:27.135 - Status: 200
12:50:27.162 - Status: 200
12:50:27.188 - Status: 200
...
...
------
12:50:26.523
"kind-control-plane"
"kind-worker"
------
12:50:26.618
"kind-control-plane"
"kind-worker"
------
12:50:26.744
"kind-control-plane"
------
12:50:26.878
"kind-control-plane"
------
...
      
      



, Kubernetes . , , Kubernetes , , etcd, 1 . , 1000 , 60000 , etcd etcd.





, , . , .








All Articles