Scaling a Kubernetes cluster up to 7500 nodes

image

Photo Carles Rabada, Unsplash.com







Kubernetes 7500 , , GPT-3, CLIP DALL·E, , , . Kubernetes — , , , .







image







2500 . . , Kubernetes. , .









. , Kubernetes, . , .







. , . GPU NVLink GPUDirect. . NUMA, CPU PCIE . Bin-packing — . (full bisection bandwidth), , . , , .







kube-scheduler , . , .







image







MPI, MPI-. , , . . stateful — , , .







Kubernetes. HTTPS-, A/B-, blue/green canary . IP- MPI SSH, . Service "discovery" — , MPI .







blob-. , blob- . PersistentVolume POSIX, blob- , .







, , , , . «-», , . , . , .









, , Flannel . IP Azure VMSS CNI. .







IP- , 200 000 IP- . , .







SDN , . VPN . , MTU. — .







iptables , . . , .







Iptables- mangle



, . , , , — . FORWARD



, INPUT



OUTPUT



— :







iptables -t mangle -A INPUT ! -s 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-in"
iptables -t mangle -A FORWARD ! -s 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-in"
iptables -t mangle -A OUTPUT ! -d 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-out"
iptables -t mangle -A FORWARD ! -d 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-out"
      
      





, iptables . iptables



:







% iptables -t mangle -L -v
Chain FORWARD (policy ACCEPT 50M packets, 334G bytes)
 pkts bytes target     prot opt in     out     source               destination
....
1253K  555M            all  --  any    any     anywhere            !10.0.0.0/8           /* iptables-exporter openai traffic=internet-out */
1161K 7937M            all  --  any    any    !10.0.0.0/8           anywhere             /* iptables-exporter openai traffic=internet-in */
      
      





- Prometheus iptables-exporter, . .







image







— CIDR- , . hub and spoke CIDR- . , ( ). — , .







NAT CIDR- , - . , .







API



Kubernetes API etcd — , . Grafana kube-prometheus, . HTTP 429 ( ) 5xx ( ) API.







image







API kube, . etcd API . API etcd, , . etcd Kubernetes etcd, . API — stateless . etcd, .







API , . 7500 70 API. , .







image







API WATCH . , kubelet node-exporter, . , WATCH. kubelet kube-proxy, N² 1 / . EndpointSlices Kubernetes 1.17 1000 .







image







API, . , DaemonSet API. , , , Datadog Cluster Agent, .







, . , — , , API . , , .







Prometheus Grafana



Prometheus Grafana , . kube-prometheus, . , .







, , Prometheus. kube-prometheus , , , , . Prometheus, .







- , Prometheus , Out-Of-Memory (OOM). , . , , , write-ahead-log (WAL) .







OOM. Grafana Prometheus, Grafana API /api/v1/series



Prometheus {le!=""}



(. . « »). /api/v1/series



— , . , . , Prometheus . Prometheus, API . .







Prometheus , , WAL . Prometheus . Robust Perception , GOMAXPROCS=24



. Prometheus WAL, , - .







(. ).









, . .









. , , GPU. GPU -, — ECC. Nvidia Data Center GPU Manager (DCGM) Xid. dcgm-exporter, Prometheus, . DCGM_FI_DEV_XID_ERRORS



. , NVML Device Query API GPU.







GPU , GPU.







. , . , , .







. , cordon, . , , . , ( Pod Disruption Budget). , 7 ( SLA) .







GPU



, GPU DCGM. , GPU . — GPU .







preflight. taint preflight, . DaemonSet preflight- . taint, .







, CronJob . , , , , .









, , . , Kubernetes . , Kubernetes.







Taint



team-resource-manager . ConfigMap, , , . , taint openai.com/team=teamname:NoSchedule.









team-resource-manager (admission webhook service), toleration . taint Kubernetes, toleration .







Baloon CPU GPU



, , ( ) . 0, — . , . — , API, .







balloon- CPU GPU, ReplicaSet . , , . , , . ( Deployment DaemonSet, DaemonSet .)







, anti-affinity, . Kubernetes O(N²) - anti-affinity. Kubernetes 1.18 .







Gang scheduling —



StatefulSet, . , StatefulSet ( MPI , MPI ).







Kubernetes StatefulSet. , 100% , Kubernetes , , , , .







, , . Kubernetes 1.18 Kubernetes, . Coscheduling.









Kubernetes. :









Prometheus — , WAL . «query processing would load too many samples» — . , Prometheus.









-. , , .









, Kubernetes . . , , , OpenAI Kubernetes.








All Articles