Photo Carles Rabada, Unsplash.com
Kubernetes 7500 , , GPT-3, CLIP DALL·E, , , . Kubernetes — , , , .
2500 . . , Kubernetes. , .
. , Kubernetes, . , .
. , . GPU NVLink GPUDirect. . NUMA, CPU PCIE . Bin-packing — . (full bisection bandwidth), , . , , .
kube-scheduler , . , .
MPI, MPI-. , , . . stateful — , , .
Kubernetes. HTTPS-, A/B-, blue/green canary . IP- MPI SSH, . Service "discovery" — , MPI .
blob-. , blob- . PersistentVolume POSIX, blob- , .
, , , , . «-», , . , . , .
, , Flannel . IP Azure VMSS CNI. .
IP- , 200 000 IP- . , .
SDN , . VPN . , MTU. — .
iptables , . . , .
Iptables- mangle
, . , , , — . FORWARD
, INPUT
OUTPUT
— :
iptables -t mangle -A INPUT ! -s 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-in" iptables -t mangle -A FORWARD ! -s 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-in" iptables -t mangle -A OUTPUT ! -d 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-out" iptables -t mangle -A FORWARD ! -d 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-out"
, iptables . iptables
:
% iptables -t mangle -L -v Chain FORWARD (policy ACCEPT 50M packets, 334G bytes) pkts bytes target prot opt in out source destination .... 1253K 555M all -- any any anywhere !10.0.0.0/8 /* iptables-exporter openai traffic=internet-out */ 1161K 7937M all -- any any !10.0.0.0/8 anywhere /* iptables-exporter openai traffic=internet-in */
- Prometheus iptables-exporter, . .
— CIDR- , . hub and spoke CIDR- . , ( ). — , .
NAT CIDR- , - . , .
API
Kubernetes API etcd — , . Grafana kube-prometheus, . HTTP 429 ( ) 5xx ( ) API.
API kube, . etcd API . API etcd, , . etcd Kubernetes etcd, . API — stateless . etcd, .
API , . 7500 70 API. , .
API WATCH . , kubelet node-exporter, . , WATCH. kubelet kube-proxy, N² 1 / . EndpointSlices Kubernetes 1.17 1000 .
API, . , DaemonSet API. , , , Datadog Cluster Agent, .
, . , — , , API . , , .
Prometheus Grafana
Prometheus Grafana , . kube-prometheus, . , .
, , Prometheus. kube-prometheus , , , , . Prometheus, .
- , Prometheus , Out-Of-Memory (OOM). , . , , , write-ahead-log (WAL) .
OOM. Grafana Prometheus, Grafana API /api/v1/series
Prometheus {le!=""}
(. . « »). /api/v1/series
— , . , . , Prometheus . Prometheus, API . .
Prometheus , , WAL . Prometheus . Robust Perception , GOMAXPROCS=24
. Prometheus WAL, , - .
, . .
. , , GPU. GPU -, — ECC. Nvidia Data Center GPU Manager (DCGM) Xid. dcgm-exporter, Prometheus, . DCGM_FI_DEV_XID_ERRORS
. , NVML Device Query API GPU.
GPU , GPU.
. , . , , .
. , cordon, . , , . , ( Pod Disruption Budget). , 7 ( SLA) .
GPU
, GPU DCGM. , GPU . — GPU .
preflight. taint preflight, . DaemonSet preflight- . taint, .
, CronJob . , , , , .
, , . , Kubernetes . , Kubernetes.
Taint
team-resource-manager . ConfigMap, , , . , taint openai.com/team=teamname:NoSchedule.
team-resource-manager (admission webhook service), toleration . taint Kubernetes, toleration .
Baloon CPU GPU
, , ( ) . 0, — . , . — , API, .
balloon- CPU GPU, ReplicaSet . , , . , , . ( Deployment DaemonSet, DaemonSet .)
, anti-affinity, . Kubernetes O(N²) - anti-affinity. Kubernetes 1.18 .
Gang scheduling —
StatefulSet, . , StatefulSet ( MPI , MPI ).
Kubernetes StatefulSet. , 100% , Kubernetes , , , , .
, , . Kubernetes 1.18 Kubernetes, . Coscheduling.
Kubernetes. :
Prometheus — , WAL . «query processing would load too many samples» — . , Prometheus.
-. , , .
, Kubernetes . . , , , OpenAI Kubernetes.