Accidents as experience # 3. How we saved our monitoring during an accident at OVH

In this article, I will share my experience of how the recent accident at OVH affected our monitoring infrastructure, how we solved the problem and what lessons we learned from this.





About monitoring with us

, «» . :





1. Blackbox- . β€” endpoint’ , . , health page JSON-, , . , β€” , / .





: HTTP/HTTPS. , , JSON ( status page- ). (, ).





( , OVH).





2. Kubernetes- , . Prometheus + Grafana + Alertmanager, . , (, , Kubernetes Deckhouse), , β€” (, ).





3. , Kubernetes. , (bare metal) , , . Okmeter ( β€” - ). OVH.





( , , , .)





, , , β€” Okmeter. ( Kubernetes-), blackbox- ( , - ).





? ? β€” ?





«» Dead man’s switch (DMS). , «»:





  • OK, , Β« Β», (Prometheus, Okmeter ..) -.





  • , OK , .





  • , , OK , - ERROR . .





, (10 ): (ERROR) DMS.





:





  • β€” DMS, 3:20 .





  • Okmeter , , . , - . , , (blackbox Kubernetes). .





  • ( 8:14) , , Okmeter , .





, Okmeter. - OVH:





  • SBG-2 β€” ;





  • SBG-1 β€” .





, - OVH . , , .





10 , , Okmeter β€” - .





, :





  1. ;





  2. , ;





  3. .





DevOps-, CTO . , , Okmeter.





, Okmeter . , ? :





Alert criticality matrix

3 . , , S1 ( ) S9 ( ). S1 β€” blackbox-, . , Okmeter ( . ). S2, (S3 ..).





S1 S2 Okmeter, , . . 





, Okmeter, . 





Okmeter

: S1-S2

, Okmeter? , β€” , Okmeter, β€” , 1 2020 .





:





  1. .





    1. ( ).





    2. .





    3. (, ).





  2. .





  3. .





: S3

S3 : , , .





. , ZooKeeper.





Bash

Okmeter, , . : Ansible- , . 10 . - Bash.





:





  1. shell-, bare metal-. ( ) , Okmeter: , severity, .. , . , .





    , API β€” flint (flant integration).





  2. Ansible- , , . Ansible- , , , .





  3. , β€” , .





S1-S2 β€” S3. ( ) .





3000 .





:





, . :





  , Okmeter , «» , , . 





, : , . : , (DRP) . - , , .





:





  1. , , . , OVH? …





  2. «» Okmeter, : , .. ( ), «» , .





  3. , : . , .





P.S.

:





  • Β«β€ž- Okmeter β€œ. β€žβ€œΒ»;





  • Β« #2. Elasticsearch KubernetesΒ»;





  • Β« #1. ClickHouse, Β».








All Articles