"The goal of SRE is a reliable system." Overview of basic SRE metrics

Site Reliability Engineering (SRE) is a form of DevOps implementation. The SRE approach originated at Google and became popular among product IT companies after the publication of the book of the same name in 2016.



In this article, we will describe how the SRE approach relates to DevOps, what tasks an SRE engineer solves and what metrics he takes care of.





From DevOps to SRE



In many IT companies, different teams are involved in development and operations with different goals. The goal of the development team is to roll out new features. The goal of the operations team is to keep old and new features working in production. Developers strive to deliver as much code as possible, system administrators strive to keep the system reliable.



The goals of the teams contradict each other. To resolve these contradictions, the DevOps methodology was created. It involves reducing silos, accepting errors, relying on automation and other principles.



, , DevOps . Β« DevOps?Β». , , .



2016 , Google Β«Site Reliability EngineeringΒ». DevOps. SRE-, IT-.



DevOps β€” . SRE β€” . DevOps β€” , SRE β€” , DevOps.



SRE-



SRE , DevOps .



, , SRE . , - . , SRE .



SRE β€” . , , β€” .



, SRE , , . - : Β« β€” Β». , . SRE . , , . , .



. , , . , .



SRE . , SRE : Β«OK, , , Β». , , , .



  • β€” , .
  • β€” , . , .


SRE , -, . SRE ( , ).



SRE , - .



, SRE . , -. β€” .



: SLA, SLI, SLO



. β€” , .



SRE , . , (, . .) , .



- β€” Service-Level Objective (SLO). , .



SRE , . Β« , . , , SLOΒ», Google. β€” , , .



, β€” Service Level Indicator (SLI). , , , β€” .



SLO SLI β€” , . Service Level Agreement (SLA). .



SLA: 99,95% ; 99 ; 85% 1,5 .



100%



SRE , . , .



, «»:



  • β€” 99%,
  • β€” 99,9%,
  • β€” 99,99%,
  • β€” 99,999%.


β€” 5 , β€” 3,5 .





, 100%, . - ROI β€” .



, . ! 47 . . .



. 99,99% 99,999%, 99%. , 10 8 . , .



β€” MTBF MTTR



, SRE : MTBF MTTR.



MTBF (Mean Time Between Failures) β€” .



MTBF . SRE Β«!Β». , SRE - , , .



MTTR (Mean Time To Recovery)β€” ( ).



MTTR SLO. SRE . , SLO 99,99% , , 13 3 . 13 , «» , SLO .



13 β€” , . 7-8 , β€” . MTTR , .



SRE , MTTR, SLO , , .



, . , , :



, SRE. , SRE , , , , . , , .





, 100% , , , β€” , - «» .



SLO. SLO (Error budget).





SRE.



43 , 40 , : SLO, . , -.



, . SRE Error budget :



  • , ,
  • ,
  • ,
  • .


, Error budget . .





«» : SRE, . , , . SRE .



β€” SRE . Netflix Chaos Engineering.



Netflix Chaos Engineering: Chaos Monkey CI/CD ; Chaos Gorilla AWS. , SRE , β€” , . , .



Chaos Engineering :



  1. , , ( ).
  2. , . β€” : , .
  3. , , , CI/CD- .


Post mortem



SRE blameless postmortem, , .



, 13 , 15. ? SRE, ; -, ; , , SLA . , , - . .





, , SLO. SRE β€” . , , .



:



  • β€” (Β« !Β»);
  • β€” (Β« - , , Β»);
  • β€” , (Β«, , , Β»).


SRE , , , , . .



(Observability). , , , .



: , , . : , - Kubernetes, , .



Observability MTTR. Observability , , , MTTR.



SRE



SRE , , , . SRE , . , . , .



SRE , , . . β€” (, ). , , , .



SRE : SLO, SLI, SLA . , SLA SLO. . , , .



, , β€” , . Error budget, , .





SRE. , .



SRE Google:

Site Reliability Engineering

The Site Reliability Workbook

Building Secure & Reliable Systems



:

SRE

SLA, SLI, SLO

Chaos Engineering Chaos Community Netflix

200 SRE



SRE ():

Keys to SRE

SRE

SRE

SRE





, β€” . , - SRE . 11–13 2020.



SLO, SLI, SLA, , , .



SLO: , , , DoS-. , Error budget, , .






All Articles