Site Reliability Engineering (SRE) is a form of DevOps implementation. The SRE approach originated at Google and became popular among product IT companies after the publication of the book of the same name in 2016.
In this article, we will describe how the SRE approach relates to DevOps, what tasks an SRE engineer solves and what metrics he takes care of.
From DevOps to SRE
In many IT companies, different teams are involved in development and operations with different goals. The goal of the development team is to roll out new features. The goal of the operations team is to keep old and new features working in production. Developers strive to deliver as much code as possible, system administrators strive to keep the system reliable.
The goals of the teams contradict each other. To resolve these contradictions, the DevOps methodology was created. It involves reducing silos, accepting errors, relying on automation and other principles.
, , DevOps . Β« DevOps?Β». , , .
2016 , Google Β«Site Reliability EngineeringΒ». DevOps. SRE-, IT-.
DevOps β . SRE β . DevOps β , SRE β , DevOps.
SRE-
SRE , DevOps .
, , SRE . , - . , SRE .
SRE β . , , β .
, SRE , , . - : Β« β Β». , . SRE . , , . , .
. , , . , .
SRE . , SRE : Β«OK, , , Β». , , , .
- β , .
- β , . , .
SRE , -, . SRE ( , ).
SRE , - .
, SRE . , -. β .
: SLA, SLI, SLO
. β , .
SRE , . , (, . .) , .
- β Service-Level Objective (SLO). , .
SRE , . Β« , . , , SLOΒ», Google. β , , .
, β Service Level Indicator (SLI). , , , β .
SLO SLI β , . Service Level Agreement (SLA). .
SLA: 99,95% ; 99 ; 85% 1,5 .
100%
SRE , . , .
, «»:
- β 99%,
- β 99,9%,
- β 99,99%,
- β 99,999%.
β 5 , β 3,5 .
, 100%, . - ROI β .
, . ! 47 . . .
. 99,99% 99,999%, 99%. , 10 8 . , .
β MTBF MTTR
, SRE : MTBF MTTR.
MTBF (Mean Time Between Failures) β .
MTBF . SRE Β«!Β». , SRE - , , .
MTTR (Mean Time To Recovery)β ( ).
MTTR SLO. SRE . , SLO 99,99% , , 13 3 . 13 , «» , SLO .
13 β , . 7-8 , β . MTTR , .
SRE , MTTR, SLO , , .
, . , , :
, SRE. , SRE , , , , . , , .
, 100% , , , β , - «» .
SLO. SLO (Error budget).
SRE.
43 , 40 , : SLO, . , -.
, . SRE Error budget :
- , ,
- ,
- ,
- .
, Error budget . .
«» : SRE, . , , . SRE .
β SRE . Netflix Chaos Engineering.
Netflix Chaos Engineering: Chaos Monkey CI/CD ; Chaos Gorilla AWS. , SRE , β , . , .
Chaos Engineering :
- , , ( ).
- , . β : , .
- , , , CI/CD- .
Post mortem
SRE blameless postmortem, , .
, 13 , 15. ? SRE, ; -, ; , , SLA . , , - . .
, , SLO. SRE β . , , .
:
- β (Β« !Β»);
- β (Β« - , , Β»);
- β , (Β«, , , Β»).
SRE , , , , . .
(Observability). , , , .
: , , . : , - Kubernetes, , .
Observability MTTR. Observability , , , MTTR.
SRE
SRE , , , . SRE , . , . , .
SRE , , . . β (, ). , , , .
SRE : SLO, SLI, SLA . , SLA SLO. . , , .
, , β , . Error budget, , .
SRE. , .
SRE Google:
Site Reliability Engineering
The Site Reliability Workbook
Building Secure & Reliable Systems
:
SRE
SLA, SLI, SLO
Chaos Engineering Chaos Community Netflix
200 SRE
SRE ():
Keys to SRE
SRE
SRE
SRE
, β . , - SRE . 11β13 2020.
SLO, SLI, SLA, , , .
SLO: , , , DoS-. , Error budget, , .