At Dropbox, we believe incident management is central to our reliability system. And while we also use proactive methods such as chaos engineering, the way we respond to incidents has a significant impact on the experience of our users. During a potential site failure or product issue, every minute counts.
Key components of our incident management process have been around for several years, but we see opportunities for continuous development in this area. The changes we have made over time include both technological and organizational and procedural improvements.
In this post, we'll detail a few lessons Dropbox has learned from its incident management experience. Most likely, not all of the points can be found in the manual on the structure of incident management, and you should not think that these improvements are universal for any company. (The usefulness of these lessons depends on your technology stack, the size of your organization, and other factors.) Instead, we hope this article serves as an example of how you can systematically analyze your company's incident response and improve it to meet the needs of your users.
1. Prerequisites
Dropbox SEV' ( SEVerity — ), , SaaS-. ( , , ).
SEV' , , , . - , Dropbox. — - , SaaS- — , SEV' . .
SEV', -. — , , , . , , , Dropbox . , , , "", . 99.9% , SLA, 43 . — 99.95% ( 21 ).
, . — -, , . , , SEV' . .
2. SEV
SEV Dropbox , SEV, , .
SEV :
SEV, ; — , , .
SEV 0-3, . 0 — .
IMOC (Incident Manager On Call — ), , , SEV- .
TLOC (Tech Lead On Call — ), .
SEV, , BMOC (Business Manager On Call — ), - , . , , , .
Dropbox : DropSEV. SEV, . Slack , email- , Jira , , . Slack email- SEV. — .

DropSEV , .

: , ( )
: , /
: ,
21 ? - , , . , .
3.
,
Dropbox , . , . — Vortex, . Vortex , , 10 .
— . , .

2018 , . , " 21 ", . , Vortex PagerDuty. Vortex , .
, , : , , ?
Vortex — , , . , , .
, , . RPC- Courier, . , Courier , . Courier , Dropbox (Go, Python, Rust, C++, Java).
, , , - . . — , , - . , , .
PagerDuty , " SEV". , , , , , SEV .
, , . , DropSEV ( , ) , SEV . , "" SEV DropSEV, , SEV.

, , " SEV?" - . , - Magic Pocket, , Get Put . , :
?
?
? ( , , , )
? SEV?
, SEV . , — . SEV , IMOC BMOC, .
SEV, . - , DropSEV , SEV, . , , , , . , SEV .
?
4.
, /
. " " , . PageDuty .

.
, , 21 : PagerDuty ? Dropbox :
?
? ?
PagerDuty? , SMS, , ?
SEV-, , PagerDuty . , PagerDuty API, , .
, - . , , . , ( ). , .
( 2021), PagerDuty On-Call Readiness Report( ), . , Dropbox, , .
, dropbox.com, , . , , . , .
, , , . :
RPC
(QPS)
(, , )
, , , , , . , , . , . ( , ).

— . ? — , . Grafana, ( , DRT, ) , . .
Dropbox — . Dropbox . . python-.

:
, SEV-, .
, , ETA .
.
, -, .
.
2020 , , Slack- .
, . : SEV- . , , SEV, , — . , SEV. , , .
, , , " ", "" . , , . () SEV, . , , , .
, , , . , " ". , , , Slack.
: , , .
5.
, ,
, " " . , ? , , , ( )
Dropbox 99.9% SLA, . , SEV. , , : " 20 ?"
, . - , , . , , , , . :
- . , , DRT, , .
, . , , .
, , , , , . , , .
, , , , . " 20 " , , .
. , , , . , , DRT . , ?
99.9
SLA , , . ?
, , :
SLA, ,
.
SLA . , , .

. . , Dropbox . , .
( ) . — ; , . — , . Dropbox , , .
" SEV ?" , , . , .
SEV Dropbox. , , , SEV , . , . SEV , ; SEV, dropbox.com . , .
, , - ( , - ). , ~20 -, . , , , SEV-. , — . , , , , .
, . "" , , :
, ?
(, , , ) ?
, , ? — , , — ?
?
6.
Dropbox — . SEV , , , . , ; SEV , , .
. , .
Dropbox "" . , . , , , , . , , SEV .
SEV, , . . — , , , — , .