Pavel Trukhanov, "Monitoring Postgres by USE and RED"
There are two performance monitoring methodologies: USE (Utilization, Saturation, Errors) by Brendan Gregg and RED (Requests, Errors, Durations) by Tom Wilkie. In the talk, I want to talk about how we were guided by them and continue to be guided when we implement Postgres monitoring in okmeter.io.
. Okmeter. , , Postgres , . , , , , , USE RED , Postgres.
, . , .
, , , , . - . performance - , , , , .
, , , Postgres β , , . , - , , , . - , . - , , .
USE. , . , , . , , saturation , .
, ? , pg_stat_activity . ? , . , - , . , , . - , - , , .
? «», CPU Usage, , iostat β . , . , , . , , . .
, , , Postgres. , Postgres . . , , . Data Egret. , .
- ?
. . , Postgres , connection connection .
. . , . β - . , , . , , , .
: Β« ?Β». , SpinLock - , , , . CPU usage , .
β . , , , , - , , , , - .
? , capacity. 100 %, , , . .
. , . , . . , . , . . .
. - , capacity. , capacity ? . . saturation, . , . , .
Postgres.
pg_stat_activity. - . , . . : 300 connection . , - . , , - .
, . , , . - , , capacity , . . , Postgres max connections.
, state connection, , , idle, . . connection , . - idle in transaction. , , . active, - .
, , . , . ? . , - , . β pool connections, β , , , , . β , . - : locks - .
, , , .
- , , active 5 % connections. 95 % . . , .
, . , connections .
?
, . ? 100 connections, max connections , settingβ, . , . , 100 %. , β . - . , . - , - .
saturation, util ? Saturation , utilization 100 % . , , , utilization 100? , .
, , CPU usage , load avarage . , 100 %, saturation . Load avarage β saturation, - . runnable , . . , , , .
, CPU usage . ? . load avarage. Load avarage , . , - . . response .
. - β idle in transaction.
. . - , . saturation .
idle. max connections, . , . -.
, selectβ pg_stat_activity connections, waiting try. . . active state, - , -. waiting.
, . utilization connection pool 100 %.
, .
waiting ? . , - saturation , . . stack Postgres, , - - . .
β locks. , lock. , locks - , , connections. , locks.
. . . - lock , .
lock β space , β . , , lock . , , connections, locks, β saturation lock.
Postgres , connection . TCP-. TCP-. Post master . , , , Β«resetΒ». time wait .
? , connections .
connections .
, connection pool . , , , , . ? - . ? -, . connections 5 000. Postgres . ? - connections. , , .
TCP . time wait, , - Postgres - , .
, connect? postmaster , connections backlog list . , search, backlog 100. . 100 %. β , - β saturation. β .
, backlog , reset.
, . Postgres , TCP «».
RED, USE? DBA, , , , - . , - . - , . . , Postgres .
RED, , , , :
- ,
- ,
- .
Postgres. , . , - . . - , .
rollbacks, , 6 , , , , , search , . . , - .
, RED . , . ? , . , , . , .
queries . - - . 8 , .
, - . . select , .
. , - , . . - . . - . , . . . : Β« , Β», , .
, . pg_stat_statements , . . , , . . , . β . . , , - , , . .
slow log. Slow log β durations . , . . , , - , .
, . , - , .
. , - . . , , . β , .
. - .
, , - . - , .
, . , . , , . .
, . USE, RED, ad-hoc , ad-hoc tools - , , , , .
.
Postgres, USE, RED ? . . .
Okmeter, . , - , . , , , , . , - , USE, RED. , . , , , saturation . , , , saturation . , . , - . , , , . , , .
! ! , 4- .
4 β USE RED. , USE, durations. errors . RED , requests durations. - , USE RED . . . - . , , .
β instance.
, ? β . β , requests . .
, !
! . β , - , , . , , . .? . . ?
, . , . , . . , , , , , USE . , , , , , selects, , , requests . , requests .
, , , , ?
. , . , , . , . . , . . . , . - , , . . , queries . - . , .
, , Postgres . , . , .
! , instance Postgres - . , ? , BD .
. β . , , , , , . , , - . , . .
The second way we fight is optimization. We optimize what we do. Effectively Okmeter makes requests to these views regularly, but quite rarely, once a minute.
That is, it's not real time?
It is a difficult question, what is real time. Let's discuss this separately. But the load is limited by how many requests you make. These requests are not very heavy at all. There are several dozen of them. And even if you do them in a sense more real time than once a minute, this load is still quite limited. Here is an example of how many queries are sent to the database. There are several thousand of them. Accordingly, even if these several dozen are polled once a second, it is still a fraction.
Got it, thanks!