Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky



Patroni's main goal is to provide High Availability for PostgreSQL. But Patroni is just a template, not a ready-made tool (which, in general, is what the documentation says). At first glance, by setting up Patroni in the test lab, you can see what a great tool it is and how easily it handles our attempts to break the cluster. However, in practice, in a production environment, things do not always happen as beautifully and elegantly as in a test lab.



Video:







. . -. 2014- Data Egret. Postgres. Postgres, Postgres, , .



2018- Patroni. - . - , , best practices. .



Postgres Linux. , . , , , Kubernetes. , . . postgres’ , , . . , . Go. software engineer, Go. .





  • , , Postgres HA (High Availability) . HA, - , , .
  • Patroni – , HA - . , , , - , , Patroni . , .
  • . .
  • , , – .




  • , Patroni, , , , , . , , .
  • . , .
  • , Patroni PostgreSQL. , , , , , .




disclaimer , .



, , 6-7-8 . best practices. . - , .



, . , - , - , , - . .





Patroni?



  • HA. . , . Patroni – , , . . , .
  • , , init- Postgres. Postgres, , , .
  • , , , , - . Patroni state . . Etcd, Consul, ZooKeeper, kubernetes’ Etcd, . . - .
  • Patroni – , , . Repmgr, . Repmgr switchover, , . Patroni .
  • . , , , . . , .




– , Patroni – , .





Patroni, . Postgres, Patroni Patroni, DCS, state. - . , ?



:



  • Postgres. , - .
  • Patroni.
  • DCS, state.
  • .


.





, , . , , … , - .





. , DCS. . , . . .



, , , .





, . . . , .



, . . - , . .



, . . . : , . . . .





, , , , , .



, Patroni. , , . .





, , . . , . . . , DCS, . . - . , . «demoted self» .





, , , .



Patroni, , , -, . . Patroni DCS. Consul agent, 8500.



, Patroni . Consul-. , Consul. .





- , , Patroni . . Pgdb-2 . . . , - , , . . , .





, , Patroni . . . , .



- , Consul- , . , : , Consul.





, , Consul. Patroni c Consul', . . Postgres, Consul. , , .



ttl, loop_wait, retry_timeout, . . . , . . .





, , . DCS , .





. , DCS.





, , . Patroni , DCS, .



, Patroni , . pg_rewind, , . Patroni , .





, , . . , , . Patroni . . , , , , - . , , .



. . , , , - , Patroni . . , .





, , , . DCS Consul, Consul, .



Consul, , Consul- Consul-.





Consul-, , - . Consul- . .



, , , , , deadline, RPC falled, . . - Consul- .





– . , , . , . , . - .





:



  • , , -, , Consul-, . . . Consul- . .
  • – raft_multiplier. Consul-. 5. staging . Consul . , Consul-. production .
  • , , Consul . «nice», . Consul- nice, .. , Consul . .
  • – Consul. , Etcd. , Etcd Consul. , , , Consul , . . . Patroni Consul- . . - , Patroni Consul-. . Etcd . Patroni Etcd- . , Etcd, Etcd , , , Consul. , . Consul .
  • – . , . Patroni , - .




, , Patroni, .





. . , , . , , , .





, . . , . , .





. .





, . Patroni , pg_rewind. , .



. , Postgres . .





, . , . , pg_rewind checkpoint. . , , .





timestamps, . 150 , . . 369 checkpoint, WAL-. 517 150 rewind . . . 150 , .





?



. , . . , , WAL-, . . - . , , .



, , , WAL-. , wal_keep_segments. 8 . 1 000 , . 16 wal_keep_segments. . . 16 .



– maintenance . , . . , , , -. , . wal_keep_segments, , . , , , . . .







production-. .



. – , , . , .



, - , , . , .





, pg_rewind . , , .





, , , , . . , , .



. Patroni. Patroni. Postgres. Postgres' , Patroni pg_rewind. , , . Patroni , . . . . 3 , 3 – .





. , . , rewind. . rewind, - .





, . – . , . . , .



pg_wal_lsn_diff . 17 . . - 17 – , - . .





?



-, – Patroni ? , , , . , , , . – standalone- , .



, , .



, «maximum_lag_on_failover». , , 1 .



? 1 , . , Patroni , . , . , .



, Patroni DCS . -, 30 ttl .



, , DCS , , . . realtime. . .



. , . . . , Patroni , , , .



. , - . . . - , , . . .





, maximum_lag_on_failover , . . . , , . .







, Postgres. , SSH. .



. - , , . , .





, , . . , .





postgres’ , , . , -- , . , . . . . .





dmesg ( ). , . software Raid. /proc/mdstat , . . . Raid 8 , . , , sde . , , . , Postgres.





Patroni , Patroni , . . .



– , watchdog? , , Patroni DCS . . . DCS Patroni , , .





, , , , -.





, , . . Patroni , Patroni , , . , . .





? , , . , .





, . . .





.





, , .





– immediate shutdown request. Postgres :



  • graceful, , .
  • fast, , .
  • immediate. immediate , , . RST (TCP-, ).


? Postgres , . . kill-9. , , . . Postgres. , .



«last» , , . , kill -9. kill -9, .. Postgres , kill -9, .





, , Patroni – 54 . timestamp, 54 .





. Patroni . , - . . . pgsql01 .





, . . . . , recovery.conf, Postgres . 10 , , .





immediate-shutdown . . recovery , . . . , .





- , .



, recovery.conf . , - .





Patroni , , . , . . . , .





, , , , , . , , recovery.conf, .





. , recovery.conf, , . - , , , . . - . .





30 , . . Patroni . , , . – Patroni, , - . recovery. , .





. , .





, . Patroni, Postgres, Patroni , .





, , , . . - . .





– , Postgres, checkpoint, - , recovery ? WAL .





Patroni, checkpoints , - , . . , . . .





, , .



? Patroni , . – 100 % , . , - , , .



, , . . , .





. . , .





Patroni, . , , , , , . .



. , , , , , Patroni, DCS.



, Patroni – . , . , .



Patroni – . , Postgres, , Patroni Postgres, . , .





? , ELK , , 6 2 . – Patroni , – Consul, Postgres . .



? -, , . . , , . : .



, , . . , .



, , ( ).



? :



  • Patroni.
  • Postgres, DCS , Patroni.
  • , .




Patroni? Patroni . , , . . Stolon, Repmgr, Pg_auto_failover, PAF. 4 . . Patroni .



: « Patroni?». , , Patroni . , , .



, Patroni, , , issues GitHub. . - , . . .



, . . , . .



Zalando , , . – , Zalando , , .



, Patroni – . , , . , Patroni. , Patroni , . , , , . Patroni, . , .



. , .







! , ?



. . . , , . – .



, , ?



, . , , . . . Patroni REST API, history. history , . . history, . , , . . , , .



!



! DCS - Postgres, ? best practices , - DCS , - . .? ? ?



, , - . . . DCS . , , . DCS , , , , .



. . , Patroni, , , - ?



, DCS-. (), . Patroni . . - , , , – , , Patroni . – patronictl pause, patronictl resume. , . maintenance DCS-, .



!



! , ?



, .



?



. « RPO RTO», . . . , . , , . , . Patroni, : , . , 100%- .



, ! Patroni zero level protection? . . standby? . . . Repmgr, . Patroni . Repmgr?



. , ( , — ). , , , , Patroni Standalone-, . .



, Repmgr . ? Patroni , Repmgr , . Repmgr daemon .



Repmgr – Postgres. Repmgr , .. Repmgr . Repmgr, … . DCS, Stolon, Patroni, .



, , , . DCS . , – , , . , - DCS- ? , : .



, DCS , . . , . , DCS , , . . - ? Patroni read only . Patroni . DCS , read only. DCS .



, DCS , ?



Yes Yes. In many modern companies, Service Discovery is an integral part of the infrastructure. It is being implemented even before there was even a database in the infrastructure. Relatively speaking, the infrastructure was launched, deployed in the DC, and we immediately have Service Discovery. If it is Consul, then DNS can be built on it. If this is Etcd, then it may be part of the Kubernetes cluster, in which everything else will be deployed. It seems to me that Service Discovery is already an integral part of modern infrastructures. And they think about it much earlier than databases.



Thanks!




All Articles