High availability monitoring. Sberservice experience

SberService is the largest service company of federal significance, providing comprehensive maintenance services for a wide range of information and telecommunication equipment, workplaces, office equipment, servers and telephony. The company is the only premium partner of Zabbix in the CIS; it employs the largest team in Russia in the field of IT monitoring, developing unique technical solutions in the field of integrated implementation of monitoring systems for organizations with high-load IT infrastructure. This fact explains why SberService chooses Zabbix as the main monitoring platform.





What is this article about?





As the name suggests, this article proposes a concept for organizing monitoring with high availability. Zabbix Server acts as an "experimental", Corosync and Pacemaker are used to organize an Active-Active cluster, and all this works on Linux. This software is OpenSource, so such a solution is available to everyone.





During the operation of Zabbix for monitoring a highly loaded IT infrastructure (an increase in the number of data items, an increase in the number of network nodes, a large storage depth of raw data, constantly growing user needs), many encounter problems with the performance of the Zabbix server during startup or restart. Under High Load (> 60k NVPS) conditions, a normal Zabbix server reboot turns into a rather lengthy, albeit regular, procedure. The time from the start of the service to the appearance of data in monitoring can reach 15-20 minutes.





Faced with this, and after analyzing the situation, the monitoring team came to a solution that clustering according to the Active-Active principle will help. In addition, the goal was to achieve Disaster Recovery by transferring it to different data centers.





A task





Creation of a fault-tolerant, highly available two-arm Zabbix server cluster operating in Active-Active mode and providing continuity of control, recording, and also eliminating the possibility of duplication of information when writing to the database.





There are many schemes and ways to protect the network environment of the service from falling. The team decided to follow the path of adapting the powerful and high-load monitoring system Zabbix to the hardware and software clustering capabilities that exist in the IT market, SberService created a high availability cluster. The choice fell on OpenSource solutions, more precisely on Pacemaker and Corosync.





Requirements





There are two important criteria to consider when creating a cluster:





  • ZabbixServer , . ., ;





  • ZabbixServer , .





Active-Passive Pacemaker Corosync        ( Corosync cman, ).





, Zabbix , « », ZabbixServer , , , – . , .





, Active-Active (LoadBalancer), .





:













High Available ZabbixServer Active-Active LoadBalancer :









« » (Cluster resource) . .





2 . stonith quorum — .

quorum 3 . , 2 , «» .





stonith , . , . . , , .

:





ocf::heartbeat:ZBX-IPaddress
ocf::heartbeat:ZBX-Instance
      
      



, . ZBX-IPaddress ip- (IPaddr2). ZBX-Instance — zabbix-server . Zabbix- , , Zabbix- Read/Write, ReadOnly, zabbix-server (, Active-Active).





. ZBX-IP-address IP- , ZBX-Instance zabbix- Read/Write, zabbix- ReadOnly, . . ZabbixProxy. , .





— master slave ZabbixServer- .





High Available ZabbixServer Active-Active LoadBalancer





:









« » . , « », , LoadBalancer - . , , « ».





Pacemaker :





ocf::heartbeat:ZBX-Cluster-Socket
ocf::heartbeat:ZBX-Instance
      
      



ZBX-Cluster-Socket — « » — LoadBalancer-.





ZBX-Instance zabbix-server- .





« », .





ZBX-Cluster-Socket Pacemaker (). « » — , , LoadBalancer. «» ZBX-Cluster-Socket ZBX-Instance (constraint) , «» . Corosync, ZBX-Cluster-Socket, 101 Master-node 201 Slave-node. LoadBalancer / : 101 — 201 — , , , .





Master-node Slave-node :





Master-node, 101 , LoadBalancer 201 Slave-node. Corosync, , Master-node , ZBX-Instance ZBX-Cluster-Socket Slave-node, «resource_movement», Pacemaker ZBX-Instance User_A User_B , .





?





: 2- Master-node ( User_A) Slave-node (User_B), Master-node .





, , . Master-node , . Slave-node . LoadBalancer , Master-node Slave-node ZabbixServer , LoadBalancer .





— ? - Read/Write, ReadOnly, :





  • Slave-node , Slave-node : User_A ReadOnly, User_B Read/Write.





  • Slave-node , Slave-node .





  • «» Master-node, LoadBalancer , Master-node .









( 2- ), . , , «How to».





In conclusion, it should be added that in the modern world of technology, almost nothing is impossible. All you need is desire and resources.








All Articles