In the previous parts, we talked about how we created and implemented a new data center monitoring system . As a result, we have a powerful mechanism for tracking and maintaining statistics of all data center parameters that affect the availability of its resources and indicators of uninterrupted operation.
The next task on the way of the development of the system was the question of its adjustment: how to make it convenient to work with the new system, and it itself would be as informative as possible?
The problem here is that the functionality of the system allows you to turn on a lot of emergency notifications and signals - with such settings, the staff will be forced to constantly respond to them, working out the appropriate scenarios.
Another option is to set an insufficient number of such notifications, creating risks for the attendants to miss a really important event.
In this part, we will share our practical experience in setting up our data center monitoring system.
A bit of theory
“Variables collected by the SCADA system are divided into tele-signaling and telemetry” - they taught me once at the institute. And in fact, nothing has changed: telesignaling is a statedevices, for example, "no alarm", "there is an alarm", "open", "closed", etc.
And telemetry, as you might guess, is the digital value of some parameter, for example, "220 Volts" or "10 Amperes ".
The state or value set by the user at which a message (alarm) appears on the screen is called the “setpoint”. You can set a delay before the message appears, that is, the alarm appears on the screen only after X seconds (provided that the emergency has not stopped earlier) or to "freeze" the message on the screen - in this case, the alarm has already disappeared, but the message about it is on the screen stored for another X seconds.
Accidents by priority are usually divided into three main types: "Red", "Yellow" and "Blue". "Red" accidents require immediate action by employees, "Yellow" warn them about something, "Blue" most often report some non-critical events. For example, we have deduced "Blue" accidents from the summary that the attendants see and use them to monitor various commercial parameters (exceeding the ordered capacity). These accidents are reported only to managers and do not distract the attendants.
For the convenience of configuring the same type of equipment, variables in different devices, but with the same name (for example, "OutputCurrent") have the same settings on all devices in the system. If we change the setting in one place, it changes everywhere.
When a device requires individual settings for the required variable, we apply a special mark "Only for this device". Now the variable has become individual for one specific device, has its own setting and does not affect other variables of the same name.
Additionally, the devices themselves have their own factory settings. For example, the PDU is factory configured to recognize an overcurrent alarm of 32A. If it is triggered, the PDU will notify about the type of alarm “Overload Alarm”. And this is a completely different variable, unrelated to the "OutputCurrent" variable configured in the BMS.
Example of factory default settings inside a PDU:
So, we have listed the main functionality for setting up a monitoring system.
How to tune this "piano" correctly? Let's go through the tasks in order.
What we want to achieve
The most important task: any alarm message on the
The second task is to minimize false or uninformative messages. No matter how attentive and responsible you have, if something is constantly blinking, blinking and ringing in front of their eyes, then they will either miss a real accident, drowning in a sea of ​​alerts, or turn off the sound - and as a result, they will also miss the incident alert.
Stage 1. Determination of the necessary and unnecessary variables for each device
Usually a so-called "map of variables" goes to each device, on the basis of which a "driver" is created by the commissioning engineer. Its task is to "indicate" to the monitoring system in which register of the received data the required variable is located. For example, register 1 of the device polling protocol contains information about the engine operation mode "System_on_fun", and register 2 - about the compressor operation mode "Compressor_1".
The number of variables for one device is often more than 100. The employee who initially configures the system (usually an IT engineer) cannot decide for himself what is important here and what is not. As a rule, all variables are added to monitoring on the “what if they come in handy” principle.
At first, this is permissible - the operations staff can look at the real values ​​of all available variables and understand what they really need. But if you leave the system in this state for a long time, then we will get the following negative effects:
- Superfluous variables load the operational task of the monitoring system and increase the size of the archive; the system is forced to process and save unnecessary data.
- The more variables are polled, the higher the probability of polling failure. This is especially true for devices connected via a loop (for example, through a gateway using the MODBUS protocol). This leads to the receipt of the states "no data (N / A)" or "disconnection", that is, in fact, the device periodically drops out of monitoring.
- Some variables are superfluous "by default". For example, your version of the equipment does not have a compressor or pressure sensor, but they are registered in the universal driver for the entire model range of equipment and are polled, being added to the archive, loading the network and processing.
The screenshots show part of the driver code. The // symbols indicate variables hidden from the poll. Also visible is a list of variables displayed to the user when setting the setpoints in the BMS itself.
According to our experience, it is better not to touch the factory settings inside the devices at the initial stage (of course, if they do not already inform you about the accident). However, at each training session on a specific equipment, employees should be reminded of the presence of settings both in the device itself and in the BMS. In the future, this will help the attendants to understand exactly what exactly is the cause of the alarm message.
Superfluous variables in the driver should be gradually revealed and hidden from the poll, and the remaining ones should be divided into those to which settings should be assigned, and those that are saved without settings only for subsequent analysis and statistics.
This should not be done by the system adjuster, but by an employee who understands the operation of the system controlled by the monitoring system - preferably the chief engineer or chief power engineer.
Stage 2. Minimization of false and uninformative messages
False positives often occur due to failures in polling the device. If the network card of the device is not self-powered, then both a failure in polling and a real power outage will be displayed as one type of failure - "communication break".
In this case, it is necessary to divide the equipment into critical (for example, PDU) and ordinary (for example, "SHCHUV" ventilation panels). For conventional equipment, you can set a delay for the signal "disconnection" (for example, 300 seconds) - then most of the false disconnections will be ignored.
It is clear that such a delay cannot be set on critical equipment, so if it constantly gives false failures, then you should deal with the physical network, the number of polled variables. It is quite possible that a lot of devices are "hung" on one gateway and it is necessary to segment the network by adding new gateways.
Non-informative accidents most often occur during transient processes. They cannot be called false - they actually exist, but they are "normal" for a specific operating mode of the equipment. The most obvious example is the transition to a diesel generator set.
In this case, a part of the equipment powered without a UPS "normally" is de-energized and gives an error "disconnection", and the UPS themselves issue a whole bunch of messages - "no power at the input", "no power on bypass", "power supply from the battery "Etc. The staff immediately receives dozens of messages.
To optimize the number of messages when switching to DGS, you should:
- set for “normally” appearing alarms during the transition longer time delays than the time when the power supply from the generator set appears. For example, set the delay for the signal "disconnection" of the ventilation panel for 300 seconds when the standard time for switching to the diesel generator set is 200 seconds.
Then the power supply to the SCHU will appear before the setpoint delay, and the situation will not be recognized as an emergency. At the same time, there are critical devices that are powered by the UPS and must always be connected (for example, PDU) - messages about their "disconnect" should appear without delay.
- analyze messages from the UPS when switching to a diesel generator set and divide them into "normal" ones with assigning them a "yellow" type (for example, a statement of the fact "there is no power at the input") and "abnormal" ("disconnection of the battery breaker", which should not be in any operating mode), assigning them a "red" type.
At the same time, we write separately in the instructions to the duty officers that in the event of a transition to a diesel generator set, “yellow” accidents can be observed and not acknowledged (they will disappear themselves after the completion of a regular transition), and “red” accidents can be eliminated immediately (they should not be).
Relying only on theory, it is very difficult to adjust the setpoints for this “transient” process at one time. For successful tuning, it is necessary to observe the transitions to the DGS several times in real time.
For example, we needed to observe 4-5 transitions for an acceptable setup of the new BMS. To analyze the unscheduled transition process, we made a recording of the monitoring system screen, since it is important to observe alarms not in the event archive, but to analyze the occurrence of alarms in the dynamics of the operational summary.
Step 3. Additional tips from our experience
1. On the screens of the duty shift, there should be no unnecessary indication in the colors of alarm messages.
Real-world example. One data center ordered a temperature flow map in the server room. This is a 3D model of air flows with a lot of temperature data from sensors. The result was a view of the northern air with streams of air - somewhere the air was highlighted in green, somewhere - yellow and red (from coldest to hottest). At the same time, air temperatures are everywhere within normal limits, and the colors are used only for clarity of displaying the temperature difference at different points.
Further, this "motley" view was displayed on one of the monitors in the "duty room". As a result, it turned out that the tool, created for process analytics, appeared before the eyes of those on duty, who were "sharpened" to run to the equipment when they see red and strain when they see yellow.
Probably, they explained to the attendants that on the left screen "red / yellow" is normal, and on the right screen the same colors are a signal for action. However, it is clear that this practice increases the risk of human error very seriously.
It is logical to remove such systems from the monitors in the duty room, they should be observed by the chief engineer for the purpose of analyzing trends - for example, after some changes in the parameters of the air environment in the server room or the commissioning of new equipment.
2. Use SMS notifications with caution.
Several years ago, we were still afraid of a bad mobile Internet and used SMS instead of instant messengers. Once I accidentally set the wrong setting, it was applied to all devices of the same name in 100 devices, and my colleagues subscribed to the mailing list received 100 SMS messages each. Since then, we have not used SMS messaging.
3. Set up duplication of messages about troubles through the messenger.
This can be done, for example, through Microsoft Teams or Telegram. Both you and the person on duty will receive messages about accidents, while the phone will make sounds and vibrate (which is not the case when working with the system through a browser).
And don't be afraid that there will be a lot of messages. In our experience, during the average day of a data center operation, only a few dozen messages are received, and they do not load the phones of employees. That is, the equipment of the data center and the BMS system can be configured so as not to receive hundreds of notifications and at the same time not to miss anything important.
To reduce the number of messages, include in the mailing list only notifications about the occurrence of "red" and "yellow" alarms, that is, the necessary minimum to keep your finger on the pulse of events.
4. Group messages in messengers.
During the transition to a diesel generator set or due to a complex accident, you will have dozens of active emergencies, the phone will constantly vibrate from incoming messages to the messenger, preventing you from making an important call or opening the monitoring system window.
You can configure the distribution so that the messenger receives one general message with a general list of accidents that have occurred in the last minute. This setting does not affect the appearance of alarms in the BMS system summary (alarms appear in the summary without delay), and for 1 minute of delay in the receipt of a message on your phone, you will not miss anything, but there will be many times fewer messages on your phone.
5. Highlight the message about the loss of connection with the server in the interface.
For example, the Internet was lost in the premises of the attendants. The user interface has no connection with the server and therefore the alarm does not appear in the summary, the dim inscription “server is not available” may not be noticed by the personnel, employees can look at the “green” BMS panel with numerical parameters for a long time, unaware that it is located offline.
The screenshot shows an example of an indication of the loss of communication with the BMS server, while irrelevant parameters of the equipment are displayed.
6. Connect as many systems as possible to monitoring.
For example, traditionally a fire alarm system works autonomously, and its panel hangs at the security post.
Yes, on the "FIRE" signal, the automatic algorithms of the systems are triggered, the warning system is started, but the security officer announces the appearance of the "Fault" or "Attention" signals in a voice on duty.
It is very difficult to fully connect such a system to monitoring, but in such a system it is easy to configure three relay signals "fault", "attention" and "fire", and then connect them with "dry contacts" to the BMS system module.
This reduces the risk of the notorious human factor. An example of a test signal "FIRE" in the BMS system of the data center, connected via "dry contacts".
Summing Up Our 4-Series History
A data center monitoring system is more than just “eyes and ears” for monitoring data center engineering systems. Its correct operation allows achieving the highest level of reliability through the continuity of the site, and, therefore, provides the company with an additional competitive advantage.
Having passed a rather difficult and long way, we got:
- a fast and stable monitoring system that currently monitors more than 2,500 devices and calculates about 10,000 virtual sensors;
- system reservation based on the Lindatacenter cloud solutions platform in St. Petersburg and Moscow;
- -, , 1 ;
- , , ;
- , , – .