Data center employee certification: how and why it is carried out at Linxdatacenter



Earlier, we talked about how we passed the Uptime Institute Management & Operations Stamp of Approval certification in 2018 and confirmed the level of compliance with its requirements in 2020. 



Today we will talk about training and testing the knowledge of data center engineers - this is the Linxdatacenter experience in St. Petersburg, which certifies the organization has adopted in its work. 

Recall what we are talking about: the Management & Operations standard of the industry expert institute Uptime Institute evaluates the quality of management of engineering services of data centers and is aimed at reducing the number of failures due to the human factor. 



It is the result of an analysis of 6,000 data center outage episodes over 20 years of observing the industry and is part (one of three) of the broader industry standard Operational Sustainability.



In addition to M&O (management and operation), it also includes Building Characteristics and Site Location. Data center management and operations issues in this hierarchy play a major role in the operational sustainability of the site. 



More than 75% of all failures are due to the human factor: it includes both direct operator errors and the adoption of incorrect management decisions when recruiting staff, building service processes, training and the general approach to work. 



Personnel training, instructions for action in various situations and routine maintenance reduce the number of failures by at least 3 times.



Maturity attestation



One of the basic tenets of the standard on which we launched the performance appraisal program is: “Having the right number of qualified employees is critical to achieving long-term goals. Without the proper number of qualified employees and the correct organization of their work, the data center will not have the resources to function successfully. " 



The standard recommends providing yourself with such employees through high-quality recruitment and development of an integrated approach to data center maintenance. Such a program consists of preventive maintenance (PM), a cleaning policy, a maintenance management system (MMS) to track work, and a service level agreement (SLA).



The higher the Tier level of a data center, the faster its performance targets grow, the more stringent the requirements for the organization become as the complexity and granularity of each of these elements increases. 



The standard offers as a solution a comprehensive training program for personnel, formalized and based on a separate block of documentation. 



Only this approach ensures consistency in the operation and maintenance of the data center infrastructure. To quote the standard again: "All personnel must understand the policies, procedures and unique requirements of the data center to avoid unplanned downtime and respond to anticipated events." 



Actually, this is where our certification system originates. 



Its second “ideological pillar” is ISO 22301 “Security and resilience - Business continuity management systems” - “Security and resilience - Business continuity management systems”. 



This standard directly regulates the steps of companies (in all areas, not only IT) to ensure the continuity of their activities, regardless of the onset of emergency situations and adverse external conditions. 



One of its points indicates that the organization must determine the necessary competencies of persons performing work that affects ensuring its smooth functioning. And hereinafter, companies are obliged to ensure the competencies of these individuals based on appropriate education, training or practical experience. 



The process should be maintained, improved and evaluated, with appropriate documented information retained as evidence of competence.



Finally, the third "pillar" of our program is our own experience of several years of consistent work to improve the coordination and efficiency of engineering services. This experience is reflected in our Emergency Operations Procedures (EOP) documentation, including personnel qualifications. 



Clearly documented and formalized procedures in the structure of business processes at the site in St. Petersburg make it possible to assess the professional level of an employee and to identify the compliance of his qualifications with the position held or the work performed by him.



Passing certification for knowledge of instructions, scenarios for responding to emergency and routine situations, the distribution of roles and areas of responsibility between the participants on the shift on duty, etc. is the responsibility of employees. 



Main types and main tasks



Why do we need this? On the one hand, yes, they worked somehow without certification before, a lot (yes, almost all) of colleagues in the industry also do without it. 



On the other hand, it should be understood that a data center is a complex engineering facility consisting of many subsystems, the management of which requires the highest qualifications, responsibility and attention. 



We are constantly upgrading engineering subsystems and data center management process groups. Only recently have been introduced the processes of preventive  maintenance of diesel generator sets  and  analysis of the quality of the supplied fuel  for them, control of the pressure level  and "back pressure" of air in server rooms in rooms and a  set of measures to prevent air pollution . A major modernization of the building management system (BMS) was also carried out , a wide range of components of the LOTO system were put into operation  .



In the course of these works, we were repeatedly convinced that any quality control methods bring good results only if they are formalized and applied on a regular basis - this is another reason for the introduction of mandatory certification.



In addition, such inspections help to stimulate the growth of efficiency and quality of work, determine the need for advanced training and "pull up" the level of knowledge of specific specialists, and also organize the correct placement of personnel, taking into account the level of their professional knowledge and skills.



Before the planned certification, the managers conduct preparatory consulting work, namely, two weeks before the certified personnel are informed about the certification criteria, questions for the exam, and they conduct explanatory consultations.



All questions are accompanied by detailed answers with links to regulations and instructions. 



Procedure in essence



Certification is carried out by a commission of at least three people, the procedure consists of two stages. 



At the first stage, the certified employee is tested within the framework of questionnaires and tests. The total number of questions is 60-70, depending on the specialization. During the certification, 15 are randomly selected. About 80% of the questions relate directly to the profession, the remaining 20% ​​- related areas of knowledge and competencies of colleagues in the data center. 



For the certification, a special internal portal was developed, which made it possible to automate and make the certification process a registered process







Examples of questions for employees of various departments  



Mechanics



Maintenance Section



  1. When is the next maintenance scheduled for the systems you are responsible for?
  2. , ?
  3. SLA ? 
  4. ? (Predictive maintenance)? Predictive maintenance .
  5. ? ? ?


EOP



  1. EOP?
  2. EOP?
  3. «Water loss alarm».






«»



  1. ( ) .
  2. , .
  3. .
  4. .


« -»



  1. , ? 
  2. , ?
  3. .
  4. ?
  5. On Hold Waiting? ?


-



« , (Common Instructions, Orders)»



  1. .
  2. .
  3. .
  4. ?
  5. ( )?
  6. - ?
  7. ?


As you can see from the above examples, we take into account the current realities in which we work. In this case, these are questions as of December 2020. 



The second stage of the certification procedure consists of a personal interview of the commission with a specialist. The direct supervisor of the certified employee must take part in the work of the certification commission. 



The main criteria based on which the professional competence of an employee is assessed are the level of his training, including professional skills, work results for a certain period of time, as well as compliance with the requirements for the position held.



The decision is taken by an open vote by a majority vote.



Verdicts



Based on the results of certification, a conclusion is made: 



  • corresponds to the position held; 
  • conforms, but not completely (re-certification is recommended); or 
  • does not correspond to the position held. 


In the first case, the employee can be included in the reserve for a higher position, the terms of labor agreements with him do not change. The latter considers the issue of either transferring to another job requiring lower qualifications, or terminating an employment contract under clause 3 of part 1 of Art. 81 of the Labor Code of the Russian Federation. 



Incomplete compliance is fraught with transfer, with the consent of the employee, to another job, as well as referral to refresher courses (additional training).



Hard to learn - easy to fight



An important role in the training process for data center operation personnel is played by the practical aspect - training and exercises. 



As an example, we will cite excerpts from the final protocol of the exercises on practicing the actions of the duty shift and security personnel of the data center in St. Petersburg.  



“Chronology of events



10 50 - A fire (imitation) occurred in room 107. The fire alarm and the voice notification system were triggered. 



10 50 - The head of the security shift of the facility contacted the duty shift of the data center, informing them about the place of the fire and tasked the security officer with organizing the evacuation of the data center clients.





11 07 - A security officer went to the data center to check the evacuation routes, unblock the gates on the evacuation routes, check the unblocking of the full-height turnstile, organize the evacuation of people. The security officer is equipped with an electric flashlight, an insulating gas mask and a radio for communication.





11 07 - A call from the security officer of the data center to the senior security shift of the SKY-TRADE security service station with a message about the incident in the data center.



11 08 - The beginning of the evacuation of people not involved in the detection and localization (elimination) of the fire from the premises of the data center.



11 09 - Employees of the data center duty shift came forward to check the reasons for the fire alarm and organize the evacuation of people from the data center.





11 11 - Employees of the data center duty shift approached the site of the alleged fire. Employees are equipped with electric torches and insulating gas masks.





11 12 - Report from the security officer that all premises are free and people from the data center have been evacuated.



11 12 - Evacuation completed.





11 15 - Transferring the fire alarm and voice notification system from the "Fire" mode to the standby mode. The end of the fire-technical training ".



This is a report on the event outline of the event, which, as we can see, fits into a time span of just over one hour. Next, those responsible for conducting the exercise indicate any nonconformities identified and list the decisions made for the team. 



In this particular case, the call of the duty shift employee to the fire brigade was not simulated - therefore, the score is only "4". 



It is recommended to repeat the procedure for a fire signal in accordance with the instructions and conduct similar exercises for each shift of personnel at least once a quarter. 



Conclusions and development plans



Formalizing and documenting the processes helps to ensure historicity (tracking dynamics) as well as the objectivity of estimates. 



At this stage of development of the direction, we managed to implement an integrated approach to training and checking the level of knowledge of the data center personnel, on which such indicators as the continuity of the site, and, ultimately, the SLA for customers depend. 



In general, the system of knowledge and skills confirmation we have implemented is a general trend in the development of the direction in the future. All business continuity solutions are built on an architecture of closely aligned specialists, policies, procedures and processes, as well as the organizational structure and resources of the company. 



And people in this list are in the first place.



All Articles