⌚️ 🥡 👨🏾‍🤝‍👨🏽 The most famous accidents in the data centers of different companies in recent years and their causes 🚍 ⚓️ 🚘

Data centers are becoming more and more important objects, because both the normal course of work of many large and small companies and the safety of data of ordinary users depend on their normal operation. Just a minute of downtime of a large data center can cause millions of losses for the customers of the data center operator. Well, many hours or even more so many days of downtime lead to losses that are sometimes impossible to estimate at all. Under the cut - the most famous accidents of recent times with a description of the reasons for their occurrence.

Fire in the OVH data center

In March 2021, one of the OVH data centers almost completely burned down. This is the biggest accident in recent years, because OVH is one of the largest hosting providers in Europe. The fire was so severe that it practically destroyed the SBG2 data center. The main suspect is one of the uninterruptible power systems, with internal serial number UPS7. On the eve of the fire, this system was undergoing maintenance, during which a large number of components were changed in it. Upon completion of the procedure, UPS7 was restarted, and it seemed to work as normal. But soon there was a fire.

By the way, fires in data centers, especially of this scale, are extremely rare. The Uptime Institute keeps track of such cases - according to its representatives, on average, fires happen less than once a year.... In most cases, incidents were interrupted at the very beginning of development, but in some cases the fire still got out of control.

In the case of OVH with outages caused by the effects of a fire in SBG2, faced approximately 3.6 million websites.

After studying the situation with OVH, experts came to the conclusion that there could be several causes of the disaster and that it was not just an uninterruptible power supply. The escalation of the incident was facilitated by:

-. (Tower design). , . «», , , , .
-, , . , , , . .

The latter is all the more strange because now there are a huge number of solutions to maintain security. Let's say there are sensors that monitor environmental parameters and are capable of working with a UPS. For example, the Eaton EMP002 environmental monitoring sensor monitors temperature, humidity, and monitors the operation of paired devices such as smoke detectors or door openers. In addition, there are security systems that are able to capture temperature changes in fractions of a degree, monitor the concentration of carbon monoxide and other substances. If a problem is detected, such devices notify the operator of the technical support service and turn on the danger signal.

Fire in the WebNX data center

In April 2021, a fire broke out in the Ogden data center of the American company WebNX. The generator caught fire, after which the fire spread to the adjacent premises. As a result, there was a complete power outage, the server equipment was damaged. Some of the servers most severely damaged by fire are unlikely to be recovered.

The situation got out of control after the power supply to the city, which supplied energy to the data center, was cut off - several autonomous power generators turned on in the data center, but one of them had a breakdown, which led to a fire.

Firefighters who arrived extinguished the fire, but their actions led to water damage to equipment in the areas affected by the fire.

The servers of the Gorilla Servers company were also located in this data center. True, the equipment of this organization was not damaged, but as a result of the power outage, services and customer sites stopped working. The data center was de-energized for several hours, restoring the operation of all systems took about 20 hours. Losses of the data center operator in this case exceeded $ 25 million.

Failure of the TSB bank data center

In September 2018, the British bank TSB decided to conduct an extensive migration of IT equipment without having previously tested the new data center. The most annoying thing for the company is that the IT services provider Sabis, who was hired to carry out the migration, tested all the data centers affected by the migration, except for one. At the same time, the fact that testing was not carried out was hidden from the management.

The result is deplorable: two million of the bank's clients lost access to their accounts at once. The bank had to spend about $ 480 million to eliminate the consequences of a data center disruption, including an incident investigation fee of approximately $ 35 million.

Fire in London Telstra data center

In August 2020, the data center of Telstra, Australia's largest telecommunications company, was damaged. As with OVH, the problem was caused by a faulty UPS. Despite the fact that the fire was contained, unlike OVH, the incident affected most of the data center area, which is 11,000 square meters. Inside the premises where the fire occurred, there were about 1,800 server racks.

Four fire engines and 25 crew members were sent to the site at once. The team appears to have worked very well, as the fire was only able to seriously damage a small portion of the warehouse. None of the personnel were injured.

Nevertheless, several dozen servers ended up offline, their work was restored only after a few hours. Accordingly, the services and sites of Telstra clients did not work. The company's total losses exceeded $ 10 million, not to mention reputational losses.

UPS failure in Equinix LD8 data center

In August 2020, a problem arose with the power grid of the Equinix LD8 data center: There, after a power outage from the network, the UPS failed there. There was no fire, but the electrical problem could not be solved for several hours, so many customers were affected.

The accident happened in a data center in London's Docklands, and support staff were able to understand the cause of the problem almost immediately after it appeared. As it turned out, the shutdown UPS de-energized the main cluster of Juniper MX and Cisco LNS routers. It was this cluster that provided the operation of most of the data center equipment.

After the cluster was de-energized, services of the largest companies - Equinix clients were cut off. These include international telecom companies Epsilon, SiPalto, EX Networks, Fast2Host, ICUK.net and Evoke Telecom. The accident also affected the operation of other data centers.

As a conclusion, I will say that these are far from all accidents that have happened over the past few years. But these incidents are probably the most revealing because they could have been prevented. Unprofessional staff, UPS problems, power outages are common problems. What challenging data center incidents have you faced? If you have a story to tell, let's discuss it in the comments.

Bonus: power outage due to maintenance

There are also situations that are quite difficult (although possible) to foresee. For example, The Register once retold a story sent to the editorial office by one of its readers. Once upon a time, there was a server farm with three 220 kVA UPSs, which worked quite normally for quite a long time. Over time, the need for one of the UPSs disappeared, and it was decided to move it to the newly opened new data center. The management planned to save money on the purchase of a new UPS - but it turned out differently.

It is worth noting that the data center in question is rather big, its area was about 2500 square meters. There was a lot of equipment, several hundred servers, so it was like death to admit any problems.

Professional electricians were invited to the data center, who were entrusted with the responsibility of disconnecting one of the UPSs from the network and transporting it with further connection in the new data center. As a result, the professionals did something wrong, and the data center was completely de-energized.

“I was sitting at my desk when the electricians began to unplug the UPS unit from the mains. They put the system on bypass without any problem. They then cut off the output circuit breaker and some more wires to speed up dismantling. And then the data center with an area of 2500 square meters suddenly fell silent. I ran to the turbine hall, expecting to find the electricians who were electrocuted. But they just calmly disconnected the wires. I shouted that the data center went offline, to which the electricians replied that the equipment is powered in bypass mode. I repeated. They stopped, thought for ten seconds, and then their eyes opened really wide, ” said an eyewitness.

It took 36 hours to restore the data center, although initially the electricians announced an hourly downtime.

The most famous accidents in the data centers of different companies in recent years and their causes