Cooking DRP - Remember Meteorite



Even during a disaster, there is always time for a cup of



DRP (disaster recovery plan) tea - a piece that ideally never needs. But if suddenly the beavers migrating during the mating season gnaw through the backbone fiber or the junior admin drops the productive base, you definitely want to be sure that you will have a pre-made plan of what to do with all this mess.



While customers in a panic start to cut off their technical support phones, the junior is looking for cyanide, you wisely open the red envelope and start putting everything in order.



In this post, I want to share recommendations on how to write a DRP and what it should contain. We'll also look at the following things:



  1. Let's learn to think like a villain.
  2. Let's look at the benefits of a cup of tea during the apocalypse.
  3. We will think over a convenient DRP structure
  4. Let's see how to test it.


For which companies it can be useful



It is very difficult to draw the line when the IT department starts to need these things. I would say that you are guaranteed to need DRP if:



  • Stopping a server, an application, or the loss of a base will lead to significant loss of business as a whole.
  • You have a full-fledged IT department. In the sense, the department is in the form of a full-fledged unit of the company, with its own budget, and not just a few tired employees laying the network, cleaning viruses and refueling printers.
  • You have a realistic budget for at least partial redundancy in the event of an emergency.


When the IT department for months begs for at least a couple of HDDs for an old server for backups, you are unlikely to be able to organize a full-fledged transfer of a fallen service to reserve capacity. Although the documentation here will not be superfluous.



Documentation is important



Start with the documentation. Let's say that your service is based on a Perl script that was written three generations of admins ago, and no one knows how it works. The accumulated technical debt and lack of documentation will inevitably shoot you not only in the knee, but also in other limbs, it is rather a matter of time.



Once you have a good description of the service components, turn up the crash statistics. They will almost certainly be completely typical. For example, you have a disk full from time to time, which leads to a node failure before it is manually cleaned. Or the client service becomes unavailable due to the fact that someone again forgot to renew the certificate, and Let's Encrypt could not or did not want to configure it.



Thoughts like a saboteur



The hardest part is predicting accidents that have never happened before, but that could potentially kill your service completely. Here we usually play villains with our colleagues. Take a lot of coffee and something tasty and lock yourself in a meeting room. Just make sure that in the same meeting room you locked those engineers who themselves raised the target service or regularly work with it. Then, either on the board or on paper, you begin to draw all the possible horrors that can happen to your service. It is not necessary to detail down to a specific cleaner and pulling out cables, it is enough to consider the scenario "Violation of the integrity of the local network."



Usually, most typical emergency situations fit into the following types:



  • Network failure
  • Failure of OS services
  • Application failure
  • Iron failure
  • Virtualization failure


Just go through each view and see what applies to your service. For example, the Nginx daemon may crash and not rise - this is a failure on the part of the OS. A rare situation that drives your web application into a non-working state is a software failure. During this stage, it is important to work out the diagnosis of the problem. How to distinguish a hung interface on virtualization from a fallen tsiska and a network crash, for example. It's important to quickly find those responsible and start tugging at their tail until the accident is fixed.



After the typical problems are written down, we pour more coffee and begin to consider the strangest scenarios when some parameters begin to go beyond the normal range. For instance:



  • What happens if the time on the active node moves back a minute relative to the others in the cluster?
  • And if time moves forward, and if by 10 years?
  • What happens if a cluster node suddenly loses its network during synchronization?
  • And what will happen if two nodes do not share leadership due to the temporary isolation of each other over the network?


The reverse approach helps a lot at this stage. You take the most stubborn team member with a sick imagination and give him the task to arrange a sabotage as soon as possible, which will put the service down. If it is difficult to diagnose it, even better. You won't believe what strange and cool ideas engineers come up with if you give them an idea to break something. And already if you promise them a test bench for this, it’s very good.



What is this DRP of yours ?!



So you have defined the threat model. The locals, who cut fiber-optic cables in search of copper, and the military radar, which drops the radio relay line strictly on Fridays at 4:46 pm, were also taken into account. Now we need to understand what to do with all this.



Your task is to write the very red envelopes that will be opened in an emergency. Immediately expect that when (not if!) Everything starts to work out, only the most inexperienced trainee will be nearby, whose hands will be shaking violently from the horror of what is happening. See how emergency signs are implemented in medical offices. For example, what to do in case of anaphylactic shock. The medical staff knows all the protocols by heart, but when a person next to them begins to die, very often everyone helplessly grabs onto everything. To do this, there is a clear instruction on the wall with items like “open the package of this” and “inject so many units of the drug intravenously”.



It's hard to think in an emergency! There should be simple instructions for parsing by the spinal cord.


A good DRP consists of a few simple blocks:



  1. . , .
  2. — , systemctl status servicename .
  3. . SLA — .
  4. , .


Remember that DRP starts when the service has completely failed and ends up recovering, even with reduced efficiency. Simply losing a reservation shouldn't activate DRP. You can also add a cup of tea to DRP. Seriously. According to statistics, many accidents from unpleasant become catastrophic due to the fact that the staff in a panic rushes to fix something, simultaneously killing the only living node with data or finally finishing off the cluster. Typically, 5 minutes for a cup of tea will give you some time to calm down and analyze what is happening.



Do not confuse DRP and system passport! Don't overload it with unnecessary data. Just make it possible to quickly and conveniently use hyperlinks to go to the required section of the documentation and read in an extended format about the necessary sections of the service architecture. And in the DRP itself, only direct instructions where and how to connect with specific commands for copy-paste.



How to test correctly



Make sure that any responsible person is able to complete all points. At the most crucial moment, it may turn out that the engineer does not have access rights to the required system, there are no passwords for the required account, or he has no idea what it means "Connect to the service management console through a proxy at the head office." Each point should be extremely simple.



Wrong - "Go to virtualization and reboot the dead node"

Correct - "Connect via the web interface to virt.example.com, in the nodes section, reboot the node that causes the error."



Avoid ambiguity. Remember the frightened trainee.



Be sure to test DRP. This is not just a plan to tick off - it is one that will allow you and your customers to quickly get out of a critical situation. It is optimal to do this several times:



  • One expert and several trainees work on a test bench that simulates a real service as much as possible. The expert breaks the service in various ways and gives the trainees the opportunity to restore it according to the DRP. All problems, ambiguities in the documentation and errors are recorded. After training the trainees, the DRP is supplemented and simplified in obscure places.
  • . . , , , . 10 , .
  • . , . , , DRP .




  1. , , .
  2. , .
  3. , , .
  4. .
  5. .
  6. DRP . , . .
  7. DRP.
  8. DRP.
  9. . .









All Articles