In my experience, companies with such a huge server fleet are looking to create mass management tools. These tools have different names, but the essence is approximately the same: you log in via ssh as root, run the command and, possibly, get some kind of exit code and / or result.
In certain situations, this is the only way to quickly put out a fire - and in such moments, you are grateful that the tool exists.
But a note about something else. It focuses on the other side of the coin, when someone uses one of these tools and creates a problem. Maybe he decided to quickly roll out some change to all servers, instead of using best practices (tests, trial run, implementation for a part of the audience, gradual deployment, etc.). Perhaps I decided to immediately send a new binary to each machine, and therefore they all crash at the same time, leaving no way to launch the actual site.
There is one thing I asked to add to my tools to prevent certain kinds of disasters. This is for solving a specific situation where someone is mistakenly executing a command on too many machines. Maybe he wanted to use a rack of test hosts (about 40), but accidentally choseeveryone .
When you have such a tool, you yourself will see similar errors.
My request is simple enough: if you are going to ask for confirmation as a sanity check, * do not * use the Y / N type. Instead, ask to read the number from the screen and enter it back.
It will look like this:
<pre>-- 123456 . ?
:</pre>
Then, to execute the command, you must accurately enter "123456".
The idea is that a person perceives this number through their usual input devices (I would say "eyes", but some people use screen readers or something similar), process it in their brain program, and then somehow return it to the computer. A few extra steps like this should hopefully activate enough gray matter for the person to stop before they shoot their entire leg off with a giant leg gun.
Of course, if you encounter such a situation too often and you really need to use so many machines, then you can just copy and paste the figure. In this case, I would say that you use this tool too often and you should think about changing the algorithm of actions so as not to rely so much on this tool.
But in a real company, it is not easy to simply “stop using” the tool. In such a situation, as an option, divide the number with spaces so that it cannot be simply thoughtlessly copied and pasted.
For example, output a number with one of the numeric separators of your language, for example 123.456 or 123.456, or 123456, or whatever else suits you. The trick is not to accept this as input, but to require the client to remove the separator and insert just numbers.
<pre>-- 123 456 . ?
: 123456
! .</pre>
I have seen this technique save people on many occasions, and I am sharing it here in the hope that it will help others. If you are developing something powerful enough, consider securing the system this way.
Just think: numbers jump off the screen, jump into a person, jump back and forth in the head - and return to the computer. Everything is part of one network.
Rachel Kroll, Facebook sysadmin