ossh: parallel execution of commands on many servers

Sometimes it is necessary to run Barmin's patch some command on many servers and it is advisable not to wait too long for the results of execution. For this, I wrote ossh (One SSH to rule them all). Here's an example of how it works:



$ wc -l /tmp/ossh.ips
21418 /tmp/ossh.ips
$ time ossh -n -h /tmp/ossh.ips -c uptime -p 1000 >/tmp/ossh.out

real    3m10.310s
user    0m30.970s
sys     0m19.282s
$ grep 'load average' /tmp/ossh.out | sort -n -k5 | tail -n1
10.23.91.97   [1]  13:37:55 up 828 days,  2:34,  0 users,  load average: 8.29, 4.45, 3.90
$

      
      





In this example, the file /tmp/ossh.ips contains 21418 ip addresses of machines. -n means that you do not need to do reverse queries to determine the name by address. -c uptime sets the command I want to run. -p 1000 allows up to 1000 connections at the same time. As you can see from the output, the command worked fairly quickly.



What else can ossh do?



$ ossh -?
Usage: ossh [-?AinPv] [-c COMMAND] [-C COMMAND_FILE] [-H HOST_STRING] [-h HOST_FILE] [-I FILTER] [-k PRIVATE_KEY] [-l USER] [-o PORT] [-p PARALLELISM] [-T TIMEOUT] [-t TIMEOUT] [parameters ...]
 -?, --help        Show help
 -A, --askpass     Prompt for a password for ssh connects
 -c, --command=COMMAND
                   Command to run
 -C, --command-file=COMMAND_FILE
                   file with commands to run
 -H, --host=HOST_STRING
                   Add the given HOST_STRING to the list of hosts
 -h, --hosts=HOST_FILE
                   Read hosts from file
 -i, --ignore-failures
                   Ignore connection failures in the preconnect mode
 -I, --inventory=FILTER
                   Use FILTER expression to select hosts from inventory
 -k, --key=PRIVATE_KEY
                   Use this private key
 -l, --user=USER   Username for connections [$LOGNAME]
 -n, --showip      In the output show ips instead of names
 -o, --port=PORT   Port to connect to [22]
 -p, --par=PARALLELISM
                   How many hosts to run simultaneously [512]
 -P, --preconnect  Connect to all hosts before running command
 -T, --connect-timeout=TIMEOUT
                   Connect timeout in seconds [60]
 -t, --timeout=TIMEOUT
                   Run timeout in seconds
 -v, --verbose     Verbose output
$

      
      





The list of hosts can be specified either directly on the command line using the -H option (in the case of several hosts, they must be separated by a space, and the entire list must be enclosed in quotes as in the examples below) or loaded from a file using the -h option. Lines starting with # in the file are ignored. The address can contain the port: my.host:2222. You can use brace expansion: "host {1,3..5} .com" will become "host1.com host3.com host4.com host5.com". Both -H and -h can be used multiple times.



For authorization will be used

  • the password that ossh will ask for when using the -A option
  • ssh switch specified by option -k
  • ssh-agent (in this case, you must have the SSH_AUTH_SOCK environment variable defined)


In that order.



Sometimes you need to make sure you can log in to all machines before executing a command. There is a -P option for this. By default, if at least one machine is unavailable ossh will fail. If you want to ignore failed connections use the -i option.



Ossh can use your inventory system. To do this, the paths must contain the ossh-inventory command, to which the parameters of the -I option will be passed. This option can be used multiple times. The ossh-inventory command should print lines to standard output in the format:

_ _
      
      





Where machine_address can be either dns name or ip address.



The commands to be executed are specified by the -C (read from file) or -c (take from the command line) options. These options can be used multiple times. If both -C and -c are present, commands from files will be executed first, then from the command line.



In addition to simply executing commands using ossh, you can stream logs in real time:



$ ossh -H "web05 web06" -c "tail -f -c 0 /var/log/nginx/access.log|grep --line-buffered Wget"
web05 192.168.1.23 - - [22/Jun/2016:12:24:02 -0700] "GET / HTTP/1.1" 200 1532 "-" "Wget/1.15 (linux-gnu)"
web05 192.168.1.49 - - [22/Jun/2016:12:24:07 -0700] "GET / HTTP/1.1" 200 1532 "-" "Wget/1.15 (linux-gnu)"
web06 192.168.1.117 - - [22/Jun/2016:12:24:23 -0700] "GET / HTTP/1.1" 200 1532 "-" "Wget/1.15 (linux-gnu)"
web05 192.168.1.29 - - [22/Jun/2016:12:24:30 -0700] "GET / HTTP/1.1" 200 1532 "-" "Wget/1.15 (linux-gnu)"
...
      
      







Here is a rolling deployment simulation:



$ ossh -p 1 -H "test0{1..3}" -c "sleep 10 && date"
test01 Wed Jun 22 12:38:24 PDT 2016
test02 Wed Jun 22 12:38:34 PDT 2016
test03 Wed Jun 22 12:38:44 PDT 2016
$
      
      







It can be seen that the commands are executed sequentially on the machines. Only one machine is involved at a time. For a real deployment, "sleep 10 && date" should be replaced with, for example, "apt-get install -y your_package".



It was for deployment that the very first version of ossh was written. Someone will ask why I didn't use some generally accepted configuration management system? The fact is that it was back in 2013 and we used chef. It was clear that chef did not suit us in particular with the uncertainty when exactly the changes would be applied (the chef-client was executed every 30 minutes). In order to consistently roll out changes on many machines, some developers used a dirty hack: the chef-client did not work all the time, but was launched once (via ssh) only at the moment when it was necessary to do the deployment. Already at that moment, work was underway to replace chef with salt, but the transition was not easy and its completion required additional time. We were developing a new service,which required frequent deployments and was rolled out by the only Debian package. First, we used the knife utility from chef. This utility allowed you to connect via ssh to the desired servers and execute commands on them. At some point, I realized that chef in this case is an extra link and wrote ossh.



It is important to note that ossh is a tool for solving large-scale and non-standard problems. If the need to use ossh arises often, this is a reason to think about whether you are doing well with the infrastructure and server management. Here are some situations in which ossh helped me personally:

  • Once I tidied up /root/.ssh/authorized_keys on a large number of servers (there were about 7000 at that time). The developers have registered their keys there, in particular for the processes of updating their services. It was necessary to get a list of all the keys used on all machines, and make sure that deleting those keys would not be catastrophic.
  • For painless leap second
  • When we were struggling with TCP SACK PANIC , the iptables rules were rolled out by the configuration management system. To make sure everything is fine, I checked for the correct rules with ossh. And it was not in vain at all, there were cars on which the rules did not apply.
  • Sometimes I have to create test environments with hundreds (and sometimes thousands) of machines. Often these machines are isolated from the production network and are not available for the standard configuration management system. In situations like this, machine configuration can be done using ossh.


I foresee the question why I did not use a ready-made solution. As I mentioned above, the need to run commands on thousands of machines first occurred to me in 2013. At that time, I was able to find only parallel ssh, which did not suit me with the following:

  • I was unable to raise the parallelism above 150, errors began to occur when connecting to remote servers
  • parallel ssh accumulated all output and dumped it when the command finished. Streaming logs, for example, was impossible with its help
  • The parallel ssh output was (for me personally) awkward to parse


Originally ossh was written in ruby, to increase performance, I used the event machine, and then fiber. I recently rewrote ossh to go. I would be grateful if go-experts (I am not at the moment) take a look at my code and point out possible ways to improve it.



All Articles