Just add some Salt

How we migrated 700+ servers to Salt



For a long time, we were satisfied with a complex and cumbersome configuration with 2 Git repositories, where part of the data is stored in MySQL, and the other part is Puppet 3.8. But our needs gradually increased, the number of services increased, and configuration performance decreased. Then we set ourselves the task of improving the configuration, optimizing all the data and tools available.



Our team has selected a suitable configuration for themselves in 3 stages. We share our experience of Salt optimization, how to apply and customize without extra effort.



Note: On Habré we found great articles of our colleagues, so we will not dwell on the issues already covered. We highly recommend reading:



What is good about SaltStack, and what tasks can be solved with it - article fromptsecurity, Positive Technologies.



Installation, launch, first commands and familiarity with functions - author's articlezerghack007...


Salt is a configuration management and remote execution system. An open source infrastructure framework written in Python.





Why Salt?



Salt, Ansible, Puppet, and Chef are decent options for choosing a configuration management tool. We chose Salt because we prioritized the following benefits:



  • Modularity, availability of API in the free version, unlike Ansible.
  • Python, which means that you can easily understand any component and write the functionality yourself that is missing.
  • High performance and scalability. The Wizard establishes a permanent connection with the minions using ZeroMQ for maximum performance.
  • Reactors are a kind of triggers that are executed when a certain message appears in the message bus.
  • Orchestration is the ability to build complex relationships and perform actions in a specific sequence, for example, configure the load balancer first and then the web server cluster.
  • Puppet and Chef are written in Ruby. Our team does not have a competent specialist to work with this programming language, but Python is well known and is often used by us.
  • For those teams that have used Ansible before, the ability to use Ansible playbooks will be relevant. This will allow you to migrate to Salt painlessly.


Please Note:



We have been using Salt for almost two years now and we advise you to pay attention to the following points:



  • Salt, , . , . , SaltStack .
  • SaltStack . , . : , . , cmd.run file.managed, .


Grafana .

, , , .



. .



Given:



So, our initial configuration is:



  • 2 Git repositories (one is for engineers and admins; the second is for highly critical servers, available only to admins);
  • a piece of data in MySQL;
  • the other part - in Puppet 3.8 (overdone with inheritance, practically not using Hiera - key-value storage).


Objective: to migrate the configuration management system to Salt, increase its performance, make server management more convenient and understandable.



Solution:



First of all, we started to migrate the original configuration to Salt from separate non-critical service servers, at the same time getting rid of obsolete code.



Then we prepared the configuration for VDS servers. In our case, these are profiles for service servers, development servers and client servers.



The main problem with the transition from Puppet to Salt was the outdated operating system (in 2018, there were Ubuntu 12.04 and 14.04). Before migration, it was necessary to update the OS and not affect the operation of the service / server. Otherwise, everything was easy enough: colleagues gradually got involved in the process.



Among the main advantages, the team noted, for example, more understandable syntax. My colleagues and I agreed to use Salt Best Practices tips , but supplemented them with our own recommendations that reflect our particularities.



The team also evaluated the configuration delivery methods: push (the master "pushes") and pull (the minion "pulls"). Masterless mode helps out if you need to test something simple and at the same time not mess with the Git repository. Running a minion in masterless mode allows you to use Salt configuration management on one machine without having to go to the Salt master on another machine. The configuration is completely local.



Up to 300 minions with such a solution, we had no major problems. The master configuration at that time is a VDS with 6 cores and 4 GB of memory.



However, as soon as the number of minions reached 300, the Load Average (average system load) increased to 3.5-4, and the system slowed down a lot. Previously, the state.apply command took 30-40 seconds, but now it takes 18 minutes!



Such a result, of course, was unacceptable to us. Moreover, experts from other companies have written about successful stories with 10,000 minions. We began to figure out what was the matter.



Observations of the master did not give an unambiguous answer to the question. There was enough memory, the network was not loaded, the disk was utilized by 10%. We thought that GitLab was to blame, but it was not to blame either.



It seems that there was not enough processor power: when adding cores, the Load Average naturally dropped, and the state.apply command was executed, although faster, about 5-7 minutes, but not as fast as we wanted.



Adding workers partially solved the problem, but it significantly increased memory consumption.



Then we decided to optimize the configuration itself.



Stage 1



Since pillars are a protected storage, access to the storage is associated with encryption operations, and you have to pay for access to it with CPU time. Therefore, we reduced the number of calls to the pillars: the same data was taken only once; if they were needed elsewhere, then access to them was carried out through import ({% - from 'defaults / pillar.sls' import profile%}).



The configuration is applied once an hour, the execution time is chosen randomly. After analyzing how many tasks are performed per minute and how evenly they are distributed over the course of an hour, we found out: at the beginning of the hour, from the 1st to the 8th minute, the most tasks pass, and at the 34th minute, none! We wrote a runner that went through all the minions once a week and evenly distributed tasks. Thanks to this approach, the load became uniform, without jumps.



There were suggestions to move to an iron server, but at that time it was not there and ... we solved the problem in a different way. We added some memory and placed the entire cache in it. Looking at the Grafana dashboard, we first thought that the salt-master was not working, since the Load Average dropped to 0.5. We checked the execution time of state.apply and were also surprised - 20-30 seconds. It was a victory!



Stage 2



Six months later, the number of minions increased to 650, and ... the degradation of performance came again. The Load Average graph grows with the number of minions.



The first thing we did: enabled the pillar cache, set the lifetime to 1 hour (pillar_cache_ttl: 3600). We realized that now our commits would not be instantaneous and would have to wait for the master to update the cache.



Since we didn't want to wait at all, we made hooks in GitLab. This allowed in the commit to indicate for which minion you need to update the cache. Pillars' cache significantly reduced the load and reduced the time to apply the configuration.



Stage 3



We meditated a bit on debug logs and put forward a hypothesis: what if we increase the update interval for the file backend and the file list cache (gitfs_update_interval, fileserver_list_cache_time)? The update of which took place once a minute and sometimes took up to 15 seconds. By increasing the update interval from 1 minute to 10 minutes, we won again in speed! The LA indicator decreased from 1.5 to 0.5. The time to apply the configuration was reduced to the desired 20 seconds. Despite the fact that LA grew again after some time, the execution speed of state.apply did not change significantly. A forced update of these caches was added to the hooks on git push.





We added analytics to Elasticsearch: rewrote the built-in elasticsearch_return and now we can monitor the results of state.apply (average execution time, longest state and number of errors).



results



Now we are completely satisfied with Salt's performance. There are plans to double the number of minions. How our master will cope with such a load is still difficult to say. Perhaps we will turn to horizontal scaling or find a magic parameter. Time will tell!



If you're using gitfs as your backend, give it a five! Chances are, you are going through the same problems that we are. So we will be happy to discuss this topic in the comments.



Useful Resources






All Articles