- how Powershell DSC works and how it differs from Ansible when managing infrastructure on Windows;
- why we switched to Ansible;
- what problems they faced and how they were solved;
- how expectations and reality compare after switching to Ansible;
- who should choose Powershell DSC and who should choose Ansible.
Why PowerShell DSC was originally chosen
Mindbox has a developed DevOps / SRE culture, despite the predominantly Windows infrastructure: Hyper-V, IIS, MS SQL Server. And although the company is gradually moving to Linux and Open Source, Windows still prevails.
To manage this infrastructure, they planned to use infrastructure code: write it, save it to the repository, and then use some tool to turn the code into real infrastructure. While Ansible is the most popular code-based infrastructure management tool, it has traditionally been associated with the Linux world. We wanted something native and Windows-specific, so we chose PowerShell DSC.
How Powershell DSC Works
PowerShell Desired State Configuration (DSC) is a service that comes with Windows out of the box and helps you manage your infrastructure through configuration files. It accepts the infrastructure code in PowerShell as input, and internally transforms it into commands that configure the system. In addition to trivial operations, such as installing Windows components, modifying registry keys, creating files, or configuring services, it can do a lot of things that PowerShell scripts usually do. For example, a full cycle of DNS configuration or a highly available MS SQL Server instance.
Useful links to the diagram:
Example of a simple configuration for DSC documents
How to use datafiles
SQL Server- Windows Server 2019
DSC pull server SQL Windows Server 2019
DSC Ansible
DSC | Ansible | |
. pull-, , .NET Framework 4.0 WMF 5.1 | , ansible, ansible-playbook ansible-inventory. Linux-, — python | |
, | ||
Pull/push- | Pull push | push |
pull- | ||
-: , , , | -: , , | |
~1300 Gallery | ~20000 Ansible Galaxy | |
PowerShell | YAML | |
, Ansible | ||
() |
, DSC
Expectations from DSC were not met in every way. In addition, during the work, new needs arose that could not be satisfied with the help of DSC.
Developers cannot use the tool on their own without the help of the SRE. Although almost every team has an SRE, the IaC tool should be simple enough for a developer to use and spend no more than half an hour on it. DSC allows you to use not only declarative code, but also any Powershell constructs. This means that there is a high chance of making code that will be difficult to maintain or that will lead to an infrastructure failure. For example, deploying an application with incorrect parameters to the wrong environment.
Can't skip dry run configuration before rolling,to see exactly which changes will be applied and which will not.
It is difficult for DSC to organize syntax and code style checks. There are few validation tools and are not updated. We have already done this for Ansible.
In DSC push mode, there is no convenient way to track the status of tasks. If the configuration was applied with an error, additional actions should be taken for diagnostics: execute commands to get the configuration application status, look at the event logs. If the error occurred on multiple servers, it is time consuming.
Pull mode never became an advantage.In it, the configuration is applied asynchronously - it is impossible to find out exactly when the application of the new configuration is complete without straps and crutches.
Overuse of two distinct IaC tools that configure servers. Ansible can do the same as DSC, and we are already using Ansible to configure Linux hosts and network equipment.
How did you plan to switch from DSC to Ansible
At first, the task seemed simple, for about a month. We have identified three stages of work:
- learn how to connect to Windows hosts using Ansible;
- rewrite DSC configurations using Ansible modules;
- remove DSC pull server, its database and other artifacts.
Here's what the workflow was on DSC, and how we planned to organize it in Ansible:
The standard structure of roles in Ansible
On Ansible, we planned to separate the complex code that configures and installs something into the role code and split the roles into separate repositories. In the main Ansible repository, only the calls to roles, overrides of role parameters and lists of servers by group should remain. So not only SRE, but also any developer could deploy the role to the required servers or tweak the parameter without delving into the logic of the infrastructure code. The developer can fix the role code only after the SRE review.
What difficulties did you face when switching to Ansible and how they were solved
When the work began, we realized that we were wrong: the task was not easy. There were no problems only with the repository, and in other matters I had to research a lot and improve the developments.
WinRM or SSH
The first surprise was the choice of the connection type. In the case of Windows, there are two of them - WinRM and SSH. It turned out that Ansible is slow to run through WinRM. That being said, Ansible does not recommend using OpenSSH out of the box for Windows Server 2019. And we found a new solution:
- Forked and remade the role from Galaxy.
- We wrote a playbook that only has a challenge to this role. This is the only playbook that connects to hosts via WinRM.
- Prometheus Blackbox Exporter monitors port 22 / tcp and the OpenSSH version as standard tools :
- alert: SSHPortDown
expr: probe_success{job=~".*-servers-ssh",instance!~".*domain..ru"} == 0
for: 1d
annotations:
summary: "Cannot reach {{`{{ $labels.instance }}`}} with SSH"
- LDAP- , Windows- :
plugin: ldap_inventory
domain: 'ldaps://domain:636'
search_ou: "DC=domain,DC=ru"
ldap_filter: "(&(objectCategory=computer)(operatingSystem=*server*)(!(userAccountControl:1.2.840.113556.1.4.803:=2)))"
validate_certs: False
exclude_hosts:
- db-
account_age: 15
fqdn_format: True
- OpenSSH , Windows- SSH .
- OpenSSH . Packer, Ansible:
"type": "shell-local",
"tempfile_extension": ".ps1",
"execute_command": ["powershell.exe", "{{.Vars}} {{.Script}}"],
"env_var_format": "$env:%s=\"%s\"; ",
"environment_vars": [
"packer_directory={{ pwd }}",
"ldap_machine_name={{user `ldap_machine_name`}}",
"ldap_username={{user `ldap_username`}}",
"ldap_password={{user `ldap_password`}}",
"ansible_playbooks={{user `ansible_playbooks`}}",
"github_token={{user `github_token`}}"
],
"script": "./scripts/run-ansiblewithdocker.ps1"
When we were rewriting the code for Ansible, we periodically ran into code duplication. For example, almost all DSC configurations contained a windows_exporter setting . The only thing that was different was the collectors that the exporter had to use:
To get rid of the duplicated code, we moved windows_exporter into a separate Ansible role, and the parameters of this setting - into host group variables.
Second hop authentication
Probably, second hop authentication is the most common problem faced by those who started using Ansible on Windows: This design causes the Access Denied error due to the fact that by default it is impossible to delegate credentials for authorization on a remote resource without additional settings. To work around the error, for example, new_credentials helps. But we preferred to take advantage of the fact that Ansible can call DSC resources through the win_dsc module. We call the File DSC resource, which by default runs under the computer account. Kerberos delegation is not needed in this case:
- name: Custom modules loaded into module directory
win_copy:
src: '\\share\dsc\modules'
dest: 'C:\Program Files\WindowsPowerShell\Modules'
remote_src: yes
- name: Custom modules loaded into module directory
win_dsc:
resource_name: File
SourcePath: '\\share\dsc\modules'
DestinationPath: 'C:\Program Files\WindowsPowerShell\Modules'
Type: Directory
Recurse: true
Force: true
MatchSource: true
At the same time, there is no contradiction in abandoning DSC, but using its resources if they better solve the problem than the Ansible module. The main goal is to stop using DSC configurations, because it was the DSC ecosystem that did not suit us, and not the resources themselves. For example, if you need to create a virtual Hyper-V switch, you will have to use the DSC resource - Ansible does not yet have tools for managing the Hyper-V configuration.
Network disconnect
Some tasks cause network disconnection (disconnect) on configurable servers. For example, creating a Hyper-V virtual switch from the example above: The problem is that in DSC such a call works, but in Ansible it fails because the managed host has disconnected. This is because Windows always disconnects when creating a virtual external switch. The solution is to add an async argument to the task : This is how Ansible sends the task to the host, waits for a specified time and only then asks for the state.
- name: External switch present
win_dsc:
resource_name: xVMSwitch
Ensure: 'Present'
Name: 'Virtual Network'
Type: 'External'
NetAdapterName: 'TEAM_LAN'
AllowManagementOS: True
async: 10
Drift infrastructure
When we started porting the code, we found configuration drift. These are the actual differences between what is described in the code and the actual configuration of the server or software. The reason is that in some cases DSC did only part of the work, and the rest was done either by scripts or manually according to the instructions.
To make it easier to work with IaC, we have collected all scripts and documents and made uniform unambiguous instructions. In addition, we organized the process so that no one made accidental changes to Ansible. We store all the infrastructure code in GitHub, and assign tasks to engineers through GitHub Projects, so we have the ability to associate changes to the infrastructure code (pull requests) with tasks. So we can see the changes for each completed task. If the task does not have any changes, then it will not be accepted and will be returned for revision.
Fact Gathering Bugs
Unlike DSC, Ansible gathers facts about managed hosts at startup so that the developer can determine the behavior of tasks depending on the state of the host. When collecting facts from Windows hosts, Ansible may throw an error due to incorrect module code. To fix it, you need to connect the ansible.windows collection . The pipeline for Ansible, before launching each playbook, checks for the presence of requirements.yml files with a list of required roles and collections, and then installs them. This is where we added the ansible.windows collection. Collections
[WARNING]: Error when collecting bios facts: New-Object : Exception calling ".ctor" with "0" argument(s): "Index was out of range. Must be non-negative and less than the size of the collection. Parameter name: index" At line:2
char:21 + ... $bios = New-Object -TypeName
Is a new development concept for Ansible. If earlier only roles were distributed in Galaxy, now there you can find collections of various plugins and modules, playbooks and roles.
Tests
Before handing over the IaC toolkit to the developers, we wanted to make sure that the Ansible code would be reliable and not break anything. In the case of DSC, there were no special tests, although there is a special framework for this task. Configurations were usually validated on staging servers, the failure of which did not lead to defects.
Ansible is usually tested using the molecule tool. as a wrapper for running tests. It's a handy tool for Linux roles, but Windows has a problem. Previously, the molecule was able to raise the infrastructure itself, but now the developers have removed this opportunity. Now the infrastructure is being raised either with the help of a molecule in Docker, or outside a molecule. Testing Windows roles in Docker is most often impossible: Hyper-V and most other Windows features will not be installed in a Docker container. We'll have to deploy the infrastructure for tests outside the molecule and use the delegated driver in the molecule.
We have not yet solved this problem, but we have found tools that will detect the most obvious errors:
Check | Functional | Comment |
Syntactic check | Checks syntax and runability of code | We use syntax checking and linting locally and in the repository. For example, we embed in the pre-commit check and configure the GitHub Action, which will be launched on every pull request |
Linting | Checks code for logical errors | |
Dry run | Allows you to know what it will do before launching the playbook | We use code rollouts in the pipeline: launch ansible-playbook with the check and diff flags, then evaluate the changes and confirm the rollout. When we write roles, we take into account that for some tasks it is necessary to explicitly indicate what exactly they should change. For example win_command and win_shell |
How Ansible works
After we implemented Ansible and overcame all the difficulties, the process of actions of engineers and automatic launches was formed:
- , Linux-. , , pull request GitHub-, .
- pull request GitHub Actions, . Linux-, . , , .
- pull request. , -, .
- . requirements.yml, GitHub- . — . . , Ansible, . pull request, .
- pull request GitHub Actions, Octopus Deploy. .
- Octopus Deploy . , ansible-playbook: --tags, --limit --extra-vars.
- , , . , .
Ansible
: DSC Ansible
DSC Ansible | :
Linux- Ansible. Linux, Ansible Linux, CI/CD Docker-. |
DSC | If the infrastructure is Windows only and you don't want to work with Linux.
If you are ready to add your resources for DSC. It is necessary to store the state of the infrastructure, as well as fix its drift. |
Implement Ansible from scratch | If you are running a mixed Windows / Linux environment and want to convert existing scripts to infrastructure code and deploy it using CI / CD systems. |
Evgeny Berendyaev, SRE engineer