Let's Encrypt migrates database servers to AMD EPYC



Dell PowerEdge R7525 2U Server internals. Two silver rectangles in the middle are AMD EPYC 7542 processors. Above and below them are 64 GB RAM strips. On the left edge of the photo there are 24 NVMe disks, this is possible only on EPYC



Let's Encrypt - the largest certification center on the Internet, more than 235 million sites operate on its free TLS certificates . At the heart of the CA is the database on which certificates are managed. It is important that its performance is at a level, otherwise we will see API errors and timeouts when issuing certificates.



At the end of 2020, the non-profit organization upgraded its servers.



Let's Encrypt's core software is Boulder 's open source CA with ACME support. It uses MySQL-style schemas and queries to manage user accounts and the entire certificate issuance process. Open-source CA works with one MySQL, MariaDB or Percona database. Currently using MariaDB with InnoDB engine.



The CA runs with a single database to minimize complexity. The developers say this is good for safety, reliability, and ease of maintenance. Let's Encrypt has multiple database replicas active at any given time, and some reads are directed to the replica servers to reduce the load on the main database.



One of the consequences of this design is that the servers must be powerful enough. If the servers did not cope, then in the end Let's Encrypt would have to split one database into several, but the upgrade made it possible to avoid this.



Specifications



Past servers were powerful, but regularly hit peak performance. For the new generation, the goal is to more than double nearly all performance metrics in the same 2U form factor. To do this, they chose AMD EPYC processors and considered the Dell PowerEdge R7525 server the best option . Here are its technical characteristics, in comparison with the old servers:



Previous generation New generation
CPU 2x Intel Xeon E5-2650

Total 24 cores / 48 threads
2x AMD EPYC 7542

64 / 128

1  2400MT/ 2  3200MT/
24x 3,8  Samsung PM883

SATA SSD

560/540 / /
24x 6,4  Intel P4610

NVMe SSD

3200/3200 / /


As you can see, the number of cores and the amount of memory have really doubled, and the performance of the SSD has nominally increased by more than five times.





1 - handle, 2 - riser expansion module 1, 3 - first power supply, 4 - riser expansion module 2, 5 - heat sink for the first processor, 6 - DIMM slots for the first processor, 7 - fans, 8 - service label, 9 - rear panel and SSD motherboard, 10 - fan cage, 11 - DIMM slots for the second processor, 12 - heatsink for the second processor, 13 - motherboard, 14 - second power supply, 15 - Riser 3 expansion module, 16 - Riser 4 expansion module



Each server has two AMD EPYC processors for a total of 64 physical cores. 2.9GHz clock speed up to 3.4GHz under load. More importantly, EPYC provides 128 PCIe v4.0 lanes. This allows 24 NVMe disks to fit in one machine. They are incredibly fast drives (5.7 times faster than SATA SSDs on previous generation servers) because they use PCIe instead of SATA. The number of PCIe lanes is usually quite limited: mainstream processors typically have only 16 lanes, while Intel Xeon chips have 48 lanes. Here AMD EPYC processors compare favorably with 128 PCIe lanes per chip, allowing a large number of NVMe drives to be installed on each machine.



Impact on performance



Let's Encrypt gives the average request processing time, because this metric is felt most strongly by users. Before the update, the median API request took about 90ms. After upgrade - about 9 ms!







In the following graph, you can see that the old processors were running at their limits. In the week before the upgrade, the CPU load on the primary database server (from / proc / stat) averaged over 90%: The







new AMD EPYC processors are running at about 25% of their maximum capacity. This can be seen in the graph where on September 15 the new server was promoted from replica (read-only) to primary (read / write).







The update has significantly reduced overall database latency. Average response time (from INFORMATION_SCHEMA) used to be around 0.45ms.







Now requests are processed on average three times faster, in about 0.15 ms.







OpenZFS and NVMe



NVMe drives are becoming more popular today for their incredible performance. However, until recently, it was almost impossible to put a lot of NVMe on one server, because they use PCIe lanes, and the processor supports a limited number of such lanes, as we said. Intel Xeon supports 48 PCIe v3 lanes, and some of them are used by the chipset, network adapter, and GPU. Few lines remain for NVMe.



The latest generation of AMD EPYC processors supports 128 PCIe lanes - more than double Intel's, and that's PCIe v4! This is enough to pack a 2U server with NVMe disks (Dell has 24 pieces).



When you have a server full of NVMe drives, you need to decide how to manage them. In the previous generation of Let's Encrypt servers, hardware RAID was arranged in a RAID-10 configuration, but there is no effective hardware RAID for NVMe, so another solution had to be found. Software RAID (mdraid on Linux) was considered as one of the options, but the developers were advised to OpenZFS. They decided to give it a try and are quite satisfied with the result.



They say there is little information on the internet on how to best tune and optimize OpenZFS for NVMe disk pools and database workloads, so they documented their experiences in detail . Perhaps it will be useful to someone.



Let's Encrypt says that this upgrade was necessary and in a sense forced, because the number of users of the free CA is constantly growing, as is the number of TLS certificates issued. The servers are quite expensive and the upgrade presented a serious technical challenge for the engineers of the organization, but everything went well.



Let's Encrypt is a non-profit certification authority that is funded by sponsorship and individual donations .



All Articles