👩‍👩‍👧‍👦 ⛔️ 🌦️ Post Mortem out of reach Quay.io 🤦🏻 ♐️ 🧔🏾

Approx. transl. : At the beginning of August, Red Hat publicly talked about solving the accessibility problems that had arisen in previous months among users of its Quay.io service (it is based on a registry for container images, which the company got with the purchase of CoreOS). Regardless of your interest in this service as such, the path taken by the company's SRE engineers to diagnose and eliminate the causes of the accident is instructive.

In the early morning of May 19 (EDT), quay.io crashed. The disaster affected both quay.io consumers and Open Source projects using quay.io as a platform for building and distributing software. Red Hat values the trust of both.

The team of SRE engineers immediately jumped in and tried to stabilize the Quay service as soon as possible. However, while they were doing this, clients lost the ability to push new images, and only occasionally they were able to pull existing ones. For some unknown reason, the quay.io database was locked after the service was scaled up to full capacity.

“ What has changed? "- this is the first question that is usually asked in such cases. We noticed that not long before the issue, the OpenShift Dedicated cluster (on which quay.io is running) started updating to version 4.3.19. Since quay.io runs on Red Hat OpenShift Dedicated (OSD), regular updates were commonplace and never caused problems. Moreover, over the past six months, we have upgraded Quay clusters several times without any service interruption.

While we were trying to restore the service, other engineers began to prepare a new OSD cluster with the previous version of the software in order to deploy everything on it in case of need.

Root Cause Analysis

The main symptom of the crash was an avalanche of tens of thousands of database connections that rendered the MySQL instance effectively disabled. This made it difficult to diagnose the problem. We have set a limit on the maximum number of connections from customers to help the SRE team assess the issue. We did not notice any unusual traffic to the database: in fact, most of the requests were for reads, and only a few were for writes.

We also tried to identify a pattern in the database traffic that could cause this avalanche. However, it was not possible to find any patterns in the logs. While waiting for the new cluster with OSD 4.3.18 to be ready, we kept trying to launch the quay.io pods. Every time the cluster went to full capacity, the database would freeze. This meant that it was necessary to restart the RDS instance in addition to all quay.io pods.

By the evening, we stabilized the service in read-only mode and disabled the maximum of non-essential functions (for example, garbage collection in the namespace) in order to reduce the load on the database. The hangs stopped, but the reason was never found . The new OSD cluster was ready, and we migrated the service, connected traffic and continued monitoring.

Quay.io ran stably on the new OSD cluster, so we went back to the database logs, but could not find a correlation explaining the locks. OpenShift engineers worked with us to see if the changes in Red Hat OpenShift 4.3.19 might have caused Quay issues. However, nothing was found, and the problem could not be reproduced in the laboratory .

Second failure

On May 28, shortly before noon EDT, quay.io crashed again with the same symptom: the database was blocked. Again, we threw all our strength into the investigation. First of all, it was necessary to restore the service. However , this time, restarting RDS and restarting quay.io pods did not lead to anything : another avalanche of connections swept the base. But why?

Quay is written in Python and each pod operates as a single monolithic container. Many parallel tasks run concurrently in the container runtime. We are using a library geventundergunicornto process web requests. When Quay receives a request (via our own API, or via the Docker API), a gevent worker is assigned to it. Typically, this worker needs to communicate with the database. After the first failure, we found that gevent workers were connecting to the database using the default settings.

Given the significant number of Quay pods and thousands of incoming requests per second, the large number of database connections could theoretically overwhelm the MySQL instance. Thanks to monitoring, it was known that Quay processes an average of 5 thousand requests per second. The number of connections to the database was about the same. 5 thousand connections with a margin fit into the capabilities of our RDS instance (which cannot be said about tens of thousands).For some reason, there were unexpected spikes in the number of connections , however we did not notice any correlation with the incoming requests.

This time, we were determined to find and fix the source of the problem, not just reboot. Changes have been made to the Quay codebase to limit the number of database connections for each gevent worker. This number became a parameter in the configuration: it became possible to change it "on the fly" without building a new container image. To find out how many connections to actually handle, we ran several tests with a staging environment that set different values to see how this would affect load testing scenarios. As a result, it turned out thatQuay starts throwing 502 errors when the number of connections exceeds 10K.

We immediately deployed this new version to production and started monitoring the database connection schedule. In the past, the base was blocked after about 20 minutes. After 30 trouble-free minutes, we had hope, and an hour later, confidence. We restored the traffic to the post on the site and started postmortem analysis.

Having managed to work around the problem leading to the blocking, we did not find out its true causes . It was confirmed that it was not related to any changes in OpenShift 4.3.19, as the same happened on version 4.3.18, which previously worked with Quay without any problems.

There was clearly something else lurking in the cluster.

Detailed study

Quay.io has been using the default settings for connecting to the database for six years without any problems. What changed? It's clear that traffic to quay.io has been growing steadily all this time. In our case, everything looked as if a certain threshold value was reached, which served as a trigger for an avalanche of connections. We continued to examine the database logs after the second crash, but did not find any patterns or obvious relationships.

In the meantime, the SRE team has been working on improvements to Quay's query observability and overall service health. New metrics and dashboards have been deployed showing which parts of Quay are most in demand from customers.

Quay.io worked fine until June 9th. In the morning (via EDT), we again witnessed a significant increase in the number of database connections. This time, there was no downtime , as the new parameter limited their number and did not allow the MySQL bandwidth to be exceeded. However, for about half an hour, many users noticed that quay.io was slow. We quickly collected all the possible data using the added monitoring tools. A pattern suddenly emerged.

Just before the jump in connections, a large number of requests were sent to the App Registry API... App Registry is a little-known feature of quay.io. It allows you to store things like Helm charts and containers that are rich in metadata. Most quay.io users do not use this feature, but Red Hat OpenShift is actively using it. OperatorHub within OpenShift stores all operators in the App Registry. These carriers form the foundation for the OpenShift workload ecosystem and partner-centric operating model (under Day 2 operations).

Each OpenShift 4 cluster uses operators from the built-in OperatorHub to publish a catalog of operators available for installation and provide updates for those already installed. With the growing popularity of OpenShift 4, the number of clusters on it all over the world has increased. Each of these clusters loads operator content to run the built-in OperatorHub, using the App Registry inside quay.io as the backend. Looking for the source of the problem, we missed the fact that as the popularity of OpenShift gradually increased, so did the load on one of the rarely used quay.io functions .

We did some analysis of the App Registry request traffic and looked into the registry code. Immediately, flaws were revealed, due to which queries to the database were formed suboptimally. They did not cause any trouble under light load, but when it increased they became a source of problems. The App Registry turned out to have two problematic endpoints that did not react well to an increase in load: the first gave a list of all packages in the repository, the second returned all blobs for a package.

Elimination of causes

Over the next week, we have been optimizing the code of the App Registry itself and its environment. Obviously inefficient SQL queries were reworked, unnecessary calls to the command were eliminated tar(it was run every time a blob was fetched), and caching was added wherever possible. Then we conducted extensive performance testing and compared the performance of the App Registry before and after the changes.

API requests that used to take up to half a minute were now completed in milliseconds . We rolled out the changes to production the next week and quay.io has been stable since then. During this time, there have been several spikes in traffic at the App Registry endpoint, but the improvements made have prevented database outages.

What have we learned?

It is clear that any service tries to avoid downtime. In our case, we believe that recent crashes have helped make quay.io better. For ourselves, we have taken out several main lessons that we want to share:

Data about who uses your service and how is not superfluous . Because Quay “just worked,” we never had to spend time optimizing traffic and managing load. All of this created a false sense of security that the service could scale indefinitely.
, — . Quay , . , — , .
Evaluate the impact of each of the service functions . Customers rarely used the App Registry, so it was not a priority for our team. When some features of the product are barely used, their bugs rarely pop up, and developers stop monitoring the code. It's easy to fall prey to the delusion that this is the way it should be - until suddenly this feature is at the center of a massive incident.

What's next?

The work to ensure the stability of the service never stops and we are constantly improving it. Traffic volumes on quay.io continue to grow, and we recognize that we must do our best to live up to our customers' trust. Therefore, we are currently working on the following tasks:

, RDS.
RDS. . , ( ); .
. , .
firewall’ - (WAF), , quay.io.
, Red Hat OpenShift App Registry (Operator Catalogs), , quay.io.
Support for the Open Container Initiative (OCI) artifact specifications could be a long-term replacement for the App Registry. It is currently being implemented as native Quay functionality and will be available to users when the specification itself is finalized.

All of the above are part of Red Hat's ongoing investment in quay.io as we move from a small startup-like team to a mature platform driven by SRE. We know that many of our customers rely on quay.io for their day-to-day work (including Red Hat!) And try to be as open as possible about recent disruptions and ongoing efforts to improve.

Post Mortem out of reach Quay.io