The usual troubleshooting in such cases is to carefully examine the life cycle of the affected object: see how memory is allocated for it, how it is freed, how correctly reference counters are taken and released, paying special attention to error paths. However, in our case, different objects were carrapped, and checking their life cycle did not find bugs.
kmalloc-192 cache is quite popular in the kernel, it combines several dozen different objects. A bug in the lifecycle of one of them is the most likely reason for this kind of bugs. Even just listing all such objects is quite problematic, and there is no question of checking all of them. Bug reports continued to arrive, but we did not manage to find their cause by direct investigation. A hint was required.
From our side, these bugs were investigated by Andrey Ryabinin, a memory management specialist, widely known in narrow circles of kernel developers as the developer of KASAN, an awesome technology for catching memory access errors. In fact, it was KASAN that was best suited to discover the causes of our bug. KASAN was not included in the original RHEL7 kernel, but Andrey ported the necessary patches to us in OpenVz. We did not include KASAN in the production version of our kernel, but it is present in the debug version of the kernel and actively helps our QA in finding bugs.
In addition to KASAN, the debug kernel includes many other debug features that we inherited from Red Hat. As a result of debug, the kernel turned out to be rather slow. QA says the same tests on a debug kernel take 4 times longer. For us, this is not fundamental, we do not measure performance there, but look for bugs. However, such a slowdown was unacceptable for customers, and our requests to put a debug kernel in production were invariably rejected.
As an alternative to KASAN, clients were asked to enable slub_debug on affected nodes... This technology also allows the detection of memory corruption. Using a red zone and memory poisoning for each object, the memory allocator checks to see if everything is in order each time it allocates and frees memory. If something goes wrong, it issues an error message, if possible, fixes the detected damage and allows the kernel to continue working. In addition, information about who last allocated and freed an object is stored, so that in the case of post-factum detection of memory corruption, it is possible to understand "who" this object was in a "past life". Slub_debug can be enabled in the kernel commandline on a production kernel, but these checks also consume memory and cpu resources. For development and QA debugging this is fine, but production clients use it without much enthusiasm.
Six months have passed, the New Year was approaching. Local tests on the debug kernel with KASAN did not catch the problem, we did not receive any bug reports from the nodes with slub_debug enabled, we could not find anything in the raw materials and we did not find the problem. Andrey was loaded with other tasks, on the contrary, I got a gap and I was instructed to analyze the next bug report.
After analyzing the crash dump, I soon discovered the problematic kmalloc-192 object: its memory was filled with some kind of garbage, information belonging to a different type of object. It was very similar to the consequences of use-after-free, but after carefully examining the life cycle of the damaged object in the raw materials, I did not find anything suspicious either.
I looked through the old bug reports, tried to find some clue there, but also to no avail.
Eventually I went back to my bug and started looking at the previous object. It also turned out to be in-use, but from its content it was completely incomprehensible what it was - there were no constants, references to functions or other objects. After tracking down several generations of references to this object, I eventually figured out that it was a shrinker bitmap. This object was part of the optimization technique for freeing container memory. The technology was originally developed for our kernels, later its author Kirill Tkhai committed it to the linux mainline.
"The results show the performance increases at least in 548 times."
Several thousand such patches supplement the original rock-stable RHEL7 kernel, making the Virtuozzo kernel as convenient as possible for hosters. Whenever possible, we try to send our developments to the mainline, as this makes it easier to maintain the code in good condition.
After following the links, I found a structure describing my bitmap. The Descriptor believed that the bitmap size should be 240 bytes, and this could not be true in any way, since in fact the object was allocated from kmalloc-192 cache.
Bingo!
It turned out that functions working with bitmap accessed memory beyond its upper bound and could change the contents of the next object. In my case, there was a refcount at the beginning of the object, and when the bitmap nullified it, the subsequent put resulted in the sudden release of the object. Later, memory was allocated anew for a new object, the initialization of which was perceived as garbage by the code of the old object, which sooner or later inevitably led to the crash of the node.
It's good when you can consult with the author of the code!
Looking at his code with Kirill, we soon found the root cause of the detected discrepancy. As the number of containers increased, the bitmap should have increased, but we left out one of the cases and, as a result, sometimes skipped the resize bitmap. In our local tests, this situation was not found, and in the version of the patch that Kirill sent to the mainline, the code was reworked, and there was no bug there.
With 4 attempts, Kirill and I worked together to compose such a patch , for a month we ran it in local tests and at the end of February we released an update with a fixed kernel. We selectively checked other crash dumps, also found the wrong bitmap in the neighborhood, celebrated the victory and wrote off old bugs on the sly.
However, the old women kept falling and falling. The trickle of these kinds of bug reports has shrunk, but has not completely dried up.
In general, this was expected. Our clients are hosters. They strongly dislike rebooting their nodes, because reboot == downtime == lost money. We also don't like to frequently release kernels. The official release of the update is a rather laborious procedure that requires running a bunch of different tests. Therefore, new stable kernels are released approximately quarterly.
To ensure prompt delivery of bug fixes to client prduction nodes, we use ReadyKernel live patches. In my opinion, no one else does that except us. Virtuozzo 7 uses an unusual strategy for using live pathes.
Usually, lifepatch is only security. In our country, 3/4 of the fixes are bug fixes. Fixes for bugs that our customers have already stumbled upon or may easily stumble upon in the future. Effectively, such things can be done only for your distribution kit: without feedback from users, you cannot understand what is important to them and what is not.
Live patching is certainly not a panacea. It is generally impossible to patch everything in a row - the technology does not allow. New functionality is not added in this way either. However, a significant part of the bugs is fixed with the simplest one-line patches, which are excellent for life patching. In more complex cases, the original patch has to be “creatively modified with a file”, sometimes the live-patching machinery is buggy, but our wizard of life patching Zhenya Shatokhin knows his job perfectly. Recently, for example, he unearthedenchanting bug in kpatch , about which, for good reasons , it is generally worth writing a separate opera.
As appropriate bug fixes accumulate, usually once every one to two weeks, Zhenya launches another series of ReadyKernel live patches. After the release, they instantly fly to the client nodes and prevent the attack on the rake we already know. And all this without rebooting the client nodes. And unnecessarily release kernels frequently. Continuous benefits.
However, often the live patch arrives at clients too late: the problem it closes has already happened, but the node, nevertheless, has not yet crashed.
That is why the emergence of new bug reports with the problem we already fixed was not unexpected for us. Parsing them over and over again showed up familiar symptoms: old kernel, garbage in kmalloc-192, "wrong" bitmap in front of it, and an unloaded or late loaded live patch with a fix.
One of such cases was the OVZ-7188 from FastVPS , which came to us at the very end of February. “Thank you very much for the bug report. Our condolences. Immediately very similar to the known issue. It's a pity that there are no live patches in OpenVZ. Wait for a stable kernel release, switch to Virtuozzo or use unstable kernels with a bugfix. "
Bug reports are one of the most valuable things OpenVZ gives us. Researching them gives us a chance to spot serious problems before any of the fat clients steps in. Therefore, despite the known issue, I nevertheless asked to fill in crash dumps for us.
Parsing the first of them somewhat discouraged me: the "wrong" bitmap in front of the "crooked" kmalloc-192 object was not found.
A little later, the problem was reproduced on the new kernel. And then another, another and another.
Oops!
How so? Unfixed? I double-checked the raw materials - everything is fine, the patch is in place, nothing is lost.
Again corruption? In the same place?
I had to figure it out again.
(What is this? See here )
In each of the new crash dumps, the investigation again ran into the kmalloc-192 object. In general, such an object looked quite normal, but at the very beginning of the object, the wrong address was found every time. Tracking the relationship of the object, I found that two internal bytes were nullified in the address.
in all cases corrupted pointer contains nulls in 2 middle bytes: (mask 0xffffffff0000ffff)
0xffff9e2400003d80
0xffff969b00005b40
0xffff919100007000
0xffff90f30000ccc0
In the first of the listed cases, instead of the "wrong" address 0xffff9e2400003d80, the "correct" address 0xffff9e24740a3d80 should have been. A similar situation was found in other cases.
It turned out that some extraneous code nullified our object with 2 bytes. The most likely scenario is use-after-free, when an object, after being freed, zeroes out some field in its first bytes. I checked the most frequently used objects, but nothing suspicious was found. Again a dead end.
FastVPSat our request, I ran the debug kernel with KASAN for a week, but it did not help, the problem was never reproduced. We asked to register slub_debug, but it required a reboot, and the process took a long time. In March-April, the nodes crashed several more times, but slub_debug was turned off, and this did not give us new information.
And then there was a lull, the problem stopped reproducing. April ended, May passed - there were no new falls.
The wait ended on June 7th - finally the problem hit the core with slub_debug enabled. While checking the red zone when freeing the slub_debug object, I found two zero bytes beyond its upper bound. In other words, it turned out that it was not use-after-free, the previous object was again the culprit. There was a normal looking struct nf_ct_ext. This structure refers to connection tracking, a description of the network connection that the firewall uses.
However, it was still not clear why this was happening.
I began to peer at conntrack: someone knocked on one of the containers using ipv6 on the open port 1720. By port and protocol, I found the corresponding nf_conntrack_helper.
static struct nf_conntrack_helper nf_conntrack_helper_q931[] __read_mostly = {
{
.name = "Q.931",
.me = THIS_MODULE,
.data_len = sizeof(struct nf_ct_h323_master),
.tuple.src.l3num = AF_INET, <<<<<<<< IPv4
.tuple.src.u.tcp.port = cpu_to_be16(Q931_PORT),
.tuple.dst.protonum = IPPROTO_TCP,
.help = q931_help,
.expect_policy = &q931_exp_policy,
},
{
.name = "Q.931",
.me = THIS_MODULE,
.tuple.src.l3num = AF_INET6, <<<<<<<< IPv6
.tuple.src.u.tcp.port = cpu_to_be16(Q931_PORT),
.tuple.dst.protonum = IPPROTO_TCP,
.help = q931_help,
.expect_policy = &q931_exp_policy,
},
};
Comparing the structures, I noticed that the ipv6 helper did not define .data_len. I got into git to figure out where it came from, I discovered a 2012 patch.
commit 1afc56794e03229fa53cfa3c5012704d226e1dec
Author: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Thu Jun 7 12:11:50 2012 +0200
netfilter: nf_ct_helper: implement variable length helper private data
This patch uses the new variable length conntrack extensions.
Instead of using union nf_conntrack_help that contain all the
helper private data information, we allocate variable length
area to store the private helper data.
This patch includes the modification of all existing helpers.
It also includes a couple of include header to avoid compilation
warnings.
The patch added a new .data_len field to the helper, which indicated how much memory the corresponding network connection handler needed. The patch was supposed to define .data_len for all nf_conntrack_helpers available at that time, but it missed the structure I found.
As a result, it turned out that the connection via ipv6 to the open port 1720 launched the q931_help () function, it wrote to a structure for which no one had allocated memory. A simple port scan nullified a couple of bytes, the transmission of a normal protocol message filled the structure with more meaningful information, but in any case, someone else's memory was frayed and sooner or later this led to the crash of the node.
Florian Westphal redesigned the code again in 2017and removed .data_len, and the problem I discovered went unnoticed.
Despite the fact that the bug is no longer found in the current linux kernel mainline, the problem was inherited by the kernels of a bunch of linux distributions, including the still current RHEL7 / CentOS7, SLES 11 & 12, Oracle Unbreakable Enterprise Kernel 3 & 4, Debian 8 & 9 and Ubuntu 14.04 & 16.04 LTS.
The bug was trivially reproduced on the test node, both on our kernel and on the original RHEL7. Explicit security: remotely managed memory corruption. Where 1720 ipv6 port is open - practically ping of death.
On June 9th I made a one-line patch with a vague description and sent it to the mainline. I sent a detailed description to Red Hat Bugzilla and wrote it separately to Red Hat Security.
Further events developed without my participation.
On June 15, Zhenya Shatokhin released the ReadyKernel live patch for our old kernels.
https://readykernel.com/patch/Virtuozzo-7/readykernel-patch-131.10-108.0-1.vl7/
On June 18th we released a new stable kernel in Virtuozzo and OpenVz.
https://virtuozzosupport.force.com/s/article/VZA-2020-043
On June 24th, Red Hat Security assigned a CVE id to the bug
https://access.redhat.com/security/cve/CVE-2020-14305
Problem received a moderate impact with an unusually high CVSS v3 Score 8.1 and over the next few days other
SUSE distributions responded to the public hat bug https://bugzilla.suse.com/show_bug.cgi?id=CVE-2020-14305
Debian https: / /security-tracker.debian.org/tracker/CVE-2020-14305
Ubuntuhttps://people.canonical.com/~ubuntu-security/cve/2020/CVE-2020-14305.html
On July 6th KernelCare released a livepatch for affected distributives.
https://blog.kernelcare.com/new-kernel-vulnerability-found-by-virtuozzo-live-patched-by-kernelcare
On July 9th the issue was fixed in stable Linux kernels 4.9.230 and 4.4.230.
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.9.y&id=396ba2fc4f27ef6c44bbc0098bfddf4da76dc4c9
Distributions, however, still haven't closed the hole ...
“Look, Kostya,” I say to my partner Kostya Khorenko, “our shell hit the same crater twice! I and one access-beyond-end-of-object the last time I met nepoymi when, and here it visited us twice in a row. Tell me, is it like a square probability? Or not square?
- The probability is square, yes. But here you have to look - what event is the probability? The square probability of the event that unusual bugs were encountered exactly 2 times in a row. It is in a row.
Well, Kostya is smart, he knows better.