The strange quirkiness of the pseudo file
/proc/*/mem
lies in its punchy semantics. Write operations through this file will succeed even if the target virtual memory is marked as not writable. This is intentional and is widely used by projects like the Julia JIT compiler or the rr debugger.
But the question is, does privileged code obey virtual memory permissions? To what extent can hardware affect kernel memory access?
We will try to answer these questions and consider the nuances of the interaction between the operating system and the hardware on which it is executed. Let's explore the processor limits that can affect the kernel and see how the kernel can work around them.
Patch libc with / proc / self / mem
What does this punchy semantics look like? Consider the code:
#include <fstream>
#include <iostream>
#include <sys/mman.h>
/* Write @len bytes at @ptr to @addr in this address space using
* /proc/self/mem.
*/
void memwrite(void *addr, char *ptr, size_t len) {
std::ofstream ff("/proc/self/mem");
ff.seekp(reinterpret_cast<size_t>(addr));
ff.write(ptr, len);
ff.flush();
}
int main(int argc, char **argv) {
// Map an unwritable page. (read-only)
auto mymap =
(int *)mmap(NULL, 0x9000,
PROT_READ, // <<<<<<<<<<<<<<<<<<<<< READ ONLY <<<<<<<<
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mymap == MAP_FAILED) {
std::cout << "FAILED\n";
return 1;
}
std::cout << "Allocated PROT_READ only memory: " << mymap << "\n";
getchar();
// Try to write to the unwritable page.
memwrite(mymap, "\x40\x41\x41\x41", 4);
std::cout << "did mymap[0] = 0x41414140 via proc self mem..";
getchar();
std::cout << "mymap[0] = 0x" << std::hex << mymap[0] << "\n";
getchar();
// Try to writ to the text segment (executable code) of libc.
auto getchar_ptr = (char *)getchar;
memwrite(getchar_ptr, "\xcc", 1);
// Run the libc function whose code we modified. If the write worked,
// we will get a SIGTRAP when the 0xcc executes.
getchar();
}
It is
/proc/self/mem
used here to write to two non-writable memory pages. The first contains the code itself, and the second belongs to
libc
(the function
getchar
). The last part is of more interest: the code writes byte 0xcc (a breakpoint in x86-64 applications), which, if executed, will cause the kernel to provide our process with a SIGTRAP. This literally changes the libc executable. And if on the next call
getchar
we get SIGTRAP, we will know that the record was successful.
This is what it looks like when you start the program:
Works! In the middle, expressions are printed that prove that the value 0x41414140 was successfully written and read from memory. The last output shows that after patching, our process received a SIGTRAP as a result of our call
getchar
.
In the video:
We've seen how this feature works from a user space perspective. Let's dig deeper. To fully understand how this works, you need to look at how hardware imposes memory constraints.
Equipment
On the x86-64 platform, there are two processor settings that control the kernel's ability to access memory. They are used by the memory management unit (MMU).
The first setting is the Write Protect bit (CR0.WP). From the Intel manual (Volume 3, Section 2.5) we know:
Write protection (16th bit CR0). If given, it prevents supervisor-level procedures from writing to write-protected pages. If the bit is empty, then supervisor-level procedures can write to write-protected pages (regardless of the U / S bit settings; see Sections 4.1.3 and 4.6).
This prevents the kernel from writing to write-protected pages, which is naturally allowed by default .
The second setting is Supervisor Mode Access Prevention (SMAP) (CR4.SMAP). The full description in Volume 3, Section 4.6, is verbose. In short, SMAP completely deprives the kernel of the ability to write to or read from user space memory. This prevents exploits that flood user space with malicious data that the kernel must read during execution.
If your kernel code only uses approved channels (
copy_to_user
etc.), then SMAP can be safely ignored, these functions will automatically use it before and after accessing memory. What about write protection?
If CR0.WP is not specified, then the
/proc/*/mem
kernel implementation can indeed unceremoniously write to write-protected user-space memory.
However, CR0.WP is set at boot and usually lives for the entire operating time of the systems. In this case, when trying to write, a page fault will be issued. It is more of a Copy-on-Write tool than a security tool, so it does not impose any real restrictions on the kernel. In other words, inconvenient fault handling is required, which is not necessary for a given bit.
Let's figure out the implementation now.
How / proc / * / mem works
/proc/*/mem
It is implemented in fs / proc / base.c .
The structure
file_operations
contains the handler functions, and the mem_rw () function fully supports the write handler.
mem_rw()
uses access_remote_vm () for write operations . And
access_remote_vm()
it does this:
- Calls
get_user_pages_remote()
to find a physical frame that matches the target virtual address. - Calls
kmap()
to mark this frame as writable in the kernel virtual address space. - Calls
copy_to_user_page()
for the final execution of write operations.
This implementation completely bypasses the issue of the kernel's ability to write to non-writable user space! The kernel's control over the virtual memory subsystem allows the MMU to be completely bypassed, allowing the kernel to simply write to its own writeable address space. So the discussion of CR0.WP becomes irrelevant.
Let's
look at each of the steps: get_user_pages_remote ()
To bypass the MMU, the kernel needs to manually do what the MMU does in hardware in the application. First, you need to convert the target virtual address to a physical one. This is done by the family of functions
get_user_pages()
... They traverse the page tables and look for physical memory frames that match a given range of virtual addresses.
The caller provides the context and uses flags to change the behavior
get_user_pages()
. The flag
FOLL_FORCE
that is being transmitted is especially interesting
mem_rw()
. The flag triggers check_vma_flags (access check logic
get_user_pages()
) to ignore writes to non-writable pages and continue searching. The "punchy" semantics is entirely related to
FOLL_FORCE
(my comments):
static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
{
[...]
if (write) { // If performing a write..
if (!(vm_flags & VM_WRITE)) { // And the page is unwritable..
if (!(gup_flags & FOLL_FORCE)) // *Unless* FOLL_FORCE..
return -EFAULT; // Return an error
[...]
return 0; // Otherwise, proceed with lookup
}
get_user_pages()
It also adheres to copy-on-write (CoW) semantics. If a write to a non-writable page table is specified, then a page failure is emulated by calling the
handle_mm_fault
main page error handler. This starts the appropriate copy-on-write processing routine
do_wp_page
, which copies the page as needed. So if entries through
/proc/*/mem
are executed by private shared mapping, for example, libc, then they are visible only within the process.
kmap ()
After a physical frame is found, it needs to be mapped to the kernel's virtual address space, which is writable. This is done with the help of
kmap()
.
On a 64-bit x86 platform, all physical memory is mapped through the linear mapping area of ββthe kernel's virtual address space. In this case, it
kmap()
works very simply: it only needs to add the starting address of the linear mapping to the physical address of the frame in order to calculate the virtual address to which this frame is mapped.
On a 32-bit x86 platform, inline mapping contains a subset of physical memory, so a function
kmap()
may need to map a frame by allocating highmem memory and manipulating page tables.
In both cases, line mapping and highmem mapping are performed with protection. PAGE_KERNEL which allows writing.
copy_to_user_page ()
The last step is to execute the write. This is done using
copy_to_user_page()
what is essentially memcpy. This works because the target is a writeable mapping from
kmap()
.
Discussion
So, first, the kernel, using the memory page table belonging to the program, converts the target virtual address in user space to the corresponding physical frame. The kernel then maps this frame to its own writeable virtual space. Finally, it writes with simple memcpy.
Strikingly, CR0.WP is not used here. The implementation elegantly bypasses this point by taking advantage of the fact that it does not have to access memory through a user-space pointer . Since the kernel has complete control over virtual memory, it can simply remap the physical frame into its own virtual address space with arbitrary resolutions and do whatever it wants with it.
It's important to note that the permissions that protect a page of memory are related to the virtual address used to access that page, not the physical frame associated with the page . The memory permission notation refers exclusively to virtual memory, not physical memory.
Conclusion
By examining the details of the punchy semantics in the implementation,
/proc/*/mem
we can reflect the relationship between the core and the processor. At first glance, the ability of the kernel to write to non-writable memory raises the question: to what extent can the processor influence the kernel's memory access? The manual describes control mechanisms that can limit the actions of the kernel. But on closer inspection, the limitations are superficial at best. These are simple obstacles to get around.