KVM host in a couple of lines of code

Hello!



Today we are publishing an article on how to write a KVM host. We saw it on Serge Zaitsev's blog , translated and supplemented it with our own Python examples for those who do not work with C ++.


KVM (Kernel-based Virtual Machine) is a virtualization technology that comes with the Linux kernel. In other words, KVM allows you to run multiple virtual machines (VMs) on a single Linux virtual host. Virtual machines in this case are called guests. If you've ever used QEMU or VirtualBox on Linux, you know what KVM is capable of.



But how does it work under the hood?



IOCTL



KVM exposes the API through a special device file / dev / kvm . When you start a device, you access the KVM subsystem and then make ioctl system calls to allocate resources and start virtual machines. Some ioctl calls return file descriptors, which can also be manipulated with ioctl. And so on ad infinitum? Not really. There are only a few API levels in KVM:



  • the / dev / kvm level used to manage the entire KVM subsystem and to create new virtual machines,
  • the VM layer used to manage an individual virtual machine,
  • VCPU level used to control the operation of one virtual processor (one virtual machine can run on several virtual processors) - VCPU.


In addition, there are APIs for I / O devices.



Let's see how it looks in practice.



// KVM layer
int kvm_fd = open("/dev/kvm", O_RDWR);
int version = ioctl(kvm_fd, KVM_GET_API_VERSION, 0);
printf("KVM version: %d\n", version);

// Create VM
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);

// Create VM Memory
#define RAM_SIZE 0x10000
void *mem = mmap(NULL, RAM_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0);
struct kvm_userspace_memory_region mem = {
	.slot = 0,
	.guest_phys_addr = 0,
	.memory_size = RAM_SIZE,
	.userspace_addr = (uintptr_t) mem,
};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &mem);

// Create VCPU
int vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);


Python example:



with open('/dev/kvm', 'wb+') as kvm_fd:
    # KVM layer
    version = ioctl(kvm_fd, KVM_GET_API_VERSION, 0)
    if version != 12:
        print(f'Unsupported version: {version}')
        sys.exit(1)

    # Create VM
    vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0)

    # Create VM Memory
    mem = mmap(-1, RAM_SIZE, MAP_PRIVATE | MAP_ANONYMOUS, PROT_READ | PROT_WRITE)
    pmem = ctypes.c_uint.from_buffer(mem)
    mem_region = UserspaceMemoryRegion(slot=0, flags=0,
                                       guest_phys_addr=0, memory_size=RAM_SIZE,
                                       userspace_addr=ctypes.addressof(pmem))
    ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, mem_region)

    # Create VCPU
    vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);


In this step, we have created a new virtual machine, allocated memory for it, and assigned one vCPU. In order for our virtual machine to actually run something, we need to load the virtual machine image and properly configure the processor registers.



Loading the virtual machine



It's easy enough! Just read the file and copy its contents to the virtual machine memory. Of course, mmap is also a good option.



int bin_fd = open("guest.bin", O_RDONLY);
if (bin_fd < 0) {
	fprintf(stderr, "can not open binary file: %d\n", errno);
	return 1;
}
char *p = (char *)ram_start;
for (;;) {
	int r = read(bin_fd, p, 4096);
	if (r <= 0) {
		break;
	}
	p += r;
}
close(bin_fd);


Python example:



    # Read guest.bin
    guest_bin = load_guestbin('guest.bin')
    mem[:len(guest_bin)] = guest_bin


It is assumed that guest.bin contains a valid byte-code for the current CPU architecture, because the KVM does not interpret the CPU instructions, one after another, as did the old virtual machine. KVM gives calculations to the real CPU and only intercepts I / O. This is why modern virtual machines operate at high performance, close to bare metal, unless you are doing I / O heavy operations.



Here is the tiny guest virtual machine kernel we will try to run first: If you are not familiar with assembler, the example above is a tiny 16-bit executable that increments a register in a loop and outputs a value to port 0x10.



#

# Build it:

#

# as -32 guest.S -o guest.o

# ld -m elf_i386 --oformat binary -N -e _start -Ttext 0x10000 -o guest guest.o

#

.globl _start

.code16

_start:

xorw %ax, %ax

loop:

out %ax, $0x10

inc %ax

jmp loop








We deliberately compiled it as an archaic 16-bit application, because the launched KVM virtual processor can operate in several modes, like a real x86 processor. The simplest mode is "real" mode, which has been used to run 16-bit code since the last century. Real mode differs in memory addressing, it is direct instead of using descriptor tables - it would be easier to initialize our register for real mode:



struct kvm_sregs sregs;
ioctl(vcpu_fd, KVM_GET_SREGS, &sregs);
// Initialize selector and base with zeros
sregs.cs.selector = sregs.cs.base = sregs.ss.selector = sregs.ss.base = sregs.ds.selector = sregs.ds.base = sregs.es.selector = sregs.es.base = sregs.fs.selector = sregs.fs.base = sregs.gs.selector = 0;
// Save special registers
ioctl(vcpu_fd, KVM_SET_SREGS, &sregs);

// Initialize and save normal registers
struct kvm_regs regs;
regs.rflags = 2; // bit 1 must always be set to 1 in EFLAGS and RFLAGS
regs.rip = 0; // our code runs from address 0
ioctl(vcpu_fd, KVM_SET_REGS, &regs);


Python example:



    sregs = Sregs()
    ioctl(vcpu_fd, KVM_GET_SREGS, sregs)
    # Initialize selector and base with zeros
    sregs.cs.selector = sregs.cs.base = sregs.ss.selector = sregs.ss.base = sregs.ds.selector = sregs.ds.base = sregs.es.selector = sregs.es.base = sregs.fs.selector = sregs.fs.base = sregs.gs.selector = 0
    # Save special registers
    ioctl(vcpu_fd, KVM_SET_SREGS, sregs)

    # Initialize and save normal registers
    regs = Regs()
    regs.rflags = 2  # bit 1 must always be set to 1 in EFLAGS and RFLAGS
    regs.rip = 0  # our code runs from address 0
    ioctl(vcpu_fd, KVM_SET_REGS, regs)


Running



The code is loaded, the registers are ready. Let's get started? To start a virtual machine, we need to get a pointer to the "run state" for each vCPU and then enter a loop in which the virtual machine will run until it is interrupted by I / O or other operations where control will be transferred back to the host.



int runsz = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
struct kvm_run *run = (struct kvm_run *) mmap(NULL, runsz, PROT_READ | PROT_WRITE, MAP_SHARED, vcpu_fd, 0);

for (;;) {
	ioctl(vcpu_fd, KVM_RUN, 0);
	switch (run->exit_reason) {
	case KVM_EXIT_IO:
		printf("IO port: %x, data: %x\n", run->io.port, *(int *)((char *)(run) + run->io.data_offset));
		break;
	case KVM_EXIT_SHUTDOWN:
		return;
	}
}


Python example:



    runsz = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0)
    run_buf = mmap(vcpu_fd, runsz, MAP_SHARED, PROT_READ | PROT_WRITE)
    run = Run.from_buffer(run_buf)

    try:
        while True:
            ret = ioctl(vcpu_fd, KVM_RUN, 0)
            if ret < 0:
                print('KVM_RUN failed')
                return
             if run.exit_reason == KVM_EXIT_IO:
                print(f'IO port: {run.io.port}, data: {run_buf[run.io.data_offset]}')
             elif run.exit_reason == KVM_EXIT_SHUTDOWN:
                return
              time.sleep(1)
    except KeyboardInterrupt:
        pass


Now if we run the application, we will see: Works! The complete source code is available at the following address (if you notice a mistake, comments are welcome!).



IO port: 10, data: 0

IO port: 10, data: 1

IO port: 10, data: 2

IO port: 10, data: 3

IO port: 10, data: 4

...








Do you call it the core?



Most likely, all this is not very impressive. How about running the Linux kernel instead?



The beginning will be the same: open / dev / kvm , create a virtual machine, etc. However, we need a few more ioctl calls at the virtual machine level to add a periodic interval timer, initialize TSS (required for Intel chips) and add an interrupt controller:



ioctl(vm_fd, KVM_SET_TSS_ADDR, 0xffffd000);
uint64_t map_addr = 0xffffc000;
ioctl(vm_fd, KVM_SET_IDENTITY_MAP_ADDR, &map_addr);
ioctl(vm_fd, KVM_CREATE_IRQCHIP, 0);
struct kvm_pit_config pit = { .flags = 0 };
ioctl(vm_fd, KVM_CREATE_PIT2, &pit);


We will also need to change the way the registers are initialized. The Linux kernel needs protected mode, so we enable it in the register flags and initialize the base, selector, granularity for each special case:



sregs.cs.base = 0;
sregs.cs.limit = ~0;
sregs.cs.g = 1;

sregs.ds.base = 0;
sregs.ds.limit = ~0;
sregs.ds.g = 1;

sregs.fs.base = 0;
sregs.fs.limit = ~0;
sregs.fs.g = 1;

sregs.gs.base = 0;
sregs.gs.limit = ~0;
sregs.gs.g = 1;

sregs.es.base = 0;
sregs.es.limit = ~0;
sregs.es.g = 1;

sregs.ss.base = 0;
sregs.ss.limit = ~0;
sregs.ss.g = 1;

sregs.cs.db = 1;
sregs.ss.db = 1;
sregs.cr0 |= 1; // enable protected mode

regs.rflags = 2;
regs.rip = 0x100000; // This is where our kernel code starts
regs.rsi = 0x10000; // This is where our boot parameters start


What are the boot parameters and why can't you just boot the kernel at address zero? It's time to learn more about the bzImage format.



The kernel image follows a special "boot protocol" where there is a fixed header with boot parameters followed by the actual kernel bytecode. The format of the boot header is described here .



Loading a kernel image



In order to properly load the kernel image into the virtual machine, we need to read the entire bzImage file first. We look at offset 0x1f1 and get the number of sectors of the setup from there. We'll skip them to see where the kernel code starts. In addition, we will copy the boot parameters from the beginning of bzImage to the memory area for the boot parameters of the virtual machine (0x10000).



But even that won't be enough. We will need to correct the boot parameters for our VM to force it to VGA mode and initialize the command line pointer.



Our kernel needs to write logs to ttyS0 so that we can intercept the I / O and our virtual machine will print it to stdout. To do this, we need to add "console = ttyS0" to the kernel command line.



But even after that, we will not get any result. I had to set a fake CPU ID for our kernel (https://www.kernel.org/doc/Documentation/virtual/kvm/cpuid.txt). Most likely the kernel I put together relied on this information to determine if it was running inside a hypervisor or on bare metal.



I used a kernel compiled with a "tiny" configuration and set up a few configuration flags to support terminal and virtio (I / O virtualization framework for Linux).



The complete code for the modified KVM host and test kernel image is available here .



If this image does not start, you can use another image available at this link .


If we compile it and run it, we get the following output:



Linux version 5.4.39 (serge@melete) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~16.04~ppa1)) #12 Fri May 8 16:04:00 CEST 2020
Command line: console=ttyS0
Intel Spectre v2 broken microcode detected; disabling Speculation Control
Disabled fast string operations
x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
BIOS-provided physical RAM map:
BIOS-88: [mem 0x0000000000000000-0x000000000009efff] usable
BIOS-88: [mem 0x0000000000100000-0x00000000030fffff] usable
NX (Execute Disable) protection: active
tsc: Fast TSC calibration using PIT
tsc: Detected 2594.055 MHz processor
last_pfn = 0x3100 max_arch_pfn = 0x400000000
x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
Using GB pages for direct mapping
Zone ranges:
  DMA32    [mem 0x0000000000001000-0x00000000030fffff]
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000000001000-0x000000000009efff]
  node   0: [mem 0x0000000000100000-0x00000000030fffff]
Zeroed struct page in unavailable ranges: 20322 pages
Initmem setup node 0 [mem 0x0000000000001000-0x00000000030fffff]
[mem 0x03100000-0xffffffff] available for PCI devices
clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
Built 1 zonelists, mobility grouping on.  Total pages: 12253
Kernel command line: console=ttyS0
Dentry cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
Inode-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
mem auto-init: stack:off, heap alloc:off, heap free:off
Memory: 37216K/49784K available (4097K kernel code, 292K rwdata, 244K rodata, 832K init, 916K bss, 12568K reserved, 0K cma-reserved)
Kernel/User page tables isolation: enabled
NR_IRQS: 4352, nr_irqs: 24, preallocated irqs: 16
Console: colour VGA+ 142x228
printk: console [ttyS0] enabled
APIC: ACPI MADT or MP tables are not detected
APIC: Switch to virtual wire mode setup with no configuration
Not enabling interrupt remapping due to skipped IO-APIC setup
clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x25644bd94a2, max_idle_ns: 440795207645 ns
Calibrating delay loop (skipped), value calculated using timer frequency.. 5188.11 BogoMIPS (lpj=10376220)
pid_max: default: 4096 minimum: 301
Mount-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes, linear)
Disabled fast string operations
Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8
Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
CPU: Intel 06/3d (family: 0x6, model: 0x3d, stepping: 0x4)
Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; no mitigation available!
Speculative Store Bypass: Vulnerable
TAA: Mitigation: Clear CPU buffers
MDS: Mitigation: Clear CPU buffers
Performance Events: Broadwell events, 16-deep LBR, Intel PMU driver.
...


Obviously, this is still a rather useless result: no initrd or root partition, no real applications that could run in this kernel, but it still proves that KVM is not such a terrible and quite powerful tool.



Conclusion



To run a full-fledged Linux, the virtual machine host needs to be much more advanced - we need to simulate several I / O drivers for disks, keyboard, graphics. But the general approach remains the same, for example, we need to configure the command line parameters for initrd in the same way. The disks will need to intercept I / O and respond appropriately.



However, no one is forcing you to use KVM directly. There is libvirt , a nice friendly library for low-level virtualization technologies like KVM or BHyve.



If you are interested in learning more about KVM, I suggest looking at the kvmtool source . They are much easier to read than QEMU and the whole project is much smaller and simpler.



Hope you enjoyed the article.



You can follow the news on Github , Twitter, or subscribe via rss .



Links to the GitHub Gist with Python examples from a Timeweb expert: (1) and (2) .



All Articles