A tiny OS on KVM, both sides in Rust

Years ago I worked through most of nand2tetris, building a computer up from NAND gates. What I never got to was the other direction, running something interesting on real hardware where every layer between the metal and the pixels is yours.

This is that, on Linux KVM. There are two programs. One is a hypervisor that talks to /dev/kvm and sets up a virtual machine. The other is a kernel that runs inside it, entered directly in 64-bit mode with no BIOS and no bootloader, so the first instruction the machine ever runs is ours. By the end it draws a spinning shaded torus, takes mouse and keyboard input, and runs an unmodified static Linux binary.

One full revolution, recorded frame by frame out of guest memory. Software-rasterized in the guest with its own z-buffer and per-face lighting. No GPU, no graphics API, no libraries below it.

Everything here runs on a plain Fedora laptop. On Fedora /dev/kvm is world read-write by default, so the hypervisor is an ordinary user program with no privileges and no setup.

The smallest virtual machine

KVM turns virtualization into a short conversation with the kernel. You open /dev/kvm, ask for a VM, give it some memory, and ask for a virtual CPU. The kvm-ioctls crate wraps each of those in a method.

Kvm::new() open /dev/kvm create_vm() an empty VM set_user_memory an mmap becomes guest RAM create_vcpu a CPU to run
Four calls stand up a machine. Everything after is setting that CPU's registers and running it.

Guest memory is just an ordinary mmap in the hypervisor. You hand its address to KVM, and from then on the guest’s physical address 0 is the start of that region.

let kvm = Kvm::new().expect("opening /dev/kvm");
let vm = kvm.create_vm().expect("creating VM");

let mem = GuestMemory::new(MEM_SIZE);
let region = kvm_userspace_memory_region {
    slot: 0,
    flags: 0,
    guest_phys_addr: 0,
    memory_size: MEM_SIZE as u64,
    userspace_addr: mem.ptr as u64,
};
// Safety: the mapping is valid, page-aligned, and outlives the VM.
unsafe { vm.set_user_memory_region(region) }.expect("registering guest memory");

That is the entire machine. What is missing is everything a physical CPU would inherit from firmware. The CPU comes up expecting 16-bit real mode, with no page tables and no idea it should be running our code. On real hardware the BIOS and a bootloader spend thousands of lines climbing from there up to a modern 64-bit environment. Because we own the virtual CPU, we can skip all of it and just set the registers to the state we want.

Long mode, the modern 64-bit state, needs page tables, so we write them into guest memory ourselves. Identity-mapping the first gigabyte, where physical address equals virtual address, takes exactly three pages. One top-level table points to one pointer table, which points to one directory of 512 entries covering two megabytes each.

/// Identity-map the first 1 GiB with 2 MiB huge pages: PML4 -> PDPT -> PD.
fn write_page_tables(mem: &GuestMemory) {
    mem.write_u64(PML4_ADDR, PDPT_ADDR | PTE_PRESENT | PTE_WRITABLE);
    mem.write_u64(PDPT_ADDR, PD_ADDR | PTE_PRESENT | PTE_WRITABLE);
    for i in 0..512u64 {
        mem.write_u64(PD_ADDR + i * 8, (i << 21) | PTE_PRESENT | PTE_WRITABLE | PTE_HUGE);
    }
}

Then we set the control registers to switch on paging and long mode, point them at those tables, and fill in the segment registers by hand. This is the reset sequence a bootloader normally performs, compressed into one struct we write before the CPU runs a single instruction.

// The whole "boot process": start the vCPU already in 64-bit long mode
// with paging and SSE enabled. There is no firmware and no bootloader;
// this struct is everything the guest inherits.
let mut sregs = vcpu.get_sregs().expect("get_sregs");
sregs.cr3 = PML4_ADDR;
sregs.cr4 = CR4_PAE | CR4_OSFXSR | CR4_OSXMMEXCPT;
sregs.cr0 = CR0_PE | CR0_MP | CR0_ET | CR0_NE | CR0_PG;
sregs.efer = EFER_LME | EFER_LMA;
sregs.cs = code_segment();
let data = data_segment();
sregs.ds = data;
sregs.es = data;
sregs.fs = data;
sregs.gs = data;
sregs.ss = data;
sregs.tr = task_segment();
vcpu.set_sregs(&sregs).expect("set_sregs");

let mut regs = vcpu.get_regs().expect("get_regs");
regs.rip = entry;
// The SysV ABI puts rsp at 8 mod 16 on function entry (the call pushed a
// return address). _start is compiled as a normal function, so entering
// with a 16-aligned rsp makes every movaps spill misaligned: #GP, and
// with no IDT, a triple fault.
regs.rsp = STACK_TOP - 8;
regs.rflags = 0x2; // bit 1 is reserved-must-be-one; interrupts stay off
vcpu.set_regs(&regs).expect("set_regs");

We also set the SSE-enable bits here, so the guest can use floating point and vector instructions from its very first line without touching a control register. The instruction pointer goes to the kernel’s entry, the stack pointer to a region we reserved, and the CPU is ready.

Now we run it. vcpu.run() enters the guest and returns when the guest does something that needs us. A first kernel only does three such things. It writes bytes to a port as its serial console, asks to exit, or halts.

fn run_vcpu(mut vcpu: VcpuFd, gate: Arc<Gate>, proxy: winit::event_loop::EventLoopProxy<()>) {
    let mut stdout = std::io::stdout();
    loop {
        match vcpu.run().expect("KVM_RUN") {
            VcpuExit::IoOut(SERIAL_PORT, data) => {
                stdout.write_all(data).unwrap();
                stdout.flush().unwrap();
            }
            VcpuExit::IoOut(FRAME_PORT, _) => {
                let mut parked = gate.parked.lock().unwrap();
                *parked = true;
                if proxy.send_event(()).is_err() {
                    return; // event loop is gone, we are shutting down
                }
                while *parked {
                    parked = gate.released.wait(parked).unwrap();
                }
            }
            VcpuExit::IoOut(EXIT_PORT, data) => {
                let code = data[0];
                println!("[vmm] guest requested exit with code {code}");
                std::process::exit(code as i32);
            }
            VcpuExit::Hlt => {
                println!("[vmm] guest halted");
                std::process::exit(0);
            }
            VcpuExit::Shutdown => {
                // With no IDT in the guest, any exception escalates to a
                // triple fault and lands here. The triple fault resets the
                // vCPU before we see the exit, so registers show reset state
                // (rip=0xfff0), not the crash site; recovering the faulting
                // state needs KVM_CAP_X86_TRIPLE_FAULT_EVENT.
                let regs = vcpu.get_regs().expect("get_regs");
                eprintln!("[vmm] guest crashed (triple fault); post-reset rip={:#x}", regs.rip);
                std::process::exit(1);
            }
            exit => {
                eprintln!("[vmm] unexpected exit: {exit:?}");
                std::process::exit(1);
            }
        }
    }
}

If the guest ever faults with no handler installed, the fault escalates to a triple fault, and KVM reports it as a shutdown. For a kernel with no exception handling yet, that is a free crash channel. Any bad memory access lands in that arm of the loop.

The kernel

The guest is a no_std Rust program. No standard library, no operating system beneath it, panic set to abort. It talks to the outside world through the one device we agreed on, the serial port, by writing bytes to it with the out instruction.

pub fn outb(port: u16, value: u8) {
    unsafe {
        asm!("out dx, al", in("dx") port, in("al") value, options(nomem, nostack, preserves_flags));
    }
}

pub struct Serial;

impl Write for Serial {
    fn write_str(&mut self, s: &str) -> core::fmt::Result {
        for byte in s.bytes() {
            outb(SERIAL_PORT, byte);
        }
        Ok(())
    }
}

Implementing core::fmt::Write on top of that one byte-at-a-time port gives us writeln! and everything that formats through it, which is the whole console. A first kernel is then just a _start that prints a line and exits:

//! The smallest guest: prove we're alive over the serial port, do one
//! hardware-float computation, and exit.

#![no_std]
#![no_main]

use core::fmt::Write;

use kernel::{Serial, exit};

#[unsafe(no_mangle)]
extern "C" fn _start() -> ! {
    let _ = writeln!(Serial, "hello from ring 0");
    let hypotenuse = libm::sqrtf(3.0f32 * 3.0 + 4.0 * 4.0);
    let _ = writeln!(Serial, "sqrt(3*3 + 4*4) = {hypotenuse}");
    exit(0)
}
hello from ring 0
sqrt(3*3 + 4*4) = 5
[vmm] guest requested exit with code 0

The sqrt line is there to prove a point. It compiles to a real sqrtss instruction running on the CPU’s floating-point unit, not a software emulation. That takes one piece of setup. Rust’s bare-metal x86-64 target is soft-float by default, meaning it avoids the vector registers entirely, which is the safe choice for a kernel but slow for anything that does math. We want the opposite, so the kernel builds against a small custom target that turns SSE back on.

{
  "arch": "x86_64",
  "code-model": "kernel",
  "cpu": "x86-64",
  "crt-objects-fallback": "false",
  "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
  "disable-redzone": true,
  "features": "-mmx,+sse,+sse2",
  "linker": "rust-lld",
  "linker-flavor": "gnu-lld",
  "llvm-target": "x86_64-unknown-none-elf",
  "max-atomic-width": 64,
  "panic-strategy": "abort",
  "plt-by-default": false,
  "position-independent-executables": false,
  "relocation-model": "static",
  "relro-level": "full",
  "stack-probes": {
    "kind": "inline"
  },
  "static-position-independent-executables": false,
  "target-pointer-width": 64
}

A linker script places the kernel at a fixed address that the hypervisor’s ELF loader reads, and a two-line Cargo config builds the standard library from source against this target. With that, the guest is ordinary Rust. It has floats, arrays, iterators, and the borrow checker, just nothing underneath it.

A screen and a mouse

A framebuffer is only a block of memory the display reads, so we make one. We pick a region of guest RAM and agree that it holds 640 by 480 pixels in the format the window expects, and the guest draws by writing pixels there. The hypervisor, which can see all of guest memory, copies that region into a desktop window every frame.

The window and its event loop live on the hypervisor’s main thread using winit and softbuffer, while the virtual CPU runs on its own thread. The two take turns. The guest draws a frame, the host displays it and collects input, and only then does the guest draw the next one. One port write is the whole handshake.

guest vCPU thread host event loop draw frame into framebuffer out to frame port · vCPU exits blit to window write input + time to mailbox resume · next frame read mailbox, apply input
The guest is stopped the whole time the host touches shared memory, so the mailbox needs no lock. One port write per frame is the entire protocol.

When the guest finishes a frame it writes to a port, which exits to the hypervisor. The host blits the framebuffer, fills a small mailbox in guest memory with the latest input and the current time, and lets the guest run again. Because the guest is stopped the entire time the host touches that shared memory, there is no lock and no race, just a struct the host writes while the guest is parked.

/// If the guest is parked at the doorbell, hand it fresh input and time,
/// then let it run the next frame.
fn release_guest(&mut self) {
    let mut parked = self.gate.parked.lock().unwrap();
    if !*parked {
        return;
    }
    self.frame += 1;
    let mut state = SharedState {
        frame: self.frame,
        time_ns: self.start.elapsed().as_nanos() as u64,
        event_count: self.pending.len() as u32,
        _pad: 0,
        events: [InputEvent { kind: 0, code: 0, value: 0 }; MAX_EVENTS],
    };
    state.events[..self.pending.len()].copy_from_slice(&self.pending);
    self.pending.clear();
    self.mem.write_state(&state);
    *parked = false;
    self.gate.released.notify_all();
}

The input events themselves borrow their shape from Linux. Each is eight bytes of type, code, and value, the same layout the kernel’s evdev interface uses, and winit hands us real Linux key codes on Wayland, so the guest reads them with no translation.

// Input events use the evdev/virtio-input shape: {type, code, value}.
// Codes are real Linux evdev codes (winit hands them to the VMM as-is).
pub const EV_KEY: u16 = 0x01;
pub const EV_REL: u16 = 0x02;
pub const EV_ABS: u16 = 0x03;
pub const ABS_X: u16 = 0x00;
pub const ABS_Y: u16 = 0x01;
/// Wheel steps; the value is an i32 stored in the u32.
pub const REL_WHEEL: u16 = 0x08;
pub const BTN_LEFT: u16 = 0x110;
pub const KEY_SPACE: u16 = 57;
pub const KEY_C: u16 = 46;

#[repr(C)]
#[derive(Clone, Copy)]
pub struct InputEvent {
    pub kind: u16,
    pub code: u16,
    pub value: u32,
}

/// Per-frame mailbox. The host fills it between the guest's FRAME_PORT
/// doorbell and the next KVM_RUN, so there is never concurrent access:
/// this is a mailbox, not a lock-free ring.
#[repr(C)]
#[derive(Clone, Copy)]
pub struct SharedState {
    pub frame: u64,
    /// Monotonic nanoseconds since the VMM started.
    pub time_ns: u64,
    pub event_count: u32,
    pub _pad: u32,
    pub events: [InputEvent; MAX_EVENTS],
}

The guest side is a loop that reads the mailbox, applies the events, draws, and signals done. Here it is as a paint program, where the mouse leaves a trail and keys change the color.

loop {
    let state = read_state();
    for event in &state.events[..state.event_count.min(shared::MAX_EVENTS as u32) as usize] {
        match (event.kind, event.code) {
            (EV_ABS, ABS_X) => mouse.0 = (event.value as i32).clamp(0, W - 1),
            (EV_ABS, ABS_Y) => mouse.1 = (event.value as i32).clamp(0, H - 1),
            (EV_KEY, BTN_LEFT) => {
                down = event.value == 1;
                if down {
                    prev = mouse; // don't connect separate strokes
                }
            }
            (EV_KEY, KEY_C) if event.value == 1 => hue += 1.3,
            (EV_KEY, KEY_SPACE) if event.value == 1 => paint_background(canvas()),
            _ => {}
        }
    }

    let brush = hue_to_rgb(hue);
    if down {
        stroke(canvas(), prev, mouse, brush);
    }
    prev = mouse;

    fb().copy_from_slice(canvas());
    let t = state.time_ns as f32 / 1e9;
    let pulse = 0.7 + 0.3 * libm::sinf(t * 5.0);
    crosshair(fb(), mouse.0, mouse.1, rgb(pulse, pulse, pulse));

    frame_done();
}

A rasterizer

With floats, a framebuffer, and input, the guest has everything a software renderer needs. The torus at the top is a mesh of triangles, projected to the screen and filled one pixel at a time.

The fill step is the classic edge-function rasterizer. For each pixel it checks whether the point lies inside the triangle by testing three signed areas, and if so, interpolates depth to decide whether this triangle is nearer than whatever is already there. The depth buffer is a second array the size of the screen.

fn fill_triangle(fb: &mut [u32], zb: &mut [f32], v: [Screen; 3], color: u32) {
    if v[0].behind || v[1].behind || v[2].behind {
        return; // no near-plane clipping; the camera zoom is clamped instead
    }
    let area2 = edge(v[0], v[1], v[2].x, v[2].y);
    if area2 <= 0.0 {
        return; // backface (or degenerate)
    }
    let min_x = v.iter().fold(f32::MAX, |m, p| m.min(p.x)).max(0.0) as usize;
    let min_y = v.iter().fold(f32::MAX, |m, p| m.min(p.y)).max(0.0) as usize;
    let max_x = (v.iter().fold(0.0f32, |m, p| m.max(p.x)) as usize).min(W - 1);
    let max_y = (v.iter().fold(0.0f32, |m, p| m.max(p.y)) as usize).min(H - 1);
    let inv_area = 1.0 / area2;
    for y in min_y..=max_y {
        let py = y as f32 + 0.5;
        for x in min_x..=max_x {
            let px = x as f32 + 0.5;
            let w0 = edge(v[1], v[2], px, py);
            let w1 = edge(v[2], v[0], px, py);
            let w2 = edge(v[0], v[1], px, py);
            if w0 < 0.0 || w1 < 0.0 || w2 < 0.0 {
                continue;
            }
            let zinv = (w0 * v[0].zinv + w1 * v[1].zinv + w2 * v[2].zinv) * inv_area;
            let idx = y * W + x;
            if zinv > zb[idx] {
                zb[idx] = zinv;
                fb[idx] = color;
            }
        }
    }
}

Everything around it is just as plain. The torus is built from two angles of sine and cosine, rotated by the camera, projected by dividing by depth, and each face is shaded by how much it points at the light. About two thousand triangles a frame, all in the guest, all on the CPU we set up by hand. Drag orbits the camera, the wheel zooms, space toggles the spin.

Orbiting with the mouse. The window is the hypervisor, the pixels and the camera are the kernel’s.

Running a real Linux binary

The last piece is the one that sounds hardest and turns out to be mechanical. A static Linux executable is self-contained, with no shared libraries and no dynamic linker. To run one, the kernel has to do what the Linux kernel does at exec time, and then answer the handful of system calls the program makes.

The hypervisor drops the executable’s bytes into guest memory. The kernel parses its ELF headers and copies each loadable segment to the address it wants, zeroing the gap that becomes uninitialized data.

/// Load the ET_EXEC image the VMM placed at PROGRAM_ADDR: copy PT_LOAD
/// segments to their link addresses, zero their .bss tails.
fn load_program() -> LoadedElf {
    let len: u64 = read(PROGRAM_ADDR);
    assert!(len != 0, "no program image; run: vmm <kernel> <static-linux-elf>");
    let image = PROGRAM_ADDR + 16;

    assert!(read::<u32>(image) == 0x464c_457f, "not an ELF");
    let e_type: u16 = read(image + 16);
    assert!(e_type == 2, "not ET_EXEC; build with -static -no-pie");
    let entry: u64 = read(image + 24);
    let phoff: u64 = read(image + 32);
    let phent: u16 = read(image + 54);
    let phnum: u16 = read(image + 56);

    let mut phdr_vaddr = 0u64;
    let mut brk_start = 0u64;
    for i in 0..phnum as u64 {
        let ph = image + phoff + i * phent as u64;
        let p_type: u32 = read(ph);
        let p_offset: u64 = read(ph + 8);
        let p_vaddr: u64 = read(ph + 16);
        let p_filesz: u64 = read(ph + 32);
        let p_memsz: u64 = read(ph + 40);
        if p_type == 6 {
            phdr_vaddr = p_vaddr; // PT_PHDR
        }
        if p_type != 1 {
            continue; // not PT_LOAD
        }
        assert!(p_vaddr + p_memsz <= BRK_MAX, "segment outside program region");
        unsafe {
            core::ptr::copy_nonoverlapping(
                (image + p_offset) as *const u8,
                p_vaddr as *mut u8,
                p_filesz as usize,
            );
            core::ptr::write_bytes((p_vaddr + p_filesz) as *mut u8, 0, (p_memsz - p_filesz) as usize);
        }
        if phdr_vaddr == 0 && phoff >= p_offset && phoff < p_offset + p_filesz {
            phdr_vaddr = p_vaddr + (phoff - p_offset);
        }
        brk_start = brk_start.max((p_vaddr + p_memsz + 0xfff) & !0xfff);
    }
    LoadedElf { entry, phdr_vaddr, phent: phent as u64, phnum: phnum as u64, brk_start }
}

Then it builds the stack the program expects to wake up on. Linux hands a new process its argument count, its arguments, its environment, and an auxiliary vector of key-value pairs describing the machine. A C library reads all of it before main runs, so we lay it out in the exact order.

/// Build the SysV process-entry stack: rsp points at argc, then argv,
/// envp and auxv follow, all NULL-terminated. rsp must be 16-aligned.
fn build_stack(elf: &LoadedElf) -> u64 {
    let argv0 = PROGRAM_STACK_TOP - 32;
    for (i, byte) in b"hello\0".iter().enumerate() {
        write_val(argv0 + i as u64, *byte);
    }
    let random = PROGRAM_STACK_TOP - 16; // 16 "random" bytes for AT_RANDOM
    write_val(random, 0x243f_6a88_85a3_08d3u64);
    write_val(random + 8, 0x1319_8a2e_0370_7344u64);

    let vector: [u64; 20] = [
        1,        // argc
        argv0, 0, // argv, NULL
        0,        // envp NULL
        3, elf.phdr_vaddr, // AT_PHDR
        4, elf.phent,      // AT_PHENT
        5, elf.phnum,      // AT_PHNUM
        6, 4096,           // AT_PAGESZ
        25, random,        // AT_RANDOM
        9, elf.entry,      // AT_ENTRY
        23, 0,             // AT_SECURE
        0, 0,              // AT_NULL
    ];
    let rsp = (PROGRAM_STACK_TOP - 64 - size_of_val(&vector) as u64) & !0xf;
    for (i, value) in vector.iter().enumerate() {
        write_val(rsp + i as u64 * 8, *value);
    }
    rsp
}

When the program runs a syscall instruction, the CPU jumps to an address we register in a model-specific register. That handler is the only assembly in the whole project, about twenty lines. It saves the program’s registers, switches to our own stack, calls a Rust function, and returns. It cannot use the usual sysret, because that would drop to user mode and the program is already running alongside the kernel in ring 0, so it returns with a plain jump.

/// `syscall` lands here (LSTAR). No stack switch happens in ring 0, so we
/// spill state to a fixed frame, hop onto our own stack, and return with a
/// plain jmp: sysret would force us to ring 3.
#[unsafe(naked)]
extern "C" fn syscall_entry() {
    naked_asm!(
        "mov [{f} + 0], rax",  // syscall number
        "mov [{f} + 8], rdi",
        "mov [{f} + 16], rsi",
        "mov [{f} + 24], rdx",
        "mov [{f} + 32], r10", // arg 4 lives in r10, not rcx
        "mov [{f} + 40], r8",
        "mov [{f} + 48], r9",
        "mov [{f} + 56], rcx", // return rip
        "mov [{f} + 64], r11", // return rflags
        "mov [{f} + 72], rsp", // program stack
        "mov rsp, {kstack}",
        "call {handler}",
        "mov rsp, [{f} + 72]",
        "mov rcx, [{f} + 56]",
        "push qword ptr [{f} + 64]",
        "popfq",
        "jmp rcx",
        f = const FRAME,
        kstack = const SYSCALL_STACK_TOP,
        handler = sym syscall_handler,
    )
}

The Rust side is a match on the syscall number, and it is short because a small program asks for little. write sends its bytes to our serial console. brk and mmap hand back slices of a bump allocator. arch_prctl sets the thread-pointer register the C library needs. exit_group writes our exit port.

extern "C" fn syscall_handler() -> u64 {
    let nr: u64 = read(FRAME);
    let (a1, a2, a3): (u64, u64, u64) = (read(FRAME + 8), read(FRAME + 16), read(FRAME + 24));
    let a4: u64 = read(FRAME + 32);
    match nr {
        1 | 20 if a1 != 1 && a1 != 2 => EBADF,
        1 => {
            // write(fd, buf, count)
            write_out(a2, a3);
            a3
        }
        20 => {
            // writev(fd, iov, iovcnt)
            let mut total = 0u64;
            for i in 0..a3 {
                let base: u64 = read(a2 + i * 16);
                let len: u64 = read(a2 + i * 16 + 8);
                write_out(base, len);
                total += len;
            }
            total
        }
        16 => ENOTTY, // ioctl: stdout is not a tty (musl probes TIOCGWINSZ)
        12 => {
            // brk(addr)
            if a1 >= BRK.load(Relaxed) && a1 < BRK_MAX {
                BRK.store(a1, Relaxed);
            }
            BRK.load(Relaxed)
        }
        9 => {
            // mmap(addr, len, prot, flags, fd, off): anonymous only
            if a4 & 0x20 == 0 {
                return ENOSYS;
            }
            let len = (a2 + 0xfff) & !0xfff;
            let addr = MMAP.fetch_add(len, Relaxed);
            if addr + len > MMAP_MAX { ENOMEM } else { addr }
        }
        11 => 0, // munmap: the arena never shrinks
        158 => match a1 {
            0x1002 => {
                wrmsr(MSR_FS_BASE, a2); // arch_prctl(ARCH_SET_FS): TLS pointer
                0
            }
            0x1003 => {
                write_val(a2, rdmsr(MSR_FS_BASE));
                0
            }
            _ => EINVAL,
        },
        218 => 1,            // set_tid_address
        13 | 14 => 0,        // rt_sigaction / rt_sigprocmask: pretend
        39 | 186 => 1,       // getpid / gettid
        102 | 104 | 107 | 108 => 0, // getuid / getgid / geteuid / getegid
        60 | 231 => {
            // exit / exit_group
            let _ = writeln!(Serial, "[kernel] program exited with {}", a1 as i64);
            exit(a1 as u8);
        }
        _ => {
            let _ = writeln!(Serial, "[kernel] unhandled syscall {nr}");
            ENOSYS
        }
    }
}

The test program is plain C, compiled against musl as a static binary. Nothing about it knows it is not running on Linux.

#include <stdio.h>

int main(void) {
    printf("hello from Linux userspace\n");
    for (int i = 1; i <= 3; i++) {
        printf("  %d squared is %d\n", i, i * i);
    }
    return 42;
}
hello from Linux userspace
  1 squared is 1
  2 squared is 4
  3 squared is 9
[kernel] program exited with 42

The printf calls reach write, which reaches our serial port. The return value from main becomes an exit_group that becomes our exit port, and the code propagates all the way back to the shell that launched the hypervisor.

Where this stops

This is about as far as the easy road goes. The program runs in ring 0 with no memory protection, one process, no threads, no files. Each of those is a real project on its own. That gap is also why a hobby kernel that wants to run something like Mesa for real OpenGL is a multi-year effort rather than a weekend one. It would mean implementing enough of the Linux ABI to satisfy a full C library and a pile of shared objects, not the two dozen syscalls a static hello world touches.

But the shape of the thing is all here, and it is smaller than it sounds. A hypervisor that fits on a screen, a kernel with floats and a framebuffer, and a syscall layer thin enough to read in one sitting. Every layer between the metal and the pixels, and none of it a black box.

The full code is at github.com/filipkunc/rust-kvm-os.