hxp | 0CTF Quals 2019: "flropyd" writeup

Wed, 27 March 2019 ~ hlt

0CTF Quals 2019: "flropyd" writeup

People always say that ROP is Turing complete, show me by implementing the Floyd-Warshall algorithm.

The description already says exactly what is going to happen here: We are provided with a program that

prints the address of malloc to give us access to the libc (libc6 2.27-3, Ubuntu x64),
reads a ROP chain from the input,
generates random fully-connected graphs,
and verifies that the ROP chain actually performs the Floyd-Warshall algorithm.

Floyd-Warshall

The Floyd-Warshall algorithm is one of the simplest graph algorithms around. It finds the shortest paths from any source to any destination in $\mathcal{O}(|V|^3)$ time, although in this version it only keeps track of the distances rather than the actual paths. Because our graph is fully connected, we can initialize the dist array that keeps the current shortest distances with the adjacency matrix of the graph. A Python implementation takes all of five lines, our solution ROP chain is a bit bigger.

# Assume that dist is already initialized, and
# node_count is the number of nodes in the graph
for k in range(node_count):
    for i in range(node_count):
        for j in range(node_count):
            if dist[i][k] + dist[k][j] < dist[i][j]:
                # Going i → k → j is shorter than taking the direct route
                dist[i][j] = dist[i][k] + dist[k][j]

Verifying ROP chains

I will briefly go over how exactly the verification takes place, because it does constrain us during ROP chain development.

In order to obtain the flag, we need to “win” ten rounds. For each round, the program generates a random fully-connected graph with 32 to 63 nodes and random weights (up to 0xFFF) on each edge. It then forks off a child process to which it connects using ptrace.

In the child process, execution time is limited to one second via alarm. Most useful bits of code in the binary itself are zeroed (including the reference implementation of Floyd-Warshall), and syscalls (except exit) are disabled via seccomp.

Then, the ROP chain is copied on the stack, without any protection mechanisms:

void child_fn(void)
{
    char stack[16];
    /* ... */
    memcpy(stack, &rop_chain, 0x10000);
    return;
}

We start executing at byte 16 of our submitted ROP chain as soon as this function returns.

In the meantime, the parent process waits for the child to receive a signal — this has to be the SIGALRM received when the one-second timer is over. If we actually exit from the child process, verification will fail (even though we are technically allowed to).

Then, the parent computes its own reference solution, and starts comparing them to the solution in the child by inspecting the child’s memory using ptrace, entry by entry. If all values are correct, we pass the round.

A straightforward trick that makes debugging a lot easier is to

replace /dev/urandom in the binary with a fixed file to always generate the same graph,
use GDB to dump the reference result for this graph once,
remove the fork() and ptrace() so that the process only executes your ROP chain, and
remove the limitation on system calls to allow us to use write to debug (this may not show up on your terminal, but strace can capture it once the calls to ptrace() are gone).

Also very helpful is to intentionally jump to 0x0 to cause segmentation faults and inspect the core files in GDB later on.

Building the ROP chain

The number of nodes in the graph is stored at 0x602060, immediately followed by the edge weights as a two-dimensional array of 64-bit integers (with size 64 in both dimensions).

We pull our gadgets from the libc using ropper. After the fact, there are a few places where it would have been possible to improve on this ROP chain, but once everything else is set, changes at the last minute (like forgetting that the weights actually take up eight bytes per entry) are best done minimally, without rewriting everything else.

Because reading through 400 lines of ROP chain generator code is incredibly boring and not exactly a fun activity, I will just highlight the general principles of how we constructed the ROP chain for this challenge.

Notes on gadgets

Many of the gadgets are really straightforward to find in the libc, but there are a couple that I used a number of times that have some side effects that need to be compensated. Most of them just clobber a register (for example, our gadget for mov rsi, [rsi + 0x70] at 0x052419 clears eax), but some are a bit weirder.

Two of our gadgets end in a ret X (pop the return address from the stack, and then remove another X bytes), which requires us to add padding bytes after the next return address. We abuse this behavior of ret 0x5a0 (at 0x15d0c4) as its own gadget to implement conditionals (more on this later), but need to compensate for it in imul cl (at 0x0b5878) which we use all over the place; it is followed by a ret 8.

Conditionals

In order to implement a simple if-else conditional, we use an indirect jump (jmp rax at 0x021ed1) to decide between a ret gadget (continue with the next gadget) or the ret 0x5a0 described earlier (continue with the next gadget, but then skip 0x5a0 bytes of our ROP chain).

Because ret is at the very front of the libc, and ret 0x5a0 at the very back, we want to compute addressof(ret) + (addressof(ret 0x5a0) - addressof(ret)) * comparison_result. This direct approach, of course, is limited by the fact that our imul cl gadget can only handle single-byte inputs, and the difference of addresses takes at least two bytes. Instead, we extend the single-bit comparison result to a full 32 bits, and use a and [rdi], eax gadget (0x03f938) to build the offset.

Here is how that works:

First, we use a comparison gadget (cmp rdx, rdi, setae al at 0x1300e9) to set al to 1 if rdx $\small\geq$ rdi
Then, we move that value into rcx (mov ecx, rax at 0x0a80a4, clobbering a bit of memory in the process), set rax to 0xFF, and use the imul cl gadget from earlier in order to set ax to all ones if the condition is met, or all zeroes if it is not.
From here, we use cwde (0x003c41) to sign-extend this into eax and store the result in memory.
With and [rdi], eax, we modify that memory to contain the offset from ret.
Finally, we retrieve that value, add the address of ret, and jump there.
An extra ret (= no-op in ROP chains) is added to deal with ret 0x5a0 executing the next instruction before jumping.

Index updates

As you can see, implementing conditionals like this is quite a hassle. After every loop iteration, we need to increment the indices. Translating the nested for-loops from the example into a single do-while-loop gives us the following update step:

def next(index):
    if index + 1 < size:
        return index + 1
    else:
        return 0

j = next(j)
if j == 0:
    i = next(i)
if j == 0 and i == 0:
    k = next(k)

To do this, we use a similar method to that used for the conditionals, but without the jumping. Instead, we compute the index update for each index directly:

We use our comparison gadget to check if index + 1 $\small\geq$ size
The same multiplication trick sets rax to 0xFF if the comparison matches, otherwise to 0.
We and this value on top of the index + 1 in rdx we computed for the comparison, giving us the result of next(index), which we store in memory.
The one-bit results of additional checks (e.g. j == 0 for the update to i) are collected using and [rdi], eax.
We compute the update for the index (next(index) - index) and use the multiplication gadget again (with the results from any additional checks) to set the final update to zero if these checks fail.
Finally, we apply this update and store the result back in memory.

Jumps and Loops

How do you write a loop in a ROP chain? In theory, this is simple — just store rsp somewhere and restore it when you want to loop. However, the gadgets we have to actually access rsp are not exactly nice to use.

Saving `rsp`

To save rsp, we use a rather long gadget at 0x03ecd1:

mov rsi, rsp;
ror rax, 0x11;
xor rax, qword ptr fs:[0x30];
jmp rax;

Because of the jmp rax at the end, we need to first invert the pointer (de)mangling code on our jump target, which requires us to use another bunch of gadgets.

First, we set 8 bytes of temporary storage to -1 (0xFFFF…)
We obtain the pointer mangling key from fs:[0x30] using a mov rax, fs:[rax] at 0x145c97
To XOR it with our jump target (the address of a ret gadget), there is a gadget at 0x0438fa (that also subtracts the jump target from the result, so we need to add it back)
Then, we can use the rol rax, cl at 0x03f934 (which clobbers even more memory and registers) to finish “encrypting” the pointer, and finally copy rsp into rsi.

We adjust the stored stack pointer by the length of this part of the ROP chain (so that restoring it points us back to the start), and keep it in memory until the end of the loop body.

There, we use a conditional construction like the one described above to skip the restoration of rsp when the loop is done.

Restoring `rsp`

Restoring rsp is a little bit simpler, although the only useful gadget here first requires us to store the value in r8 (then, mov rsp, r8, mov rbp, r9, jmp rdx at 0x03eca9 gives us a straightforward way to jump to the start of the loop).

Accessing r8 is a bit hairy, but we can write to it from rdx using a gadget at 0x044350.

Termination

Finally, the question remains of what to do at the end of the program. Calling C’s exit() is out of the question (it uses the exit_group syscall, which we are not allowed to use by the seccomp settings), and using the older exit syscall directly means that verification fails because the process is gone before the parent can read its memory using ptrace(). Ultimately, our method of choice is to load the address of a jmp rax gadget (0x021ed1) into rax, which will immediately lead to an infinite loop until the SIGALRM hits.

Getting the flag

And indeed, after much debugging:

[*] Establishing connection
[*] Detecting memory layout
    malloc() is at 0x007fb867e20070
    libc is at     0x007fb867d89000
[*] Building ROP chain
[*] ROP chain has size 0x1a60
[*] Sending ROP chain
[*] Round 1
[*] Round 2
[*] Round 3
[*] Round 4
[*] Round 5
[*] Round 6
[*] Round 7
[*] Round 8
[*] Round 9
[*] Round 10
[*] Got flag: flag{for_k_in_N_for_i_in_N_for_j_in_N}