People always say that ROP is Turing complete, show me by implementing the Floyd-Warshall algorithm.
The description already says exactly what is going to happen here: We are provided with a program that
malloc
to give us access to the libc (libc6 2.27-3, Ubuntu x64),The Floyd-Warshall algorithm is one of the simplest graph algorithms around. It finds the shortest paths from any source to any destination in $\mathcal{O}(|V|^3)$ time, although in this version it only keeps track of the distances rather than the actual paths. Because our graph is fully connected, we can initialize the dist
array that keeps the current shortest distances with the adjacency matrix of the graph. A Python implementation takes all of five lines, our solution ROP chain is a bit bigger.
# Assume that dist is already initialized, and
# node_count is the number of nodes in the graph
for k in range(node_count):
for i in range(node_count):
for j in range(node_count):
if dist[i][k] + dist[k][j] < dist[i][j]:
# Going i → k → j is shorter than taking the direct route
dist[i][j] = dist[i][k] + dist[k][j]
I will briefly go over how exactly the verification takes place, because it does constrain us during ROP chain development.
In order to obtain the flag, we need to “win” ten rounds. For each round, the program generates a random fully-connected graph with 32 to 63 nodes and random weights (up to 0xFFF) on each edge. It then forks off a child process to which it connects using ptrace
.
In the child process, execution time is limited to one second via alarm
. Most useful bits of code in the binary itself are zeroed (including the reference implementation of Floyd-Warshall), and syscalls (except exit
) are disabled via seccomp.
Then, the ROP chain is copied on the stack, without any protection mechanisms:
void child_fn(void)
{
char stack[16];
/* ... */
memcpy(stack, &rop_chain, 0x10000);
return;
}
We start executing at byte 16 of our submitted ROP chain as soon as this function returns.
In the meantime, the parent process waits for the child to receive a signal — this has to be the SIGALRM received when the one-second timer is over. If we actually exit
from the child process, verification will fail (even though we are technically allowed to).
Then, the parent computes its own reference solution, and starts comparing them to the solution in the child by inspecting the child’s memory using ptrace
, entry by entry. If all values are correct, we pass the round.
A straightforward trick that makes debugging a lot easier is to
/dev/urandom
in the binary with a fixed file to always generate the same graph,fork()
and ptrace()
so that the process only executes your ROP chain, andwrite
to debug (this may not show up on your terminal, but strace
can capture it once the calls to ptrace()
are gone).Also very helpful is to intentionally jump to 0x0 to cause segmentation faults and inspect the core files in GDB later on.
The number of nodes in the graph is stored at 0x602060, immediately followed by the edge weights as a two-dimensional array of 64-bit integers (with size 64 in both dimensions).
We pull our gadgets from the libc using ropper. After the fact, there are a few places where it would have been possible to improve on this ROP chain, but once everything else is set, changes at the last minute (like forgetting that the weights actually take up eight bytes per entry) are best done minimally, without rewriting everything else.
Because reading through 400 lines of ROP chain generator code is incredibly boring and not exactly a fun activity, I will just highlight the general principles of how we constructed the ROP chain for this challenge.
Many of the gadgets are really straightforward to find in the libc, but there are a couple that I used a number of times that have some side effects that need to be compensated. Most of them just clobber a register (for example, our gadget for mov rsi, [rsi + 0x70]
at 0x052419 clears eax
), but some are a bit weirder.
Two of our gadgets end in a ret X
(pop the return address from the stack, and then remove another X bytes), which requires us to add padding bytes after the next return address. We abuse this behavior of ret 0x5a0
(at 0x15d0c4) as its own gadget to implement conditionals (more on this later), but need to compensate for it in imul cl
(at 0x0b5878) which we use all over the place; it is followed by a ret 8
.
In order to implement a simple if-else conditional, we use an indirect jump (jmp rax
at 0x021ed1) to decide between a ret
gadget (continue with the next gadget) or the ret 0x5a0
described earlier (continue with the next gadget, but then skip 0x5a0 bytes of our ROP chain).
Because ret
is at the very front of the libc, and ret 0x5a0
at the very back, we want to compute addressof(ret) + (addressof(ret 0x5a0) - addressof(ret)) * comparison_result
. This direct approach, of course, is limited by the fact that our imul cl
gadget can only handle single-byte inputs, and the difference of addresses takes at least two bytes. Instead, we extend the single-bit comparison result to a full 32 bits, and use a and [rdi], eax
gadget (0x03f938) to build the offset.
Here is how that works:
cmp rdx, rdi
, setae al
at 0x1300e9) to set al
to 1 if rdx
$\small\geq$ rdi
rcx
(mov ecx, rax
at 0x0a80a4, clobbering a bit of memory in the process), set rax
to 0xFF, and use the imul cl
gadget from earlier in order to set ax
to all ones if the condition is met, or all zeroes if it is not.cwde
(0x003c41) to sign-extend this into eax and store the result in memory.and [rdi], eax
, we modify that memory to contain the offset from ret
.ret
, and jump there.ret
(= no-op in ROP chains) is added to deal with ret 0x5a0
executing the next instruction before jumping.As you can see, implementing conditionals like this is quite a hassle. After every loop iteration, we need to increment the indices. Translating the nested for-loops from the example into a single do-while-loop gives us the following update step:
def next(index):
if index + 1 < size:
return index + 1
else:
return 0
j = next(j)
if j == 0:
i = next(i)
if j == 0 and i == 0:
k = next(k)
To do this, we use a similar method to that used for the conditionals, but without the jumping. Instead, we compute the index update for each index directly:
index + 1
$\small\geq$ size
rax
to 0xFF if the comparison matches, otherwise to 0.and
this value on top of the index + 1
in rdx
we computed for the comparison, giving us the result of next(index)
, which we store in memory.j == 0
for the update to i
) are collected using and [rdi], eax
.next(index) - index
) and use the multiplication gadget again (with the results from any additional checks) to set the final update to zero if these checks fail.How do you write a loop in a ROP chain? In theory, this is simple — just store rsp
somewhere and restore it when you want to loop. However, the gadgets we have to actually access rsp
are not exactly nice to use.
rsp
To save rsp
, we use a rather long gadget at 0x03ecd1:
mov rsi, rsp;
ror rax, 0x11;
xor rax, qword ptr fs:[0x30];
jmp rax;
Because of the jmp rax
at the end, we need to first invert the pointer (de)mangling code on our jump target, which requires us to use another bunch of gadgets.
fs:[0x30]
using a mov rax, fs:[rax]
at 0x145c97ret
gadget), there is a gadget at 0x0438fa (that also subtracts the jump target from the result, so we need to add it back)rol rax, cl
at 0x03f934 (which clobbers even more memory and registers) to finish “encrypting” the pointer, and finally copy rsp
into rsi
.We adjust the stored stack pointer by the length of this part of the ROP chain (so that restoring it points us back to the start), and keep it in memory until the end of the loop body.
There, we use a conditional construction like the one described above to skip the restoration of rsp
when the loop is done.
rsp
Restoring rsp
is a little bit simpler, although the only useful gadget here first requires us to store the value in r8
(then, mov rsp, r8
, mov rbp, r9
, jmp rdx
at 0x03eca9 gives us a straightforward way to jump to the start of the loop).
Accessing r8
is a bit hairy, but we can write to it from rdx
using a gadget at 0x044350.
Finally, the question remains of what to do at the end of the program. Calling C’s exit()
is out of the question (it uses the exit_group
syscall, which we are not allowed to use by the seccomp settings), and using the older exit
syscall directly means that verification fails because the process is gone before the parent can read its memory using ptrace()
. Ultimately, our method of choice is to load the address of a jmp rax
gadget (0x021ed1) into rax
, which will immediately lead to an infinite loop until the SIGALRM hits.
And indeed, after much debugging:
[*] Establishing connection
[*] Detecting memory layout
malloc() is at 0x007fb867e20070
libc is at 0x007fb867d89000
[*] Building ROP chain
[*] ROP chain has size 0x1a60
[*] Sending ROP chain
[*] Round 1
[*] Round 2
[*] Round 3
[*] Round 4
[*] Round 5
[*] Round 6
[*] Round 7
[*] Round 8
[*] Round 9
[*] Round 10
[*] Got flag: flag{for_k_in_N_for_i_in_N_for_j_in_N}