This writeup explains how we solved a reversing task called `moon`

during the
Google CTF last weekend. In summary, the task implements a password check
*using OpenGL shaders*.

We’re given a Portable Executable (Download) for
`x86-64`

that was obviously compiled using MinGW (no
Rich header but the ```
This program
cannot be run in DOS mode
```

stub used typically used by MSVC – Borland prints a
different message):

```
00000000: 4d 5a 90 00 03 00 00 00 04 00 00 00 ff ff 00 00 |MZ..............|
00000010: b8 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 |........@.......|
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 80 00 00 00 |................|
00000040: 0e 1f ba 0e 00 b4 09 cd 21 b8 01 4c cd 21 54 68 |........!..L.!Th|
00000050: 69 73 20 70 72 6f 67 72 61 6d 20 63 61 6e 6e 6f |is program canno|
00000060: 74 20 62 65 20 72 75 6e 20 69 6e 20 44 4f 53 20 |t be run in DOS |
00000070: 6d 6f 64 65 2e 0d 0d 0a 24 00 00 00 00 00 00 00 |mode....$.......|
00000080: 50 45 00 00 64 86 09 00 00 00 00 00 00 00 00 00 |PE..d...........|
```

In general, Portable Executables tend to come together with a lot of statically
linked-in library code, which we usually are not interested in during reversing.
As such, the most convenient way of determining the `main`

function in Windows
binaries that occur during CTFs is usually just quickly searching for suspicious
strings such as “Wrong”, “Congratz”, “flag”, or even the program name. Moreover,
as compilers are quite deterministic, all non-library strings will be placed
close to each other. And indeed, using the hacker tool `strings`

once more we
find:

```
➜ moon strings moon.exe| grep -iC8 moon
%.8x
glGetShaderiv:
glCompileShader:
GL_TIMEOUT_EXPIRED or GL_WAIT_FAILED
vert_xyz
vert_uv
NopeGood
SDL_Init: %s
Moon
gl3wInit()
OpenGL 4.3 not supported
basic_string::append
timeout error
Bye bye...
30c7ead97107775969be4ba00cf5578f1048ab1375113631dbb6871dbe35162b1c62e982eb6a7512f3274743fb2e55c818912779ef7a34169a838666ff3994bb4d3c6e14ba2d732f14414f2c1cb5d3844935aebbbe3fb206343a004e18a092daba02e3c0969871548ed2c372eb68d1af41152cb3b61f300e3c1a8246108010d282e16df8ae7bff6cb6314d4ad38b5f9779ef23208efe3e1b699700429eae1fa93c036e5dcbe87d32be1ecfac2452ddfdc704a00ea24fbc2161b7824a968e9da1db756712be3e7b3d3420c8f33c37dba42072a941d799ba2eebbf86191cb59aa49a80ebe0b61a79741888cb62341259f62848aad44df2b809383e09437928980f
?UUU?
glActiveProgramEXT
```

Hmm … We can spot “Nope” and “Good”, which probably indicate whether we were
successful, a “Bye bye” message that apparently is printed when the program is
closed and a high entropy hexadecimal string that already should catch our
attention. Additionally, we can see several instances of the letters `gl`

, which
hint towards OpenGL being used in the program. Let’s fire up IDA and see what is
going on inside the binary!

Searching backwards from the long hexstring in IDA, we find a function that performs all kinds of interesting static initializations. After doing some cleanup und some renaming, the Hex-Rays output looks as follows:

```
__int64 init_global_vars()
{
some_byte_foo_ptr = (__int64)&some_byte_foo_dat;
new_basic_string((_BYTE **)&some_byte_foo_ptr, byte_49A010, aVersion430InVec3Ve);
sub_411C40();
shader_one_ptr = (__int64)&shader_one_dat;
new_basic_string((_BYTE **)&shader_one_ptr, &aVersion430InVec3Ve[315], &aVersion430InVec3Ve[491]);
sub_411C40();
shader_two_ptr = (__int64)&shader_two_dat;
new_basic_string((_BYTE **)&shader_two_ptr, aVersion430InVec3Ve, &aVersion430InVec3Ve[315]);
sub_411C40();
letters_png_ptr = (__int64)&letters_png_dat;
new_basic_string((_BYTE **)&letters_png_ptr, &unk_49C23A, &off_49CBD0);
sub_411C40();
qword_4CA088 = 0i64;
qword_4CA080 = &byte_4CA090;
byte_4CA090 = 0;
sub_411C40();
long_hexstring_ptr = &long_hexstring_dat;
new_basic_string(
(_BYTE **)&long_hexstring_ptr,
"30c7ead97107775969be4ba00cf5578f1048ab1375113631dbb6871dbe35162b1c62e982eb6a7512f3274743fb2e55c818912779ef7a34169a83"
"8666ff3994bb4d3c6e14ba2d732f14414f2c1cb5d3844935aebbbe3fb206343a004e18a092daba02e3c0969871548ed2c372eb68d1af41152cb3"
"b61f300e3c1a8246108010d282e16df8ae7bff6cb6314d4ad38b5f9779ef23208efe3e1b699700429eae1fa93c036e5dcbe87d32be1ecfac2452"
"ddfdc704a00ea24fbc2161b7824a968e9da1db756712be3e7b3d3420c8f33c37dba42072a941d799ba2eebbf86191cb59aa49a80ebe0b61a7974"
"1888cb62341259f62848aad44df2b809383e09437928980f",
"");
return sub_411C40();
}
```

At the bottom, we see the contents of `long_hexstring`

being assigned to a
dynamically allocated `std::string`

. In total `new_basic_string`

(which we named
according to the error messages found in this binary function) is called four
times with four different arguments:

- Some non-printable gibberish at address
`0x49A010`

A string representing a program written in a C-like language:

`#version 430 in vec3 vert_xyz; in vec2 vert_uv; out vec2 frag_uv; void main() { frag_uv = vert_uv; // Rescale the scene so it's in pixels. gl_Position = vec4( (vert_xyz.x / 1280.0) * 2.0 - 1.0, -((vert_xyz.y / 720.0) * 2.0 - 1.0), vert_xyz.z / 1024.0, 1.0 );`

Another program:

`#version 430 in vec2 frag_uv; out vec4 frag_colour; uniform sampler2D tex; void main() { frag_colour = vec4(1.0, 1.0, 1.0, 1.0) * texture(tex, frag_uv);`

Something that starts with a PNG header at address

`0x49C23A`

which, dumped using`binwalk`

, looks like the charset used by the program:

Okay, we can clearly understand why the program needs the PNG, but what are those mysterious C-like snippets about? A brief query to our search engine of choice tells us that we are dealing with OpenGL Shading Language (abbreviated GLSL). In summary, GLSL allows for embedding shaders into a program that get compiled to ARB Assembly Language which then in turn can be uploaded to the graphics hardware for further processing. Nevertheless, the two snippets we found above are totally boring and don’t do anything that is of interest for us. Let’s try to find out what that unreadable gibberish (no. 1 from the list above) was.

Looking at the locations from where `some_byte_foo_ptr`

is accessed, we can spot
a simple string scrambling function (at `0x401a30`

):

```
if ( some_byte_foo_ptr != qword_4CA128 + some_byte_foo_ptr )
{
v2 = 1i64;
v3 = (_BYTE *)(some_byte_foo_ptr + 1);
v4 = 0i64;
v5 = -561040962;
v6 = *(_BYTE *)some_byte_foo_ptr ^ 0xBE;
while ( 1 )
{
*((_BYTE *)v0 + v4) = v6;
v17 = v2;
*((_BYTE *)Memory + v2) = 0;
if ( v3 == (_BYTE *)v1 )
break;
v7 = v5;
v8 = v5;
v9 = __ROL4__(v5, 3);
v4 = v17;
v10 = (v8 >> 21) ^ (v7 << 8);
v11 = 15i64;
v5 = v10 + v9 - 0x6837E5D9;
v0 = (__int64 *)Memory;
v2 = v17 + 1;
v6 = *v3 ^ v5;
if ( Memory != &v18 )
v11 = v18;
++v3;
if ( v2 > v11 )
{
sub_4870A0(&Memory, v17, 0i64, 0i64);
v0 = (__int64 *)Memory;
}
}
}
```

At this point, we anticipated that unscrambling the string was not the main
point of the challenge, and therefore we just decided to use
x64dbg to dump the unscrambled data. This can easily
be done by setting a breakpoint to `0x401B30`

(which is directly after the loop)
and patching up `0x401aa0`

(containing a `mov r14d, 0x0F`

) to actually
unscramble more than just the first 15 bytes of the data. Afterwards, the
data can be dumped conveniently from memory referenced by location `rsp+0x60`

.

After fixing indentation and line breaks we get something that arouses interest immediately:

```
#version 430
layout(local_size_x=8,local_size_y=8)in;
layout(std430,binding=0) buffer shaderExchangeProtocol{
uint state[64];
uint hash[64];
uint password[32];
};
vec3 calc(uint p) {
float r = radians(p);
float c = cos(r);
float s = sin(r);
mat3 m = mat3(c, -s, 0.0, s, c, 0.0, 0.0, 0.0, 1.0);
vec3 pt = vec3(1024.0, 0.0, 0.0);
vec3 res = m*pt;
res += vec3(2048.0, 2048.0, 0.0);
return res;
}
uint extend(uint e) {
uint i;
uint r = e ^ 0x5f208c26;
for (i = 15; i < 31; i += 3){
uint f = e << i;
r ^= f;
}
return r;
}
uint hash_alpha(uint p) {
vec3 res=calc(p);
return extend(uint(res[0]));
}
uint hash_beta(uint p) {
vec3 res=calc(p);
return extend(uint(res[1]));
}
void main(){
uint idx = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * 8;
uint final;
if (state[idx] != 1){
return;
}
if ((idx & 1) == 0){
final = hash_alpha(password[idx/2]);
}else{
final = hash_beta(password[idx/2]);
}
uint i;
for (i=0; i < 32; i += 6){
final ^= idx << i;
}
uint h = 0x5a;
for (i = 0; i < 32; i++){
uint p = password[i];
uint r = (i * 3) & 7;
p = (p << r) | (p >> (8 - r));
p &= 0xff;
h ^= p;
}
final ^= (h | (h << 8) | (h << 16) | (h << 24));
hash[idx] = final;
state[idx] = 2;
memoryBarrierShared();
}
```

What we can see above is a simple transformation of data risiding in an array
`password`

. Note that despite the term `hash`

being used by the challenge author
(we did not rename anything in the code above), the steps performed by the
algorithm are actually reversible. Let’s discuss the above shader a bit more
in-depth. The first thing to understand is probably IO behaviour:

```
layout(std430,binding=0) buffer shaderExchangeProtocol{
uint state[64];
uint hash[64];
uint password[32];
};
```

This defines three arrays `state`

, `hash`

, and `password`

. As we can infer from
the `main`

function, `password`

is used for reading in data, `hash`

is used for
storing the result of the computation, and `state`

is (presumably) used to
signal whether a certain worker thread on the GPU already has finished the
calculation …

… which brings us to how processing on the GPU is done in general. The line

```
uint idx = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * 8;
```

and the fact that the `hash`

array stores 64 results tells us that our GPU cores
are organized in a 8 x 8 grid. The runtime assigns a (x, y) position in that
grid to each of the cores. From this grid position, the code can then calculate a
unique id, resulting in the transformation being invoked in parallel for
different input characters.

Let’s look at the actual computation steps that are eventually performed. The
first one is `calc`

, which involves some linear algebra magic.

```
vec3 calc(uint p) {
float r = radians(p);
float c = cos(r);
float s = sin(r);
mat3 m = mat3(c, -s, 0.0, s, c, 0.0, 0.0, 0.0, 1.0);
vec3 pt = vec3(1024.0, 0.0, 0.0);
vec3 res = m*pt;
res += vec3(2048.0, 2048.0, 0.0);
return res;
}
```

Even though this looks intimidating, the operation is quite simple: What we can
see here is that `p`

is used to construct a Rotation
Matrix by `p`

degrees in the
$(x, y)$ plane which gets applied to the vector $(1024, 0, 0)$ (the third
coordinate is left untouched). Afterwards, the result is translated (shifted) in
the $(x, y)$ plane by $(2048, 2048)$. With this knowledge, it becomes clear that
(assuming user input within ASCII range $(0, 255)$) this operation can easily be
inverted. In fact, if the python version of `calc`

looks like follows:

```
def calc(p):
r = math.radians(ord(p))
c = math.cos(r)
s = math.sin(r)
m = [[ c, -s, 0.0],
[ s, c, 0.0],
[0.0, 0.0, 1.0]]
pt = [ 1024.0, 0.0, 0.0 ]
res = np.dot(m, pt)
res += [2048.0, 2048.0, 0.0]
return res
```

then the reverse operation `rev_calc`

can be implemented by simply un-doing the
translation and then calculating the angle of the resulting vector and the
vector pointing into direction $(1024, 0, 0)$:

```
import math
import numpy as np
def unit_vector(vector):
return vector / np.linalg.norm(vector)
def angle_between(v1, v2):
v1_u = unit_vector(v1)
v2_u = unit_vector(v2)
return np.arccos(np.clip(np.dot(v1_u, v2_u), -1.0, 1.0))
def rev_calc(res):
res = np.subtract(res, [2048.0, 2048.0, 0.0])
p = angle_between(res, [1024.0, 0.0, 0.0])
return chr(int(math.degrees(p)))
```

The second computation step uses more traditional fixed length integer arithmetic

```
uint extend(uint e) {
uint i;
uint r = e ^ 0x5f208c26;
for (i = 15; i < 31; i += 3){
uint f = e << i;
r ^= f;
}
return r;
}
```

Looking carefully at this function, we can see that every third of the bits 0 to
14 is xored into the upper half of `r`

. Since `r`

is the input xored by a fixed
constant, the lowest 15 bits remain untouched and the operation becomes
trivially invertible:

```
def extend(e):
r = e ^ 0x5f208c26
for i in range(15, 31, 3):
r ^= (e << i) & 0xffffffff
return r & 0xffffffff
def rev_extend(r):
l = ((r ^ 0x5f208c26) & ((1 << 15) - 1))
for i in range(15, 31, 3):
r ^= (l << i) & 0xffffffff
return (r ^ 0x5f208c26) & 0xffffffff
```

Let’s look at the `main`

function of the shader again:

```
void main(){
uint idx = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * 8;
uint final;
if (state[idx] != 1){
return;
}
if ((idx & 1) == 0){
final = hash_alpha(password[idx/2]);
}else{
final = hash_beta(password[idx/2]);
}
uint i;
for (i=0; i < 32; i += 6){
final ^= idx << i;
}
uint h = 0x5a;
for (i = 0; i < 32; i++){
uint p = password[i];
uint r = (i * 3) & 7;
p = (p << r) | (p >> (8 - r));
p &= 0xff;
h ^= p;
}
final ^= (h | (h << 8) | (h << 16) | (h << 24));
hash[idx] = final;
state[idx] = 2;
memoryBarrierShared();
}
```

As `extend`

and `calc`

are used by `hash_alpha`

and `hash_beta`

, we are now
fully able to invert the first part of the algorithm. Nice!

What about the remaining steps?

So far, the transformation was performed for each character independently from each other (observe how password is just accessed by means of the global id). To stop simple character wise brute force, this obviously has to change; which is what the second part of the transformation deals with:

```
for (i=0; i < 32; i += 6){
final ^= idx << i;
}
uint h = 0x5a;
for (i = 0; i < 32; i++){
uint p = password[i];
uint r = (i * 3) & 7;
p = (p << r) | (p >> (8 - r));
p &= 0xff;
h ^= p;
}
final ^= (h | (h << 8) | (h << 16) | (h << 24));
```

The first `for`

loop simply xors the global id `idx`

to the result at several
postions. The more devastating part is the second loop: All characters of the
password are accumulated into one byte value by combining a bitwise rotate with
a multiplication by $3$. The result is then promoted to a 32 bit value by simply
duplicating the lowest byte and then xoring it to the final hash value. We don’t
know the value of `h`

and therefore cannot reconstruct `final`

from before the
final xor. Damn. Or can we?

Thinking about this a second we came to the conclusion that this it can be
broken by performing a *known plaintext attack on the target value the thread
with idx 0*. But, wait, what are our target values? Remember the long hexstring?
Let’s get back to the binary a second time!

This is the `main`

function of the binary after cleaning it up and teaching IDA
about SDL headers:

```
SDL_StartTextInput();
do
{
while ( SDL_PollEvent(&event) )
{
evt_type = event.type;
if ( event.type == SDL_QUIT )
goto LABEL_34;
if ( event.type - 768 > 1 )
goto LABEL_15;
if ( event.window.event == SDL_PEEKEVENT )
{
if ( event.window.data2 == 27 )
{
LABEL_34:
v21 = v30;
goto LABEL_35;
}
if ( (event.window.data2 == 0x400000BB || event.window.data2 == 8) && qword_4CA088 )
{
sub_486E60((__int64)&qword_4CA080, qword_4CA088 - 1, 1i64);
evt_type = event.type;
cmp_flag = 0;
LABEL_15:
if ( evt_type == SDL_TEXTINPUT && strlen(event.edit.text) == 1 )
{
if ( (unsigned __int64)qword_4CA088 <= 0x1F )
{
string_append((__int64)&qword_4CA080, &event.jhat.hat, 1ui64);
if ( qword_4CA088 == 32 )
goto LABEL_27;
LABEL_19:
cmp_flag = 0;
}
else
{
if ( qword_4CA088 != 32 )
goto LABEL_19;
LABEL_27:
v17 = (__int128 *)v42;
Size = 0i64;
v35 = 0;
Memory = &v35;
if ( (unsigned __int8)do_check(qword_4CA080, (unsigned int *)v42) )
{
do
{
v18 = *(_DWORD *)v17;
*(_QWORD *)Dest = 0i64;
v32 = 0;
sprintf_0(Dest, "%.8x", v18);
v19 = &Dest[strlen(Dest)];
if ( v19 - Dest > 0x7FFFFFFFFFFFFFFFi64 - Size )
sub_4921C0("basic_string::append");
string_append((__int64)&Memory, Dest, v19 - Dest);
v17 = (__int128 *)((char *)v17 + 4);
}
while ( &v43 != v17 );
v20 = (char *)Memory;
if ( Size != long_hexstring_data || Size && memcmp(Memory, long_hexstring_ptr, Size) )
cmp_flag = 1;
else
cmp_flag = 2;
}
else
{
v25 = __iob_func();
sub_41E440(&v25[2], "timeout error\n", v26);
v20 = (char *)Memory;
}
if ( v20 != &v35 )
j_free(v20);
}
}
}
}
}
v21 = 0;
LABEL_35:
v22 = SDL_GetTicks();
*(float *)&v23 = (float)((float)v22 / 1000.0) - *(float *)&dword_4CA044;
*(float *)&dword_4CA044 = (float)v22 / 1000.0;
dword_4CA040 = v23;
SDL_GL_MakeCurrent(v27, v28);
glUseProgram((unsigned int)dword_4CA0A4);
sub_402660((float *)&v36);
glUseProgram(0i64);
SDL_GL_SwapWindow(v27);
}
while ( !v21 );
puts("Bye bye...");
```

The `long_hexstring_ptr`

is used here in a `memcmp`

. The rest of the code is
boilerplate to read in user input, store it at `0x4ca080`

and then throw it into
`do_check`

. This function, unsurprisingly, invokes the secreet shader with the
input the user entered (renaming the global function names based on string
analysis where the pointer values are actually obtained from `GetProcAddress`

):

```
glUseProgram((unsigned int)flagCheckProgram);
if ( !(unsigned char)glIsBuffer((unsigned int)dword_4CA0A0) )
{
memset(Dst, 0, 0x280ui64);
glGenBUffers(1i64, &dword_4CA0A0);
glBindBufferBase(37074i64, 0i64, (unsigned int)dword_4CA0A0);
glBindBuffer(37074i64, (unsigned int)dword_4CA0A0);
glBufferData(37074i64, 640i64, Dst, 35048i64);
}
// ...
glDispatchCompute(8i64, 8i64, 1i64);
LODWORD(v21) = glFenceSync(37143i64, 0i64);
v22 = v21;
if ( (glClientWaitSync(v21, 1i64, 1000000000i64) - 37147) & 0xFFFFFFFD )
// ...
```

Alright, no surprises here, the `long_hexstring_ptr`

is the result of the
computation of our 8 x 8 gpu thread grid. Let’s finish this …

Remember, we perform a known-plaintext attack on the value of `h`

in the
shader’s main function that is xored to the `final`

variable in the last step.
We assume that the password starts with a “C” (for CTF{}, yay flag format) and
do the following in our python script (remember, idx 0 skips the second last xor
loop):

```
import math, struct, binascii
import numpy as np
dat = binascii.unhexlify(open('magic', 'r').read().strip())
u = lambda x: struct.unpack(">I", x)[0]
grid = [ [ u(dat[j * 32 + i * 4:][:4]) for i in range(8) ] for j in range(8) ]
print(hex(extend(int(calc('C')[0] + 0)) ^ grid[0][0]))
➜ moon ./pwn.py
0x6f6f6f6f
```

Success! The fact that we can see four duplicated byte values tells us that for the correct password

```
uint h = 0x5a;
for (i = 0; i < 32; i++){
uint p = password[i];
uint r = (i * 3) & 7;
p = (p << r) | (p >> (8 - r));
p &= 0xff;
h ^= p;
}
```

ends up with `h = 0x6f`

.

This is the final solution script (with `magic`

being just the
long hex string that we obtained running `strings`

):

```
#!/usr/bin/env python3
import math, struct, binascii
import numpy as np
dat = binascii.unhexlify(open('magic', 'r').read().strip())
u = lambda x: struct.unpack(">I", x)[0]
grid = [ [ u(dat[j * 32 + i * 4:][:4]) for i in range(8) ] for j in range(8) ]
tgt = grid[0][0]
def unit_vector(vector):
return vector / np.linalg.norm(vector)
def angle_between(v1, v2):
v1_u = unit_vector(v1)
v2_u = unit_vector(v2)
return np.arccos(np.clip(np.dot(v1_u, v2_u), -1.0, 1.0))
def extend(e):
r = e ^ 0x5f208c26
for i in range(15, 31, 3):
r ^= (e << i) & 0xffffffff
return r & 0xffffffff
def rev_extend(r):
l = ((r ^ 0x5f208c26) & ((1 << 15) - 1))
for i in range(15, 31, 3):
r ^= (l << i) & 0xffffffff
return (r ^ 0x5f208c26) & 0xffffffff
def calc(p):
r = math.radians(ord(p))
c = math.cos(r)
s = math.sin(r)
m = [[ c, -s, 0.0],
[ s, c, 0.0],
[0.0, 0.0, 1.0]]
pt = [ 1024.0, 0.0, 0.0 ]
res = np.dot(m, pt)
res += [2048.0, 2048.0, 0.0]
return res
def rev_calc(res):
res = np.subtract(res, [2048.0, 2048.0, 0.0])
p = angle_between(res, [1024.0, 0.0, 0.0])
return chr(int(math.degrees(p)))
def second(x, y):
final = extend(int(calc('C')[0] + 0)) ^ grid[0][0]
idx = x + y * 8
tgt = grid[y][x] ^ final
for i in range(0, 32, 6):
tgt ^= (idx << i) & 0xffffffff
return tgt
print(hex(extend(int(calc('C')[0] + 0)) ^ grid[0][0]))
y = 0
x = 0
for y in range(8):
for x in range(0, 8, 2):
tmp = [ rev_extend(second(x, y)), rev_extend(second(x + 1, y)), 0.0 ]
z = rev_calc([ rev_extend(second(x, y)), rev_extend(second(x + 1, y)), 0.0 ])
print(z, end = '')
print('')
```

Executing the script gives as (almost) the flag:

```
➜ moon ./pwn.py
0x6f6f6f6f
CTF{OpenGLMoonMoonG0erT0TheMoon}
```

Which, after some guessing gave us the correct flag: