Peek and poke in the age of Linux

Some time ago I ran into a production issue where the init process (upstart) stopped behaving properly. Specifically, instead of spawning new processes, it deadlocked in a transitional state. To be precise, the init process itself was responsive, but the critical services were stuck in one of the pre- or post- states, never actually restarting. What’s worse, upstart doesn’t allow forcing a state transition and trying to manually create and send DBus events didn’t help either. That meant the sane options we were left with were:

restart the host (not desirable at all in that scenario)
start the process manually and hope auto-respawn will not be needed.

Of course there are also some insane options. Why not cheat like in the old times and just PEEK and POKE the process in the right places? The solution used at the time involved a very ugly script driving gdb which probably summoned satan in some edge cases. But edge cases were not hit and majority of hosts recovered without issues. (if you overwrite memory of your init process, you should expect at least a small percent of segfaults) After some time however I wanted to recreate the experiment in a cleaner way and see what interfaces are available if I had a legitimate use for doing something similar again.

The goal is the same - given an upstart job name, change its goal and status fields to arbitrary values, without killing the init process. First some context however:

Why is peek/poke harder these days?

In good old times when software size was measured in kilobytes and each byte was quite expensive dynamic allocation was very rare. Whatever could be static, was static. Whatever couldn’t be, was most likely pooled in a specific region and there was a preset number of “things” the program could handle. That means your lives counter or some other important value was most likely always at the exact same address every time. That’s not the case anymore unfortunately. Almost everything needs to handle an arbitrary number of “things” these days and that means dynamic allocation.

It’s also trivial to allocate new memory regions and OS takes care of things like making sure the memory looks like a one continuous space to your app, while it reality it can be all over the place. The practical implication is that anything we’ll need to search for in the upstart process will be malloc’d somewhere in the heap area. We also need to know where the heap happens to be at the specific time.

Ways of direct access to a process.

On Linux there are a couple of ways to access memory of a foreign process. The easiest two are reading directly from /proc/(pid)/mem and using the ptrace library. The ptrace request ids are actually called PTRACE_PEEKDATA and PTRACE_POKEDATA which should make their purpose quite clear. There’s a lot of information about them in man pages if you want more details, but let’s move on to some real action.

Where to read from is another interesting question. Apart from dynamic allocation we’ve got virtual memory these days and additional memory-shifting concepts like ASLR. The up-to-date, valid information about where to look for data will exist under /proc/(pid)/maps for each running application. For the init process (PID 1), it looks something like this:

......
7fae2b2b7000-7fae2b2b9000 rw-p 00023000 fd:01 2860       /lib/x86_64-linux-gnu/ld-2.15.so
7fae2b2b9000-7fae2b2df000 r-xp 00000000 fd:01 4259       /sbin/init (deleted)
7fae2b4de000-7fae2b4e0000 r--p 00025000 fd:01 4259       /sbin/init (deleted)
7fae2b4e0000-7fae2b4e1000 rw-p 00027000 fd:01 4259       /sbin/init (deleted)
7fae2cf09000-7fae2cfd0000 rw-p 00000000 00:00 0          [heap]
7fffc146b000-7fffc148c000 rw-p 00000000 00:00 0          [stack]
7fffc1599000-7fffc159a000 r-xp 00000000 00:00 0          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0  [vsyscall]

As previously noted, all of the interesting / long-lived data will be found in the heap which is annotated with a fake path “[heap]”. All of the ranges listed in the maps file are available. Others will give an error on access. Process memory acts like a stricter version of a sparse file in this case.

Nice ways of direct access

Both ptrace and memory-file interfaces are quite low-level, so instead of writing lots of C code, I’m going to use some Python instead. Fortunately there’s an existing ptrace wrapper on pypi and even though it looks abandoned, it still works very well. The interface allows easy “stop and attach” operation as well as exposes some interesting functions for address range reading and writing. Allow me to do some blog-literate programming here. The ptrace interface allows for easy attaching to a chosen PID (1 in this case):

def get_init_process():
    d = ptrace.debugger.PtraceDebugger()
    proc = d.addProcess(1, False)
    return proc

Now down to the details… After a quick glance at init/job.h from upstart source code, it looks like we’re interested in two values from struct Job - goal and state. Both have a range of values described at the top of the file. Counting from the beginning of the struct, they’re at offset 5*(native pointer length), because NihList consists of two pointers only.

PTR_SIZE=ptrace.cpu_info.CPU_WORD_SIZE
JOB_CLASS_NAME_OFFSET = PTR_SIZE*2
JOB_CLASS_PATH_OFFSET = PTR_SIZE*3
JOB_NAME_OFFSET = PTR_SIZE*2
JOB_JOB_CLASS_OFFSET = PTR_SIZE*3
JOB_PATH_OFFSET = PTR_SIZE*4
JOB_GOAL_OFFSET = PTR_SIZE*5

But struct Job is not something we can find easily. Let’s say the upstart job to fix is called “rsyslog”. This string is in the heap, but not pointed to from the Job structure. That part initially consisted of some guesswork and upstream code browsing which I’m not going to reproduce here, but the result is that the bytes “rsyslog” (or “rsyslog\0” to be precise) exists in structure JobClass in init/job_class.h. Actually… there and in 18 other places. That means on the current system I can find 19 places which contain that name terminated by a zero byte and the next steps are going to be figuring out how to figure out which of those occurrences can be traced back to the job itself.

def get_heap(proc):
    return [m for m in proc.readMappings() if m.pathname == '[heap]'][0]

def find_refs_to(mem, bytestr)
    return [addr for addr in heap.search(bytestr)]

With such a low number of hits we can just check each of them and see how viable each one is.

Tracking references

So how to find out if each of the guesses is correct? By checking if the surrounding values and pointers makes sense. In this case the JobClass has a path field which according to comments is a string containing the DBus path for the job. As noted previously, those fields have a known offset from the start of the structure. Let’s write something generic then that will browse through given addresses and check if the memory referencing them looks like it could be a known object:

def flatten(stream):
    result = []
    for collection in stream:
        result.extend(collection)
    return result

def places_referring_to(mem, search_value):
    needle = ptrace.ctypes_tools.word2bytes(search_value)
    return find_refs_to(mem, needle)

def find_object_references(proc, heap, values, offset, verifier):
    refs = flatten(places_referring_to(heap, value) for value in values)
    return [ref-offset for ref in refs if verifier(proc, ref-offset)]

Now some functions that can actually judge whether some location looks like a Job or a JobClass by extracting expected strings:

def deref_string(proc, addr):
    s_addr = proc.readWord(addr)
    try:
        return proc.readCString(s_addr, 100)[0]
    except ptrace.debugger.process_error.ProcessError:
        return None

def looks_like_job_class(proc, addr):
    s = deref_string(proc, addr+JOB_CLASS_PATH_OFFSET)
    return s is not None and s.startswith('/com/ubuntu/Upstart/jobs/')

def looks_like_job(proc, addr):
    s = deref_string(proc, addr+JOB_PATH_OFFSET)
    return s is not None and s.startswith('/com/ubuntu/Upstart/jobs/')

And that’s it. There could be a lot more sanity checking going on, but after a quick check it appears to be unnecessary. A quick run results in only one pointer which actually does show a valid Job structure.

The reference chain we’re looking for is: string (name of the process) -> that is used in a JobClass -> that is used in a Job. To wrap it all up into an actual script:

proc = get_init_process()
heap = get_heap(proc)
process_strings = find_refs_to(heap, process_to_fix)
job_classes = find_object_references(proc, heap, process_strings,
    JOB_CLASS_NAME_OFFSET, looks_like_job_class)
jobs = find_object_references(proc, heap, job_classes,
    JOB_JOB_CLASS_OFFSET, looks_like_job)

for job in jobs:
    print "job found at 0x%016x" % job
    goal, state = proc.readStruct(job+JOB_GOAL_OFFSET,
        ctypes.c_int32*2)[:]
    print "goal", job_goals[goal]
    print "state", job_states[state]

Does it all work?

Yes, of course it does! And pretty reliably actually:

sudo ./search_init.py rsyslog
job found at 0x00007fae2cf95ca0
goal JOB_START
state JOB_RUNNING

After finding the right address it’s only a matter of proc.writeBytes() to force the change of the goal and state.

Unfortunately there’s nothing stopping the system from being in a state where this change really shouldn’t happen. For example right before the value is read, or while it’s being copied and some code path still holds the old reference, or… Basically changing memory which you don’t have complete control over is not safe. Ever. Around 1% of machines had problems with init going crazy afterwards, but those could be just rebooted then. But as a hack that allows you to fix a critical issue, it’s worth remembering that it’s not rocket science.

And finally: thanks to Victor Stinner for writing some really useful Python tools.

blogroll

social