Some time ago I ran into a production issue where the init process
(upstart) stopped behaving properly. Specifically, instead of
spawning new processes, it deadlocked in a transitional state. To be
precise, the init process itself was responsive, but the critical
services were stuck in one of the pre- or post- states, never actually
restarting. What’s worse, upstart doesn’t allow forcing a state
transition and trying to manually create and send DBus events didn’t
help either. That meant the sane options we were left with were:
- restart the host (not desirable at all in that scenario)
- start the process manually and hope auto-respawn will not be needed.
Of course there are also some insane options. Why not cheat like in the
old times and just PEEK and POKE the process in the right
places? The solution used at the time involved a very ugly script
driving gdb which probably summoned satan in some edge cases. But edge
cases were not hit and majority of hosts recovered without issues. (if
you overwrite memory of your init process, you should expect at least a
small percent of segfaults) After some time however I wanted to recreate
the experiment in a cleaner way and see what interfaces are available if
I had a legitimate use for doing something similar again.
The goal is the same - given an upstart job name, change its goal
and status fields to arbitrary values, without killing the init
process. First some context however:
Why is peek/poke harder these days?
In good old times when software size was measured in kilobytes and each
byte was quite expensive dynamic allocation was very rare. Whatever
could be static, was static. Whatever couldn’t be, was most likely
pooled in a specific region and there was a preset number of “things”
the program could handle. That means your lives counter or some other
important value was most likely always at the exact same address every
time. That’s not the case anymore unfortunately. Almost everything needs
to handle an arbitrary number of “things” these days and that means
dynamic allocation.
It’s also trivial to allocate new memory regions and OS takes care of
things like making sure the memory looks like a one continuous space to
your app, while it reality it can be all over the place. The practical
implication is that anything we’ll need to search for in the upstart
process will be malloc’d somewhere in the heap area. We also need to
know where the heap happens to be at the specific time.
Ways of direct access to a process.
On Linux there are a couple of ways to access memory of a foreign
process. The easiest two are reading directly from /proc/(pid)/mem and
using the ptrace library. The ptrace request ids are actually called
PTRACE_PEEKDATA and PTRACE_POKEDATA which should make their
purpose quite clear. There’s a lot of information about them in man
pages if you want more details, but let’s move on to some real action.
Where to read from is another interesting question. Apart from dynamic
allocation we’ve got virtual memory these days and additional
memory-shifting concepts like ASLR. The up-to-date, valid
information about where to look for data will exist under
/proc/(pid)/maps for each running application. For the init process (PID
1), it looks something like this:
......
7fae2b2b7000-7fae2b2b9000 rw-p 00023000 fd:01 2860 /lib/x86_64-linux-gnu/ld-2.15.so
7fae2b2b9000-7fae2b2df000 r-xp 00000000 fd:01 4259 /sbin/init (deleted)
7fae2b4de000-7fae2b4e0000 r--p 00025000 fd:01 4259 /sbin/init (deleted)
7fae2b4e0000-7fae2b4e1000 rw-p 00027000 fd:01 4259 /sbin/init (deleted)
7fae2cf09000-7fae2cfd0000 rw-p 00000000 00:00 0 [heap]
7fffc146b000-7fffc148c000 rw-p 00000000 00:00 0 [stack]
7fffc1599000-7fffc159a000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
As previously noted, all of the interesting / long-lived data will be
found in the heap which is annotated with a fake path “[heap]“. All
of the ranges listed in the maps file are available. Others will give an
error on access. Process memory acts like a stricter version of a sparse
file in this case.
Nice ways of direct access
Both ptrace and memory-file interfaces are quite low-level, so instead
of writing lots of C code, I’m going to use some Python instead.
Fortunately there’s an existing ptrace wrapper on pypi and even though it
looks abandoned, it still works very well. The interface allows easy
“stop and attach” operation as well as exposes some interesting
functions for address range reading and writing. Allow me to do some
blog-literate programming here. The ptrace interface allows for easy
attaching to a chosen PID (1 in this case):
def get_init_process():
d = ptrace.debugger.PtraceDebugger()
proc = d.addProcess(1, False)
return proc
Now down to the details… After a quick glance at
init/job.h
from upstart source code, it looks like we’re interested in two values
from struct Job - goal and state. Both have a range of values described
at the top of the file. Counting from the beginning of the struct,
they’re at offset 5*(native pointer length), because
NihList
consists of two pointers only.
PTR_SIZE=ptrace.cpu_info.CPU_WORD_SIZE
JOB_CLASS_NAME_OFFSET = PTR_SIZE*2
JOB_CLASS_PATH_OFFSET = PTR_SIZE*3
JOB_NAME_OFFSET = PTR_SIZE*2
JOB_JOB_CLASS_OFFSET = PTR_SIZE*3
JOB_PATH_OFFSET = PTR_SIZE*4
JOB_GOAL_OFFSET = PTR_SIZE*5
But struct Job is not something we can find easily. Let’s say the
upstart job to fix is called “rsyslog“. This string is in the heap,
but not pointed to from the Job structure. That part initially consisted
of some guesswork and upstream code browsing which I’m not going to
reproduce here, but the result is that the bytes “rsyslog” (or
“rsyslog\0” to be precise) exists in structure JobClass in
init/job_class.h.
Actually… there and in 18 other places. That means on the current
system I can find 19 places which contain that name terminated by a zero
byte and the next steps are going to be figuring out how to figure out
which of those occurrences can be traced back to the job itself.
def get_heap(proc):
return [m for m in proc.readMappings() if m.pathname == '[heap]'][0]
def find_refs_to(mem, bytestr)
return [addr for addr in heap.search(bytestr)]
With such a low number of hits we can just check each of them and see
how viable each one is.
Tracking references
So how to find out if each of the guesses is correct? By checking if the
surrounding values and pointers makes sense. In this case the JobClass
has a path field which according to comments is a string containing
the DBus path for the job. As noted previously, those fields have a
known offset from the start of the structure. Let’s write something
generic then that will browse through given addresses and check if the
memory referencing them looks like it could be a known object:
def flatten(stream):
result = []
for collection in stream:
result.extend(collection)
return result
def places_referring_to(mem, search_value):
needle = ptrace.ctypes_tools.word2bytes(search_value)
return find_refs_to(mem, needle)
def find_object_references(proc, heap, values, offset, verifier):
refs = flatten(places_referring_to(heap, value) for value in values)
return [ref-offset for ref in refs if verifier(proc, ref-offset)]
Now some functions that can actually judge whether some location looks
like a Job or a JobClass by extracting expected strings:
def deref_string(proc, addr):
s_addr = proc.readWord(addr)
try:
return proc.readCString(s_addr, 100)[0]
except ptrace.debugger.process_error.ProcessError:
return None
def looks_like_job_class(proc, addr):
s = deref_string(proc, addr+JOB_CLASS_PATH_OFFSET)
return s is not None and s.startswith('/com/ubuntu/Upstart/jobs/')
def looks_like_job(proc, addr):
s = deref_string(proc, addr+JOB_PATH_OFFSET)
return s is not None and s.startswith('/com/ubuntu/Upstart/jobs/')
And that’s it. There could be a lot more sanity checking going on, but
after a quick check it appears to be unnecessary. A quick run results in
only one pointer which actually does show a valid Job structure.
The reference chain we’re looking for is: string (name of the
process) -> that is used in a JobClass -> that is used in a Job. To
wrap it all up into an actual script:
proc = get_init_process()
heap = get_heap(proc)
process_strings = find_refs_to(heap, process_to_fix)
job_classes = find_object_references(proc, heap, process_strings,
JOB_CLASS_NAME_OFFSET, looks_like_job_class)
jobs = find_object_references(proc, heap, job_classes,
JOB_JOB_CLASS_OFFSET, looks_like_job)
for job in jobs:
print "job found at 0x%016x" % job
goal, state = proc.readStruct(job+JOB_GOAL_OFFSET,
ctypes.c_int32*2)[:]
print "goal", job_goals[goal]
print "state", job_states[state]
Does it all work?
Yes, of course it does! And pretty reliably actually:
sudo ./search_init.py rsyslog
job found at 0x00007fae2cf95ca0
goal JOB_START
state JOB_RUNNING
After finding the right address it’s only a matter of
proc.writeBytes() to force the change of the goal and state.
Unfortunately there’s nothing stopping the system from being in a state
where this change really shouldn’t happen. For example right before the
value is read, or while it’s being copied and some code path still holds
the old reference, or… Basically changing memory which you don’t have
complete control over is not safe. Ever. Around 1% of machines had
problems with init going crazy afterwards, but those could be just
rebooted then. But as a hack that allows you to fix a critical issue,
it’s worth remembering that it’s not rocket science.
And finally: thanks to Victor Stinner
for writing some really useful Python tools.