Published: Wed 25 February 2015
By viraptor
tags: memcache seccomp linux
As explained in the previous
post , seccomp can be
used for protecting the system and users from misbehaving and exploited
applications. But there has to be some work done to actually enable the
protection it offers. That’s where the programming part starts and possibly
some exploration of the application you want to constrain.
The application I chose to play with is memcached .
It’s a good example in this case, because the interaction it’s supposed to have
with the system is actually pretty minimal. The data isn’t persisted and
there’s even a good separation between thread accepting the connections and
threads which do the set/get work.
Where to start
To actually implement the seccomp protection in a daemon, you can take one of 2 ways:
either you have to know the application inside out and know all the syscalls
it can make (unlikely, unless you started it), or
you can run it in as many ways as you can and collect the “normal behaviour”
profile, which you can then enforce
These points may be familiar to anyone who configured an apparmor / rbac /
selinux profile before.
Of course I didn’t write memcached , so the first option is not available to
me. There’s likely a lot of things it does under the covers that I’m not aware
of. This leaves me with the second option only, hoping that other users or
developers will point out any special cases that were missed. The experimental
approach is actually made much easier due to a reasonable test suite included
in the repository. It covers a lot of the possible scenarios and can be used to
find a “normal” profile.
Unfortunately, as described later, having a test suite tightly integrated into
the app makes the process a bit more difficult in other ways.
Useful tools
There are a few tools which can help in this case and pretty much all of them
rely in some way on ptrace mechanism. The obvious choices here would be
strace and gdb . The first one is exactly what we need to create a profile -
it will print out only the system calls made by the application. It will print
out everything from the start of the program however, so not all of the calls
will be interesting for us. We do not need to take into account the program
initialisation, because the first prctl()
call will happen after it. Also,
most likely we don’t need to care about multiple shared libraries being loaded
at the beginning.
Gdb on the other hand may be useful in more in-depth investigation.
Specifically if some syscall needs to be restricted based on its arguments, it
may be useful to set a breakpoint where it’s called and look at the live
process. Also, gdb can break at a signal that would otherwise exit the application.
Finding relevant syscalls
Let’s start with the main tool, strace and some simple to understand app -
cat . And for a moment let’s pretend it doesn’t open more that a single file
from the arguments list. To get the basic profile for an app, we can do the following:
strace cat /etc/hosts > /dev/null
The output should look like this:
execve ( "/usr/bin/cat" , [ "cat" , "/etc/hosts" ], [ /* 53 vars */ ]) = 0
brk ( 0 ) = 0x220c000
# .... a lot of initialisation
open ( "/usr/lib/libc.so.6" , O_RDONLY | O_CLOEXEC ) = 3
# .... loading shared libs
open ( "/usr/lib/locale/locale-archive" , O_RDONLY | O_CLOEXEC ) = 3
# .... libc starts up
open ( "/etc/hosts" , O_RDONLY ) = 3
# !!! finally - the first syscall we'd be interested in
fstat ( 3 , { st_mode = S_IFREG | 0644 , st_size = 347 , ... }) = 0
fadvise64 ( 3 , 0 , 0 , POSIX_FADV_SEQUENTIAL ) = 0
mmap ( NULL , 139264 , PROT_READ | PROT_WRITE , MAP_PRIVATE | MAP_ANONYMOUS , - 1 , 0 ) = 0x7f72c1f43000
read ( 3 , "# \n # /etc/hosts: static lookup ta" ... , 131072 ) = 347
write ( 1 , "# \n # /etc/hosts: static lookup ta" ... , 347 ) = 347
read ( 3 , "" , 131072 ) = 0
munmap ( 0x7f72c1f43000 , 139264 ) = 0
close ( 3 ) = 0
close ( 1 ) = 0
close ( 2 ) = 0
exit_group ( 0 ) = ?
+++ exited with 0 +++
So if we did want to secure cat , we’d be interested in anything that happens
after opening the file(s). We wouldn’t expect anything apart from reading from
those files and writing to stdout to happen. Additional things like fstat()
and mmap()
/ fadvice()
may be either treated here as part of the setup, or
as part of the execution. You’ll have to find the right answer, depending on
what you want to protect yourself from.
Trying to protect this execution, I’d setup a profile before fstat()
allowing:
fstat(3, ...)
fadvise64(3, ...)
mmap(...)
read(3, ...)
write([1, 2], ...)
munmap(...)
close(...)
exit_group(...)
.
What does that protect us from? For example from writing to the any other
files, executing processes, spawning new threads, opening other files. What
could be easily missed? Although this doesn’t happen in this trace, an error
could cause some writes stderr
. I’m sure there are some other cases too that
need to be tested.
In case of a bigger program, you can get an easy overview of what happened
during the execution by dumping the strace output into a file (-o some
output
), deleting the initialisation part and grep/sed/awk-ing out just the
syscall names that follow. This works just as well for multithreaded apps, but
you have to remember each thread can carry its own profile, so its useful to
think about each pid’s syscalls separately.
Writing a simple test
Let’s try adding simple rules to a contrived program now. It will only print
out two lines of text - one before the seccomp initialisation and one after.
The original “program” looks like this:
#include <stdio.h>
int main () {
puts ( "pre-load" );
puts ( "post-load" );
return 0 ;
}
First, let’s add rules which should block the second line of text. Allow just
enough syscalls to exit successfully - exit_group()
and to not break system
behaviour - sigreturn
. Anything else will fail and set errno
to EACCESS
.
#include <stdio.h>
#include <seccomp.h>
#include <errno.h>
int main () {
scmp_filter_ctx ctx = seccomp_init ( SCMP_ACT_ERRNO ( EACCES ));
seccomp_rule_add ( ctx , SCMP_ACT_ALLOW , SCMP_SYS ( sigreturn ), 0 );
seccomp_rule_add ( ctx , SCMP_ACT_ALLOW , SCMP_SYS ( exit_group ), 0 );
puts ( "pre-load" );
seccomp_load ( ctx );
puts ( "post-load" );
return 0 ;
}
Now the second line does not get printed and the process exits. What does it
look like in strace then? Actually strace will not show the second attempted
syscall at all. This is very unfortunate for debugging the results. To make
unintended failures easier to spot it’s better to set the failure mode to
SCMP_ACT_TRAP
which does cause a visible failure, even though it kills the
whole thread instead.
To add a syscall necessary to print out the second line, we need to allow
writes too. This can be done by allowing SCMP_SYS(write)
. But let’s make it a
bit more interesting and allow only writing to stdout. To do that we need extra
arguments for seccomp_rule_add()
. Specifically, the count of extra args
becomes 1
now and another argument needs to specify the first arg of
write()
must be 1
. The whole rule is:
seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
SCMP_SYS(write), 1,
SCMP_A0(SCMP_CMP_EQ, 1));
What’s missing
The above and reading seccomp manpages is pretty much all you need to know to
apply seccomp rules to an application. Taking the memcached app as an
example, I used strace to get a list of needed syscalls. Due to the nice
design of the application, the separation of concerns is pretty good. The main
thread only accepts the connections and sends them to the workers. That means
the main process doesn’t even need the read()
call after initialisation.
The workers do quite a bit of work, but even that is fairly restricted and
workers never need to open()
or accept()
. The only means of communication
they have are sockets handed to them by the parent. Some extensions like being
able to shutdown are turned on/off by the commandline configuration, so
tgkill()
for example is only allowed if the option was enabled during the startup.
The main problem was the test suite.
What about the test suite
Memcached includes a test suite which actually spawns the tested process
itself. This means that tests need unrestricted open()
and write()
to
report their results. But that’s an issue, because you don’t want those calls
available in production. There are three ways (at least) to solve this issue:
you can create a new runtime parameter which turns on extended possibilities
for the tests, or
ignore the tests and only restrict privileges in production, or
rewrite test cases to not require extra privileges
The second idea is the worst - tests exist for a reason. If you can’t test with
the restrictions enabled, then how do you know the app works as intended? The
last idea is pretty hard to implement though - moving a stable, existing
project to a new test architecture just to enable a new feature is not a great idea.
For those reasons I chose the first option in the memcached patch. The tests
will grab extra privileges, but will keep most restrictions. It’s also not
possible to change the mode (test/production) after initialisation is complete,
so this solution doesn’t lower the security.
Final implementation
The final PR is available
and waiting to be tested / merged. I created it for two reasons. First, to
learn how to use seccomp with a project that already exists. Second, to improve
an existing project. Hopefully the two posts about it will cause some more
people to look at seccomp and apply it to new projects. It’s not hard (unless
the app’s design is very complicated), but it’s also not very popular yet. I
really hope this changes in the future. Many exploits cannot be prevented this
way, but the results of break-ins can be constrained a lot.
So if you’re writing a new daemon for Linux, give it a go. (and check other
platforms for their own sandboxing alternatives)