Published: Wed 04 February 2015
memcache seccomp linux
Many programs running for a long time on some server do not do random,
unpredictable things. They actually have a pretty well defined set of behaviours
and anything that is outside of that set could be automatically treated as a
bug, or a hack. For example databases do a lot of adhoc query processing, but
mostly within some known context: read/write files in a known location, accept
incoming connections, etc. But you’d be surprised if your database started
downloading files over http, or writing to your .bashrc for example.
This was of course noticed and there are many external solutions for controlling
/ whitelisting application behaviours on Linux. That’s the job of the Linux
Security Modules. Selinux, Apparmor, Tomoyo, Smack and others can make sure that
your database has no business writing in your home directory or opening outgoing
connections. (unless they’re to a known slave) But this is not always
convenient, partially because the security profile becomes external to the app.
This may be a bad thing: app provides its rules, distro modifies them, you want
a different layout and are left with fixing third layer of those rules. But it
may be a good thing: the app doesn’t approving any policy, distros do, you’re
more secure by default.
But what about applications which want to provide their own restrictions,
independent of the environment. Even better - can they provide restrictions in a
better way than external
LSM frameworks? Actually, to some extent they can… Seccomp
There’s a pretty interesting interface called seccomp, which started its life a
few years ago. You can find some descriptions in
the first implementation. At the time it only allowed simple sandboxing of an
app without any detailed configuration. The use case was to set up all needed
connections and just switch to a very restricted environment where only
accessible memory matters. LWN
This was useful for some small number of use cases. Unfortunately it’s not
useful for more interesting applications like databases, main loops of servers,
etc. If you can’t accept a new network connection, how can you do any
interesting work? Sure, you can
clone()/ fork() after every accept, but
that’s going to be really slow. Fortunately there were some ideas to extend the
scheme, finally ending with a BFP based filter. That filter allows you in
practice to do any kind of stateless check on the syscall (or a packet, as it
was initially designed to do). Now, with a very small and pretty well designed
bytecode, you can validate not only the syscall itself, but also its arguments.
This allows to create conditions like “allow write(), but only to stdout/err”. More interesting scenarios
Of course at this point we can do more creative things - for example you don’t
have to specify your policy at the start of an app. You can skip the
initialisation, which is usually local and trusted and only enable restrictions
after binding all sockets. You can have a restricted initialisation profile
which leaves the
prctl() call available and applies more restrictions later
on. You can also have per-thread profiles which limit your main process to
only accept the connections and hand them off to workers. The possibilities
here are really great.
To be fair to the external frameworks, you can also provide after-init profiles
in AppArmor and Selinux, but this has a downside of relying on a specific
implementation. There is no common
API for this operation, so AppArmor has
change_hat(), while Selinux has setcon(). Doing this using seccomp,
in-process doesn’t rely on a distro-specific framework.
So what’s the best way to learn about things like seccomp filters? Implement it
for some existing useful application of course! Like
memcached, which already accepted a less restrictive
drop_privileges() implementation for Solaris. (it prevents forks, execs,
messing with session, but not any reads/writes as far as I understand) The implementation
As I mentioned, the bytecode is really simple and can be written by hand, but
it’s not a great experience. For the syscall scenario there’s not much there,
apart from loading a value at a given offset and comparing it against a known
value. Additionally, as the seccomp documentation warns, the architecture needs
to be checked before other operations. It could be possible to pretend to use a
different syscall scheme to bypass restrictions otherwise.
Assuming the bytecode is the way to go, there are three ways to write it.
Either compile some code using
bpfc, construct the bytecode inline using
macros, or use libseccomp for convenience.
Using a compiler has some benefits over using just macros - it can check your
code for simple mistakes and handles all the jumps correctly. This is important
because all jump offsets are relative (and always positive to prevent loops) -
it makes maintaining and editing rules pretty hard. The compiler itself doesn’t
do much conversion however; it works with assembler-like source and converts
each line into bytecode.
The result of either the compilation or hand coding is very likely to be just a
list of compares and jumps. Although straightforward, if you use macros it will
still look something like this:
BPF_STMT ( BPF_LD + BPF_W + BPF_ABS , arch_nr ),
BPF_JUMP ( BPF_JMP + BPF_JEQ + BPF_K , ARCH_NR , 1 , 0 ),
BPF_STMT ( BPF_RET + BPF_K , SECCOMP_RET_KILL ),
Which compares the expected architectures to the one currently used and skips
over kill instruction if they match. Not so great to read.
The easier way - libseccomp
Like each assembler language, the syscalls filtering can be abstracted a bit to
provide nicer and simpler mechanisms in a high-level description. In case of
seccomp that is done by
It’s a C library which takes descriptions of the rules and provides a much
easier solution to writing your own description. While it doesn’t provide full
control over the resulting bpf bytecode, it doesn’t have to at the moment,
since seccomp itself doesn’t enable us to do anything else apart from analysing
the call and choosing an answer. The result may even be better than the
bytecode written by hand if any optimisation techniques are applied in the future.
The interface is pretty simple. You initialise a filter, add some rules to it,
then install the filter. Syscall arguments can be checked in simple ways when
adding a rule. For example this will allow only
write(), close() and
program termination: (error checking omitted)
ctx = seccomp_init ( SCMP_ACT_KILL );
seccomp_rule_add ( ctx , SCMP_ACT_ALLOW , SCMP_SYS ( write ), 0 );
seccomp_rule_add ( ctx , SCMP_ACT_ALLOW , SCMP_SYS ( close ), 0 );
seccomp_rule_add ( ctx , SCMP_ACT_ALLOW , SCMP_SYS ( rt_sigreturn ), 0 );
seccomp_rule_add ( ctx , SCMP_ACT_ALLOW , SCMP_SYS ( exit_group ), 0 );
seccomp_load ( ctx );
seccomp_release ( ctx );
Short, simple and effective.
In the next part I’ll describe how to apply such rules to an existing
application, like memcached and what tools to use to verify everything still
works as expected.