Seccomp sandboxes and memcached example, part 2

As explained in the previous post, seccomp can be used for protecting the system and users from misbehaving and exploited applications. But there has to be some work done to actually enable the protection it offers. That’s where the programming part starts and possibly some exploration of the application you want to constrain.

The application I chose to play with is memcached. It’s a good example in this case, because the interaction it’s supposed to have with the system is actually pretty minimal. The data isn’t persisted and there’s even a good separation between thread accepting the connections and threads which do the set/get work.

Where to start

To actually implement the seccomp protection in a daemon, you can take one of 2 ways:

  • either you have to know the application inside out and know all the syscalls it can make (unlikely, unless you started it), or
  • you can run it in as many ways as you can and collect the “normal behaviour” profile, which you can then enforce

These points may be familiar to anyone who configured an apparmor / rbac / selinux profile before.

Of course I didn’t write memcached, so the first option is not available to me. There’s likely a lot of things it does under the covers that I’m not aware of. This leaves me with the second option only, hoping that other users or developers will point out any special cases that were missed. The experimental approach is actually made much easier due to a reasonable test suite included in the repository. It covers a lot of the possible scenarios and can be used to find a “normal” profile.

Unfortunately, as described later, having a test suite tightly integrated into the app makes the process a bit more difficult in other ways.

Useful tools

There are a few tools which can help in this case and pretty much all of them rely in some way on ptrace mechanism. The obvious choices here would be strace and gdb. The first one is exactly what we need to create a profile - it will print out only the system calls made by the application. It will print out everything from the start of the program however, so not all of the calls will be interesting for us. We do not need to take into account the program initialisation, because the first prctl() call will happen after it. Also, most likely we don’t need to care about multiple shared libraries being loaded at the beginning.

Gdb on the other hand may be useful in more in-depth investigation. Specifically if some syscall needs to be restricted based on its arguments, it may be useful to set a breakpoint where it’s called and look at the live process. Also, gdb can break at a signal that would otherwise exit the application.

Finding relevant syscalls

Let’s start with the main tool, strace and some simple to understand app - cat. And for a moment let’s pretend it doesn’t open more that a single file from the arguments list. To get the basic profile for an app, we can do the following:

strace cat /etc/hosts > /dev/null

The output should look like this:

execve("/usr/bin/cat", ["cat", "/etc/hosts"], [/* 53 vars */]) = 0
brk(0)                                  = 0x220c000

   # .... a lot of initialisation

open("/usr/lib/", O_RDONLY|O_CLOEXEC) = 3

   # .... loading shared libs

open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3

   # .... libc starts up

open("/etc/hosts", O_RDONLY)            = 3

   # !!! finally - the first syscall we'd be interested in

fstat(3, {st_mode=S_IFREG|0644, st_size=347, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f72c1f43000
read(3, "#\n# /etc/hosts: static lookup ta"..., 131072) = 347
write(1, "#\n# /etc/hosts: static lookup ta"..., 347) = 347
read(3, "", 131072)                     = 0
munmap(0x7f72c1f43000, 139264)          = 0
close(3)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

So if we did want to secure cat, we’d be interested in anything that happens after opening the file(s). We wouldn’t expect anything apart from reading from those files and writing to stdout to happen. Additional things like fstat() and mmap() / fadvice() may be either treated here as part of the setup, or as part of the execution. You’ll have to find the right answer, depending on what you want to protect yourself from.

Trying to protect this execution, I’d setup a profile before fstat() allowing:

  • fstat(3, ...)
  • fadvise64(3, ...)
  • mmap(...)
  • read(3, ...)
  • write([1, 2], ...)
  • munmap(...)
  • close(...)
  • exit_group(...).

What does that protect us from? For example from writing to the any other files, executing processes, spawning new threads, opening other files. What could be easily missed? Although this doesn’t happen in this trace, an error could cause some writes stderr. I’m sure there are some other cases too that need to be tested.

In case of a bigger program, you can get an easy overview of what happened during the execution by dumping the strace output into a file (-o some output), deleting the initialisation part and grep/sed/awk-ing out just the syscall names that follow. This works just as well for multithreaded apps, but you have to remember each thread can carry its own profile, so its useful to think about each pid’s syscalls separately.

Writing a simple test

Let’s try adding simple rules to a contrived program now. It will only print out two lines of text - one before the seccomp initialisation and one after. The original “program” looks like this:

#include <stdio.h>
int main() {
    return 0;

First, let’s add rules which should block the second line of text. Allow just enough syscalls to exit successfully - exit_group() and to not break system behaviour - sigreturn. Anything else will fail and set errno to EACCESS.

#include <stdio.h>
#include <seccomp.h>
#include <errno.h>
int main() {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ERRNO(EACCES));
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(sigreturn), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

    return 0;

Now the second line does not get printed and the process exits. What does it look like in strace then? Actually strace will not show the second attempted syscall at all. This is very unfortunate for debugging the results. To make unintended failures easier to spot it’s better to set the failure mode to SCMP_ACT_TRAP which does cause a visible failure, even though it kills the whole thread instead.

To add a syscall necessary to print out the second line, we need to allow writes too. This can be done by allowing SCMP_SYS(write). But let’s make it a bit more interesting and allow only writing to stdout. To do that we need extra arguments for seccomp_rule_add(). Specifically, the count of extra args becomes 1 now and another argument needs to specify the first arg of write() must be 1. The whole rule is:

seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
                 SCMP_SYS(write), 1,
                 SCMP_A0(SCMP_CMP_EQ, 1));

What’s missing

The above and reading seccomp manpages is pretty much all you need to know to apply seccomp rules to an application. Taking the memcached app as an example, I used strace to get a list of needed syscalls. Due to the nice design of the application, the separation of concerns is pretty good. The main thread only accepts the connections and sends them to the workers. That means the main process doesn’t even need the read() call after initialisation.

The workers do quite a bit of work, but even that is fairly restricted and workers never need to open() or accept(). The only means of communication they have are sockets handed to them by the parent. Some extensions like being able to shutdown are turned on/off by the commandline configuration, so tgkill() for example is only allowed if the option was enabled during the startup.

The main problem was the test suite.

What about the test suite

Memcached includes a test suite which actually spawns the tested process itself. This means that tests need unrestricted open() and write() to report their results. But that’s an issue, because you don’t want those calls available in production. There are three ways (at least) to solve this issue:

  • you can create a new runtime parameter which turns on extended possibilities for the tests, or
  • ignore the tests and only restrict privileges in production, or
  • rewrite test cases to not require extra privileges

The second idea is the worst - tests exist for a reason. If you can’t test with the restrictions enabled, then how do you know the app works as intended? The last idea is pretty hard to implement though - moving a stable, existing project to a new test architecture just to enable a new feature is not a great idea.

For those reasons I chose the first option in the memcached patch. The tests will grab extra privileges, but will keep most restrictions. It’s also not possible to change the mode (test/production) after initialisation is complete, so this solution doesn’t lower the security.

Final implementation

The final PR is available and waiting to be tested / merged. I created it for two reasons. First, to learn how to use seccomp with a project that already exists. Second, to improve an existing project. Hopefully the two posts about it will cause some more people to look at seccomp and apply it to new projects. It’s not hard (unless the app’s design is very complicated), but it’s also not very popular yet. I really hope this changes in the future. Many exploits cannot be prevented this way, but the results of break-ins can be constrained a lot.

So if you’re writing a new daemon for Linux, give it a go. (and check other platforms for their own sandboxing alternatives)

Was it useful? BTC: 182DVfre4E7WNk3Qakc4aK7bh4fch51hTY