Many programs running for a long time on some server do not do random, unpredictable things. They actually have a pretty well defined set of behaviours and anything that is outside of that set could be automatically treated as a bug, or a hack. For example databases do a lot of adhoc query processing, but mostly within some known context: read/write files in a known location, accept incoming connections, etc. But you’d be surprised if your database started downloading files over http, or writing to your .bashrc for example.
This was of course noticed and there are many external solutions for controlling / whitelisting application behaviours on Linux. That’s the job of the Linux Security Modules. Selinux, Apparmor, Tomoyo, Smack and others can make sure that your database has no business writing in your home directory or opening outgoing connections. (unless they’re to a known slave) But this is not always convenient, partially because the security profile becomes external to the app. This may be a bad thing: app provides its rules, distro modifies them, you want a different layout and are left with fixing third layer of those rules. But it may be a good thing: the app doesn’t approving any policy, distros do, you’re more secure by default.
But what about applications which want to provide their own restrictions, independent of the environment. Even better - can they provide restrictions in a better way than external LSM frameworks? Actually, to some extent they can…
There’s a pretty interesting interface called seccomp, which started its life a few years ago. You can find some descriptions in LWN about the first implementation. At the time it only allowed simple sandboxing of an app without any detailed configuration. The use case was to set up all needed connections and just switch to a very restricted environment where only accessible memory matters.
This was useful for some small number of use cases. Unfortunately it’s not useful for more interesting applications like databases, main loops of servers, etc. If you can’t accept a new network connection, how can you do any interesting work? Sure, you can clone()/fork() after every accept, but that’s going to be really slow. Fortunately there were some ideas to extend the scheme, finally ending with a BFP based filter. That filter allows you in practice to do any kind of stateless check on the syscall (or a packet, as it was initially designed to do). Now, with a very small and pretty well designed bytecode, you can validate not only the syscall itself, but also its arguments. This allows to create conditions like “allow write(), but only to stdout/err”.
More interesting scenarios
Of course at this point we can do more creative things - for example you don’t have to specify your policy at the start of an app. You can skip the initialisation, which is usually local and trusted and only enable restrictions after binding all sockets. You can have a restricted initialisation profile which leaves the prctl() call available and applies more restrictions later on. You can also have per-thread profiles which limit your main process to only accept the connections and hand them off to workers. The possibilities here are really great.
To be fair to the external frameworks, you can also provide after-init profiles in AppArmor and Selinux, but this has a downside of relying on a specific implementation. There is no common API for this operation, so AppArmor has change_hat(), while Selinux has setcon(). Doing this using seccomp, in-process doesn’t rely on a distro-specific framework.
So what’s the best way to learn about things like seccomp filters? Implement it for some existing useful application of course! Like memcached, which already accepted a less restrictive drop_privileges() implementation for Solaris. (it prevents forks, execs, messing with session, but not any reads/writes as far as I understand)
As I mentioned, the bytecode is really simple and can be written by hand, but it’s not a great experience. For the syscall scenario there’s not much there, apart from loading a value at a given offset and comparing it against a known value. Additionally, as the seccomp documentation warns, the architecture needs to be checked before other operations. It could be possible to pretend to use a different syscall scheme to bypass restrictions otherwise.
Assuming the bytecode is the way to go, there are three ways to write it. Either compile some code using bpfc, construct the bytecode inline using macros, or use libseccomp for convenience.
Using a compiler has some benefits over using just macros - it can check your code for simple mistakes and handles all the jumps correctly. This is important because all jump offsets are relative (and always positive to prevent loops) - it makes maintaining and editing rules pretty hard. The compiler itself doesn’t do much conversion however; it works with assembler-like source and converts each line into bytecode.
The result of either the compilation or hand coding is very likely to be just a list of compares and jumps. Although straightforward, if you use macros it will still look something like this:
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, arch_nr), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ARCH_NR, 1, 0), BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
Which compares the expected architectures to the one currently used and skips over kill instruction if they match. Not so great to read.
The easier way - libseccomp
Like each assembler language, the syscalls filtering can be abstracted a bit to provide nicer and simpler mechanisms in a high-level description. In case of seccomp that is done by libseccomp. It’s a C library which takes descriptions of the rules and provides a much easier solution to writing your own description. While it doesn’t provide full control over the resulting bpf bytecode, it doesn’t have to at the moment, since seccomp itself doesn’t enable us to do anything else apart from analysing the call and choosing an answer. The result may even be better than the bytecode written by hand if any optimisation techniques are applied in the future.
The interface is pretty simple. You initialise a filter, add some rules to it, then install the filter. Syscall arguments can be checked in simple ways when adding a rule. For example this will allow only write(), close() and program termination: (error checking omitted)
ctx = seccomp_init(SCMP_ACT_KILL); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(rt_sigreturn), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0); seccomp_load(ctx); seccomp_release(ctx);
Short, simple and effective.
In the next part I’ll describe how to apply such rules to an existing application, like memcached and what tools to use to verify everything still works as expected.