Learning About Syscall Filtering With Seccomp
Posted on
Updated on
I’d heard about being able to run Docker containers with a custom security profile, but wasn’t really sure what that meant or what was happening behind the scenes, so I decided to do some experimentation to find out.
It turns out that the Linux kernel includes a feature called “secure computing
mode,” or seccomp
for short. Using seccomp
lets you tell the kernel that you
only expect your program to use a specific set of system calls, and if your
program makes any system calls that aren’t in your approved list, the kernel
should kill your program.
But why would you want to do this? I think if you had a pretty simple program,
using seccomp
might be overkill. But if your program makes different system
calls depending on possibly-untrustworthy user input, it might make sense to try
to limit what the program is allowed to do. Looking at a list of software using
seccomp
on Wikipedia backs this up: the software listed are mostly
hypervisors/container runners (like Docker), web browsers, etc.
By reading the manual page for the seccomp(2)
system call, we
can learn how to write a program to try this out. The simplest action is to
enter “strict mode,” which prevents all system calls except for read(2)
,
write(2)
, _exit(2)
, and sigreturn(2)
--- in other words, what I think
should be just enough to write hello world! Let’s give it a shot:
#include <linux/seccomp.h>
#include <sys/prctl.h>
#include <stdio.h>
int
main()
{
if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) != 0) {
perror("prctl");
return 1;
}
printf("hello, world!\n");
return 0;
}
When I compile and run my program, I just see Killed being printed, not
hello, world!. Well, this is pretty good evidence that seccomp
is doing
something --- it’s at least killing my program! Let’s try to find out why it’s
being killed using strace
, a program that shows you all of the system calls
being made:
$ strace ./hello
execve("./hello", ["./hello"], 0x7fff77b754b0 /* 20 vars */) = 0
brk(NULL) = 0x559e08463000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=25762, ...}) = 0
mmap(NULL, 25762, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe65b9f0000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\34\2\0\0\0\0\0"...,
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2030544, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fe65b9ee000
mmap(NULL, 4131552, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) =
0x7fe65b3df000
mprotect(0x7fe65b5c6000, 2097152, PROT_NONE) = 0
mmap(0x7fe65b7c6000, 24576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7fe65b7c6000
mmap(0x7fe65b7cc000, 15072, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fe65b7cc000
close(3) = 0
arch_prctl(ARCH_SET_FS, 0x7fe65b9ef4c0) = 0
mprotect(0x7fe65b7c6000, 16384, PROT_READ) = 0
mprotect(0x559e077b9000, 4096, PROT_READ) = 0
mprotect(0x7fe65b9f7000, 4096, PROT_READ) = 0
munmap(0x7fe65b9f0000, 25762) = 0
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0
fstat(1, <unfinished ...>) = ?
+++ killed by SIGKILL +++
Killed
There’s a lot at the beginning about loading dynamically linked libraries,
reading the program binary, and mapping it into memory that I don’t fully
understand. But the last few syscalls provide some clues: right after prctl
is
called, we see fstat
being called! fstat
is a system call for getting the
status of a file, and 1
happens to be the file descriptor for standard output.
It makes sense that calling printf
might involve checking the status of
standard output, so I tried commenting out the call to printf
in hello.c
.
When I compiled and ran the new version, it still just printed Killed, so I
used strace
again. Just looking at the last few lines:
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0
exit_group(0) = ?
+++ killed by SIGKILL +++
Killed
Now my program is making the exit_group
system call. Thinking back to the
manual page for seccomp
, it said:
The only system calls that the calling thread is permitted to make are
read(2)
,write(2)
,_exit(2)
(but notexit_group(2)
), andsigreturn(2)
.
It looks like I’ll need to actually do some real filtering if I want to run my
hello world program and not just use strict mode. To do this, we need to use
SECCOMP_MODE_FILTER
and pass a pointer to a struct sock_fprog
, which
according to the manpage is “a Berkeley Packet Filter program designed to filter
arbitrary system calls and system call arguments.“
While we could construct a BPF program using an array of struct sock_filter
s,
looking at the chain of instructions we’d need made me think it would be much
easier to enlist the services of libseccomp
, a library designed
for just this purpose. Let’s try rewriting hello.c
to use libseccomp
and
allowing those three syscalls we saw before (fstat
, write
, and
exit_group
):
#include <seccomp.h>
#include <stdio.h>
#include <stdlib.h>
scmp_filter_ctx ctx;
/* graceful_exit cleans up our seccomp context before exiting */
void
graceful_exit(int rc)
{
seccomp_release(ctx);
exit(rc);
}
/* setup_seccomp initializes seccomp and loads our BPF program that filters
* syscalls into the kernel */
void
setup_seccomp()
{
int rc;
/* Initialize the seccomp filter state */
if ((ctx = seccomp_init(SCMP_ACT_KILL)) == NULL) {
graceful_exit(1);
}
if ((rc = seccomp_reset(ctx, SCMP_ACT_KILL)) != 0) {
graceful_exit(1);
}
/* Add allowed system calls to the BPF program */
if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(fstat), 0)) != 0) {
graceful_exit(1);
}
if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0)) != 0) {
graceful_exit(1);
}
if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0)) != 0) {
graceful_exit(1);
}
/* Load the BPF program for the current context into the kernel */
if ((rc = seccomp_load(ctx)) != 0) {
graceful_exit(1);
}
}
int
main()
{
setup_seccomp();
printf("hello, world!\n");
graceful_exit(0);
}
Since we’re now using libseccomp
, we need to tell our C compiler to link the
library:
$ cc -o hello hello.c -lseccomp
$ ./hello
hello, world!
Success! Our program compiles and runs, and all of the necessary syscalls have
been allowed. Now let’s try modifying the main()
function of our program to do
something bad, like trying to read the password file /etc/shadow
:
int
main()
{
FILE *fd;
setup_seccomp();
printf("hello, world!\n");
if ((fd = fopen("/etc/shadow", "r")) == NULL) {
perror("fopen");
graceful_exit(1);
}
fclose(fd);
graceful_exit(0);
}
Now when we compile and run our program, we get:
$ ./hello
hello, world!
Bad system call (core dumped)
Nice! The kernel killed our program when we tried to use a system call
(openat
) that we didn’t plan on!
Now let’s go back to how this all fits in to Docker. Looking at Docker’s
default seccomp
profile, a lot of it starts to make more
sense. In fact, it looks like they’re using the exact same names from
libseccomp
that we used in our program! If we search the moby source code for
libseccomp
, we can see that it is indeed being used (via Go
bindings).
Let’s try to use a custom seccomp
profile to prohibit programs in our Docker
container from listening for network connections. To start, I want to make sure
I can accept network connections, then modify my profile and watch it break. I
downloaded the default seccomp
profile to use as a starting point for
tweaking, started a container with port 4000 open, then used nc
to try
communicating from my host machine to a listener in the Docker container:
$ docker run --rm -it -p 4000:4000 --security-opt seccomp=seccomp.json alpine
/ # nc -l -p 4000
When I run echo hi | nc 127.0.0.1 4000
in a separate terminal, my greeting is
printed by the netcat listener in the Docker container---success! Now that I know
my basic TCP server works, let’s try blocking it with seccomp
! To start
listening on a TCP port, I know that nc
has to use the socket
, bind
, and
listen
system calls (which we can verify using strace
). I’ll try removing
them from the list of allowed system calls in the default profile, and run the
docker container again with the modified profile:
$ docker run --rm -it -p 4000:4000 --security-opt seccomp=seccomp.json alpine
/ # nc -l -p 4000
nc: socket(AF_INET,1,0): Operation not permitted
Awesome! We just used seccomp
to control what our Docker container is allowed
to do!
I can imagine this might be helpful if you had an environment where security was
extremely important and wanted to really lock down your containers, but it’s
hard to imagine that writing custom seccomp
profiles for every container in
your production environment is the best use of time without having some specific
situation you’re trying to address.