[RFC,1/3] seccomp: add a return code to trap to userspace

Submitted by Tycho Andersen on Feb. 4, 2018, 10:49 a.m.

Details

Message ID 20180204104946.25559-2-tycho@tycho.ws
State New
Series "seccomp trap to userspace"
Headers show

Commit Message

Tycho Andersen Feb. 4, 2018, 10:49 a.m.
This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mknod(), since this checks
capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
/dev/zero should be ok for containers to mknod, but we'd like to avoid hard
coding some whitelist in the kernel. Another example is mount(), which has
many security restrictions for good reason, but configuration or runtime
knowledge could potentially be used to relax these restrictions.

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to start services.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex. Also worth noting that there
is one race still present:

  1. a task does a SECCOMP_RET_USER_NOTIF
  2. the userspace handler reads this notification
  3. the task dies
  4. a new task with the same pid starts
  5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
     that the previous one did
  6. the userspace handler writes a response

There's no way to distinguish this case right now. Maybe we care, maybe we
don't, but it's worth noting.

Right now the interface is a simple structure copy across a file
descriptor. We could potentially invent something fancier.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 arch/Kconfig                                  |   7 +
 include/linux/seccomp.h                       |   3 +-
 include/uapi/linux/seccomp.h                  |  18 +-
 kernel/seccomp.c                              | 366 +++++++++++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 114 +++++++-
 5 files changed, 502 insertions(+), 6 deletions(-)

Patch hide | download patch | download mbox

diff --git a/arch/Kconfig b/arch/Kconfig
index 400b9e1b2f27..2946cb6fd704 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -387,6 +387,13 @@  config SECCOMP_FILTER
 
 	  See Documentation/prctl/seccomp_filter.txt for details.
 
+config SECCOMP_USER_NOTIFICATION
+	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
+	depends on SECCOMP_FILTER
+	help
+	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
+	  programs to notify a userspace listener that a particular event happened.
+
 config HAVE_GCC_PLUGINS
 	bool
 	help
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 10f25f7e4304..ce07da2ffd53 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,7 +5,8 @@ 
 #include <uapi/linux/seccomp.h>
 
 #define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
-					 SECCOMP_FILTER_FLAG_LOG)
+					 SECCOMP_FILTER_FLAG_LOG | \
+					 SECCOMP_FILTER_FLAG_GET_LISTENER)
 
 #ifdef CONFIG_SECCOMP
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 2a0bd9dd104d..4a342aa2e524 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -17,8 +17,9 @@ 
 #define SECCOMP_GET_ACTION_AVAIL	2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC	1
-#define SECCOMP_FILTER_FLAG_LOG		2
+#define SECCOMP_FILTER_FLAG_TSYNC		1
+#define SECCOMP_FILTER_FLAG_LOG			2
+#define SECCOMP_FILTER_FLAG_GET_LISTENER	4
 
 /*
  * All BPF programs must return a 32-bit value.
@@ -34,6 +35,7 @@ 
 #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
 #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
 #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
+#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
 #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
 #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
 #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
@@ -59,4 +61,16 @@  struct seccomp_data {
 	__u64 args[6];
 };
 
+struct seccomp_notif {
+	__u32 id;
+	pid_t pid;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u32 id;
+	int error;
+	long val;
+};
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 5f0dfb2abb8d..9541eb379e74 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -38,6 +38,52 @@ 
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+
+enum notify_state {
+	SECCOMP_NOTIFY_INIT,
+	SECCOMP_NOTIFY_READ,
+	SECCOMP_NOTIFY_WRITE,
+};
+
+struct seccomp_knotif {
+	/* The pid whose filter triggered the notification */
+	pid_t pid;
+
+	/*
+	 * The "cookie" for this request; this is unique for this filter.
+	 */
+	u32 id;
+
+	/*
+	 * The seccomp data. This pointer is valid the entire time this
+	 * notification is active, since it comes from __seccomp_filter which
+	 * eclipses the entire lifecycle here.
+	 */
+	const struct seccomp_data *data;
+
+	/*
+	 * SECCOMP_NOTIFY_INIT: someone has made this request, but it has not
+	 * 	yet been sent to userspace
+	 * SECCOMP_NOTIFY_READ: sent to userspace but no response yet
+	 * SECCOMP_NOTIFY_WRITE: we have a response from userspace, but it has
+	 * 	not yet been written back to the application
+	 */
+	enum notify_state state;
+
+	/* The return values, only valid when in SECCOMP_NOTIFY_WRITE */
+	int error;
+	long val;
+
+	/* Signals when this has entered SECCOMP_NOTIFY_WRITE */
+	struct completion ready;
+
+	struct list_head list;
+};
+#endif
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -64,6 +110,30 @@  struct seccomp_filter {
 	bool log;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	/*
+	 * A semaphore that users of this notification can wait on for
+	 * changes. Actual reads and writes are still controlled with
+	 * filter->notify_lock.
+	 */
+	struct semaphore request;
+
+	/*
+	 * A lock for all notification-related accesses.
+	 */
+	struct mutex notify_lock;
+
+	/*
+	 * Is there currently an attached listener?
+	 */
+	bool has_listener;
+
+	/*
+	 * A list of struct seccomp_knotif elements.
+	 */
+	struct list_head notifications;
+#endif
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -383,6 +453,12 @@  static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	if (!sfilter)
 		return ERR_PTR(-ENOMEM);
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	mutex_init(&sfilter->notify_lock);
+	sema_init(&sfilter->request, 0);
+	INIT_LIST_HEAD(&sfilter->notifications);
+#endif
+
 	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
 					seccomp_check_filter, save_orig);
 	if (ret < 0) {
@@ -547,13 +623,15 @@  static void seccomp_send_sigsys(int syscall, int reason)
 #define SECCOMP_LOG_TRACE		(1 << 4)
 #define SECCOMP_LOG_LOG			(1 << 5)
 #define SECCOMP_LOG_ALLOW		(1 << 6)
+#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
 
 static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
 				    SECCOMP_LOG_KILL_THREAD  |
 				    SECCOMP_LOG_TRAP  |
 				    SECCOMP_LOG_ERRNO |
 				    SECCOMP_LOG_TRACE |
-				    SECCOMP_LOG_LOG;
+				    SECCOMP_LOG_LOG |
+				    SECCOMP_LOG_USER_NOTIF;
 
 static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 			       bool requested)
@@ -572,6 +650,9 @@  static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 	case SECCOMP_RET_TRACE:
 		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
+		break;
 	case SECCOMP_RET_LOG:
 		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
 		break;
@@ -645,6 +726,89 @@  void secure_computing_strict(int this_syscall)
 }
 #else
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+/*
+ * Finds the next unique notification id.
+ */
+static u32 seccomp_next_notify_id(struct list_head *list)
+{
+	struct seccomp_knotif *knotif = NULL;
+	struct list_head *cur;
+	u32 id = get_random_u32();
+
+again:
+	list_for_each(cur, list) {
+		knotif = list_entry(cur, struct seccomp_knotif, list);
+
+		if (knotif->id == id) {
+			id = get_random_u32();
+			goto again;
+		}
+	}
+
+	return id;
+}
+
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	int err;
+	long ret = 0;
+	struct seccomp_knotif n = {};
+
+	mutex_lock(&match->notify_lock);
+	if (!match->has_listener) {
+		err = -ENOSYS;
+		goto out;
+	}
+
+	n.pid = current->pid;
+	n.state = SECCOMP_NOTIFY_INIT;
+	n.data = sd;
+	n.id = seccomp_next_notify_id(&match->notifications);
+	init_completion(&n.ready);
+
+	list_add(&n.list, &match->notifications);
+
+	mutex_unlock(&match->notify_lock);
+	up(&match->request);
+
+	err = wait_for_completion_interruptible(&n.ready);
+	/*
+	 * This syscall is getting interrupted. We no longer need to
+	 * tell userspace about it, and any userspace responses should
+	 * be ignored.
+	 */
+	mutex_lock(&match->notify_lock);
+	if (err < 0)
+		goto remove_list;
+
+	ret = n.val;
+	err = n.error;
+
+	WARN(n.state != SECCOMP_NOTIFY_WRITE,
+	     "notified about write complete when state is not write");
+
+remove_list:
+	list_del(&n.list);
+out:
+	mutex_unlock(&match->notify_lock);
+	syscall_set_return_value(current, task_pt_regs(current),
+				 err, ret);
+}
+#else
+static void seccomp_do_user_notification(int this_syscall,
+					 u32 action,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	WARN(1, "user notification received, but disabled");
+	seccomp_log(this_syscall, SIGSYS, action, true);
+	do_exit(SIGSYS);
+}
+#endif
+
 #ifdef CONFIG_SECCOMP_FILTER
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
@@ -722,6 +886,9 @@  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 
 		return 0;
 
+	case SECCOMP_RET_USER_NOTIF:
+		seccomp_do_user_notification(this_syscall, match, sd);
+		goto skip;
 	case SECCOMP_RET_LOG:
 		seccomp_log(this_syscall, 0, action, true);
 		return 0;
@@ -828,6 +995,10 @@  static long seccomp_set_mode_strict(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static struct file *init_listener(struct seccomp_filter *filter);
+#endif
+
 /**
  * seccomp_set_mode_filter: internal function for setting seccomp filter
  * @flags:  flags to change filter behavior
@@ -847,6 +1018,8 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
 	struct seccomp_filter *prepared = NULL;
 	long ret = -EINVAL;
+	int listener = 0;
+	struct file *listener_f = NULL;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -857,13 +1030,28 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	if (IS_ERR(prepared))
 		return PTR_ERR(prepared);
 
+	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+		listener = get_unused_fd_flags(O_RDWR);
+		if (listener < 0) {
+			ret = listener;
+			goto out_free;
+		}
+
+		listener_f = init_listener(prepared);
+		if (IS_ERR(listener_f)) {
+			put_unused_fd(listener);
+			ret = PTR_ERR(listener_f);
+			goto out_free;
+		}
+	}
+
 	/*
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_free;
+		goto out_put_fd;
 
 	spin_lock_irq(&current->sighand->siglock);
 
@@ -881,6 +1069,16 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		mutex_unlock(&current->signal->cred_guard_mutex);
+out_put_fd:
+	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+		if (ret < 0) {
+			fput(listener_f);
+			put_unused_fd(listener);
+		} else {
+			fd_install(listener, listener_f);
+			ret = listener;
+		}
+	}
 out_free:
 	seccomp_filter_free(prepared);
 	return ret;
@@ -909,6 +1107,9 @@  static long seccomp_get_action_avail(const char __user *uaction)
 	case SECCOMP_RET_LOG:
 	case SECCOMP_RET_ALLOW:
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
+			break;
 	default:
 		return -EOPNOTSUPP;
 	}
@@ -1057,6 +1258,7 @@  long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
 #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
 #define SECCOMP_RET_TRAP_NAME		"trap"
 #define SECCOMP_RET_ERRNO_NAME		"errno"
+#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
 #define SECCOMP_RET_TRACE_NAME		"trace"
 #define SECCOMP_RET_LOG_NAME		"log"
 #define SECCOMP_RET_ALLOW_NAME		"allow"
@@ -1066,6 +1268,7 @@  static const char seccomp_actions_avail[] =
 				SECCOMP_RET_KILL_THREAD_NAME	" "
 				SECCOMP_RET_TRAP_NAME		" "
 				SECCOMP_RET_ERRNO_NAME		" "
+				SECCOMP_RET_USER_NOTIF_NAME     " "
 				SECCOMP_RET_TRACE_NAME		" "
 				SECCOMP_RET_LOG_NAME		" "
 				SECCOMP_RET_ALLOW_NAME;
@@ -1083,6 +1286,7 @@  static const struct seccomp_log_name seccomp_log_names[] = {
 	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
 	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
 	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
+	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
 	{ }
 };
 
@@ -1231,3 +1435,161 @@  static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static int seccomp_notify_release(struct inode *inode, struct file *file)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct list_head *cur;
+
+	mutex_lock(&filter->notify_lock);
+
+	/*
+	 * If this file is being closed because e.g. the task who owned it
+	 * died, let's wake everyone up who was waiting on us.
+	 */
+	list_for_each(cur, &filter->notifications) {
+		struct seccomp_knotif *knotif;
+
+		knotif = list_entry(cur, struct seccomp_knotif, list);
+
+		knotif->state = SECCOMP_NOTIFY_WRITE;
+		knotif->error = -ENOSYS;
+		knotif->val = 0;
+		complete(&knotif->ready);
+	}
+
+	filter->has_listener = false;
+	mutex_unlock(&filter->notify_lock);
+	__put_seccomp_filter(filter);
+	return 0;
+}
+
+static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
+				   size_t size, loff_t *ppos)
+{
+	struct seccomp_filter *filter = f->private_data;
+	struct seccomp_knotif *knotif = NULL;
+	struct seccomp_notif unotif;
+	struct list_head *cur;
+	ssize_t ret;
+
+	/* No offset reads. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	ret = down_interruptible(&filter->request);
+	if (ret < 0)
+		return ret;
+
+	mutex_lock(&filter->notify_lock);
+	list_for_each(cur, &filter->notifications) {
+		knotif = list_entry(cur, struct seccomp_knotif, list);
+		if (knotif->state == SECCOMP_NOTIFY_INIT)
+			break;
+	}
+
+	/*
+	 * We didn't find anything which is odd, because at least one
+	 * thing should have been queued.
+	 */
+	if (knotif->state != SECCOMP_NOTIFY_INIT) {
+		ret = -ENOENT;
+		WARN(1, "no seccomp notification found");
+		goto out;
+	}
+
+	unotif.id = knotif->id;
+	unotif.pid = knotif->pid;
+	unotif.data = *(knotif->data);
+
+	size = min_t(size_t, size, sizeof(struct seccomp_notif));
+	if (copy_to_user(buf, &unotif, size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = sizeof(unotif);
+	knotif->state = SECCOMP_NOTIFY_READ;
+
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
+				    size_t size, loff_t *ppos)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_notif_resp resp = {};
+	struct seccomp_knotif *knotif = NULL;
+	struct list_head *cur;
+	ssize_t ret = -EINVAL;
+
+	/* No partial writes. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	size = min_t(size_t, size, sizeof(resp));
+	if (copy_from_user(&resp, buf, size))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each(cur, &filter->notifications) {
+		knotif = list_entry(cur, struct seccomp_knotif, list);
+
+		if (knotif->id == resp.id)
+			break;
+	}
+
+	if (!knotif || knotif->id != resp.id) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = size;
+	knotif->state = SECCOMP_NOTIFY_WRITE;
+	knotif->error = resp.error;
+	knotif->val = resp.val;
+	complete(&knotif->ready);
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static const struct file_operations seccomp_notify_ops = {
+	.read = seccomp_notify_read,
+	.write = seccomp_notify_write,
+	/* TODO: poll */
+	.release = seccomp_notify_release,
+};
+
+static struct file *init_listener(struct seccomp_filter *filter)
+{
+	struct file *ret;
+
+	mutex_lock(&filter->notify_lock);
+	if (filter->has_listener) {
+		mutex_unlock(&filter->notify_lock);
+		return ERR_PTR(-EBUSY);
+	}
+
+	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
+				 filter, O_RDWR);
+	if (IS_ERR(ret)) {
+		__put_seccomp_filter(filter);
+	} else {
+		/*
+		 * Intentionally don't put_seccomp_filter(). The file
+		 * has a reference to it now.
+		 */
+		filter->has_listener = true;
+	}
+
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+#endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 24dbf634e2dd..b43e2a70b08c 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -40,6 +40,7 @@ 
 #include <sys/fcntl.h>
 #include <sys/mman.h>
 #include <sys/times.h>
+#include <sys/socket.h>
 
 #define _GNU_SOURCE
 #include <unistd.h>
@@ -141,6 +142,24 @@  struct seccomp_data {
 #define SECCOMP_FILTER_FLAG_LOG 2
 #endif
 
+#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
+#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
+
+#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
+
+struct seccomp_notif {
+	__u32 id;
+	pid_t pid;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u32 id;
+	int error;
+	long val;
+};
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
@@ -2063,7 +2082,8 @@  TEST(seccomp_syscall_mode_lock)
 TEST(detect_seccomp_filter_flags)
 {
 	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
-				 SECCOMP_FILTER_FLAG_LOG };
+				 SECCOMP_FILTER_FLAG_LOG,
+				 SECCOMP_FILTER_FLAG_GET_LISTENER };
 	unsigned int flag, all_flags;
 	int i;
 	long ret;
@@ -2845,6 +2865,98 @@  TEST(get_action_avail)
 	EXPECT_EQ(errno, EOPNOTSUPP);
 }
 
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+#define USER_NOTIF_MAGIC 116983961184613L
+TEST(get_user_notification_syscall)
+{
+	pid_t pid;
+	long ret;
+	int status, listener;
+	struct seccomp_notif req;
+	struct seccomp_notif_resp resp;
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	/* Check that we get -ENOSYS with no listener attached */
+	if (pid == 0) {
+		ASSERT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+		ret = syscall(__NR_getpid);
+		exit(ret >= 0 || errno != ENOSYS);
+	}
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_EQ(true, WIFEXITED(status));
+	ASSERT_EQ(0, WEXITSTATUS(status));
+
+	/* Check that the basic notification machinery works */
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_GET_LISTENER);
+	ASSERT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ASSERT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	ASSERT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_EQ(true, WIFEXITED(status));
+	ASSERT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that nothing bad happens when we kill the task in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read(listener, &req, sizeof(req));
+	ASSERT_EQ(ret, sizeof(req));
+
+	ASSERT_EQ(kill(pid, SIGKILL), 0);
+	ASSERT_EQ(waitpid(pid, NULL, 0), pid);
+
+	resp.id = req.id;
+	ret = write(listener, &resp, sizeof(resp));
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, EINVAL);
+
+	close(listener);
+}
+
 /*
  * TODO:
  * - add microbenchmarks

Comments

Andy Lutomirski Feb. 4, 2018, 5:36 p.m.
On Sun, Feb 4, 2018 at 10:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.

Neat!

>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
>
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex. Also worth noting that there
> is one race still present:
>
>   1. a task does a SECCOMP_RET_USER_NOTIF
>   2. the userspace handler reads this notification
>   3. the task dies
>   4. a new task with the same pid starts
>   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
>      that the previous one did
>   6. the userspace handler writes a response

I'm slightly confused.  I thought the id was never reused for a given
struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)

On very quick reading, I have a question.  What happens if a process
has two seccomp_filters attached, one of them returns
SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?
Tycho Andersen Feb. 4, 2018, 8:01 p.m.
Hi Andy,

On Sun, Feb 04, 2018 at 05:36:33PM +0000, Andy Lutomirski wrote:
> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex. Also worth noting that there
> > is one race still present:
> >
> >   1. a task does a SECCOMP_RET_USER_NOTIF
> >   2. the userspace handler reads this notification
> >   3. the task dies
> >   4. a new task with the same pid starts
> >   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
> >      that the previous one did
> >   6. the userspace handler writes a response
> 
> I'm slightly confused.  I thought the id was never reused for a given
> struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)

Well, what happens when u32/64 overflows? Eventually it will wrap.

> On very quick reading, I have a question.  What happens if a process
> has two seccomp_filters attached, one of them returns
> SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?

Good question, in seccomp_run_filters(), the first (lowest, last
applied) filter who returns SECCOMP_RET_USER_NOTIF is the one that
gets the notification and the other receives nothing.

I don't really have any reason to prefer this behavior, it's just what
happened without much thought.

Cheers,

Tycho
Andy Lutomirski Feb. 4, 2018, 8:33 p.m.
On Sun, Feb 4, 2018 at 8:01 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> Hi Andy,
>
> On Sun, Feb 04, 2018 at 05:36:33PM +0000, Andy Lutomirski wrote:
>> > The actual implementation of this is fairly small, although getting the
>> > synchronization right was/is slightly complex. Also worth noting that there
>> > is one race still present:
>> >
>> >   1. a task does a SECCOMP_RET_USER_NOTIF
>> >   2. the userspace handler reads this notification
>> >   3. the task dies
>> >   4. a new task with the same pid starts
>> >   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
>> >      that the previous one did
>> >   6. the userspace handler writes a response
>>
>> I'm slightly confused.  I thought the id was never reused for a given
>> struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)
>
> Well, what happens when u32/64 overflows? Eventually it will wrap.

I think we can safely assume that u64 won't overflow.  Even if we
processed one user return notification on a given seccomp_filter every
nanosecond (which would be insanely fast), that's 584 years.

>
>> On very quick reading, I have a question.  What happens if a process
>> has two seccomp_filters attached, one of them returns
>> SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?
>
> Good question, in seccomp_run_filters(), the first (lowest, last
> applied) filter who returns SECCOMP_RET_USER_NOTIF is the one that
> gets the notification and the other receives nothing.
>
> I don't really have any reason to prefer this behavior, it's just what
> happened without much thought.

Hmm.  This won't nest right.  Maybe we should just disallow a
user-notification-using filter from being applied if there is already
one in the stack.  Then, if anyone cares about making these things
nest right, they can fix it.
Tycho Andersen Feb. 5, 2018, 8:47 a.m.
On Sun, Feb 04, 2018 at 08:33:25PM +0000, Andy Lutomirski wrote:
> On Sun, Feb 4, 2018 at 8:01 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > Hi Andy,
> >
> > On Sun, Feb 04, 2018 at 05:36:33PM +0000, Andy Lutomirski wrote:
> >> > The actual implementation of this is fairly small, although getting the
> >> > synchronization right was/is slightly complex. Also worth noting that there
> >> > is one race still present:
> >> >
> >> >   1. a task does a SECCOMP_RET_USER_NOTIF
> >> >   2. the userspace handler reads this notification
> >> >   3. the task dies
> >> >   4. a new task with the same pid starts
> >> >   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
> >> >      that the previous one did
> >> >   6. the userspace handler writes a response
> >>
> >> I'm slightly confused.  I thought the id was never reused for a given
> >> struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)
> >
> > Well, what happens when u32/64 overflows? Eventually it will wrap.
> 
> I think we can safely assume that u64 won't overflow.  Even if we
> processed one user return notification on a given seccomp_filter every
> nanosecond (which would be insanely fast), that's 584 years.

Yes, fair point r.e. u64. I'll make the change.

> >
> >> On very quick reading, I have a question.  What happens if a process
> >> has two seccomp_filters attached, one of them returns
> >> SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?
> >
> > Good question, in seccomp_run_filters(), the first (lowest, last
> > applied) filter who returns SECCOMP_RET_USER_NOTIF is the one that
> > gets the notification and the other receives nothing.
> >
> > I don't really have any reason to prefer this behavior, it's just what
> > happened without much thought.
> 
> Hmm.  This won't nest right.  Maybe we should just disallow a
> user-notification-using filter from being applied if there is already
> one in the stack.  Then, if anyone cares about making these things
> nest right, they can fix it.

Sounds fine to me, I'll add a check.

Cheers,

Tycho
Kees Cook Feb. 13, 2018, 9:09 p.m.
On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.

Related to the eBPF seccomp thread, can the logic for these things be
handled entirely by eBPF? My assumption is that you still need to stop
the process to do something (i.e. do a mknod, or a mount) before
letting it continue. Is there some "wait for notification" system in
eBPF?

> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.

Agreed: notification is extremely painful right now. The container
case is compelling, since it will always want a way to trick out these
kinds of filesystem calls.

> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex. Also worth noting that there
> is one race still present:
>
>   1. a task does a SECCOMP_RET_USER_NOTIF
>   2. the userspace handler reads this notification
>   3. the task dies
>   4. a new task with the same pid starts
>   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
>      that the previous one did
>   6. the userspace handler writes a response
>
> There's no way to distinguish this case right now. Maybe we care, maybe we
> don't, but it's worth noting.

So, I'd like to avoid the cookie if possible (surprise). Why isn't it
possible to close the kernel-side of the fd to indicate that it lost
the pid it was attached to? Is this just that the reader has no idea
who is sending messages? So the risk is a fork/die loop within the
same process tree (i.e. attached to the same filter)? Hrmpf. I can't
think of a better way to handle the
one(fd)-to-many(task-with-that-filter-attached) situation...

> Right now the interface is a simple structure copy across a file
> descriptor. We could potentially invent something fancier.

I wonder if this communication should be netlink, which gives a more
well-structured way to describe what's on the wire? The reason I ask
is because if we ever change the seccomp_data structure, we'll now
have two places where we need to deal with it (the first being within
the BPF itself). My initial idea was to prefix the communication with
a size field, then send the structure, and then I had nightmares, and
realized this was basically netlink reinvented.

> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.

Is this really true? Couldn't a multi-threaded process muck with
memory out from under both the manager and the stopped process?

> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  arch/Kconfig                                  |   7 +
>  include/linux/seccomp.h                       |   3 +-
>  include/uapi/linux/seccomp.h                  |  18 +-
>  kernel/seccomp.c                              | 366 +++++++++++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 114 +++++++-
>  5 files changed, 502 insertions(+), 6 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 400b9e1b2f27..2946cb6fd704 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -387,6 +387,13 @@ config SECCOMP_FILTER
>
>           See Documentation/prctl/seccomp_filter.txt for details.
>
> +config SECCOMP_USER_NOTIFICATION
> +       bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
> +       depends on SECCOMP_FILTER
> +       help
> +         Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
> +         programs to notify a userspace listener that a particular event happened.
> +
>  config HAVE_GCC_PLUGINS
>         bool
>         help
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 10f25f7e4304..ce07da2ffd53 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,7 +5,8 @@
>  #include <uapi/linux/seccomp.h>
>
>  #define SECCOMP_FILTER_FLAG_MASK       (SECCOMP_FILTER_FLAG_TSYNC | \
> -                                        SECCOMP_FILTER_FLAG_LOG)
> +                                        SECCOMP_FILTER_FLAG_LOG | \
> +                                        SECCOMP_FILTER_FLAG_GET_LISTENER)
>
>  #ifdef CONFIG_SECCOMP
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 2a0bd9dd104d..4a342aa2e524 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,8 +17,9 @@
>  #define SECCOMP_GET_ACTION_AVAIL       2
>
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC      1
> -#define SECCOMP_FILTER_FLAG_LOG                2
> +#define SECCOMP_FILTER_FLAG_TSYNC              1
> +#define SECCOMP_FILTER_FLAG_LOG                        2
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER       4
>
>  /*
>   * All BPF programs must return a 32-bit value.
> @@ -34,6 +35,7 @@
>  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */

/me tries to come up with an ordering rationale here and fails.

An ERRNO filter would block a USER_NOTIF because it's unconditional.
TRACE could be either, USER_NOTIF could be either.

This means TRACE rules would be bumped by a USER_NOTIF... hmm.

>  #define SECCOMP_RET_LOG                 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW       0x7fff0000U /* allow */
> @@ -59,4 +61,16 @@ struct seccomp_data {
>         __u64 args[6];
>  };
>
> +struct seccomp_notif {
> +       __u32 id;
> +       pid_t pid;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u32 id;
> +       int error;
> +       long val;
> +};
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 5f0dfb2abb8d..9541eb379e74 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -38,6 +38,52 @@
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION

I wonder if it's time to split up seccomp.c ... probably not, but I've
always been unhappy with the #ifdefs even for just regular _FILTER. ;)

> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +       SECCOMP_NOTIFY_INIT,
> +       SECCOMP_NOTIFY_READ,
> +       SECCOMP_NOTIFY_WRITE,
> +};
> +
> +struct seccomp_knotif {
> +       /* The pid whose filter triggered the notification */
> +       pid_t pid;
> +
> +       /*
> +        * The "cookie" for this request; this is unique for this filter.
> +        */
> +       u32 id;
> +
> +       /*
> +        * The seccomp data. This pointer is valid the entire time this
> +        * notification is active, since it comes from __seccomp_filter which
> +        * eclipses the entire lifecycle here.
> +        */
> +       const struct seccomp_data *data;
> +
> +       /*
> +        * SECCOMP_NOTIFY_INIT: someone has made this request, but it has not
> +        *      yet been sent to userspace
> +        * SECCOMP_NOTIFY_READ: sent to userspace but no response yet
> +        * SECCOMP_NOTIFY_WRITE: we have a response from userspace, but it has
> +        *      not yet been written back to the application
> +        */
> +       enum notify_state state;
> +
> +       /* The return values, only valid when in SECCOMP_NOTIFY_WRITE */
> +       int error;
> +       long val;
> +
> +       /* Signals when this has entered SECCOMP_NOTIFY_WRITE */
> +       struct completion ready;
> +
> +       struct list_head list;
> +};
> +#endif
> +
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
>   *
> @@ -64,6 +110,30 @@ struct seccomp_filter {
>         bool log;
>         struct seccomp_filter *prev;
>         struct bpf_prog *prog;
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +       /*
> +        * A semaphore that users of this notification can wait on for
> +        * changes. Actual reads and writes are still controlled with
> +        * filter->notify_lock.
> +        */
> +       struct semaphore request;
> +
> +       /*
> +        * A lock for all notification-related accesses.
> +        */
> +       struct mutex notify_lock;
> +
> +       /*
> +        * Is there currently an attached listener?
> +        */
> +       bool has_listener;
> +
> +       /*
> +        * A list of struct seccomp_knotif elements.
> +        */

Nit: these 3 above can be one-line comments.

> +       struct list_head notifications;
> +#endif
>  };
>
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -383,6 +453,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>         if (!sfilter)
>                 return ERR_PTR(-ENOMEM);
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +       mutex_init(&sfilter->notify_lock);
> +       sema_init(&sfilter->request, 0);
> +       INIT_LIST_HEAD(&sfilter->notifications);
> +#endif
> +
>         ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>                                         seccomp_check_filter, save_orig);
>         if (ret < 0) {
> @@ -547,13 +623,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
>  #define SECCOMP_LOG_TRACE              (1 << 4)
>  #define SECCOMP_LOG_LOG                        (1 << 5)
>  #define SECCOMP_LOG_ALLOW              (1 << 6)
> +#define SECCOMP_LOG_USER_NOTIF         (1 << 7)
>
>  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
>                                     SECCOMP_LOG_KILL_THREAD  |
>                                     SECCOMP_LOG_TRAP  |
>                                     SECCOMP_LOG_ERRNO |
>                                     SECCOMP_LOG_TRACE |
> -                                   SECCOMP_LOG_LOG;
> +                                   SECCOMP_LOG_LOG |
> +                                   SECCOMP_LOG_USER_NOTIF;
>
>  static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>                                bool requested)
> @@ -572,6 +650,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>         case SECCOMP_RET_TRACE:
>                 log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
>                 break;
> +       case SECCOMP_RET_USER_NOTIF:
> +               log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> +               break;
>         case SECCOMP_RET_LOG:
>                 log = seccomp_actions_logged & SECCOMP_LOG_LOG;
>                 break;
> @@ -645,6 +726,89 @@ void secure_computing_strict(int this_syscall)
>  }
>  #else
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +/*
> + * Finds the next unique notification id.
> + */
> +static u32 seccomp_next_notify_id(struct list_head *list)
> +{
> +       struct seccomp_knotif *knotif = NULL;
> +       struct list_head *cur;
> +       u32 id = get_random_u32();
> +
> +again:
> +       list_for_each(cur, list) {
> +               knotif = list_entry(cur, struct seccomp_knotif, list);
> +
> +               if (knotif->id == id) {
> +                       id = get_random_u32();
> +                       goto again;
> +               }
> +       }
> +
> +       return id;
> +}
> +
> +static void seccomp_do_user_notification(int this_syscall,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       int err;
> +       long ret = 0;
> +       struct seccomp_knotif n = {};
> +
> +       mutex_lock(&match->notify_lock);
> +       if (!match->has_listener) {
> +               err = -ENOSYS;
> +               goto out;
> +       }
> +
> +       n.pid = current->pid;
> +       n.state = SECCOMP_NOTIFY_INIT;
> +       n.data = sd;
> +       n.id = seccomp_next_notify_id(&match->notifications);
> +       init_completion(&n.ready);
> +
> +       list_add(&n.list, &match->notifications);
> +
> +       mutex_unlock(&match->notify_lock);
> +       up(&match->request);
> +
> +       err = wait_for_completion_interruptible(&n.ready);
> +       /*
> +        * This syscall is getting interrupted. We no longer need to
> +        * tell userspace about it, and any userspace responses should
> +        * be ignored.
> +        */
> +       mutex_lock(&match->notify_lock);
> +       if (err < 0)
> +               goto remove_list;
> +
> +       ret = n.val;
> +       err = n.error;
> +
> +       WARN(n.state != SECCOMP_NOTIFY_WRITE,
> +            "notified about write complete when state is not write");
> +
> +remove_list:
> +       list_del(&n.list);
> +out:
> +       mutex_unlock(&match->notify_lock);
> +       syscall_set_return_value(current, task_pt_regs(current),
> +                                err, ret);
> +}
> +#else
> +static void seccomp_do_user_notification(int this_syscall,
> +                                        u32 action,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       WARN(1, "user notification received, but disabled");
> +       seccomp_log(this_syscall, SIGSYS, action, true);
> +       do_exit(SIGSYS);
> +}
> +#endif
> +
>  #ifdef CONFIG_SECCOMP_FILTER
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>                             const bool recheck_after_trace)
> @@ -722,6 +886,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>
>                 return 0;
>
> +       case SECCOMP_RET_USER_NOTIF:
> +               seccomp_do_user_notification(this_syscall, match, sd);
> +               goto skip;
>         case SECCOMP_RET_LOG:
>                 seccomp_log(this_syscall, 0, action, true);
>                 return 0;
> @@ -828,6 +995,10 @@ static long seccomp_set_mode_strict(void)
>  }
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static struct file *init_listener(struct seccomp_filter *filter);
> +#endif
> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -847,6 +1018,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>         struct seccomp_filter *prepared = NULL;
>         long ret = -EINVAL;
> +       int listener = 0;
> +       struct file *listener_f = NULL;
>
>         /* Validate flags. */
>         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -857,13 +1030,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         if (IS_ERR(prepared))
>                 return PTR_ERR(prepared);
>
> +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +               listener = get_unused_fd_flags(O_RDWR);
> +               if (listener < 0) {
> +                       ret = listener;
> +                       goto out_free;
> +               }
> +
> +               listener_f = init_listener(prepared);
> +               if (IS_ERR(listener_f)) {
> +                       put_unused_fd(listener);
> +                       ret = PTR_ERR(listener_f);
> +                       goto out_free;
> +               }
> +       }
> +
>         /*
>          * Make sure we cannot change seccomp or nnp state via TSYNC
>          * while another thread is in the middle of calling exec.
>          */
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>             mutex_lock_killable(&current->signal->cred_guard_mutex))
> -               goto out_free;
> +               goto out_put_fd;
>
>         spin_lock_irq(&current->sighand->siglock);
>
> @@ -881,6 +1069,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         spin_unlock_irq(&current->sighand->siglock);
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>                 mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +               if (ret < 0) {
> +                       fput(listener_f);
> +                       put_unused_fd(listener);
> +               } else {
> +                       fd_install(listener, listener_f);
> +                       ret = listener;
> +               }
> +       }
>  out_free:
>         seccomp_filter_free(prepared);
>         return ret;
> @@ -909,6 +1107,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
>         case SECCOMP_RET_LOG:
>         case SECCOMP_RET_ALLOW:
>                 break;
> +       case SECCOMP_RET_USER_NOTIF:
> +               if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
> +                       break;
>         default:
>                 return -EOPNOTSUPP;
>         }
> @@ -1057,6 +1258,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
>  #define SECCOMP_RET_KILL_THREAD_NAME   "kill_thread"
>  #define SECCOMP_RET_TRAP_NAME          "trap"
>  #define SECCOMP_RET_ERRNO_NAME         "errno"
> +#define SECCOMP_RET_USER_NOTIF_NAME    "user_notif"
>  #define SECCOMP_RET_TRACE_NAME         "trace"
>  #define SECCOMP_RET_LOG_NAME           "log"
>  #define SECCOMP_RET_ALLOW_NAME         "allow"
> @@ -1066,6 +1268,7 @@ static const char seccomp_actions_avail[] =
>                                 SECCOMP_RET_KILL_THREAD_NAME    " "
>                                 SECCOMP_RET_TRAP_NAME           " "
>                                 SECCOMP_RET_ERRNO_NAME          " "
> +                               SECCOMP_RET_USER_NOTIF_NAME     " "
>                                 SECCOMP_RET_TRACE_NAME          " "
>                                 SECCOMP_RET_LOG_NAME            " "
>                                 SECCOMP_RET_ALLOW_NAME;
> @@ -1083,6 +1286,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
>         { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
>         { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
>         { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> +       { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
>         { }
>  };
>
> @@ -1231,3 +1435,161 @@ static int __init seccomp_sysctl_init(void)
>  device_initcall(seccomp_sysctl_init)
>
>  #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       struct list_head *cur;
> +
> +       mutex_lock(&filter->notify_lock);
> +
> +       /*
> +        * If this file is being closed because e.g. the task who owned it
> +        * died, let's wake everyone up who was waiting on us.
> +        */
> +       list_for_each(cur, &filter->notifications) {
> +               struct seccomp_knotif *knotif;
> +
> +               knotif = list_entry(cur, struct seccomp_knotif, list);
> +
> +               knotif->state = SECCOMP_NOTIFY_WRITE;
> +               knotif->error = -ENOSYS;
> +               knotif->val = 0;
> +               complete(&knotif->ready);
> +       }
> +
> +       filter->has_listener = false;
> +       mutex_unlock(&filter->notify_lock);
> +       __put_seccomp_filter(filter);
> +       return 0;
> +}
> +
> +static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
> +                                  size_t size, loff_t *ppos)
> +{
> +       struct seccomp_filter *filter = f->private_data;
> +       struct seccomp_knotif *knotif = NULL;
> +       struct seccomp_notif unotif;
> +       struct list_head *cur;
> +       ssize_t ret;
> +
> +       /* No offset reads. */
> +       if (*ppos != 0)
> +               return -EINVAL;
> +
> +       ret = down_interruptible(&filter->request);
> +       if (ret < 0)
> +               return ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       list_for_each(cur, &filter->notifications) {
> +               knotif = list_entry(cur, struct seccomp_knotif, list);
> +               if (knotif->state == SECCOMP_NOTIFY_INIT)
> +                       break;
> +       }
> +
> +       /*
> +        * We didn't find anything which is odd, because at least one
> +        * thing should have been queued.
> +        */
> +       if (knotif->state != SECCOMP_NOTIFY_INIT) {
> +               ret = -ENOENT;
> +               WARN(1, "no seccomp notification found");

I tend to prefer WARN_ONCE, just in case this ever finds itself
exposed to being triggered trivially from userspace.

> +               goto out;
> +       }
> +
> +       unotif.id = knotif->id;
> +       unotif.pid = knotif->pid;
> +       unotif.data = *(knotif->data);
> +
> +       size = min_t(size_t, size, sizeof(struct seccomp_notif));
> +       if (copy_to_user(buf, &unotif, size)) {
> +               ret = -EFAULT;
> +               goto out;
> +       }
> +
> +       ret = sizeof(unotif);
> +       knotif->state = SECCOMP_NOTIFY_READ;
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
> +                                   size_t size, loff_t *ppos)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       struct seccomp_notif_resp resp = {};
> +       struct seccomp_knotif *knotif = NULL;
> +       struct list_head *cur;
> +       ssize_t ret = -EINVAL;
> +
> +       /* No partial writes. */
> +       if (*ppos != 0)
> +               return -EINVAL;
> +
> +       size = min_t(size_t, size, sizeof(resp));

In this case, we can't use min_t, size _must_ be == sizeof(resp),
otherwise we're operating on what's in the stack (which is zeroed, but
still).

> +       if (copy_from_user(&resp, buf, size))
> +               return -EFAULT;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       list_for_each(cur, &filter->notifications) {
> +               knotif = list_entry(cur, struct seccomp_knotif, list);
> +
> +               if (knotif->id == resp.id)
> +                       break;

So we're finding the matching id here. Now, I'm trying to think about
how this will look in real-world use: the pid will be _blocked_ while
this happening. And all the other pids that trip this filter will
_also_ be blocked, since they're all waiting for the reader to read
and respond. The risk is pid death while waiting, and having another
appear with the same pid, trigger the same filter, get blocked, and
then the reader replies for the old pid, and the new pid gets the
results?

Since this notification queue is already linear, can't we use ordering
to enforce this? i.e. only the pid at the head of the filter
notification queue is going to have anything happening to it. Or is
the idea to have multiple readers/writers of the fd?

> +       }
> +
> +       if (!knotif || knotif->id != resp.id) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       ret = size;
> +       knotif->state = SECCOMP_NOTIFY_WRITE;
> +       knotif->error = resp.error;
> +       knotif->val = resp.val;
> +       complete(&knotif->ready);
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +       .read = seccomp_notify_read,
> +       .write = seccomp_notify_write,
> +       /* TODO: poll */

What's needed for poll? I think you've got all the pieces you need
already, i.e. wait queue, notifications, etc.

> +       .release = seccomp_notify_release,
> +};
> +
> +static struct file *init_listener(struct seccomp_filter *filter)
> +{
> +       struct file *ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       if (filter->has_listener) {
> +               mutex_unlock(&filter->notify_lock);
> +               return ERR_PTR(-EBUSY);
> +       }
> +
> +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> +                                filter, O_RDWR);
> +       if (IS_ERR(ret)) {
> +               __put_seccomp_filter(filter);
> +       } else {
> +               /*
> +                * Intentionally don't put_seccomp_filter(). The file
> +                * has a reference to it now.
> +                */
> +               filter->has_listener = true;
> +       }

I spent some time staring at this, and I don't see it: where is the
get_() for this? The caller of init_listener() already does a put() on
the failure path. It seems like there is a get() missing near the
start of init_listener(), or I've entirely missed something.
(Regardless, I think the usage counting need a comment somewhere,
maybe near the top of seccomp.c with the field?)

> +
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +#endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 24dbf634e2dd..b43e2a70b08c 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -40,6 +40,7 @@
>  #include <sys/fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/times.h>
> +#include <sys/socket.h>
>
>  #define _GNU_SOURCE
>  #include <unistd.h>
> @@ -141,6 +142,24 @@ struct seccomp_data {
>  #define SECCOMP_FILTER_FLAG_LOG 2
>  #endif
>
> +#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
> +
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
> +
> +struct seccomp_notif {
> +       __u32 id;
> +       pid_t pid;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u32 id;
> +       int error;
> +       long val;
> +};
> +#endif
> +
>  #ifndef seccomp
>  int seccomp(unsigned int op, unsigned int flags, void *args)
>  {
> @@ -2063,7 +2082,8 @@ TEST(seccomp_syscall_mode_lock)
>  TEST(detect_seccomp_filter_flags)
>  {
>         unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
> -                                SECCOMP_FILTER_FLAG_LOG };
> +                                SECCOMP_FILTER_FLAG_LOG,
> +                                SECCOMP_FILTER_FLAG_GET_LISTENER };
>         unsigned int flag, all_flags;
>         int i;
>         long ret;
> @@ -2845,6 +2865,98 @@ TEST(get_action_avail)
>         EXPECT_EQ(errno, EOPNOTSUPP);
>  }
>
> +static int user_trap_syscall(int nr, unsigned int flags)
> +{
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +                       offsetof(struct seccomp_data, nr)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> +}
> +
> +#define USER_NOTIF_MAGIC 116983961184613L

Is this just you mashing the numpad? :)

> +TEST(get_user_notification_syscall)
> +{
> +       pid_t pid;
> +       long ret;
> +       int status, listener;
> +       struct seccomp_notif req;
> +       struct seccomp_notif_resp resp;
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       /* Check that we get -ENOSYS with no listener attached */
> +       if (pid == 0) {
> +               ASSERT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +               ret = syscall(__NR_getpid);
> +               exit(ret >= 0 || errno != ENOSYS);
> +       }
> +
> +       ASSERT_EQ(waitpid(pid, &status, 0), pid);
> +       ASSERT_EQ(true, WIFEXITED(status));
> +       ASSERT_EQ(0, WEXITSTATUS(status));
> +
> +       /* Check that the basic notification machinery works */
> +       listener = user_trap_syscall(__NR_getpid,
> +                                    SECCOMP_FILTER_FLAG_GET_LISTENER);
> +       ASSERT_GE(listener, 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ASSERT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
> +
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +
> +       ASSERT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
> +
> +       ASSERT_EQ(waitpid(pid, &status, 0), pid);
> +       ASSERT_EQ(true, WIFEXITED(status));
> +       ASSERT_EQ(0, WEXITSTATUS(status));
> +
> +       /*
> +        * Check that nothing bad happens when we kill the task in the middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ret = read(listener, &req, sizeof(req));
> +       ASSERT_EQ(ret, sizeof(req));
> +
> +       ASSERT_EQ(kill(pid, SIGKILL), 0);
> +       ASSERT_EQ(waitpid(pid, NULL, 0), pid);
> +
> +       resp.id = req.id;
> +       ret = write(listener, &resp, sizeof(resp));
> +       EXPECT_EQ(ret, -1);
> +       EXPECT_EQ(errno, EINVAL);
> +
> +       close(listener);
> +}

Yay selftests! :)

-Kees

> +
>  /*
>   * TODO:
>   * - add microbenchmarks
> --
> 2.14.1
>
Tycho Andersen Feb. 14, 2018, 3:29 p.m.
Hey Kees,

Thanks for taking a look!

On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > This patch introduces a means for syscalls matched in seccomp to notify
> > some other task that a particular filter has been triggered.
> >
> > The motivation for this is primarily for use with containers. For example,
> > if a container does an init_module(), we obviously don't want to load this
> > untrusted code, which may be compiled for the wrong version of the kernel
> > anyway. Instead, we could parse the module image, figure out which module
> > the container is trying to load and load it on the host.
> >
> > As another example, containers cannot mknod(), since this checks
> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > coding some whitelist in the kernel. Another example is mount(), which has
> > many security restrictions for good reason, but configuration or runtime
> > knowledge could potentially be used to relax these restrictions.
> 
> Related to the eBPF seccomp thread, can the logic for these things be
> handled entirely by eBPF? My assumption is that you still need to stop
> the process to do something (i.e. do a mknod, or a mount) before
> letting it continue. Is there some "wait for notification" system in
> eBPF?

I replied in the other thread
(https://patchwork.ozlabs.org/cover/872938/#1856642 for those
following along at home), but no, at least not that I know of.

> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex. Also worth noting that there
> > is one race still present:
> >
> >   1. a task does a SECCOMP_RET_USER_NOTIF
> >   2. the userspace handler reads this notification
> >   3. the task dies
> >   4. a new task with the same pid starts
> >   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
> >      that the previous one did
> >   6. the userspace handler writes a response
> >
> > There's no way to distinguish this case right now. Maybe we care, maybe we
> > don't, but it's worth noting.
> 
> So, I'd like to avoid the cookie if possible (surprise). Why isn't it
> possible to close the kernel-side of the fd to indicate that it lost
> the pid it was attached to?

Because the fd is for a filter, not a task.

> Is this just that the reader has no idea
> who is sending messages? So the risk is a fork/die loop within the
> same process tree (i.e. attached to the same filter)? Hrmpf. I can't
> think of a better way to handle the
> one(fd)-to-many(task-with-that-filter-attached) situation...

Yes, exactly. The cookie just adds uniqueness, and as Andy pointed out
if we switch to u64, the race above basically ("u64 should be enough
for anybody") goes away.

> > Right now the interface is a simple structure copy across a file
> > descriptor. We could potentially invent something fancier.
> 
> I wonder if this communication should be netlink, which gives a more
> well-structured way to describe what's on the wire? The reason I ask
> is because if we ever change the seccomp_data structure, we'll now
> have two places where we need to deal with it (the first being within
> the BPF itself). My initial idea was to prefix the communication with
> a size field, then send the structure, and then I had nightmares, and
> realized this was basically netlink reinvented.

I suggested netlink in LA, and everyone (especially Andy) groaned very
loudly :). I'm happy to switch it to netlink if you like, although i
think memcpy() of structs should be safe here, since the return value
from read or write can indicate the size of things.

> > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > memory data from the task still applies here, but can be avoided with
> > careful design of the userspace handler: if the userspace handler reads all
> > of the task memory that is necessary before applying its security policy,
> > the tracee's subsequent memory edits will not be read by the tracer.
> 
> Is this really true? Couldn't a multi-threaded process muck with
> memory out from under both the manager and the stopped process?

Sure, but as long as the manager copies the relevant arguments out of
the tracee's memory *before* evaluating whether it's safe to do the
thing the tracee wants to do, it's ok. The assumption here is that the
tracee can't corrupt the manager's memory (because if it could, lots
of other things would already be broken).

> >  /*
> >   * All BPF programs must return a 32-bit value.
> > @@ -34,6 +35,7 @@
> >  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
> >  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
> >  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> > +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
> >  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */
> 
> /me tries to come up with an ordering rationale here and fails.
> 
> An ERRNO filter would block a USER_NOTIF because it's unconditional.
> TRACE could be either, USER_NOTIF could be either.
> 
> This means TRACE rules would be bumped by a USER_NOTIF... hmm.

Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
seemed more important than USER_NOTIF, but TRACE didn't. I don't have
a strong opinion about what to do here, because users can adjust their
filters accordingly. Let me know what you prefer.

> > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> 
> I wonder if it's time to split up seccomp.c ... probably not, but I've
> always been unhappy with the #ifdefs even for just regular _FILTER. ;)

A reasonable question. I'm happy to do that as a separate series
before this one goes in if you want.

> > +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
> > +                                   size_t size, loff_t *ppos)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       struct seccomp_notif_resp resp = {};
> > +       struct seccomp_knotif *knotif = NULL;
> > +       struct list_head *cur;
> > +       ssize_t ret = -EINVAL;
> > +
> > +       /* No partial writes. */
> > +       if (*ppos != 0)
> > +               return -EINVAL;
> > +
> > +       size = min_t(size_t, size, sizeof(resp));
> 
> In this case, we can't use min_t, size _must_ be == sizeof(resp),
> otherwise we're operating on what's in the stack (which is zeroed, but
> still).

I'm not sure I follow. If the user passes in an old (smaller) struct
seccomp_notif_resp, we don't want to copy more than they specified. If
they pass in a bigger one, this will be sizeof(resp).

> > +       if (copy_from_user(&resp, buf, size))
> > +               return -EFAULT;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       list_for_each(cur, &filter->notifications) {
> > +               knotif = list_entry(cur, struct seccomp_knotif, list);
> > +
> > +               if (knotif->id == resp.id)
> > +                       break;
> 
> So we're finding the matching id here. Now, I'm trying to think about
> how this will look in real-world use: the pid will be _blocked_ while
> this happening. And all the other pids that trip this filter will
> _also_ be blocked, since they're all waiting for the reader to read
> and respond. The risk is pid death while waiting, and having another
> appear with the same pid, trigger the same filter, get blocked, and
> then the reader replies for the old pid, and the new pid gets the
> results?

Yep, exactly.

> Since this notification queue is already linear, can't we use ordering
> to enforce this? i.e. only the pid at the head of the filter
> notification queue is going to have anything happening to it. Or is
> the idea to have multiple readers/writers of the fd?

I'm not really sure how we prevent multiple readers/writers of the fd.
But even with a single writer, the case you described "could" happen
(although again, with u64 cookies it shouldn't be a problem).

I'm not sure how ordering helps us though; the problem is really that
one entry for a pid was deleted, and a whole new one was created. So
ordering will look ok, but the response will go to the wrong pid.

> > +       }
> > +
> > +       if (!knotif || knotif->id != resp.id) {
> > +               ret = -EINVAL;
> > +               goto out;
> > +       }
> > +
> > +       ret = size;
> > +       knotif->state = SECCOMP_NOTIFY_WRITE;
> > +       knotif->error = resp.error;
> > +       knotif->val = resp.val;
> > +       complete(&knotif->ready);
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> > +static const struct file_operations seccomp_notify_ops = {
> > +       .read = seccomp_notify_read,
> > +       .write = seccomp_notify_write,
> > +       /* TODO: poll */
> 
> What's needed for poll? I think you've got all the pieces you need
> already, i.e. wait queue, notifications, etc.

Nothing, I just didn't implement it. I will do so for v2.

> > +       .release = seccomp_notify_release,
> > +};
> > +
> > +static struct file *init_listener(struct seccomp_filter *filter)
> > +{
> > +       struct file *ret;
> > +
> > +       mutex_lock(&filter->notify_lock);
> > +       if (filter->has_listener) {
> > +               mutex_unlock(&filter->notify_lock);
> > +               return ERR_PTR(-EBUSY);
> > +       }
> > +
> > +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> > +                                filter, O_RDWR);
> > +       if (IS_ERR(ret)) {
> > +               __put_seccomp_filter(filter);
> > +       } else {
> > +               /*
> > +                * Intentionally don't put_seccomp_filter(). The file
> > +                * has a reference to it now.
> > +                */
> > +               filter->has_listener = true;
> > +       }
> 
> I spent some time staring at this, and I don't see it: where is the
> get_() for this? The caller of init_listener() already does a put() on
> the failure path. It seems like there is a get() missing near the
> start of init_listener(), or I've entirely missed something.

Ugh, yes. For the SECCOMP_FILTER_FLAG_GET_LISTENER case, you're right.
Originally I only had the ptrace-based one, and that has a get() in
get_nth_filter(), so the comment makes sense in that case.

I'll straighten this out for v2 and

> (Regardless, I think the usage counting need a comment somewhere,
> maybe near the top of seccomp.c with the field?)

...add a comment.

> > +#define USER_NOTIF_MAGIC 116983961184613L
> 
> Is this just you mashing the numpad? :)

Is there some better way to generate magic numbers? :)

Cheers,

Tycho
Andy Lutomirski Feb. 14, 2018, 5:19 p.m.
On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> Hey Kees,
>
> Thanks for taking a look!
>
> On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
>> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>> > This patch introduces a means for syscalls matched in seccomp to notify
>> > some other task that a particular filter has been triggered.
>> >
>> > The motivation for this is primarily for use with containers. For example,
>> > if a container does an init_module(), we obviously don't want to load this
>> > untrusted code, which may be compiled for the wrong version of the kernel
>> > anyway. Instead, we could parse the module image, figure out which module
>> > the container is trying to load and load it on the host.
>> >
>> > As another example, containers cannot mknod(), since this checks
>> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
>> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
>> > coding some whitelist in the kernel. Another example is mount(), which has
>> > many security restrictions for good reason, but configuration or runtime
>> > knowledge could potentially be used to relax these restrictions.
>>
>> Related to the eBPF seccomp thread, can the logic for these things be
>> handled entirely by eBPF? My assumption is that you still need to stop
>> the process to do something (i.e. do a mknod, or a mount) before
>> letting it continue. Is there some "wait for notification" system in
>> eBPF?
>
> I replied in the other thread
> (https://patchwork.ozlabs.org/cover/872938/#1856642 for those
> following along at home), but no, at least not that I know of.

eBPF can call functions.  One of those functions could put the caller
to sleep.  In fact, I think I once proposed doing this for the seccomp
logging action as well.

>> I wonder if this communication should be netlink, which gives a more
>> well-structured way to describe what's on the wire? The reason I ask
>> is because if we ever change the seccomp_data structure, we'll now
>> have two places where we need to deal with it (the first being within
>> the BPF itself). My initial idea was to prefix the communication with
>> a size field, then send the structure, and then I had nightmares, and
>> realized this was basically netlink reinvented.
>
> I suggested netlink in LA, and everyone (especially Andy) groaned very
> loudly :). I'm happy to switch it to netlink if you like, although i
> think memcpy() of structs should be safe here, since the return value
> from read or write can indicate the size of things.

I could easily get on board with "netlink" (i.e. NLA) messages sent
over an fd.  I will object strongly to the use of netlink *sockets*.

>
>> An ERRNO filter would block a USER_NOTIF because it's unconditional.
>> TRACE could be either, USER_NOTIF could be either.
>>
>> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
>
> Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
> seemed more important than USER_NOTIF, but TRACE didn't. I don't have
> a strong opinion about what to do here, because users can adjust their
> filters accordingly. Let me know what you prefer.

If we switched to eBPF functions, this whole issue goes away.
Tycho Andersen Feb. 14, 2018, 5:23 p.m.
On Wed, Feb 14, 2018 at 05:19:52PM +0000, Andy Lutomirski wrote:
> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > Hey Kees,
> >
> > Thanks for taking a look!
> >
> > On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
> >> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> > This patch introduces a means for syscalls matched in seccomp to notify
> >> > some other task that a particular filter has been triggered.
> >> >
> >> > The motivation for this is primarily for use with containers. For example,
> >> > if a container does an init_module(), we obviously don't want to load this
> >> > untrusted code, which may be compiled for the wrong version of the kernel
> >> > anyway. Instead, we could parse the module image, figure out which module
> >> > the container is trying to load and load it on the host.
> >> >
> >> > As another example, containers cannot mknod(), since this checks
> >> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> >> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> >> > coding some whitelist in the kernel. Another example is mount(), which has
> >> > many security restrictions for good reason, but configuration or runtime
> >> > knowledge could potentially be used to relax these restrictions.
> >>
> >> Related to the eBPF seccomp thread, can the logic for these things be
> >> handled entirely by eBPF? My assumption is that you still need to stop
> >> the process to do something (i.e. do a mknod, or a mount) before
> >> letting it continue. Is there some "wait for notification" system in
> >> eBPF?
> >
> > I replied in the other thread
> > (https://patchwork.ozlabs.org/cover/872938/#1856642 for those
> > following along at home), but no, at least not that I know of.
> 
> eBPF can call functions.  One of those functions could put the caller
> to sleep.  In fact, I think I once proposed doing this for the seccomp
> logging action as well.

Yes, true. We could always add a bpf_func_map_lookup_wait or
something. I can look into that if it's preferable.
Christian Brauner Feb. 15, 2018, 2:48 p.m.
On Wed, Feb 14, 2018 at 05:19:52PM +0000, Andy Lutomirski wrote:
> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > Hey Kees,
> >
> > Thanks for taking a look!
> >
> > On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
> >> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> > This patch introduces a means for syscalls matched in seccomp to notify
> >> > some other task that a particular filter has been triggered.
> >> >
> >> > The motivation for this is primarily for use with containers. For example,
> >> > if a container does an init_module(), we obviously don't want to load this
> >> > untrusted code, which may be compiled for the wrong version of the kernel
> >> > anyway. Instead, we could parse the module image, figure out which module
> >> > the container is trying to load and load it on the host.
> >> >
> >> > As another example, containers cannot mknod(), since this checks
> >> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> >> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> >> > coding some whitelist in the kernel. Another example is mount(), which has
> >> > many security restrictions for good reason, but configuration or runtime
> >> > knowledge could potentially be used to relax these restrictions.
> >>
> >> Related to the eBPF seccomp thread, can the logic for these things be
> >> handled entirely by eBPF? My assumption is that you still need to stop
> >> the process to do something (i.e. do a mknod, or a mount) before
> >> letting it continue. Is there some "wait for notification" system in
> >> eBPF?
> >
> > I replied in the other thread
> > (https://patchwork.ozlabs.org/cover/872938/#1856642 for those
> > following along at home), but no, at least not that I know of.
> 
> eBPF can call functions.  One of those functions could put the caller
> to sleep.  In fact, I think I once proposed doing this for the seccomp
> logging action as well.
> 
> >> I wonder if this communication should be netlink, which gives a more
> >> well-structured way to describe what's on the wire? The reason I ask
> >> is because if we ever change the seccomp_data structure, we'll now
> >> have two places where we need to deal with it (the first being within
> >> the BPF itself). My initial idea was to prefix the communication with
> >> a size field, then send the structure, and then I had nightmares, and
> >> realized this was basically netlink reinvented.
> >
> > I suggested netlink in LA, and everyone (especially Andy) groaned very
> > loudly :). I'm happy to switch it to netlink if you like, although i
> > think memcpy() of structs should be safe here, since the return value
> > from read or write can indicate the size of things.
> 
> I could easily get on board with "netlink" (i.e. NLA) messages sent
> over an fd.  I will object strongly to the use of netlink *sockets*.

I think sending netlink messages makes perfect sense here although we
burden userspace with all those nice macros to parse these messages.
Are there already other cases where userspace gets netlink messages on
fds without having opened a netlink socket.

> 
> >
> >> An ERRNO filter would block a USER_NOTIF because it's unconditional.
> >> TRACE could be either, USER_NOTIF could be either.
> >>
> >> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
> >
> > Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
> > seemed more important than USER_NOTIF, but TRACE didn't. I don't have
> > a strong opinion about what to do here, because users can adjust their
> > filters accordingly. Let me know what you prefer.
> 
> If we switched to eBPF functions, this whole issue goes away.
Kees Cook Feb. 27, 2018, 12:49 a.m.
On Wed, Feb 14, 2018 at 9:19 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
>> On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
>>> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>>> I wonder if this communication should be netlink, which gives a more
>>> well-structured way to describe what's on the wire? The reason I ask
>>> is because if we ever change the seccomp_data structure, we'll now
>>> have two places where we need to deal with it (the first being within
>>> the BPF itself). My initial idea was to prefix the communication with
>>> a size field, then send the structure, and then I had nightmares, and
>>> realized this was basically netlink reinvented.
>>
>> I suggested netlink in LA, and everyone (especially Andy) groaned very
>> loudly :). I'm happy to switch it to netlink if you like, although i
>> think memcpy() of structs should be safe here, since the return value
>> from read or write can indicate the size of things.
>
> I could easily get on board with "netlink" (i.e. NLA) messages sent
> over an fd.  I will object strongly to the use of netlink *sockets*.

Yeah, I was thinking NLA over the fd; not a netlink socket.

>>> An ERRNO filter would block a USER_NOTIF because it's unconditional.
>>> TRACE could be either, USER_NOTIF could be either.
>>>
>>> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
>>
>> Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
>> seemed more important than USER_NOTIF, but TRACE didn't. I don't have
>> a strong opinion about what to do here, because users can adjust their
>> filters accordingly. Let me know what you prefer.
>
> If we switched to eBPF functions, this whole issue goes away.

Yeah, though we'd still need some kind of "wait for answer" eBPF
function. It feels wrong to re-use maps for that...

-Kees
Andy Lutomirski Feb. 27, 2018, 3:27 a.m.
> On Feb 26, 2018, at 4:49 PM, Kees Cook <keescook@chromium.org> wrote:
> 
>> On Wed, Feb 14, 2018 at 9:19 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
>>>> On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
>>>> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>>>> I wonder if this communication should be netlink, which gives a more
>>>> well-structured way to describe what's on the wire? The reason I ask
>>>> is because if we ever change the seccomp_data structure, we'll now
>>>> have two places where we need to deal with it (the first being within
>>>> the BPF itself). My initial idea was to prefix the communication with
>>>> a size field, then send the structure, and then I had nightmares, and
>>>> realized this was basically netlink reinvented.
>>> 
>>> I suggested netlink in LA, and everyone (especially Andy) groaned very
>>> loudly :). I'm happy to switch it to netlink if you like, although i
>>> think memcpy() of structs should be safe here, since the return value
>>> from read or write can indicate the size of things.
>> 
>> I could easily get on board with "netlink" (i.e. NLA) messages sent
>> over an fd.  I will object strongly to the use of netlink *sockets*.
> 
> Yeah, I was thinking NLA over the fd; not a netlink socket.
> 
>>>> An ERRNO filter would block a USER_NOTIF because it's unconditional.
>>>> TRACE could be either, USER_NOTIF could be either.
>>>> 
>>>> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
>>> 
>>> Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
>>> seemed more important than USER_NOTIF, but TRACE didn't. I don't have
>>> a strong opinion about what to do here, because users can adjust their
>>> filters accordingly. Let me know what you prefer.
>> 
>> If we switched to eBPF functions, this whole issue goes away.
> 
> Yeah, though we'd still need some kind of "wait for answer" eBPF
> function. It feels wrong to re-use maps for that...
> 

BPF_CALL.

Alexei, can we make it so that each bpf program type can easily limit which BPF_CALL helpers can be use and allow bpf program types to add their own helpers?c