[v2,1/4] seccomp: add a return code to trap to userspace

Submitted by Tycho Andersen on May 17, 2018, 3:12 p.m.

Details

Message ID 20180517151218.12850-2-tycho@tycho.ws
State New
Series "seccomp trap to userspace"
Headers show

Commit Message

Tycho Andersen May 17, 2018, 3:12 p.m.
This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mknod(), since this checks
capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
/dev/zero should be ok for containers to mknod, but we'd like to avoid hard
coding some whitelist in the kernel. Another example is mount(), which has
many security restrictions for good reason, but configuration or runtime
knowledge could potentially be used to relax these restrictions.

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to start services.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

v2: * make id a u64; the idea here being that it will never overflow,
      because 64 is huge (one syscall every nanosecond => wrap every 584
      years)
    * prevent nesting of user notifications: if someone is already attached
      the tree in one place, nobody else can attach to the tree
    * notify the listener of signals the tracee receives as well
    * implement poll

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 arch/Kconfig                                  |   7 +
 include/linux/seccomp.h                       |   3 +-
 include/uapi/linux/seccomp.h                  |  18 +-
 kernel/seccomp.c                              | 402 +++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 181 +++++++-
 5 files changed, 605 insertions(+), 6 deletions(-)

Patch hide | download patch | download mbox

diff --git a/arch/Kconfig b/arch/Kconfig
index 8e0d665c8d53..dd99eef3e049 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -401,6 +401,13 @@  config SECCOMP_FILTER
 
 	  See Documentation/prctl/seccomp_filter.txt for details.
 
+config SECCOMP_USER_NOTIFICATION
+	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
+	depends on SECCOMP_FILTER
+	help
+	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
+	  programs to notify a userspace listener that a particular event happened.
+
 config HAVE_GCC_PLUGINS
 	bool
 	help
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index c723a5c4e3ff..0fd3e0676a1c 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,7 +5,8 @@ 
 #include <uapi/linux/seccomp.h>
 
 #define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
-					 SECCOMP_FILTER_FLAG_LOG)
+					 SECCOMP_FILTER_FLAG_LOG | \
+					 SECCOMP_FILTER_FLAG_GET_LISTENER)
 
 #ifdef CONFIG_SECCOMP
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 2a0bd9dd104d..8160e6cad528 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -17,8 +17,9 @@ 
 #define SECCOMP_GET_ACTION_AVAIL	2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC	1
-#define SECCOMP_FILTER_FLAG_LOG		2
+#define SECCOMP_FILTER_FLAG_TSYNC		1
+#define SECCOMP_FILTER_FLAG_LOG			2
+#define SECCOMP_FILTER_FLAG_GET_LISTENER	4
 
 /*
  * All BPF programs must return a 32-bit value.
@@ -34,6 +35,7 @@ 
 #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
 #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
 #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
+#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
 #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
 #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
 #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
@@ -59,4 +61,16 @@  struct seccomp_data {
 	__u64 args[6];
 };
 
+struct seccomp_notif {
+	__u64 id;
+	pid_t pid;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index dc77548167ef..a169a62cb78a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -38,6 +38,53 @@ 
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+
+enum notify_state {
+	SECCOMP_NOTIFY_INIT,
+	SECCOMP_NOTIFY_SENT,
+	SECCOMP_NOTIFY_REPLIED,
+};
+
+struct seccomp_knotif {
+	/* The pid whose filter triggered the notification */
+	pid_t pid;
+
+	/*
+	 * The "cookie" for this request; this is unique for this filter.
+	 */
+	u32 id;
+
+	/*
+	 * The seccomp data. This pointer is valid the entire time this
+	 * notification is active, since it comes from __seccomp_filter which
+	 * eclipses the entire lifecycle here.
+	 */
+	const struct seccomp_data *data;
+
+	/*
+	 * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
+	 * struct seccomp_knotif is created and starts out in INIT. Once the
+	 * handler reads the notification off of an FD, it transitions to READ.
+	 * If a signal is received the state transitions back to INIT and
+	 * another message is sent. When the userspace handler replies, state
+	 * transitions to REPLIED.
+	 */
+	enum notify_state state;
+
+	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
+	int error;
+	long val;
+
+	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
+	struct completion ready;
+
+	struct list_head list;
+};
+#endif
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -64,6 +111,27 @@  struct seccomp_filter {
 	bool log;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	/*
+	 * A semaphore that users of this notification can wait on for
+	 * changes. Actual reads and writes are still controlled with
+	 * filter->notify_lock.
+	 */
+	struct semaphore request;
+
+	/* A lock for all notification-related accesses. */
+	struct mutex notify_lock;
+
+	/* Is there currently an attached listener? */
+	bool has_listener;
+
+	/* The id of the next request. */
+	u64 next_id;
+
+	/* A list of struct seccomp_knotif elements. */
+	struct list_head notifications;
+#endif
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -383,6 +451,13 @@  static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	if (!sfilter)
 		return ERR_PTR(-ENOMEM);
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	mutex_init(&sfilter->notify_lock);
+	sema_init(&sfilter->request, 0);
+	INIT_LIST_HEAD(&sfilter->notifications);
+	sfilter->next_id = get_random_u64();
+#endif
+
 	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
 					seccomp_check_filter, save_orig);
 	if (ret < 0) {
@@ -547,13 +622,15 @@  static void seccomp_send_sigsys(int syscall, int reason)
 #define SECCOMP_LOG_TRACE		(1 << 4)
 #define SECCOMP_LOG_LOG			(1 << 5)
 #define SECCOMP_LOG_ALLOW		(1 << 6)
+#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
 
 static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
 				    SECCOMP_LOG_KILL_THREAD  |
 				    SECCOMP_LOG_TRAP  |
 				    SECCOMP_LOG_ERRNO |
 				    SECCOMP_LOG_TRACE |
-				    SECCOMP_LOG_LOG;
+				    SECCOMP_LOG_LOG |
+				    SECCOMP_LOG_USER_NOTIF;
 
 static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 			       bool requested)
@@ -572,6 +649,9 @@  static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 	case SECCOMP_RET_TRACE:
 		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
+		break;
 	case SECCOMP_RET_LOG:
 		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
 		break;
@@ -645,6 +725,91 @@  void secure_computing_strict(int this_syscall)
 }
 #else
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
+{
+	u64 ret = filter->next_id;
+
+	/* Note: overflow is ok here, the id just needs to be unique */
+	filter->next_id++;
+
+	return ret;
+}
+
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	int err;
+	long ret = 0;
+	struct seccomp_knotif n = {};
+
+	mutex_lock(&match->notify_lock);
+	if (!match->has_listener) {
+		err = -ENOSYS;
+		goto out;
+	}
+
+	n.pid = current->pid;
+	n.state = SECCOMP_NOTIFY_INIT;
+	n.data = sd;
+	n.id = seccomp_next_notify_id(match);
+	init_completion(&n.ready);
+
+	list_add(&n.list, &match->notifications);
+
+	mutex_unlock(&match->notify_lock);
+	up(&match->request);
+
+	err = wait_for_completion_interruptible(&n.ready);
+	mutex_lock(&match->notify_lock);
+
+	/*
+	 * Here it's possible we got a signal and then had to wait on the mutex
+	 * while the reply was sent, so let's be sure there wasn't a response
+	 * in the meantime.
+	 */
+	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
+		/*
+		 * We got a signal. Let's tell userspace about it (potentially
+		 * again, if we had already notified them about the first one).
+		 */
+		if (n.state == SECCOMP_NOTIFY_SENT) {
+			n.state = SECCOMP_NOTIFY_INIT;
+			up(&match->request);
+		}
+		mutex_unlock(&match->notify_lock);
+		err = wait_for_completion_killable(&n.ready);
+		mutex_lock(&match->notify_lock);
+		if (err < 0)
+			goto remove_list;
+	}
+
+	ret = n.val;
+	err = n.error;
+
+	WARN(n.state != SECCOMP_NOTIFY_REPLIED,
+	     "notified about write complete when state is not write");
+
+remove_list:
+	list_del(&n.list);
+out:
+	mutex_unlock(&match->notify_lock);
+	syscall_set_return_value(current, task_pt_regs(current),
+				 err, ret);
+}
+#else
+static void seccomp_do_user_notification(int this_syscall,
+					 u32 action,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	WARN(1, "user notification received, but disabled");
+	seccomp_log(this_syscall, SIGSYS, action, true);
+	do_exit(SIGSYS);
+}
+#endif
+
 #ifdef CONFIG_SECCOMP_FILTER
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
@@ -722,6 +887,9 @@  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 
 		return 0;
 
+	case SECCOMP_RET_USER_NOTIF:
+		seccomp_do_user_notification(this_syscall, match, sd);
+		goto skip;
 	case SECCOMP_RET_LOG:
 		seccomp_log(this_syscall, 0, action, true);
 		return 0;
@@ -828,6 +996,11 @@  static long seccomp_set_mode_strict(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static struct file *init_listener(struct task_struct *,
+				  struct seccomp_filter *);
+#endif
+
 /**
  * seccomp_set_mode_filter: internal function for setting seccomp filter
  * @flags:  flags to change filter behavior
@@ -847,6 +1020,8 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
 	struct seccomp_filter *prepared = NULL;
 	long ret = -EINVAL;
+	int listener = 0;
+	struct file *listener_f = NULL;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -857,13 +1032,28 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	if (IS_ERR(prepared))
 		return PTR_ERR(prepared);
 
+	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+		listener = get_unused_fd_flags(O_RDWR);
+		if (listener < 0) {
+			ret = listener;
+			goto out_free;
+		}
+
+		listener_f = init_listener(current, prepared);
+		if (IS_ERR(listener_f)) {
+			put_unused_fd(listener);
+			ret = PTR_ERR(listener_f);
+			goto out_free;
+		}
+	}
+
 	/*
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_free;
+		goto out_put_fd;
 
 	spin_lock_irq(&current->sighand->siglock);
 
@@ -881,6 +1071,16 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		mutex_unlock(&current->signal->cred_guard_mutex);
+out_put_fd:
+	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+		if (ret < 0) {
+			fput(listener_f);
+			put_unused_fd(listener);
+		} else {
+			fd_install(listener, listener_f);
+			ret = listener;
+		}
+	}
 out_free:
 	seccomp_filter_free(prepared);
 	return ret;
@@ -909,6 +1109,9 @@  static long seccomp_get_action_avail(const char __user *uaction)
 	case SECCOMP_RET_LOG:
 	case SECCOMP_RET_ALLOW:
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
+			break;
 	default:
 		return -EOPNOTSUPP;
 	}
@@ -1105,6 +1308,7 @@  long seccomp_get_metadata(struct task_struct *task,
 #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
 #define SECCOMP_RET_TRAP_NAME		"trap"
 #define SECCOMP_RET_ERRNO_NAME		"errno"
+#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
 #define SECCOMP_RET_TRACE_NAME		"trace"
 #define SECCOMP_RET_LOG_NAME		"log"
 #define SECCOMP_RET_ALLOW_NAME		"allow"
@@ -1114,6 +1318,7 @@  static const char seccomp_actions_avail[] =
 				SECCOMP_RET_KILL_THREAD_NAME	" "
 				SECCOMP_RET_TRAP_NAME		" "
 				SECCOMP_RET_ERRNO_NAME		" "
+				SECCOMP_RET_USER_NOTIF_NAME     " "
 				SECCOMP_RET_TRACE_NAME		" "
 				SECCOMP_RET_LOG_NAME		" "
 				SECCOMP_RET_ALLOW_NAME;
@@ -1131,6 +1336,7 @@  static const struct seccomp_log_name seccomp_log_names[] = {
 	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
 	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
 	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
+	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
 	{ }
 };
 
@@ -1279,3 +1485,195 @@  static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static int seccomp_notify_release(struct inode *inode, struct file *file)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_knotif *knotif;
+
+	mutex_lock(&filter->notify_lock);
+
+	/*
+	 * If this file is being closed because e.g. the task who owned it
+	 * died, let's wake everyone up who was waiting on us.
+	 */
+	list_for_each_entry(knotif, &filter->notifications, list) {
+		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
+			continue;
+
+		knotif->state = SECCOMP_NOTIFY_REPLIED;
+		knotif->error = -ENOSYS;
+		knotif->val = 0;
+
+		complete(&knotif->ready);
+	}
+
+	filter->has_listener = false;
+	mutex_unlock(&filter->notify_lock);
+	__put_seccomp_filter(filter);
+	return 0;
+}
+
+static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
+				   size_t size, loff_t *ppos)
+{
+	struct seccomp_filter *filter = f->private_data;
+	struct seccomp_knotif *knotif = NULL, *cur;
+	struct seccomp_notif unotif;
+	ssize_t ret;
+
+	/* No offset reads. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	ret = down_interruptible(&filter->request);
+	if (ret < 0)
+		return ret;
+
+	mutex_lock(&filter->notify_lock);
+	list_for_each_entry(cur, &filter->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT) {
+			knotif = cur;
+			break;
+		}
+	}
+
+	/*
+	 * If we didn't find a notification, it could be that the task was
+	 * interrupted between the time we were woken and when we were able to
+	 * acquire the rw lock. Should we retry here or just -ENOENT? -ENOENT
+	 * for now.
+	 */
+	if (!knotif) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	unotif.id = knotif->id;
+	unotif.pid = knotif->pid;
+	unotif.data = *(knotif->data);
+
+	size = min_t(size_t, size, sizeof(struct seccomp_notif));
+	if (copy_to_user(buf, &unotif, size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = sizeof(unotif);
+	knotif->state = SECCOMP_NOTIFY_SENT;
+
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
+				    size_t size, loff_t *ppos)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_notif_resp resp = {};
+	struct seccomp_knotif *knotif = NULL;
+	ssize_t ret = -EINVAL;
+
+	/* No partial writes. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	size = min_t(size_t, size, sizeof(resp));
+	if (copy_from_user(&resp, buf, size))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(knotif, &filter->notifications, list) {
+		if (knotif->id == resp.id)
+			break;
+	}
+
+	if (!knotif || knotif->id != resp.id) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Allow exactly one reply. */
+	if (knotif->state != SECCOMP_NOTIFY_SENT) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = size;
+	knotif->state = SECCOMP_NOTIFY_REPLIED;
+	knotif->error = resp.error;
+	knotif->val = resp.val;
+	complete(&knotif->ready);
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static __poll_t seccomp_notify_poll(struct file *file,
+				    struct poll_table_struct *poll_tab)
+{
+	struct seccomp_filter *filter = file->private_data;
+	__poll_t ret = 0;
+	struct seccomp_knotif *cur;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(cur, &filter->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT)
+			ret |= EPOLLIN | EPOLLRDNORM;
+		if (cur->state == SECCOMP_NOTIFY_SENT)
+			ret |= EPOLLOUT | EPOLLWRNORM;
+	}
+
+	mutex_unlock(&filter->notify_lock);
+
+	return ret;
+}
+
+static const struct file_operations seccomp_notify_ops = {
+	.read = seccomp_notify_read,
+	.write = seccomp_notify_write,
+	.poll = seccomp_notify_poll,
+	.release = seccomp_notify_release,
+};
+
+static struct file *init_listener(struct task_struct *task,
+				  struct seccomp_filter *filter)
+{
+	struct file *ret = ERR_PTR(-EBUSY);
+	struct seccomp_filter *cur;
+	bool have_listener = false;
+
+	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
+		mutex_lock(&cur->notify_lock);
+		if (cur->has_listener)
+			have_listener = true;
+	}
+
+	if (have_listener)
+		goto out;
+
+	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
+				 filter, O_RDWR);
+	if (IS_ERR(ret))
+		goto out;
+
+
+	/* The file has a reference to it now */
+	__get_seccomp_filter(filter);
+	filter->has_listener = true;
+
+out:
+	for (cur = task->seccomp.filter; cur; cur = cur->prev)
+		mutex_unlock(&cur->notify_lock);
+
+	return ret;
+}
+#endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 168c66d74fc5..bb96df66222f 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -40,10 +40,12 @@ 
 #include <sys/fcntl.h>
 #include <sys/mman.h>
 #include <sys/times.h>
+#include <sys/socket.h>
 
 #define _GNU_SOURCE
 #include <unistd.h>
 #include <sys/syscall.h>
+#include <poll.h>
 
 #include "../kselftest_harness.h"
 
@@ -150,6 +152,24 @@  struct seccomp_metadata {
 };
 #endif
 
+#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
+#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
+
+#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
+
+struct seccomp_notif {
+	__u64 id;
+	pid_t pid;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
@@ -2072,7 +2092,8 @@  TEST(seccomp_syscall_mode_lock)
 TEST(detect_seccomp_filter_flags)
 {
 	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
-				 SECCOMP_FILTER_FLAG_LOG };
+				 SECCOMP_FILTER_FLAG_LOG,
+				 SECCOMP_FILTER_FLAG_GET_LISTENER };
 	unsigned int flag, all_flags;
 	int i;
 	long ret;
@@ -2917,6 +2938,164 @@  TEST(get_metadata)
 	ASSERT_EQ(0, kill(pid, SIGKILL));
 }
 
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+static int read_notif(int listener, struct seccomp_notif *req)
+{
+	int ret;
+
+	do {
+		errno = 0;
+		ret = read(listener, req, sizeof(*req));
+	} while (ret == -1 && errno == ENOENT);
+	return ret;
+}
+
+static void signal_handler(int signal)
+{
+}
+
+#define USER_NOTIF_MAGIC 116983961184613L
+TEST(get_user_notification_syscall)
+{
+	pid_t pid;
+	long ret;
+	int status, listener;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	struct pollfd pollfd;
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	/* Check that we get -ENOSYS with no listener attached */
+	if (pid == 0) {
+		if (user_trap_syscall(__NR_getpid, 0) < 0)
+			exit(1);
+		ret = syscall(__NR_getpid);
+		exit(ret >= 0 || errno != ENOSYS);
+	}
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/* Check that the basic notification machinery works */
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_GET_LISTENER);
+	EXPECT_GE(listener, 0);
+
+	/* Installing a second listener in the chain should EBUSY */
+	EXPECT_EQ(user_trap_syscall(__NR_getpid,
+				    SECCOMP_FILTER_FLAG_GET_LISTENER),
+		  -1);
+	EXPECT_EQ(errno, EBUSY);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLOUT);
+
+	EXPECT_EQ(req.data.nr,  __NR_getpid);
+
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that nothing bad happens when we kill the task in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read(listener, &req, sizeof(req));
+	EXPECT_EQ(ret, sizeof(req));
+
+	EXPECT_EQ(kill(pid, SIGKILL), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+
+	resp.id = req.id;
+	ret = write(listener, &resp, sizeof(resp));
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, EINVAL);
+
+	/*
+	 * Check that we get another notification about a signal in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
+			perror("signal");
+			exit(1);
+		}
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(kill(pid, SIGUSR1), 0);
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	resp.id = req.id;
+	ret = write(listener, &resp, sizeof(resp));
+	EXPECT_EQ(ret, sizeof(resp));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	close(listener);
+}
+
 /*
  * TODO:
  * - add microbenchmarks

Comments

Oleg Nesterov May 17, 2018, 3:33 p.m.
I didn't read this series yet, and I don't even understand what are you
trying to do, just one question...

On 05/17, Tycho Andersen wrote:
>
> +static struct file *init_listener(struct task_struct *task,
> +				  struct seccomp_filter *filter)
> +{
> +	struct file *ret = ERR_PTR(-EBUSY);
> +	struct seccomp_filter *cur;
> +	bool have_listener = false;
> +
> +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +		mutex_lock(&cur->notify_lock);

Did you test this patch with CONFIG_LOCKDEP ?

From lockdep pov this loop tries to take the same lock twice or more, it shoul
complain.

Oleg.
Tycho Andersen May 17, 2018, 3:39 p.m.
Hi Oleg,

Thanks for taking a look!

On Thu, May 17, 2018 at 05:33:24PM +0200, Oleg Nesterov wrote:
> I didn't read this series yet, and I don't even understand what are you
> trying to do, just one question...
> 
> On 05/17, Tycho Andersen wrote:
> >
> > +static struct file *init_listener(struct task_struct *task,
> > +				  struct seccomp_filter *filter)
> > +{
> > +	struct file *ret = ERR_PTR(-EBUSY);
> > +	struct seccomp_filter *cur;
> > +	bool have_listener = false;
> > +
> > +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> > +		mutex_lock(&cur->notify_lock);
> 
> Did you test this patch with CONFIG_LOCKDEP ?

Yes, with,

CONFIG_LOCKDEP=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y

> From lockdep pov this loop tries to take the same lock twice or more, it shoul
> complain.

I didn't, but I guess that's because it's not trying to take the same lock
twice -- the pointer cur is changing in the loop. Unless I'm misunderstanding
what you're saying.

The idea behind this code is to lock the entire chain of filters up to the
parent so that we can ensure none of them have a listener installed. This is
based on a suggestion from Andy last review cycle to not allow two listeners,
since nesting would be confusing.

Tycho
Oleg Nesterov May 17, 2018, 3:46 p.m.
On 05/17, Tycho Andersen wrote:
>
> > From lockdep pov this loop tries to take the same lock twice or more, it shoul
> > complain.
>
> I didn't, but I guess that's because it's not trying to take the same lock
> twice -- the pointer cur is changing in the loop.

Yes, I see. But this is the same lock for lockdep, it has the same class.

Oleg.
Christian Brauner May 18, 2018, 2:04 p.m.
On Thu, May 17, 2018 at 09:12:15AM -0600, Tycho Andersen wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
> 
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
> 
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
> 
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
> 
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
> 
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
> 
> v2: * make id a u64; the idea here being that it will never overflow,
>       because 64 is huge (one syscall every nanosecond => wrap every 584
>       years)
>     * prevent nesting of user notifications: if someone is already attached
>       the tree in one place, nobody else can attach to the tree
>     * notify the listener of signals the tracee receives as well
>     * implement poll
> 
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  arch/Kconfig                                  |   7 +
>  include/linux/seccomp.h                       |   3 +-
>  include/uapi/linux/seccomp.h                  |  18 +-
>  kernel/seccomp.c                              | 402 +++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 181 +++++++-
>  5 files changed, 605 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 8e0d665c8d53..dd99eef3e049 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -401,6 +401,13 @@ config SECCOMP_FILTER
>  
>  	  See Documentation/prctl/seccomp_filter.txt for details.
>  
> +config SECCOMP_USER_NOTIFICATION
> +	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
> +	depends on SECCOMP_FILTER
> +	help
> +	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
> +	  programs to notify a userspace listener that a particular event happened.
> +
>  config HAVE_GCC_PLUGINS
>  	bool
>  	help
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index c723a5c4e3ff..0fd3e0676a1c 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,7 +5,8 @@
>  #include <uapi/linux/seccomp.h>
>  
>  #define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
> -					 SECCOMP_FILTER_FLAG_LOG)
> +					 SECCOMP_FILTER_FLAG_LOG | \
> +					 SECCOMP_FILTER_FLAG_GET_LISTENER)
>  
>  #ifdef CONFIG_SECCOMP
>  
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 2a0bd9dd104d..8160e6cad528 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,8 +17,9 @@
>  #define SECCOMP_GET_ACTION_AVAIL	2
>  
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC	1
> -#define SECCOMP_FILTER_FLAG_LOG		2
> +#define SECCOMP_FILTER_FLAG_TSYNC		1
> +#define SECCOMP_FILTER_FLAG_LOG			2
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER	4
>  
>  /*
>   * All BPF programs must return a 32-bit value.
> @@ -34,6 +35,7 @@
>  #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
>  #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
> @@ -59,4 +61,16 @@ struct seccomp_data {
>  	__u64 args[6];
>  };
>  
> +struct seccomp_notif {
> +	__u64 id;
> +	pid_t pid;
> +	struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +	__u64 id;
> +	__s32 error;
> +	__s64 val;
> +};
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index dc77548167ef..a169a62cb78a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -38,6 +38,53 @@
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +	SECCOMP_NOTIFY_INIT,
> +	SECCOMP_NOTIFY_SENT,
> +	SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> +	/* The pid whose filter triggered the notification */
> +	pid_t pid;
> +
> +	/*
> +	 * The "cookie" for this request; this is unique for this filter.
> +	 */
> +	u32 id;
> +
> +	/*
> +	 * The seccomp data. This pointer is valid the entire time this
> +	 * notification is active, since it comes from __seccomp_filter which
> +	 * eclipses the entire lifecycle here.
> +	 */
> +	const struct seccomp_data *data;
> +
> +	/*
> +	 * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> +	 * struct seccomp_knotif is created and starts out in INIT. Once the
> +	 * handler reads the notification off of an FD, it transitions to READ.
> +	 * If a signal is received the state transitions back to INIT and
> +	 * another message is sent. When the userspace handler replies, state
> +	 * transitions to REPLIED.
> +	 */
> +	enum notify_state state;
> +
> +	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> +	int error;
> +	long val;
> +
> +	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> +	struct completion ready;
> +
> +	struct list_head list;
> +};
> +#endif
> +
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
>   *
> @@ -64,6 +111,27 @@ struct seccomp_filter {
>  	bool log;
>  	struct seccomp_filter *prev;
>  	struct bpf_prog *prog;
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +	/*
> +	 * A semaphore that users of this notification can wait on for
> +	 * changes. Actual reads and writes are still controlled with
> +	 * filter->notify_lock.
> +	 */
> +	struct semaphore request;
> +
> +	/* A lock for all notification-related accesses. */
> +	struct mutex notify_lock;
> +
> +	/* Is there currently an attached listener? */
> +	bool has_listener;
> +
> +	/* The id of the next request. */
> +	u64 next_id;
> +
> +	/* A list of struct seccomp_knotif elements. */
> +	struct list_head notifications;
> +#endif
>  };
>  
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -383,6 +451,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>  	if (!sfilter)
>  		return ERR_PTR(-ENOMEM);
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +	mutex_init(&sfilter->notify_lock);
> +	sema_init(&sfilter->request, 0);
> +	INIT_LIST_HEAD(&sfilter->notifications);
> +	sfilter->next_id = get_random_u64();
> +#endif
> +
>  	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>  					seccomp_check_filter, save_orig);
>  	if (ret < 0) {
> @@ -547,13 +622,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
>  #define SECCOMP_LOG_TRACE		(1 << 4)
>  #define SECCOMP_LOG_LOG			(1 << 5)
>  #define SECCOMP_LOG_ALLOW		(1 << 6)
> +#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
>  
>  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
>  				    SECCOMP_LOG_KILL_THREAD  |
>  				    SECCOMP_LOG_TRAP  |
>  				    SECCOMP_LOG_ERRNO |
>  				    SECCOMP_LOG_TRACE |
> -				    SECCOMP_LOG_LOG;
> +				    SECCOMP_LOG_LOG |
> +				    SECCOMP_LOG_USER_NOTIF;
>  
>  static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>  			       bool requested)
> @@ -572,6 +649,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>  	case SECCOMP_RET_TRACE:
>  		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
>  		break;
> +	case SECCOMP_RET_USER_NOTIF:
> +		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> +		break;
>  	case SECCOMP_RET_LOG:
>  		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
>  		break;
> @@ -645,6 +725,91 @@ void secure_computing_strict(int this_syscall)
>  }
>  #else
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> +{
> +	u64 ret = filter->next_id;
> +
> +	/* Note: overflow is ok here, the id just needs to be unique */
> +	filter->next_id++;
> +
> +	return ret;
> +}

Nit: Depending on how averse people are to relying on side-effects this
could be simplified to:

static inline u64 seccomp_next_notify_id(struct seccomp_filter *filter)
{
        /* Note: Overflow is ok. The id just needs to be unique. */
        return filter->next_id++;
}

> +
> +static void seccomp_do_user_notification(int this_syscall,
> +					 struct seccomp_filter *match,
> +					 const struct seccomp_data *sd)
> +{
> +	int err;
> +	long ret = 0;
> +	struct seccomp_knotif n = {};
> +
> +	mutex_lock(&match->notify_lock);
> +	if (!match->has_listener) {
> +		err = -ENOSYS;
> +		goto out;
> +	}

Nit:

err = -ENOSYS;
mutex_lock(&match->notify_lock);
if (!match->has_listener)
        goto out;

looks cleaner to me or you do the err initalization at the top of the
function. :)

> +
> +	n.pid = current->pid;
> +	n.state = SECCOMP_NOTIFY_INIT;
> +	n.data = sd;
> +	n.id = seccomp_next_notify_id(match);
> +	init_completion(&n.ready);
> +
> +	list_add(&n.list, &match->notifications);
> +
> +	mutex_unlock(&match->notify_lock);
> +	up(&match->request);
> +
> +	err = wait_for_completion_interruptible(&n.ready);
> +	mutex_lock(&match->notify_lock);
> +
> +	/*
> +	 * Here it's possible we got a signal and then had to wait on the mutex
> +	 * while the reply was sent, so let's be sure there wasn't a response
> +	 * in the meantime.
> +	 */
> +	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> +		/*
> +		 * We got a signal. Let's tell userspace about it (potentially
> +		 * again, if we had already notified them about the first one).
> +		 */
> +		if (n.state == SECCOMP_NOTIFY_SENT) {
> +			n.state = SECCOMP_NOTIFY_INIT;
> +			up(&match->request);
> +		}
> +		mutex_unlock(&match->notify_lock);
> +		err = wait_for_completion_killable(&n.ready);
> +		mutex_lock(&match->notify_lock);
> +		if (err < 0)
> +			goto remove_list;
> +	}
> +
> +	ret = n.val;
> +	err = n.error;
> +
> +	WARN(n.state != SECCOMP_NOTIFY_REPLIED,
> +	     "notified about write complete when state is not write");

Nit: That message seems a little cryptic.

> +
> +remove_list:
> +	list_del(&n.list);
> +out:
> +	mutex_unlock(&match->notify_lock);
> +	syscall_set_return_value(current, task_pt_regs(current),
> +				 err, ret);
> +}
> +#else
> +static void seccomp_do_user_notification(int this_syscall,
> +					 u32 action,
> +					 struct seccomp_filter *match,
> +					 const struct seccomp_data *sd)
> +{
> +	WARN(1, "user notification received, but disabled");

Nit: "received unexpected user notification" might be clearer

> +	seccomp_log(this_syscall, SIGSYS, action, true);
> +	do_exit(SIGSYS);
> +}
> +#endif
> +
>  #ifdef CONFIG_SECCOMP_FILTER
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  			    const bool recheck_after_trace)
> @@ -722,6 +887,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  
>  		return 0;
>  
> +	case SECCOMP_RET_USER_NOTIF:
> +		seccomp_do_user_notification(this_syscall, match, sd);
> +		goto skip;
>  	case SECCOMP_RET_LOG:
>  		seccomp_log(this_syscall, 0, action, true);
>  		return 0;
> @@ -828,6 +996,11 @@ static long seccomp_set_mode_strict(void)
>  }
>  
>  #ifdef CONFIG_SECCOMP_FILTER
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static struct file *init_listener(struct task_struct *,
> +				  struct seccomp_filter *);
> +#endif
> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -847,6 +1020,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>  	struct seccomp_filter *prepared = NULL;
>  	long ret = -EINVAL;
> +	int listener = 0;
> +	struct file *listener_f = NULL;
>  
>  	/* Validate flags. */
>  	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -857,13 +1032,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	if (IS_ERR(prepared))
>  		return PTR_ERR(prepared);
>  
> +	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +		listener = get_unused_fd_flags(O_RDWR);
> +		if (listener < 0) {
> +			ret = listener;
> +			goto out_free;
> +		}
> +
> +		listener_f = init_listener(current, prepared);
> +		if (IS_ERR(listener_f)) {
> +			put_unused_fd(listener);
> +			ret = PTR_ERR(listener_f);
> +			goto out_free;
> +		}
> +	}
> +
>  	/*
>  	 * Make sure we cannot change seccomp or nnp state via TSYNC
>  	 * while another thread is in the middle of calling exec.
>  	 */
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>  	    mutex_lock_killable(&current->signal->cred_guard_mutex))
> -		goto out_free;
> +		goto out_put_fd;
>  
>  	spin_lock_irq(&current->sighand->siglock);
>  
> @@ -881,6 +1071,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	spin_unlock_irq(&current->sighand->siglock);
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>  		mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +		if (ret < 0) {
> +			fput(listener_f);
> +			put_unused_fd(listener);
> +		} else {
> +			fd_install(listener, listener_f);
> +			ret = listener;
> +		}
> +	}
>  out_free:
>  	seccomp_filter_free(prepared);
>  	return ret;
> @@ -909,6 +1109,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
>  	case SECCOMP_RET_LOG:
>  	case SECCOMP_RET_ALLOW:
>  		break;
> +	case SECCOMP_RET_USER_NOTIF:
> +		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
> +			break;
>  	default:
>  		return -EOPNOTSUPP;
>  	}
> @@ -1105,6 +1308,7 @@ long seccomp_get_metadata(struct task_struct *task,
>  #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
>  #define SECCOMP_RET_TRAP_NAME		"trap"
>  #define SECCOMP_RET_ERRNO_NAME		"errno"
> +#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
>  #define SECCOMP_RET_TRACE_NAME		"trace"
>  #define SECCOMP_RET_LOG_NAME		"log"
>  #define SECCOMP_RET_ALLOW_NAME		"allow"
> @@ -1114,6 +1318,7 @@ static const char seccomp_actions_avail[] =
>  				SECCOMP_RET_KILL_THREAD_NAME	" "
>  				SECCOMP_RET_TRAP_NAME		" "
>  				SECCOMP_RET_ERRNO_NAME		" "
> +				SECCOMP_RET_USER_NOTIF_NAME     " "
>  				SECCOMP_RET_TRACE_NAME		" "
>  				SECCOMP_RET_LOG_NAME		" "
>  				SECCOMP_RET_ALLOW_NAME;
> @@ -1131,6 +1336,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
>  	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
>  	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
>  	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> +	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
>  	{ }
>  };
>  
> @@ -1279,3 +1485,195 @@ static int __init seccomp_sysctl_init(void)
>  device_initcall(seccomp_sysctl_init)
>  
>  #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	struct seccomp_knotif *knotif;
> +
> +	mutex_lock(&filter->notify_lock);
> +
> +	/*
> +	 * If this file is being closed because e.g. the task who owned it
> +	 * died, let's wake everyone up who was waiting on us.
> +	 */
> +	list_for_each_entry(knotif, &filter->notifications, list) {
> +		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> +			continue;
> +
> +		knotif->state = SECCOMP_NOTIFY_REPLIED;
> +		knotif->error = -ENOSYS;
> +		knotif->val = 0;
> +
> +		complete(&knotif->ready);
> +	}
> +
> +	filter->has_listener = false;
> +	mutex_unlock(&filter->notify_lock);
> +	__put_seccomp_filter(filter);
> +	return 0;
> +}
> +
> +static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
> +				   size_t size, loff_t *ppos)
> +{
> +	struct seccomp_filter *filter = f->private_data;
> +	struct seccomp_knotif *knotif = NULL, *cur;
> +	struct seccomp_notif unotif;
> +	ssize_t ret;
> +
> +	/* No offset reads. */
> +	if (*ppos != 0)
> +		return -EINVAL;
> +
> +	ret = down_interruptible(&filter->request);
> +	if (ret < 0)
> +		return ret;
> +
> +	mutex_lock(&filter->notify_lock);
> +	list_for_each_entry(cur, &filter->notifications, list) {
> +		if (cur->state == SECCOMP_NOTIFY_INIT) {
> +			knotif = cur;
> +			break;
> +		}
> +	}
> +
> +	/*
> +	 * If we didn't find a notification, it could be that the task was
> +	 * interrupted between the time we were woken and when we were able to
> +	 * acquire the rw lock. Should we retry here or just -ENOENT? -ENOENT
> +	 * for now.
> +	 */
> +	if (!knotif) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	unotif.id = knotif->id;
> +	unotif.pid = knotif->pid;
> +	unotif.data = *(knotif->data);
> +
> +	size = min_t(size_t, size, sizeof(struct seccomp_notif));
> +	if (copy_to_user(buf, &unotif, size)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	ret = sizeof(unotif);
> +	knotif->state = SECCOMP_NOTIFY_SENT;
> +
> +out:
> +	mutex_unlock(&filter->notify_lock);
> +	return ret;
> +}
> +
> +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
> +				    size_t size, loff_t *ppos)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	struct seccomp_notif_resp resp = {};
> +	struct seccomp_knotif *knotif = NULL;
> +	ssize_t ret = -EINVAL;
> +
> +	/* No partial writes. */
> +	if (*ppos != 0)
> +		return -EINVAL;
> +
> +	size = min_t(size_t, size, sizeof(resp));
> +	if (copy_from_user(&resp, buf, size))
> +		return -EFAULT;
> +
> +	ret = mutex_lock_interruptible(&filter->notify_lock);
> +	if (ret < 0)
> +		return ret;
> +
> +	list_for_each_entry(knotif, &filter->notifications, list) {
> +		if (knotif->id == resp.id)
> +			break;
> +	}
> +
> +	if (!knotif || knotif->id != resp.id) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Allow exactly one reply. */
> +	if (knotif->state != SECCOMP_NOTIFY_SENT) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	ret = size;
> +	knotif->state = SECCOMP_NOTIFY_REPLIED;
> +	knotif->error = resp.error;
> +	knotif->val = resp.val;
> +	complete(&knotif->ready);
> +out:
> +	mutex_unlock(&filter->notify_lock);
> +	return ret;
> +}
> +
> +static __poll_t seccomp_notify_poll(struct file *file,
> +				    struct poll_table_struct *poll_tab)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	__poll_t ret = 0;
> +	struct seccomp_knotif *cur;
> +
> +	ret = mutex_lock_interruptible(&filter->notify_lock);
> +	if (ret < 0)
> +		return ret;
> +
> +	list_for_each_entry(cur, &filter->notifications, list) {
> +		if (cur->state == SECCOMP_NOTIFY_INIT)
> +			ret |= EPOLLIN | EPOLLRDNORM;
> +		if (cur->state == SECCOMP_NOTIFY_SENT)
> +			ret |= EPOLLOUT | EPOLLWRNORM;
> +	}
> +
> +	mutex_unlock(&filter->notify_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +	.read = seccomp_notify_read,
> +	.write = seccomp_notify_write,
> +	.poll = seccomp_notify_poll,
> +	.release = seccomp_notify_release,
> +};
> +
> +static struct file *init_listener(struct task_struct *task,
> +				  struct seccomp_filter *filter)
> +{
> +	struct file *ret = ERR_PTR(-EBUSY);
> +	struct seccomp_filter *cur;
> +	bool have_listener = false;
> +
> +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +		mutex_lock(&cur->notify_lock);
> +		if (cur->has_listener)
> +			have_listener = true;
> +	}
> +
> +	if (have_listener)
> +		goto out;
> +
> +	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> +				 filter, O_RDWR);
> +	if (IS_ERR(ret))
> +		goto out;
> +
> +
> +	/* The file has a reference to it now */
> +	__get_seccomp_filter(filter);
> +	filter->has_listener = true;
> +
> +out:
> +	for (cur = task->seccomp.filter; cur; cur = cur->prev)
> +		mutex_unlock(&cur->notify_lock);
> +
> +	return ret;
> +}
> +#endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 168c66d74fc5..bb96df66222f 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -40,10 +40,12 @@
>  #include <sys/fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/times.h>
> +#include <sys/socket.h>
>  
>  #define _GNU_SOURCE
>  #include <unistd.h>
>  #include <sys/syscall.h>
> +#include <poll.h>
>  
>  #include "../kselftest_harness.h"
>  
> @@ -150,6 +152,24 @@ struct seccomp_metadata {
>  };
>  #endif
>  
> +#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
> +
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
> +
> +struct seccomp_notif {
> +	__u64 id;
> +	pid_t pid;
> +	struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +	__u64 id;
> +	__s32 error;
> +	__s64 val;
> +};
> +#endif
> +
>  #ifndef seccomp
>  int seccomp(unsigned int op, unsigned int flags, void *args)
>  {
> @@ -2072,7 +2092,8 @@ TEST(seccomp_syscall_mode_lock)
>  TEST(detect_seccomp_filter_flags)
>  {
>  	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
> -				 SECCOMP_FILTER_FLAG_LOG };
> +				 SECCOMP_FILTER_FLAG_LOG,
> +				 SECCOMP_FILTER_FLAG_GET_LISTENER };
>  	unsigned int flag, all_flags;
>  	int i;
>  	long ret;
> @@ -2917,6 +2938,164 @@ TEST(get_metadata)
>  	ASSERT_EQ(0, kill(pid, SIGKILL));
>  }
>  
> +static int user_trap_syscall(int nr, unsigned int flags)
> +{
> +	struct sock_filter filter[] = {
> +		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +			offsetof(struct seccomp_data, nr)),
> +		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
> +		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> +		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +	};
> +
> +	struct sock_fprog prog = {
> +		.len = (unsigned short)ARRAY_SIZE(filter),
> +		.filter = filter,
> +	};
> +
> +	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> +}
> +
> +static int read_notif(int listener, struct seccomp_notif *req)
> +{
> +	int ret;
> +
> +	do {
> +		errno = 0;
> +		ret = read(listener, req, sizeof(*req));
> +	} while (ret == -1 && errno == ENOENT);
> +	return ret;
> +}
> +
> +static void signal_handler(int signal)
> +{
> +}
> +
> +#define USER_NOTIF_MAGIC 116983961184613L
> +TEST(get_user_notification_syscall)
> +{
> +	pid_t pid;
> +	long ret;
> +	int status, listener;
> +	struct seccomp_notif req = {};
> +	struct seccomp_notif_resp resp = {};
> +	struct pollfd pollfd;
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	/* Check that we get -ENOSYS with no listener attached */
> +	if (pid == 0) {
> +		if (user_trap_syscall(__NR_getpid, 0) < 0)
> +			exit(1);
> +		ret = syscall(__NR_getpid);
> +		exit(ret >= 0 || errno != ENOSYS);
> +	}
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +	/* Check that the basic notification machinery works */
> +	listener = user_trap_syscall(__NR_getpid,
> +				     SECCOMP_FILTER_FLAG_GET_LISTENER);
> +	EXPECT_GE(listener, 0);
> +
> +	/* Installing a second listener in the chain should EBUSY */
> +	EXPECT_EQ(user_trap_syscall(__NR_getpid,
> +				    SECCOMP_FILTER_FLAG_GET_LISTENER),
> +		  -1);
> +	EXPECT_EQ(errno, EBUSY);
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		ret = syscall(__NR_getpid);
> +		exit(ret != USER_NOTIF_MAGIC);
> +	}
> +
> +	EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
> +
> +	pollfd.fd = listener;
> +	pollfd.events = POLLIN | POLLOUT;
> +
> +	EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +	EXPECT_EQ(pollfd.revents, POLLOUT);
> +
> +	EXPECT_EQ(req.data.nr,  __NR_getpid);
> +
> +	resp.id = req.id;
> +	resp.error = 0;
> +	resp.val = USER_NOTIF_MAGIC;
> +
> +	EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +	/*
> +	 * Check that nothing bad happens when we kill the task in the middle
> +	 * of a syscall.
> +	 */
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		ret = syscall(__NR_getpid);
> +		exit(ret != USER_NOTIF_MAGIC);
> +	}
> +
> +	ret = read(listener, &req, sizeof(req));
> +	EXPECT_EQ(ret, sizeof(req));
> +
> +	EXPECT_EQ(kill(pid, SIGKILL), 0);
> +	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +
> +	resp.id = req.id;
> +	ret = write(listener, &resp, sizeof(resp));
> +	EXPECT_EQ(ret, -1);
> +	EXPECT_EQ(errno, EINVAL);
> +
> +	/*
> +	 * Check that we get another notification about a signal in the middle
> +	 * of a syscall.
> +	 */
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
> +			perror("signal");
> +			exit(1);
> +		}
> +		ret = syscall(__NR_getpid);
> +		exit(ret != USER_NOTIF_MAGIC);
> +	}
> +
> +	ret = read_notif(listener, &req);
> +	EXPECT_EQ(ret, sizeof(req));
> +	EXPECT_EQ(errno, 0);
> +
> +	EXPECT_EQ(kill(pid, SIGUSR1), 0);
> +
> +	ret = read_notif(listener, &req);
> +	EXPECT_EQ(ret, sizeof(req));
> +	EXPECT_EQ(errno, 0);
> +
> +	resp.id = req.id;
> +	ret = write(listener, &resp, sizeof(resp));
> +	EXPECT_EQ(ret, sizeof(resp));
> +	EXPECT_EQ(errno, 0);
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +	close(listener);
> +}
> +
>  /*
>   * TODO:
>   * - add microbenchmarks
> -- 
> 2.17.0
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Tycho Andersen May 18, 2018, 3:21 p.m.
On Fri, May 18, 2018 at 04:04:16PM +0200, Christian Brauner wrote:
> On Thu, May 17, 2018 at 09:12:15AM -0600, Tycho Andersen wrote:
> > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> > +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> > +{
> > +	u64 ret = filter->next_id;
> > +
> > +	/* Note: overflow is ok here, the id just needs to be unique */
> > +	filter->next_id++;
> > +
> > +	return ret;
> > +}
> 
> Nit: Depending on how averse people are to relying on side-effects this
> could be simplified to:
> 
> static inline u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> {
>         /* Note: Overflow is ok. The id just needs to be unique. */
>         return filter->next_id++;
> }

Oh, yes, definitely. I think this is leftover from when this function
worked a different way.

> > +
> > +static void seccomp_do_user_notification(int this_syscall,
> > +					 struct seccomp_filter *match,
> > +					 const struct seccomp_data *sd)
> > +{
> > +	int err;
> > +	long ret = 0;
> > +	struct seccomp_knotif n = {};
> > +
> > +	mutex_lock(&match->notify_lock);
> > +	if (!match->has_listener) {
> > +		err = -ENOSYS;
> > +		goto out;
> > +	}
> 
> Nit:
> 
> err = -ENOSYS;
> mutex_lock(&match->notify_lock);
> if (!match->has_listener)
>         goto out;
> 
> looks cleaner to me or you do the err initalization at the top of the
> function. :)

Ok :)

> > +
> > +	n.pid = current->pid;
> > +	n.state = SECCOMP_NOTIFY_INIT;
> > +	n.data = sd;
> > +	n.id = seccomp_next_notify_id(match);
> > +	init_completion(&n.ready);
> > +
> > +	list_add(&n.list, &match->notifications);
> > +
> > +	mutex_unlock(&match->notify_lock);
> > +	up(&match->request);
> > +
> > +	err = wait_for_completion_interruptible(&n.ready);
> > +	mutex_lock(&match->notify_lock);
> > +
> > +	/*
> > +	 * Here it's possible we got a signal and then had to wait on the mutex
> > +	 * while the reply was sent, so let's be sure there wasn't a response
> > +	 * in the meantime.
> > +	 */
> > +	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> > +		/*
> > +		 * We got a signal. Let's tell userspace about it (potentially
> > +		 * again, if we had already notified them about the first one).
> > +		 */
> > +		if (n.state == SECCOMP_NOTIFY_SENT) {
> > +			n.state = SECCOMP_NOTIFY_INIT;
> > +			up(&match->request);
> > +		}
> > +		mutex_unlock(&match->notify_lock);
> > +		err = wait_for_completion_killable(&n.ready);
> > +		mutex_lock(&match->notify_lock);
> > +		if (err < 0)
> > +			goto remove_list;
> > +	}
> > +
> > +	ret = n.val;
> > +	err = n.error;
> > +
> > +	WARN(n.state != SECCOMP_NOTIFY_REPLIED,
> > +	     "notified about write complete when state is not write");
> 
> Nit: That message seems a little cryptic.

Perhaps we can just drop it. It's just a sanity check, but given the
tests above, it doesn't seem likely.

> > +
> > +remove_list:
> > +	list_del(&n.list);
> > +out:
> > +	mutex_unlock(&match->notify_lock);
> > +	syscall_set_return_value(current, task_pt_regs(current),
> > +				 err, ret);
> > +}
> > +#else
> > +static void seccomp_do_user_notification(int this_syscall,
> > +					 u32 action,
> > +					 struct seccomp_filter *match,
> > +					 const struct seccomp_data *sd)
> > +{
> > +	WARN(1, "user notification received, but disabled");
> 
> Nit: "received unexpected user notification" might be clearer

Yes, I wonder if we shouldn't just drop this too -- it's not a kernel
bug, but a userspace bug that they're using features that aren't
enabled.

We could enhance the verifier with a static check for
BPF_RET | BPF_K == SECCOMPO_RET_USER_NOTIF and reject such programs if
user notification isn't enabled. Of course, it wouldn't handle the
dynamic case, but it might be useful.

Tycho
kbuild test robot May 19, 2018, 12:14 a.m.
Hi Tycho,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc5]
[cannot apply to next-20180517]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Tycho-Andersen/seccomp-trap-to-userspace/20180519-071527
config: x86_64-randconfig-x010-201819 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   kernel/seccomp.c: In function '__seccomp_filter':
>> kernel/seccomp.c:891:46: warning: passing argument 2 of 'seccomp_do_user_notification' makes integer from pointer without a cast [-Wint-conversion]
      seccomp_do_user_notification(this_syscall, match, sd);
                                                 ^~~~~
   kernel/seccomp.c:802:13: note: expected 'u32 {aka unsigned int}' but argument is of type 'struct seccomp_filter *'
    static void seccomp_do_user_notification(int this_syscall,
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/seccomp.c:891:53: error: passing argument 3 of 'seccomp_do_user_notification' from incompatible pointer type [-Werror=incompatible-pointer-types]
      seccomp_do_user_notification(this_syscall, match, sd);
                                                        ^~
   kernel/seccomp.c:802:13: note: expected 'struct seccomp_filter *' but argument is of type 'const struct seccomp_data *'
    static void seccomp_do_user_notification(int this_syscall,
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/seccomp.c:891:3: error: too few arguments to function 'seccomp_do_user_notification'
      seccomp_do_user_notification(this_syscall, match, sd);
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/seccomp.c:802:13: note: declared here
    static void seccomp_do_user_notification(int this_syscall,
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/seccomp.c: In function 'seccomp_set_mode_filter':
>> kernel/seccomp.c:1036:14: error: implicit declaration of function 'get_unused_fd_flags'; did you mean 'getname_flags'? [-Werror=implicit-function-declaration]
      listener = get_unused_fd_flags(O_RDWR);
                 ^~~~~~~~~~~~~~~~~~~
                 getname_flags
>> kernel/seccomp.c:1042:16: error: implicit declaration of function 'init_listener'; did you mean 'init_llist_head'? [-Werror=implicit-function-declaration]
      listener_f = init_listener(current, prepared);
                   ^~~~~~~~~~~~~
                   init_llist_head
>> kernel/seccomp.c:1042:14: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
      listener_f = init_listener(current, prepared);
                 ^
>> kernel/seccomp.c:1044:4: error: implicit declaration of function 'put_unused_fd'; did you mean 'put_user_ns'? [-Werror=implicit-function-declaration]
       put_unused_fd(listener);
       ^~~~~~~~~~~~~
       put_user_ns
>> kernel/seccomp.c:1077:4: error: implicit declaration of function 'fput'; did you mean 'iput'? [-Werror=implicit-function-declaration]
       fput(listener_f);
       ^~~~
       iput
>> kernel/seccomp.c:1080:4: error: implicit declaration of function 'fd_install'; did you mean 'fs_initcall'? [-Werror=implicit-function-declaration]
       fd_install(listener, listener_f);
       ^~~~~~~~~~
       fs_initcall
   cc1: some warnings being treated as errors

vim +/seccomp_do_user_notification +891 kernel/seccomp.c

   738	
   739	static void seccomp_do_user_notification(int this_syscall,
   740						 struct seccomp_filter *match,
   741						 const struct seccomp_data *sd)
   742	{
   743		int err;
   744		long ret = 0;
   745		struct seccomp_knotif n = {};
   746	
   747		mutex_lock(&match->notify_lock);
   748		if (!match->has_listener) {
   749			err = -ENOSYS;
   750			goto out;
   751		}
   752	
   753		n.pid = current->pid;
   754		n.state = SECCOMP_NOTIFY_INIT;
   755		n.data = sd;
   756		n.id = seccomp_next_notify_id(match);
   757		init_completion(&n.ready);
   758	
   759		list_add(&n.list, &match->notifications);
   760	
   761		mutex_unlock(&match->notify_lock);
   762		up(&match->request);
   763	
   764		err = wait_for_completion_interruptible(&n.ready);
   765		mutex_lock(&match->notify_lock);
   766	
   767		/*
   768		 * Here it's possible we got a signal and then had to wait on the mutex
   769		 * while the reply was sent, so let's be sure there wasn't a response
   770		 * in the meantime.
   771		 */
   772		if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
   773			/*
   774			 * We got a signal. Let's tell userspace about it (potentially
   775			 * again, if we had already notified them about the first one).
   776			 */
   777			if (n.state == SECCOMP_NOTIFY_SENT) {
   778				n.state = SECCOMP_NOTIFY_INIT;
   779				up(&match->request);
   780			}
   781			mutex_unlock(&match->notify_lock);
   782			err = wait_for_completion_killable(&n.ready);
   783			mutex_lock(&match->notify_lock);
   784			if (err < 0)
   785				goto remove_list;
   786		}
   787	
   788		ret = n.val;
   789		err = n.error;
   790	
   791		WARN(n.state != SECCOMP_NOTIFY_REPLIED,
   792		     "notified about write complete when state is not write");
   793	
   794	remove_list:
   795		list_del(&n.list);
   796	out:
   797		mutex_unlock(&match->notify_lock);
   798		syscall_set_return_value(current, task_pt_regs(current),
   799					 err, ret);
   800	}
   801	#else
 > 802	static void seccomp_do_user_notification(int this_syscall,
   803						 u32 action,
   804						 struct seccomp_filter *match,
   805						 const struct seccomp_data *sd)
   806	{
   807		WARN(1, "user notification received, but disabled");
   808		seccomp_log(this_syscall, SIGSYS, action, true);
   809		do_exit(SIGSYS);
   810	}
   811	#endif
   812	
   813	#ifdef CONFIG_SECCOMP_FILTER
   814	static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
   815				    const bool recheck_after_trace)
   816	{
   817		u32 filter_ret, action;
   818		struct seccomp_filter *match = NULL;
   819		int data;
   820	
   821		/*
   822		 * Make sure that any changes to mode from another thread have
   823		 * been seen after TIF_SECCOMP was seen.
   824		 */
   825		rmb();
   826	
   827		filter_ret = seccomp_run_filters(sd, &match);
   828		data = filter_ret & SECCOMP_RET_DATA;
   829		action = filter_ret & SECCOMP_RET_ACTION_FULL;
   830	
   831		switch (action) {
   832		case SECCOMP_RET_ERRNO:
   833			/* Set low-order bits as an errno, capped at MAX_ERRNO. */
   834			if (data > MAX_ERRNO)
   835				data = MAX_ERRNO;
   836			syscall_set_return_value(current, task_pt_regs(current),
   837						 -data, 0);
   838			goto skip;
   839	
   840		case SECCOMP_RET_TRAP:
   841			/* Show the handler the original registers. */
   842			syscall_rollback(current, task_pt_regs(current));
   843			/* Let the filter pass back 16 bits of data. */
   844			seccomp_send_sigsys(this_syscall, data);
   845			goto skip;
   846	
   847		case SECCOMP_RET_TRACE:
   848			/* We've been put in this state by the ptracer already. */
   849			if (recheck_after_trace)
   850				return 0;
   851	
   852			/* ENOSYS these calls if there is no tracer attached. */
   853			if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
   854				syscall_set_return_value(current,
   855							 task_pt_regs(current),
   856							 -ENOSYS, 0);
   857				goto skip;
   858			}
   859	
   860			/* Allow the BPF to provide the event message */
   861			ptrace_event(PTRACE_EVENT_SECCOMP, data);
   862			/*
   863			 * The delivery of a fatal signal during event
   864			 * notification may silently skip tracer notification,
   865			 * which could leave us with a potentially unmodified
   866			 * syscall that the tracer would have liked to have
   867			 * changed. Since the process is about to die, we just
   868			 * force the syscall to be skipped and let the signal
   869			 * kill the process and correctly handle any tracer exit
   870			 * notifications.
   871			 */
   872			if (fatal_signal_pending(current))
   873				goto skip;
   874			/* Check if the tracer forced the syscall to be skipped. */
   875			this_syscall = syscall_get_nr(current, task_pt_regs(current));
   876			if (this_syscall < 0)
   877				goto skip;
   878	
   879			/*
   880			 * Recheck the syscall, since it may have changed. This
   881			 * intentionally uses a NULL struct seccomp_data to force
   882			 * a reload of all registers. This does not goto skip since
   883			 * a skip would have already been reported.
   884			 */
   885			if (__seccomp_filter(this_syscall, NULL, true))
   886				return -1;
   887	
   888			return 0;
   889	
   890		case SECCOMP_RET_USER_NOTIF:
 > 891			seccomp_do_user_notification(this_syscall, match, sd);
   892			goto skip;
   893		case SECCOMP_RET_LOG:
   894			seccomp_log(this_syscall, 0, action, true);
   895			return 0;
   896	
   897		case SECCOMP_RET_ALLOW:
   898			/*
   899			 * Note that the "match" filter will always be NULL for
   900			 * this action since SECCOMP_RET_ALLOW is the starting
   901			 * state in seccomp_run_filters().
   902			 */
   903			return 0;
   904	
   905		case SECCOMP_RET_KILL_THREAD:
   906		case SECCOMP_RET_KILL_PROCESS:
   907		default:
   908			seccomp_log(this_syscall, SIGSYS, action, true);
   909			/* Dump core only if this is the last remaining thread. */
   910			if (action == SECCOMP_RET_KILL_PROCESS ||
   911			    get_nr_threads(current) == 1) {
   912				siginfo_t info;
   913	
   914				/* Show the original registers in the dump. */
   915				syscall_rollback(current, task_pt_regs(current));
   916				/* Trigger a manual coredump since do_exit skips it. */
   917				seccomp_init_siginfo(&info, this_syscall, data);
   918				do_coredump(&info);
   919			}
   920			if (action == SECCOMP_RET_KILL_PROCESS)
   921				do_group_exit(SIGSYS);
   922			else
   923				do_exit(SIGSYS);
   924		}
   925	
   926		unreachable();
   927	
   928	skip:
   929		seccomp_log(this_syscall, 0, action, match ? match->log : false);
   930		return -1;
   931	}
   932	#else
   933	static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
   934				    const bool recheck_after_trace)
   935	{
   936		BUG();
   937	}
   938	#endif
   939	
   940	int __secure_computing(const struct seccomp_data *sd)
   941	{
   942		int mode = current->seccomp.mode;
   943		int this_syscall;
   944	
   945		if (IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) &&
   946		    unlikely(current->ptrace & PT_SUSPEND_SECCOMP))
   947			return 0;
   948	
   949		this_syscall = sd ? sd->nr :
   950			syscall_get_nr(current, task_pt_regs(current));
   951	
   952		switch (mode) {
   953		case SECCOMP_MODE_STRICT:
   954			__secure_computing_strict(this_syscall);  /* may call do_exit */
   955			return 0;
   956		case SECCOMP_MODE_FILTER:
   957			return __seccomp_filter(this_syscall, sd, false);
   958		default:
   959			BUG();
   960		}
   961	}
   962	#endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */
   963	
   964	long prctl_get_seccomp(void)
   965	{
   966		return current->seccomp.mode;
   967	}
   968	
   969	/**
   970	 * seccomp_set_mode_strict: internal function for setting strict seccomp
   971	 *
   972	 * Once current->seccomp.mode is non-zero, it may not be changed.
   973	 *
   974	 * Returns 0 on success or -EINVAL on failure.
   975	 */
   976	static long seccomp_set_mode_strict(void)
   977	{
   978		const unsigned long seccomp_mode = SECCOMP_MODE_STRICT;
   979		long ret = -EINVAL;
   980	
   981		spin_lock_irq(&current->sighand->siglock);
   982	
   983		if (!seccomp_may_assign_mode(seccomp_mode))
   984			goto out;
   985	
   986	#ifdef TIF_NOTSC
   987		disable_TSC();
   988	#endif
   989		seccomp_assign_mode(current, seccomp_mode);
   990		ret = 0;
   991	
   992	out:
   993		spin_unlock_irq(&current->sighand->siglock);
   994	
   995		return ret;
   996	}
   997	
   998	#ifdef CONFIG_SECCOMP_FILTER
   999	#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
  1000	static struct file *init_listener(struct task_struct *,
  1001					  struct seccomp_filter *);
  1002	#endif
  1003	
  1004	/**
  1005	 * seccomp_set_mode_filter: internal function for setting seccomp filter
  1006	 * @flags:  flags to change filter behavior
  1007	 * @filter: struct sock_fprog containing filter
  1008	 *
  1009	 * This function may be called repeatedly to install additional filters.
  1010	 * Every filter successfully installed will be evaluated (in reverse order)
  1011	 * for each system call the task makes.
  1012	 *
  1013	 * Once current->seccomp.mode is non-zero, it may not be changed.
  1014	 *
  1015	 * Returns 0 on success or -EINVAL on failure.
  1016	 */
  1017	static long seccomp_set_mode_filter(unsigned int flags,
  1018					    const char __user *filter)
  1019	{
  1020		const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
  1021		struct seccomp_filter *prepared = NULL;
  1022		long ret = -EINVAL;
  1023		int listener = 0;
  1024		struct file *listener_f = NULL;
  1025	
  1026		/* Validate flags. */
  1027		if (flags & ~SECCOMP_FILTER_FLAG_MASK)
  1028			return -EINVAL;
  1029	
  1030		/* Prepare the new filter before holding any locks. */
  1031		prepared = seccomp_prepare_user_filter(filter);
  1032		if (IS_ERR(prepared))
  1033			return PTR_ERR(prepared);
  1034	
  1035		if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> 1036			listener = get_unused_fd_flags(O_RDWR);
  1037			if (listener < 0) {
  1038				ret = listener;
  1039				goto out_free;
  1040			}
  1041	
> 1042			listener_f = init_listener(current, prepared);
  1043			if (IS_ERR(listener_f)) {
> 1044				put_unused_fd(listener);
  1045				ret = PTR_ERR(listener_f);
  1046				goto out_free;
  1047			}
  1048		}
  1049	
  1050		/*
  1051		 * Make sure we cannot change seccomp or nnp state via TSYNC
  1052		 * while another thread is in the middle of calling exec.
  1053		 */
  1054		if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
  1055		    mutex_lock_killable(&current->signal->cred_guard_mutex))
  1056			goto out_put_fd;
  1057	
  1058		spin_lock_irq(&current->sighand->siglock);
  1059	
  1060		if (!seccomp_may_assign_mode(seccomp_mode))
  1061			goto out;
  1062	
  1063		ret = seccomp_attach_filter(flags, prepared);
  1064		if (ret)
  1065			goto out;
  1066		/* Do not free the successfully attached filter. */
  1067		prepared = NULL;
  1068	
  1069		seccomp_assign_mode(current, seccomp_mode);
  1070	out:
  1071		spin_unlock_irq(&current->sighand->siglock);
  1072		if (flags & SECCOMP_FILTER_FLAG_TSYNC)
  1073			mutex_unlock(&current->signal->cred_guard_mutex);
  1074	out_put_fd:
  1075		if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
  1076			if (ret < 0) {
> 1077				fput(listener_f);
  1078				put_unused_fd(listener);
  1079			} else {
> 1080				fd_install(listener, listener_f);
  1081				ret = listener;
  1082			}
  1083		}
  1084	out_free:
  1085		seccomp_filter_free(prepared);
  1086		return ret;
  1087	}
  1088	#else
  1089	static inline long seccomp_set_mode_filter(unsigned int flags,
  1090						   const char __user *filter)
  1091	{
  1092		return -EINVAL;
  1093	}
  1094	#endif
  1095	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
kbuild test robot May 19, 2018, 5:01 a.m.
Hi Tycho,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc5]
[cannot apply to next-20180517]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Tycho-Andersen/seccomp-trap-to-userspace/20180519-071527
config: i386-randconfig-a1-05181545 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   kernel/seccomp.c: In function '__seccomp_filter':
   kernel/seccomp.c:891:46: warning: passing argument 2 of 'seccomp_do_user_notification' makes integer from pointer without a cast
      seccomp_do_user_notification(this_syscall, match, sd);
                                                 ^
   kernel/seccomp.c:802:13: note: expected 'u32' but argument is of type 'struct seccomp_filter *'
    static void seccomp_do_user_notification(int this_syscall,
                ^
>> kernel/seccomp.c:891:53: warning: passing argument 3 of 'seccomp_do_user_notification' from incompatible pointer type
      seccomp_do_user_notification(this_syscall, match, sd);
                                                        ^
   kernel/seccomp.c:802:13: note: expected 'struct seccomp_filter *' but argument is of type 'const struct seccomp_data *'
    static void seccomp_do_user_notification(int this_syscall,
                ^
   kernel/seccomp.c:891:3: error: too few arguments to function 'seccomp_do_user_notification'
      seccomp_do_user_notification(this_syscall, match, sd);
      ^
   kernel/seccomp.c:802:13: note: declared here
    static void seccomp_do_user_notification(int this_syscall,
                ^
   kernel/seccomp.c: In function 'seccomp_set_mode_filter':
>> kernel/seccomp.c:1036:3: error: implicit declaration of function 'get_unused_fd_flags' [-Werror=implicit-function-declaration]
      listener = get_unused_fd_flags(O_RDWR);
      ^
>> kernel/seccomp.c:1042:3: error: implicit declaration of function 'init_listener' [-Werror=implicit-function-declaration]
      listener_f = init_listener(current, prepared);
      ^
   kernel/seccomp.c:1042:14: warning: assignment makes pointer from integer without a cast
      listener_f = init_listener(current, prepared);
                 ^
>> kernel/seccomp.c:1044:4: error: implicit declaration of function 'put_unused_fd' [-Werror=implicit-function-declaration]
       put_unused_fd(listener);
       ^
>> kernel/seccomp.c:1077:4: error: implicit declaration of function 'fput' [-Werror=implicit-function-declaration]
       fput(listener_f);
       ^
>> kernel/seccomp.c:1080:4: error: implicit declaration of function 'fd_install' [-Werror=implicit-function-declaration]
       fd_install(listener, listener_f);
       ^
   cc1: some warnings being treated as errors

vim +/get_unused_fd_flags +1036 kernel/seccomp.c

   812	
   813	#ifdef CONFIG_SECCOMP_FILTER
   814	static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
   815				    const bool recheck_after_trace)
   816	{
   817		u32 filter_ret, action;
   818		struct seccomp_filter *match = NULL;
   819		int data;
   820	
   821		/*
   822		 * Make sure that any changes to mode from another thread have
   823		 * been seen after TIF_SECCOMP was seen.
   824		 */
   825		rmb();
   826	
   827		filter_ret = seccomp_run_filters(sd, &match);
   828		data = filter_ret & SECCOMP_RET_DATA;
   829		action = filter_ret & SECCOMP_RET_ACTION_FULL;
   830	
   831		switch (action) {
   832		case SECCOMP_RET_ERRNO:
   833			/* Set low-order bits as an errno, capped at MAX_ERRNO. */
   834			if (data > MAX_ERRNO)
   835				data = MAX_ERRNO;
   836			syscall_set_return_value(current, task_pt_regs(current),
   837						 -data, 0);
   838			goto skip;
   839	
   840		case SECCOMP_RET_TRAP:
   841			/* Show the handler the original registers. */
   842			syscall_rollback(current, task_pt_regs(current));
   843			/* Let the filter pass back 16 bits of data. */
   844			seccomp_send_sigsys(this_syscall, data);
   845			goto skip;
   846	
   847		case SECCOMP_RET_TRACE:
   848			/* We've been put in this state by the ptracer already. */
   849			if (recheck_after_trace)
   850				return 0;
   851	
   852			/* ENOSYS these calls if there is no tracer attached. */
   853			if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
   854				syscall_set_return_value(current,
   855							 task_pt_regs(current),
   856							 -ENOSYS, 0);
   857				goto skip;
   858			}
   859	
   860			/* Allow the BPF to provide the event message */
   861			ptrace_event(PTRACE_EVENT_SECCOMP, data);
   862			/*
   863			 * The delivery of a fatal signal during event
   864			 * notification may silently skip tracer notification,
   865			 * which could leave us with a potentially unmodified
   866			 * syscall that the tracer would have liked to have
   867			 * changed. Since the process is about to die, we just
   868			 * force the syscall to be skipped and let the signal
   869			 * kill the process and correctly handle any tracer exit
   870			 * notifications.
   871			 */
   872			if (fatal_signal_pending(current))
   873				goto skip;
   874			/* Check if the tracer forced the syscall to be skipped. */
   875			this_syscall = syscall_get_nr(current, task_pt_regs(current));
   876			if (this_syscall < 0)
   877				goto skip;
   878	
   879			/*
   880			 * Recheck the syscall, since it may have changed. This
   881			 * intentionally uses a NULL struct seccomp_data to force
   882			 * a reload of all registers. This does not goto skip since
   883			 * a skip would have already been reported.
   884			 */
   885			if (__seccomp_filter(this_syscall, NULL, true))
   886				return -1;
   887	
   888			return 0;
   889	
   890		case SECCOMP_RET_USER_NOTIF:
 > 891			seccomp_do_user_notification(this_syscall, match, sd);
   892			goto skip;
   893		case SECCOMP_RET_LOG:
   894			seccomp_log(this_syscall, 0, action, true);
   895			return 0;
   896	
   897		case SECCOMP_RET_ALLOW:
   898			/*
   899			 * Note that the "match" filter will always be NULL for
   900			 * this action since SECCOMP_RET_ALLOW is the starting
   901			 * state in seccomp_run_filters().
   902			 */
   903			return 0;
   904	
   905		case SECCOMP_RET_KILL_THREAD:
   906		case SECCOMP_RET_KILL_PROCESS:
   907		default:
   908			seccomp_log(this_syscall, SIGSYS, action, true);
   909			/* Dump core only if this is the last remaining thread. */
   910			if (action == SECCOMP_RET_KILL_PROCESS ||
   911			    get_nr_threads(current) == 1) {
   912				siginfo_t info;
   913	
   914				/* Show the original registers in the dump. */
   915				syscall_rollback(current, task_pt_regs(current));
   916				/* Trigger a manual coredump since do_exit skips it. */
   917				seccomp_init_siginfo(&info, this_syscall, data);
   918				do_coredump(&info);
   919			}
   920			if (action == SECCOMP_RET_KILL_PROCESS)
   921				do_group_exit(SIGSYS);
   922			else
   923				do_exit(SIGSYS);
   924		}
   925	
   926		unreachable();
   927	
   928	skip:
   929		seccomp_log(this_syscall, 0, action, match ? match->log : false);
   930		return -1;
   931	}
   932	#else
   933	static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
   934				    const bool recheck_after_trace)
   935	{
   936		BUG();
   937	}
   938	#endif
   939	
   940	int __secure_computing(const struct seccomp_data *sd)
   941	{
   942		int mode = current->seccomp.mode;
   943		int this_syscall;
   944	
   945		if (IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) &&
   946		    unlikely(current->ptrace & PT_SUSPEND_SECCOMP))
   947			return 0;
   948	
   949		this_syscall = sd ? sd->nr :
   950			syscall_get_nr(current, task_pt_regs(current));
   951	
   952		switch (mode) {
   953		case SECCOMP_MODE_STRICT:
   954			__secure_computing_strict(this_syscall);  /* may call do_exit */
   955			return 0;
   956		case SECCOMP_MODE_FILTER:
   957			return __seccomp_filter(this_syscall, sd, false);
   958		default:
   959			BUG();
   960		}
   961	}
   962	#endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */
   963	
   964	long prctl_get_seccomp(void)
   965	{
   966		return current->seccomp.mode;
   967	}
   968	
   969	/**
   970	 * seccomp_set_mode_strict: internal function for setting strict seccomp
   971	 *
   972	 * Once current->seccomp.mode is non-zero, it may not be changed.
   973	 *
   974	 * Returns 0 on success or -EINVAL on failure.
   975	 */
   976	static long seccomp_set_mode_strict(void)
   977	{
   978		const unsigned long seccomp_mode = SECCOMP_MODE_STRICT;
   979		long ret = -EINVAL;
   980	
   981		spin_lock_irq(&current->sighand->siglock);
   982	
   983		if (!seccomp_may_assign_mode(seccomp_mode))
   984			goto out;
   985	
   986	#ifdef TIF_NOTSC
   987		disable_TSC();
   988	#endif
   989		seccomp_assign_mode(current, seccomp_mode);
   990		ret = 0;
   991	
   992	out:
   993		spin_unlock_irq(&current->sighand->siglock);
   994	
   995		return ret;
   996	}
   997	
   998	#ifdef CONFIG_SECCOMP_FILTER
   999	#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
  1000	static struct file *init_listener(struct task_struct *,
  1001					  struct seccomp_filter *);
  1002	#endif
  1003	
  1004	/**
  1005	 * seccomp_set_mode_filter: internal function for setting seccomp filter
  1006	 * @flags:  flags to change filter behavior
  1007	 * @filter: struct sock_fprog containing filter
  1008	 *
  1009	 * This function may be called repeatedly to install additional filters.
  1010	 * Every filter successfully installed will be evaluated (in reverse order)
  1011	 * for each system call the task makes.
  1012	 *
  1013	 * Once current->seccomp.mode is non-zero, it may not be changed.
  1014	 *
  1015	 * Returns 0 on success or -EINVAL on failure.
  1016	 */
  1017	static long seccomp_set_mode_filter(unsigned int flags,
  1018					    const char __user *filter)
  1019	{
  1020		const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
  1021		struct seccomp_filter *prepared = NULL;
  1022		long ret = -EINVAL;
  1023		int listener = 0;
  1024		struct file *listener_f = NULL;
  1025	
  1026		/* Validate flags. */
  1027		if (flags & ~SECCOMP_FILTER_FLAG_MASK)
  1028			return -EINVAL;
  1029	
  1030		/* Prepare the new filter before holding any locks. */
  1031		prepared = seccomp_prepare_user_filter(filter);
  1032		if (IS_ERR(prepared))
  1033			return PTR_ERR(prepared);
  1034	
  1035		if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> 1036			listener = get_unused_fd_flags(O_RDWR);
  1037			if (listener < 0) {
  1038				ret = listener;
  1039				goto out_free;
  1040			}
  1041	
> 1042			listener_f = init_listener(current, prepared);
  1043			if (IS_ERR(listener_f)) {
> 1044				put_unused_fd(listener);
  1045				ret = PTR_ERR(listener_f);
  1046				goto out_free;
  1047			}
  1048		}
  1049	
  1050		/*
  1051		 * Make sure we cannot change seccomp or nnp state via TSYNC
  1052		 * while another thread is in the middle of calling exec.
  1053		 */
  1054		if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
  1055		    mutex_lock_killable(&current->signal->cred_guard_mutex))
  1056			goto out_put_fd;
  1057	
  1058		spin_lock_irq(&current->sighand->siglock);
  1059	
  1060		if (!seccomp_may_assign_mode(seccomp_mode))
  1061			goto out;
  1062	
  1063		ret = seccomp_attach_filter(flags, prepared);
  1064		if (ret)
  1065			goto out;
  1066		/* Do not free the successfully attached filter. */
  1067		prepared = NULL;
  1068	
  1069		seccomp_assign_mode(current, seccomp_mode);
  1070	out:
  1071		spin_unlock_irq(&current->sighand->siglock);
  1072		if (flags & SECCOMP_FILTER_FLAG_TSYNC)
  1073			mutex_unlock(&current->signal->cred_guard_mutex);
  1074	out_put_fd:
  1075		if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
  1076			if (ret < 0) {
> 1077				fput(listener_f);
  1078				put_unused_fd(listener);
  1079			} else {
> 1080				fd_install(listener, listener_f);
  1081				ret = listener;
  1082			}
  1083		}
  1084	out_free:
  1085		seccomp_filter_free(prepared);
  1086		return ret;
  1087	}
  1088	#else
  1089	static inline long seccomp_set_mode_filter(unsigned int flags,
  1090						   const char __user *filter)
  1091	{
  1092		return -EINVAL;
  1093	}
  1094	#endif
  1095	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Tycho Andersen May 21, 2018, 10:55 p.m.
On Sat, May 19, 2018 at 01:01:15PM +0800, kbuild test robot wrote:
> Hi Tycho,
> 
> I love your patch! Yet something to improve:

Whoops, seems I forgot to compile the
!CONFIG_SECCOMP_USER_NOTIFICATION case. Anyways, I've fixed this for
v3.

Tycho
Tycho Andersen May 24, 2018, 3:28 p.m.
Hi Oleg,

On Thu, May 17, 2018 at 05:46:37PM +0200, Oleg Nesterov wrote:
> On 05/17, Tycho Andersen wrote:
> >
> > > From lockdep pov this loop tries to take the same lock twice or more, it shoul
> > > complain.
> >
> > I didn't, but I guess that's because it's not trying to take the same lock
> > twice -- the pointer cur is changing in the loop.
> 
> Yes, I see. But this is the same lock for lockdep, it has the same class.

I finally figured this out, I needed CONFIG_PROVE_LOCKING=y too,
anyway, I've added the nesting annotations for v3. Thanks!

Tycho