[v3,1/4] seccomp: add a return code to trap to userspace

Submitted by Tycho Andersen on May 31, 2018, 2:49 p.m.

Details

Message ID 20180531144949.24995-2-tycho@tycho.ws
State New
Series "seccomp trap to userspace"
Headers show

Commit Message

Tycho Andersen May 31, 2018, 2:49 p.m.
This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mknod(), since this checks
capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
/dev/zero should be ok for containers to mknod, but we'd like to avoid hard
coding some whitelist in the kernel. Another example is mount(), which has
many security restrictions for good reason, but configuration or runtime
knowledge could potentially be used to relax these restrictions.

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to start services.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

v2: * make id a u64; the idea here being that it will never overflow,
      because 64 is huge (one syscall every nanosecond => wrap every 584
      years) (Andy)
    * prevent nesting of user notifications: if someone is already attached
      the tree in one place, nobody else can attach to the tree (Andy)
    * notify the listener of signals the tracee receives as well (Andy)
    * implement poll
v3: * lockdep fix (Oleg)
    * drop unnecessary WARN()s (Christian)
    * rearrange error returns to be more rpetty (Christian)
    * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 arch/Kconfig                                  |   7 +
 include/linux/seccomp.h                       |   3 +-
 include/uapi/linux/seccomp.h                  |  18 +-
 kernel/seccomp.c                              | 398 +++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 195 ++++++++-
 5 files changed, 615 insertions(+), 6 deletions(-)

Patch hide | download patch | download mbox

diff --git a/arch/Kconfig b/arch/Kconfig
index 75dd23acf133..1c1ae8d8c8b9 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -401,6 +401,13 @@  config SECCOMP_FILTER
 
 	  See Documentation/prctl/seccomp_filter.txt for details.
 
+config SECCOMP_USER_NOTIFICATION
+	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
+	depends on SECCOMP_FILTER
+	help
+	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
+	  programs to notify a userspace listener that a particular event happened.
+
 config HAVE_GCC_PLUGINS
 	bool
 	help
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index c723a5c4e3ff..0fd3e0676a1c 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,7 +5,8 @@ 
 #include <uapi/linux/seccomp.h>
 
 #define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
-					 SECCOMP_FILTER_FLAG_LOG)
+					 SECCOMP_FILTER_FLAG_LOG | \
+					 SECCOMP_FILTER_FLAG_GET_LISTENER)
 
 #ifdef CONFIG_SECCOMP
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 2a0bd9dd104d..8160e6cad528 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -17,8 +17,9 @@ 
 #define SECCOMP_GET_ACTION_AVAIL	2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC	1
-#define SECCOMP_FILTER_FLAG_LOG		2
+#define SECCOMP_FILTER_FLAG_TSYNC		1
+#define SECCOMP_FILTER_FLAG_LOG			2
+#define SECCOMP_FILTER_FLAG_GET_LISTENER	4
 
 /*
  * All BPF programs must return a 32-bit value.
@@ -34,6 +35,7 @@ 
 #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
 #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
 #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
+#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
 #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
 #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
 #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
@@ -59,4 +61,16 @@  struct seccomp_data {
 	__u64 args[6];
 };
 
+struct seccomp_notif {
+	__u64 id;
+	pid_t pid;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index dc77548167ef..f69327d5f7c7 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -31,6 +31,7 @@ 
 #endif
 
 #ifdef CONFIG_SECCOMP_FILTER
+#include <linux/file.h>
 #include <linux/filter.h>
 #include <linux/pid.h>
 #include <linux/ptrace.h>
@@ -38,6 +39,52 @@ 
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+#include <linux/anon_inodes.h>
+
+enum notify_state {
+	SECCOMP_NOTIFY_INIT,
+	SECCOMP_NOTIFY_SENT,
+	SECCOMP_NOTIFY_REPLIED,
+};
+
+struct seccomp_knotif {
+	/* The pid whose filter triggered the notification */
+	pid_t pid;
+
+	/*
+	 * The "cookie" for this request; this is unique for this filter.
+	 */
+	u32 id;
+
+	/*
+	 * The seccomp data. This pointer is valid the entire time this
+	 * notification is active, since it comes from __seccomp_filter which
+	 * eclipses the entire lifecycle here.
+	 */
+	const struct seccomp_data *data;
+
+	/*
+	 * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
+	 * struct seccomp_knotif is created and starts out in INIT. Once the
+	 * handler reads the notification off of an FD, it transitions to READ.
+	 * If a signal is received the state transitions back to INIT and
+	 * another message is sent. When the userspace handler replies, state
+	 * transitions to REPLIED.
+	 */
+	enum notify_state state;
+
+	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
+	int error;
+	long val;
+
+	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
+	struct completion ready;
+
+	struct list_head list;
+};
+#endif
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -64,6 +111,27 @@  struct seccomp_filter {
 	bool log;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	/*
+	 * A semaphore that users of this notification can wait on for
+	 * changes. Actual reads and writes are still controlled with
+	 * filter->notify_lock.
+	 */
+	struct semaphore request;
+
+	/* A lock for all notification-related accesses. */
+	struct mutex notify_lock;
+
+	/* Is there currently an attached listener? */
+	bool has_listener;
+
+	/* The id of the next request. */
+	u64 next_id;
+
+	/* A list of struct seccomp_knotif elements. */
+	struct list_head notifications;
+#endif
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -383,6 +451,13 @@  static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	if (!sfilter)
 		return ERR_PTR(-ENOMEM);
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	mutex_init(&sfilter->notify_lock);
+	sema_init(&sfilter->request, 0);
+	INIT_LIST_HEAD(&sfilter->notifications);
+	sfilter->next_id = get_random_u64();
+#endif
+
 	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
 					seccomp_check_filter, save_orig);
 	if (ret < 0) {
@@ -547,13 +622,15 @@  static void seccomp_send_sigsys(int syscall, int reason)
 #define SECCOMP_LOG_TRACE		(1 << 4)
 #define SECCOMP_LOG_LOG			(1 << 5)
 #define SECCOMP_LOG_ALLOW		(1 << 6)
+#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
 
 static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
 				    SECCOMP_LOG_KILL_THREAD  |
 				    SECCOMP_LOG_TRAP  |
 				    SECCOMP_LOG_ERRNO |
 				    SECCOMP_LOG_TRACE |
-				    SECCOMP_LOG_LOG;
+				    SECCOMP_LOG_LOG |
+				    SECCOMP_LOG_USER_NOTIF;
 
 static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 			       bool requested)
@@ -572,6 +649,9 @@  static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 	case SECCOMP_RET_TRACE:
 		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
+		break;
 	case SECCOMP_RET_LOG:
 		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
 		break;
@@ -645,6 +725,81 @@  void secure_computing_strict(int this_syscall)
 }
 #else
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
+{
+	/* Note: overflow is ok here, the id just needs to be unique */
+	return filter->next_id++;
+}
+
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	int err;
+	long ret = 0;
+	struct seccomp_knotif n = {};
+
+	mutex_lock(&match->notify_lock);
+	err = -ENOSYS;
+	if (!match->has_listener)
+		goto out;
+
+	n.pid = current->pid;
+	n.state = SECCOMP_NOTIFY_INIT;
+	n.data = sd;
+	n.id = seccomp_next_notify_id(match);
+	init_completion(&n.ready);
+
+	list_add(&n.list, &match->notifications);
+
+	mutex_unlock(&match->notify_lock);
+	up(&match->request);
+
+	err = wait_for_completion_interruptible(&n.ready);
+	mutex_lock(&match->notify_lock);
+
+	/*
+	 * Here it's possible we got a signal and then had to wait on the mutex
+	 * while the reply was sent, so let's be sure there wasn't a response
+	 * in the meantime.
+	 */
+	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
+		/*
+		 * We got a signal. Let's tell userspace about it (potentially
+		 * again, if we had already notified them about the first one).
+		 */
+		if (n.state == SECCOMP_NOTIFY_SENT) {
+			n.state = SECCOMP_NOTIFY_INIT;
+			up(&match->request);
+		}
+		mutex_unlock(&match->notify_lock);
+		err = wait_for_completion_killable(&n.ready);
+		mutex_lock(&match->notify_lock);
+		if (err < 0)
+			goto remove_list;
+	}
+
+	ret = n.val;
+	err = n.error;
+
+remove_list:
+	list_del(&n.list);
+out:
+	mutex_unlock(&match->notify_lock);
+	syscall_set_return_value(current, task_pt_regs(current),
+				 err, ret);
+}
+#else
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_USER_NOTIF, true);
+	do_exit(SIGSYS);
+}
+#endif
+
 #ifdef CONFIG_SECCOMP_FILTER
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
@@ -722,6 +877,9 @@  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 
 		return 0;
 
+	case SECCOMP_RET_USER_NOTIF:
+		seccomp_do_user_notification(this_syscall, match, sd);
+		goto skip;
 	case SECCOMP_RET_LOG:
 		seccomp_log(this_syscall, 0, action, true);
 		return 0;
@@ -828,6 +986,9 @@  static long seccomp_set_mode_strict(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+static struct file *init_listener(struct task_struct *,
+				  struct seccomp_filter *);
+
 /**
  * seccomp_set_mode_filter: internal function for setting seccomp filter
  * @flags:  flags to change filter behavior
@@ -847,6 +1008,8 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
 	struct seccomp_filter *prepared = NULL;
 	long ret = -EINVAL;
+	int listener = 0;
+	struct file *listener_f = NULL;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -857,13 +1020,28 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	if (IS_ERR(prepared))
 		return PTR_ERR(prepared);
 
+	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+		listener = get_unused_fd_flags(O_RDWR);
+		if (listener < 0) {
+			ret = listener;
+			goto out_free;
+		}
+
+		listener_f = init_listener(current, prepared);
+		if (IS_ERR(listener_f)) {
+			put_unused_fd(listener);
+			ret = PTR_ERR(listener_f);
+			goto out_free;
+		}
+	}
+
 	/*
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_free;
+		goto out_put_fd;
 
 	spin_lock_irq(&current->sighand->siglock);
 
@@ -881,6 +1059,16 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		mutex_unlock(&current->signal->cred_guard_mutex);
+out_put_fd:
+	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+		if (ret < 0) {
+			fput(listener_f);
+			put_unused_fd(listener);
+		} else {
+			fd_install(listener, listener_f);
+			ret = listener;
+		}
+	}
 out_free:
 	seccomp_filter_free(prepared);
 	return ret;
@@ -909,6 +1097,9 @@  static long seccomp_get_action_avail(const char __user *uaction)
 	case SECCOMP_RET_LOG:
 	case SECCOMP_RET_ALLOW:
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
+			break;
 	default:
 		return -EOPNOTSUPP;
 	}
@@ -1105,6 +1296,7 @@  long seccomp_get_metadata(struct task_struct *task,
 #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
 #define SECCOMP_RET_TRAP_NAME		"trap"
 #define SECCOMP_RET_ERRNO_NAME		"errno"
+#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
 #define SECCOMP_RET_TRACE_NAME		"trace"
 #define SECCOMP_RET_LOG_NAME		"log"
 #define SECCOMP_RET_ALLOW_NAME		"allow"
@@ -1114,6 +1306,7 @@  static const char seccomp_actions_avail[] =
 				SECCOMP_RET_KILL_THREAD_NAME	" "
 				SECCOMP_RET_TRAP_NAME		" "
 				SECCOMP_RET_ERRNO_NAME		" "
+				SECCOMP_RET_USER_NOTIF_NAME     " "
 				SECCOMP_RET_TRACE_NAME		" "
 				SECCOMP_RET_LOG_NAME		" "
 				SECCOMP_RET_ALLOW_NAME;
@@ -1131,6 +1324,7 @@  static const struct seccomp_log_name seccomp_log_names[] = {
 	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
 	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
 	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
+	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
 	{ }
 };
 
@@ -1279,3 +1473,203 @@  static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static int seccomp_notify_release(struct inode *inode, struct file *file)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_knotif *knotif;
+
+	mutex_lock(&filter->notify_lock);
+
+	/*
+	 * If this file is being closed because e.g. the task who owned it
+	 * died, let's wake everyone up who was waiting on us.
+	 */
+	list_for_each_entry(knotif, &filter->notifications, list) {
+		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
+			continue;
+
+		knotif->state = SECCOMP_NOTIFY_REPLIED;
+		knotif->error = -ENOSYS;
+		knotif->val = 0;
+
+		complete(&knotif->ready);
+	}
+
+	filter->has_listener = false;
+	mutex_unlock(&filter->notify_lock);
+	__put_seccomp_filter(filter);
+	return 0;
+}
+
+static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
+				   size_t size, loff_t *ppos)
+{
+	struct seccomp_filter *filter = f->private_data;
+	struct seccomp_knotif *knotif = NULL, *cur;
+	struct seccomp_notif unotif;
+	ssize_t ret;
+
+	/* No offset reads. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	ret = down_interruptible(&filter->request);
+	if (ret < 0)
+		return ret;
+
+	mutex_lock(&filter->notify_lock);
+	list_for_each_entry(cur, &filter->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT) {
+			knotif = cur;
+			break;
+		}
+	}
+
+	/*
+	 * If we didn't find a notification, it could be that the task was
+	 * interrupted between the time we were woken and when we were able to
+	 * acquire the rw lock. Should we retry here or just -ENOENT? -ENOENT
+	 * for now.
+	 */
+	if (!knotif) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	unotif.id = knotif->id;
+	unotif.pid = knotif->pid;
+	unotif.data = *(knotif->data);
+
+	size = min_t(size_t, size, sizeof(struct seccomp_notif));
+	if (copy_to_user(buf, &unotif, size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = sizeof(unotif);
+	knotif->state = SECCOMP_NOTIFY_SENT;
+
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
+				    size_t size, loff_t *ppos)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_notif_resp resp = {};
+	struct seccomp_knotif *knotif = NULL;
+	ssize_t ret = -EINVAL;
+
+	/* No partial writes. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	size = min_t(size_t, size, sizeof(resp));
+	if (copy_from_user(&resp, buf, size))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(knotif, &filter->notifications, list) {
+		if (knotif->id == resp.id)
+			break;
+	}
+
+	if (!knotif || knotif->id != resp.id) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Allow exactly one reply. */
+	if (knotif->state != SECCOMP_NOTIFY_SENT) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = size;
+	knotif->state = SECCOMP_NOTIFY_REPLIED;
+	knotif->error = resp.error;
+	knotif->val = resp.val;
+	complete(&knotif->ready);
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static __poll_t seccomp_notify_poll(struct file *file,
+				    struct poll_table_struct *poll_tab)
+{
+	struct seccomp_filter *filter = file->private_data;
+	__poll_t ret = 0;
+	struct seccomp_knotif *cur;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(cur, &filter->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT)
+			ret |= EPOLLIN | EPOLLRDNORM;
+		if (cur->state == SECCOMP_NOTIFY_SENT)
+			ret |= EPOLLOUT | EPOLLWRNORM;
+	}
+
+	mutex_unlock(&filter->notify_lock);
+
+	return ret;
+}
+
+static const struct file_operations seccomp_notify_ops = {
+	.read = seccomp_notify_read,
+	.write = seccomp_notify_write,
+	.poll = seccomp_notify_poll,
+	.release = seccomp_notify_release,
+};
+
+static struct file *init_listener(struct task_struct *task,
+				  struct seccomp_filter *filter)
+{
+	struct file *ret = ERR_PTR(-EBUSY);
+	struct seccomp_filter *cur;
+	bool have_listener = false;
+	int filter_nesting = 0;
+
+	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
+		mutex_lock_nested(&cur->notify_lock, filter_nesting);
+		filter_nesting++;
+		if (cur->has_listener)
+			have_listener = true;
+	}
+
+	if (have_listener)
+		goto out;
+
+	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
+				 filter, O_RDWR);
+	if (IS_ERR(ret))
+		goto out;
+
+
+	/* The file has a reference to it now */
+	__get_seccomp_filter(filter);
+	filter->has_listener = true;
+
+out:
+	for (cur = task->seccomp.filter; cur; cur = cur->prev)
+		mutex_unlock(&cur->notify_lock);
+
+	return ret;
+}
+#else
+static struct file *init_listener(struct task_struct *task,
+				  struct seccomp_filter *filter)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 168c66d74fc5..f439bd3fb15f 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -40,10 +40,12 @@ 
 #include <sys/fcntl.h>
 #include <sys/mman.h>
 #include <sys/times.h>
+#include <sys/socket.h>
 
 #define _GNU_SOURCE
 #include <unistd.h>
 #include <sys/syscall.h>
+#include <poll.h>
 
 #include "../kselftest_harness.h"
 
@@ -150,6 +152,24 @@  struct seccomp_metadata {
 };
 #endif
 
+#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
+#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
+
+#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
+
+struct seccomp_notif {
+	__u64 id;
+	pid_t pid;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
@@ -2072,7 +2092,8 @@  TEST(seccomp_syscall_mode_lock)
 TEST(detect_seccomp_filter_flags)
 {
 	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
-				 SECCOMP_FILTER_FLAG_LOG };
+				 SECCOMP_FILTER_FLAG_LOG,
+				 SECCOMP_FILTER_FLAG_GET_LISTENER };
 	unsigned int flag, all_flags;
 	int i;
 	long ret;
@@ -2917,6 +2938,178 @@  TEST(get_metadata)
 	ASSERT_EQ(0, kill(pid, SIGKILL));
 }
 
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+static int read_notif(int listener, struct seccomp_notif *req)
+{
+	int ret;
+
+	do {
+		errno = 0;
+		ret = read(listener, req, sizeof(*req));
+	} while (ret == -1 && errno == ENOENT);
+	return ret;
+}
+
+static void signal_handler(int signal)
+{
+}
+
+#define USER_NOTIF_MAGIC 116983961184613L
+TEST(get_user_notification_syscall)
+{
+	pid_t pid;
+	long ret;
+	int status, listener;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	struct pollfd pollfd;
+
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	/* Check that we get -ENOSYS with no listener attached */
+	if (pid == 0) {
+		if (user_trap_syscall(__NR_getpid, 0) < 0)
+			exit(1);
+		ret = syscall(__NR_getpid);
+		exit(ret >= 0 || errno != ENOSYS);
+	}
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/* Add some no-op filters so that we (don't) trigger lockdep. */
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+
+	/* Check that the basic notification machinery works */
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_GET_LISTENER);
+	EXPECT_GE(listener, 0);
+
+	/* Installing a second listener in the chain should EBUSY */
+	EXPECT_EQ(user_trap_syscall(__NR_getpid,
+				    SECCOMP_FILTER_FLAG_GET_LISTENER),
+		  -1);
+	EXPECT_EQ(errno, EBUSY);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLOUT);
+
+	EXPECT_EQ(req.data.nr,  __NR_getpid);
+
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that nothing bad happens when we kill the task in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read(listener, &req, sizeof(req));
+	EXPECT_EQ(ret, sizeof(req));
+
+	EXPECT_EQ(kill(pid, SIGKILL), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+
+	resp.id = req.id;
+	ret = write(listener, &resp, sizeof(resp));
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, EINVAL);
+
+	/*
+	 * Check that we get another notification about a signal in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
+			perror("signal");
+			exit(1);
+		}
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(kill(pid, SIGUSR1), 0);
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	resp.id = req.id;
+	ret = write(listener, &resp, sizeof(resp));
+	EXPECT_EQ(ret, sizeof(resp));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	close(listener);
+}
+
 /*
  * TODO:
  * - add microbenchmarks

Comments

Jann Horn via Containers June 3, 2018, 6:41 p.m.
On Sun, Jun 3, 2018 at 2:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
>
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
>
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
>
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
[...]
> @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         if (IS_ERR(prepared))
>                 return PTR_ERR(prepared);
>
> +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +               listener = get_unused_fd_flags(O_RDWR);

I think you want either 0 or O_CLOEXEC here?

> +out_put_fd:
> +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +               if (ret < 0) {
> +                       fput(listener_f);
> +                       put_unused_fd(listener);
> +               } else {
> +                       fd_install(listener, listener_f);
> +                       ret = listener;
> +               }
> +       }
>  out_free:
>         seccomp_filter_free(prepared);
>         return ret;
[...]
> +static __poll_t seccomp_notify_poll(struct file *file,
> +                                   struct poll_table_struct *poll_tab)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       __poll_t ret = 0;
> +       struct seccomp_knotif *cur;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       list_for_each_entry(cur, &filter->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT)
> +                       ret |= EPOLLIN | EPOLLRDNORM;
> +               if (cur->state == SECCOMP_NOTIFY_SENT)
> +                       ret |= EPOLLOUT | EPOLLWRNORM;
> +       }
> +
> +       mutex_unlock(&filter->notify_lock);
> +
> +       return ret;
> +}

I don't think f_op->poll handlers work like this. AFAIK you're
supposed to use something like poll_wait() to connect the caller to
something like a waitqueue head, so that as soon as the file becomes
ready for reading/writing, any waiting task is notified. See
eventfd_poll() in fs/eventfd.c for a simple example. AFAICS in the
current code, seccomp_notify_poll() only works if an event is pending
at the time seccomp_notify_poll() is called.
Tycho Andersen June 4, 2018, 12:18 a.m.
Hi Jann,

On Sun, Jun 03, 2018 at 08:41:01PM +0200, Jann Horn wrote:
> On Sun, Jun 3, 2018 at 2:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > This patch introduces a means for syscalls matched in seccomp to notify
> > some other task that a particular filter has been triggered.
> >
> > The motivation for this is primarily for use with containers. For example,
> > if a container does an init_module(), we obviously don't want to load this
> > untrusted code, which may be compiled for the wrong version of the kernel
> > anyway. Instead, we could parse the module image, figure out which module
> > the container is trying to load and load it on the host.
> >
> > As another example, containers cannot mknod(), since this checks
> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > coding some whitelist in the kernel. Another example is mount(), which has
> > many security restrictions for good reason, but configuration or runtime
> > knowledge could potentially be used to relax these restrictions.
> >
> > This patch adds functionality that is already possible via at least two
> > other means that I know about, both of which involve ptrace(): first, one
> > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> > Unfortunately this is slow, so a faster version would be to install a
> > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> > Since ptrace allows only one tracer, if the container runtime is that
> > tracer, users inside the container (or outside) trying to debug it will not
> > be able to use ptrace, which is annoying. It also means that older
> > distributions based on Upstart cannot boot inside containers using ptrace,
> > since upstart itself uses ptrace to start services.
> >
> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex.
> >
> > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > memory data from the task still applies here, but can be avoided with
> > careful design of the userspace handler: if the userspace handler reads all
> > of the task memory that is necessary before applying its security policy,
> > the tracee's subsequent memory edits will not be read by the tracer.
> [...]
> > @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         if (IS_ERR(prepared))
> >                 return PTR_ERR(prepared);
> >
> > +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> > +               listener = get_unused_fd_flags(O_RDWR);
> 
> I think you want either 0 or O_CLOEXEC here?

Do we? I suppose it makes sense to be able to set CLOEXEC, but I could
imagine a case where a handler wanted to fork+exec to handle
something. I'm happy to make the change, but it's not obvious to me
that it's what we want by default.

> > +out_put_fd:
> > +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> > +               if (ret < 0) {
> > +                       fput(listener_f);
> > +                       put_unused_fd(listener);
> > +               } else {
> > +                       fd_install(listener, listener_f);
> > +                       ret = listener;
> > +               }
> > +       }
> >  out_free:
> >         seccomp_filter_free(prepared);
> >         return ret;
> [...]
> > +static __poll_t seccomp_notify_poll(struct file *file,
> > +                                   struct poll_table_struct *poll_tab)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       __poll_t ret = 0;
> > +       struct seccomp_knotif *cur;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       list_for_each_entry(cur, &filter->notifications, list) {
> > +               if (cur->state == SECCOMP_NOTIFY_INIT)
> > +                       ret |= EPOLLIN | EPOLLRDNORM;
> > +               if (cur->state == SECCOMP_NOTIFY_SENT)
> > +                       ret |= EPOLLOUT | EPOLLWRNORM;
> > +       }
> > +
> > +       mutex_unlock(&filter->notify_lock);
> > +
> > +       return ret;
> > +}
> 
> I don't think f_op->poll handlers work like this. AFAIK you're
> supposed to use something like poll_wait() to connect the caller to
> something like a waitqueue head, so that as soon as the file becomes
> ready for reading/writing, any waiting task is notified. See
> eventfd_poll() in fs/eventfd.c for a simple example. AFAICS in the
> current code, seccomp_notify_poll() only works if an event is pending
> at the time seccomp_notify_poll() is called.

Arg. I was trying to avoid adding yet another piece of
synchronization, but perhaps it's not possible. Thanks for pointing
this out.

Tycho
Matthew Helsley June 12, 2018, 9:39 p.m.
On Thu, May 31, 2018 at 7:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:

<snip>


> +struct seccomp_notif {
> +       __u64 id;
> +       pid_t pid;
> +       struct seccomp_data data;
> +};
>

Since it's part of the UAPI I think it would be good to add documentation
to this patch series. Part of that documentation should talk about which
pid namespaces this pid value is relevant in. This is especially important
since the feature is designed for use by things like container/sandbox
managers.


> +
> +struct seccomp_notif_resp {
> +       __u64 id;
> +       __s32 error;
> +       __s64 val;
> +};
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
>

<snip>


> +struct seccomp_knotif {
> +       /* The pid whose filter triggered the notification */
> +       pid_t pid;

+
> +       /*
> +        * The "cookie" for this request; this is unique for this filter.
> +        */
> +       u32 id;
> +
> +       /*
> +        * The seccomp data. This pointer is valid the entire time this
> +        * notification is active, since it comes from __seccomp_filter
> which
> +        * eclipses the entire lifecycle here.
> +        */
> +       const struct seccomp_data *data;
> +
> +       /*
> +        * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> +        * struct seccomp_knotif is created and starts out in INIT. Once
> the
> +        * handler reads the notification off of an FD, it transitions to
> READ.
> +        * If a signal is received the state transitions back to INIT and
> +        * another message is sent. When the userspace handler replies,
> state
> +        * transitions to REPLIED.
> +        */
> +       enum notify_state state;
> +
> +       /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> +       int error;
> +       long val;
> +
> +       /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> +       struct completion ready;
> +
> +       struct list_head list;
> +};
> +#endif
>

<snip>


> +static void seccomp_do_user_notification(int this_syscall,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       int err;
> +       long ret = 0;
> +       struct seccomp_knotif n = {};
> +
> +       mutex_lock(&match->notify_lock);
> +       err = -ENOSYS;
> +       if (!match->has_listener)
> +               goto out;
> +
> +       n.pid = current->pid;
>

How have you tested this code for correctness? I don't see many
combinations being tested below nor here:
https://github.com/tych0/kernel-utils/blob/master/seccomp/notify_stress.c

What about processes in different pid namespaces? Make tests that vary key
parameters like these between the task generating the notifications and the
task interested in processing the notifications. Make tests that try to
kill them at interesting times too. etc. Make tests that pass the
notification fd around and see how they work (or not).

I ask about testing because you're effectively returning a pid value to
userspace here but not using the proper macros to access the task's struct
pid for that purpose. That will work so long as both processes are in the
same pid namespace but breaks otherwise.

Furthermore, a pid value is not the best solution for the queueing of these
notifications because a single pid value is not meaningful/correct outside
current's pid namespace and the seccomp notification file descriptor could
be passed on to processes in another pid namespaces; this pid value will
almost always not be relevant or correct there hence taking a reference to
the struct pid is useful. Then, during read(), the code would use the
proper macro to turn the struct pid reference into a pid value relevant in
the *reader's* pid namespace *or* return something obviously bogus if the
reader is in a pid namespace that can't see that pid. This could prevent
management processes from being tricked into clobbering the wrong process,
feeding the wrong process sensitive information via syscall results, etc.

Alternately, you could choose to specify that the seccomp manager is
expected to be in the pid namespace of the process it's managing at all
times. That's not necessarily trivial either because the process(es) it
manages could potentially create new child pid namespaces. It also means
that the processes being managed can "see" the manager process at all times.

Regardless, you still need to use the proper macros to access current's pid
for export to userspace.

+       n.state = SECCOMP_NOTIFY_INIT;
> +       n.data = sd;
> +       n.id = seccomp_next_notify_id(match);
> +       init_completion(&n.ready);
> +
> +       list_add(&n.list, &match->notifications);
> +
> +       mutex_unlock(&match->notify_lock);
> +       up(&match->request);
> +
> +       err = wait_for_completion_interruptible(&n.ready);
> +       mutex_lock(&match->notify_lock);
> +
> +       /*
> +        * Here it's possible we got a signal and then had to wait on the
> mutex
> +        * while the reply was sent, so let's be sure there wasn't a
> response
> +        * in the meantime.
> +        */
> +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> +               /*
> +                * We got a signal. Let's tell userspace about it
> (potentially
> +                * again, if we had already notified them about the first
> one).
> +                */
> +               if (n.state == SECCOMP_NOTIFY_SENT) {
> +                       n.state = SECCOMP_NOTIFY_INIT;
> +                       up(&match->request);
> +               }
> +               mutex_unlock(&match->notify_lock);
> +               err = wait_for_completion_killable(&n.ready);
> +               mutex_lock(&match->notify_lock);
> +               if (err < 0)
> +                       goto remove_list;
> +       }
> +
> +       ret = n.val;
> +       err = n.error;
> +
> +remove_list:
> +       list_del(&n.list);
> +out:
> +       mutex_unlock(&match->notify_lock);
> +       syscall_set_return_value(current, task_pt_regs(current),
> +                                err, ret);
> +}
>

<snip>


> +static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
> +                                  size_t size, loff_t *ppos)
> +{
> +       struct seccomp_filter *filter = f->private_data;
> +       struct seccomp_knotif *knotif = NULL, *cur;
> +       struct seccomp_notif unotif;
> +       ssize_t ret;
> +
> +       /* No offset reads. */
> +       if (*ppos != 0)
> +               return -EINVAL;
> +
> +       ret = down_interruptible(&filter->request);
> +       if (ret < 0)
> +               return ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       list_for_each_entry(cur, &filter->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> +                       knotif = cur;
> +                       break;
> +               }
> +       }
> +
> +       /*
> +        * If we didn't find a notification, it could be that the task was
> +        * interrupted between the time we were woken and when we were
> able to
> +        * acquire the rw lock. Should we retry here or just -ENOENT?
> -ENOENT
> +        * for now.
> +        */
> +       if (!knotif) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       unotif.id = knotif->id;
> +       unotif.pid = knotif->pid;
> +       unotif.data = *(knotif->data);
> +
> +       size = min_t(size_t, size, sizeof(struct seccomp_notif));
> +       if (copy_to_user(buf, &unotif, size)) {
> +               ret = -EFAULT;
> +               goto out;
> +       }
> +
> +       ret = sizeof(unotif);
> +       knotif->state = SECCOMP_NOTIFY_SENT;
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
>

<snip>


> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c
> b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 168c66d74fc5..f439bd3fb15f 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -40,10 +40,12 @@
>  #include <sys/fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/times.h>
> +#include <sys/socket.h>
>
>  #define _GNU_SOURCE
>  #include <unistd.h>
>  #include <sys/syscall.h>
> +#include <poll.h>
>
>  #include "../kselftest_harness.h"
>
> @@ -150,6 +152,24 @@ struct seccomp_metadata {
>  };
>  #endif
>
> +#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
> +
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
> +
> +struct seccomp_notif {
> +       __u64 id;
> +       pid_t pid;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u64 id;
> +       __s32 error;
> +       __s64 val;
> +};
> +#endif
> +
>  #ifndef seccomp
>  int seccomp(unsigned int op, unsigned int flags, void *args)
>  {
> @@ -2072,7 +2092,8 @@ TEST(seccomp_syscall_mode_lock)
>  TEST(detect_seccomp_filter_flags)
>  {
>         unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
> -                                SECCOMP_FILTER_FLAG_LOG };
> +                                SECCOMP_FILTER_FLAG_LOG,
> +                                SECCOMP_FILTER_FLAG_GET_LISTENER };
>         unsigned int flag, all_flags;
>         int i;
>         long ret;
> @@ -2917,6 +2938,178 @@ TEST(get_metadata)
>         ASSERT_EQ(0, kill(pid, SIGKILL));
>  }
>
> +static int user_trap_syscall(int nr, unsigned int flags)
> +{
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +                       offsetof(struct seccomp_data, nr)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> +}
> +
> +static int read_notif(int listener, struct seccomp_notif *req)
> +{
> +       int ret;
> +
> +       do {
> +               errno = 0;
> +               ret = read(listener, req, sizeof(*req));
> +       } while (ret == -1 && errno == ENOENT);
> +       return ret;
> +}
> +
> +static void signal_handler(int signal)
> +{
> +}
> +
> +#define USER_NOTIF_MAGIC 116983961184613L
> +TEST(get_user_notification_syscall)
> +{
> +       pid_t pid;
> +       long ret;
> +       int status, listener;
> +       struct seccomp_notif req = {};
> +       struct seccomp_notif_resp resp = {};
> +       struct pollfd pollfd;
> +
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       /* Check that we get -ENOSYS with no listener attached */
> +       if (pid == 0) {
> +               if (user_trap_syscall(__NR_getpid, 0) < 0)
> +                       exit(1);
> +               ret = syscall(__NR_getpid);
> +               exit(ret >= 0 || errno != ENOSYS);
> +       }
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       /* Add some no-op filters so that we (don't) trigger lockdep. */
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +
> +       /* Check that the basic notification machinery works */
> +       listener = user_trap_syscall(__NR_getpid,
> +                                    SECCOMP_FILTER_FLAG_GET_LISTENER);
> +       EXPECT_GE(listener, 0);
> +
> +       /* Installing a second listener in the chain should EBUSY */
> +       EXPECT_EQ(user_trap_syscall(__NR_getpid,
> +                                   SECCOMP_FILTER_FLAG_GET_LISTENER),
> +                 -1);
> +       EXPECT_EQ(errno, EBUSY);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
> +
> +       pollfd.fd = listener;
> +       pollfd.events = POLLIN | POLLOUT;
> +
> +       EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +       EXPECT_EQ(pollfd.revents, POLLOUT);
> +
> +       EXPECT_EQ(req.data.nr,  __NR_getpid);
> +
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +
> +       EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       /*
> +        * Check that nothing bad happens when we kill the task in the
> middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ret = read(listener, &req, sizeof(req));
> +       EXPECT_EQ(ret, sizeof(req));
> +
> +       EXPECT_EQ(kill(pid, SIGKILL), 0);
> +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +
> +       resp.id = req.id;
> +       ret = write(listener, &resp, sizeof(resp));
> +       EXPECT_EQ(ret, -1);
> +       EXPECT_EQ(errno, EINVAL);
> +
> +       /*
> +        * Check that we get another notification about a signal in the
> middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
> +                       perror("signal");
> +                       exit(1);
> +               }
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ret = read_notif(listener, &req);
> +       EXPECT_EQ(ret, sizeof(req));
> +       EXPECT_EQ(errno, 0);
> +
> +       EXPECT_EQ(kill(pid, SIGUSR1), 0);
> +
> +       ret = read_notif(listener, &req);
> +       EXPECT_EQ(ret, sizeof(req));
> +       EXPECT_EQ(errno, 0);
> +
> +       resp.id = req.id;
> +       ret = write(listener, &resp, sizeof(resp));
> +       EXPECT_EQ(ret, sizeof(resp));
> +       EXPECT_EQ(errno, 0);
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       close(listener);
> +}
> +
>  /*
>   * TODO:
>   * - add microbenchmarks
>

More combinations of tests would be good too.

Cheers,
     -Matt Helsley
Tycho Andersen June 12, 2018, 11:16 p.m.
Hi Matthew,

On Tue, Jun 12, 2018 at 02:39:03PM -0700, Matthew Helsley wrote:
> On Thu, May 31, 2018 at 7:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> 
> <snip>
> 
> 
> > +struct seccomp_notif {
> > +       __u64 id;
> > +       pid_t pid;
> > +       struct seccomp_data data;
> > +};
> >
> 
> Since it's part of the UAPI I think it would be good to add documentation
> to this patch series. Part of that documentation should talk about which
> pid namespaces this pid value is relevant in. This is especially important
> since the feature is designed for use by things like container/sandbox
> managers.

Yes, at least Documentation/userspace-api/seccomp_filter.txt should be
updated. I'll add that for the next series.

> > +static void seccomp_do_user_notification(int this_syscall,
> > +                                        struct seccomp_filter *match,
> > +                                        const struct seccomp_data *sd)
> > +{
> > +       int err;
> > +       long ret = 0;
> > +       struct seccomp_knotif n = {};
> > +
> > +       mutex_lock(&match->notify_lock);
> > +       err = -ENOSYS;
> > +       if (!match->has_listener)
> > +               goto out;
> > +
> > +       n.pid = current->pid;
> >
> 
> How have you tested this code for correctness? I don't see many
> combinations being tested below nor here:
> https://github.com/tych0/kernel-utils/blob/master/seccomp/notify_stress.c
> 
> What about processes in different pid namespaces? Make tests that vary key
> parameters like these between the task generating the notifications and the
> task interested in processing the notifications. Make tests that try to
> kill them at interesting times too. etc. Make tests that pass the
> notification fd around and see how they work (or not).
> 
> I ask about testing because you're effectively returning a pid value to
> userspace here but not using the proper macros to access the task's struct
> pid for that purpose. That will work so long as both processes are in the
> same pid namespace but breaks otherwise.
> 
> Furthermore, a pid value is not the best solution for the queueing of these
> notifications because a single pid value is not meaningful/correct outside
> current's pid namespace and the seccomp notification file descriptor could
> be passed on to processes in another pid namespaces; this pid value will
> almost always not be relevant or correct there hence taking a reference to

Well, it *has* to be relevant in some cases: if you want to access
some of the task's memory to e.g. read the args to the syscall, you
need to ptrace or map_files to access the memory. So we do need to
pass it, but,

> the struct pid is useful. Then, during read(), the code would use the
> proper macro to turn the struct pid reference into a pid value relevant in
> the *reader's* pid namespace *or* return something obviously bogus if the
> reader is in a pid namespace that can't see that pid. This could prevent
> management processes from being tricked into clobbering the wrong process,
> feeding the wrong process sensitive information via syscall results, etc.

Yes, this makes sense. Seems like we need to do some magic about
passing the tracee's task struct to the listener, so it can do
task_pid_vnr(). We could perhaps require the listener to be in the
init_pid_ns case, but I think things like socket() might still be
useful even if you can't read the tracee's memory.

> Alternately, you could choose to specify that the seccomp manager is
> expected to be in the pid namespace of the process it's managing at all
> times. That's not necessarily trivial either because the process(es) it
> manages could potentially create new child pid namespaces. It also means
> that the processes being managed can "see" the manager process at all times.

Right, I think we don't want to require this.

> Regardless, you still need to use the proper macros to access current's pid
> for export to userspace.

Yes, thanks.

Tycho
Jann Horn via Containers June 13, 2018, 3:32 p.m.
On Mon, Jun 4, 2018 at 2:18 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> Hi Jann,
>
> On Sun, Jun 03, 2018 at 08:41:01PM +0200, Jann Horn wrote:
> > On Sun, Jun 3, 2018 at 2:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > >
> > > This patch introduces a means for syscalls matched in seccomp to notify
> > > some other task that a particular filter has been triggered.
> > >
> > > The motivation for this is primarily for use with containers. For example,
> > > if a container does an init_module(), we obviously don't want to load this
> > > untrusted code, which may be compiled for the wrong version of the kernel
> > > anyway. Instead, we could parse the module image, figure out which module
> > > the container is trying to load and load it on the host.
> > >
> > > As another example, containers cannot mknod(), since this checks
> > > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > > coding some whitelist in the kernel. Another example is mount(), which has
> > > many security restrictions for good reason, but configuration or runtime
> > > knowledge could potentially be used to relax these restrictions.
> > >
> > > This patch adds functionality that is already possible via at least two
> > > other means that I know about, both of which involve ptrace(): first, one
> > > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> > > Unfortunately this is slow, so a faster version would be to install a
> > > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> > > Since ptrace allows only one tracer, if the container runtime is that
> > > tracer, users inside the container (or outside) trying to debug it will not
> > > be able to use ptrace, which is annoying. It also means that older
> > > distributions based on Upstart cannot boot inside containers using ptrace,
> > > since upstart itself uses ptrace to start services.
> > >
> > > The actual implementation of this is fairly small, although getting the
> > > synchronization right was/is slightly complex.
> > >
> > > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > > memory data from the task still applies here, but can be avoided with
> > > careful design of the userspace handler: if the userspace handler reads all
> > > of the task memory that is necessary before applying its security policy,
> > > the tracee's subsequent memory edits will not be read by the tracer.
> > [...]
> > > @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> > >         if (IS_ERR(prepared))
> > >                 return PTR_ERR(prepared);
> > >
> > > +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> > > +               listener = get_unused_fd_flags(O_RDWR);
> >
> > I think you want either 0 or O_CLOEXEC here?
>
> Do we? I suppose it makes sense to be able to set CLOEXEC, but I could
> imagine a case where a handler wanted to fork+exec to handle
> something. I'm happy to make the change, but it's not obvious to me
> that it's what we want by default.

I said "either 0 or O_CLOEXEC" - I just meant that O_RDWR doesn't make
much sense to me here, given that that's not a property of the fd and
will be ignored by the function you're calling.

On whether 0 or O_CLOEXEC is better: If you look at
get_unused_fd_flags() calls in e.g. various ioctl handlers, it's a mix
of places that hardcode 0, places that hardcode O_CLOEXEC, and places
that allow the caller to specify the flag. Either should work - but
personally, I believe that if the caller can't pass a flag,
get_unused_fd_flags(O_CLOEXEC) is the better choice because you can
still clear the O_CLOEXEC flag using fcntl() if necessary, while
setting the flag using fcntl() is potentially racy in a multi-threaded
context.
Jann Horn via Containers June 13, 2018, 3:43 p.m.
On Wed, Jun 13, 2018 at 5:32 PM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Jun 4, 2018 at 2:18 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > Hi Jann,
> >
> > On Sun, Jun 03, 2018 at 08:41:01PM +0200, Jann Horn wrote:
> > > On Sun, Jun 3, 2018 at 2:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > >
> > > > This patch introduces a means for syscalls matched in seccomp to notify
> > > > some other task that a particular filter has been triggered.
[...]

By the way: You should probably CC linux-api@ on the next version of
this patchset.
Documentation/process/submitting-patches.rst says:

| If changes affect userland-kernel interfaces, please send the MAN-PAGES
| maintainer (as listed in the MAINTAINERS file) a man-pages patch, or at
| least a notification of the change, so that some information makes its way
| into the manual pages.  User-space API changes should also be copied to
| linux-api@vger.kernel.org.
Matthew Helsley June 14, 2018, 7:44 p.m.
On Tue, Jun 12, 2018 at 4:16 PM, Tycho Andersen <tycho@tycho.ws> wrote:

> Hi Matthew,
>
> On Tue, Jun 12, 2018 at 02:39:03PM -0700, Matthew Helsley wrote:
> > On Thu, May 31, 2018 at 7:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > <snip>
> >
> >
> > > +struct seccomp_notif {
> > > +       __u64 id;
> > > +       pid_t pid;
> > > +       struct seccomp_data data;
> > > +};
> > >
> >
> > Since it's part of the UAPI I think it would be good to add documentation
> > to this patch series. Part of that documentation should talk about which
> > pid namespaces this pid value is relevant in. This is especially
> important
> > since the feature is designed for use by things like container/sandbox
> > managers.
>
> Yes, at least Documentation/userspace-api/seccomp_filter.txt should be
> updated. I'll add that for the next series.
>

Are there some relevant man pages too that should be updated too?


>
> > > +static void seccomp_do_user_notification(int this_syscall,
> > > +                                        struct seccomp_filter *match,
> > > +                                        const struct seccomp_data *sd)
> > > +{
> > > +       int err;
> > > +       long ret = 0;
> > > +       struct seccomp_knotif n = {};
> > > +
> > > +       mutex_lock(&match->notify_lock);
> > > +       err = -ENOSYS;
> > > +       if (!match->has_listener)
> > > +               goto out;
> > > +
> > > +       n.pid = current->pid;
> > >
> >
> > How have you tested this code for correctness? I don't see many
> > combinations being tested below nor here:
> > https://github.com/tych0/kernel-utils/blob/master/
> seccomp/notify_stress.c
> >
> > What about processes in different pid namespaces? Make tests that vary
> key
> > parameters like these between the task generating the notifications and
> the
> > task interested in processing the notifications. Make tests that try to
> > kill them at interesting times too. etc. Make tests that pass the
> > notification fd around and see how they work (or not).
> >
> > I ask about testing because you're effectively returning a pid value to
> > userspace here but not using the proper macros to access the task's
> struct
> > pid for that purpose. That will work so long as both processes are in the
> > same pid namespace but breaks otherwise.
> >
> > Furthermore, a pid value is not the best solution for the queueing of
> these
> > notifications because a single pid value is not meaningful/correct
> outside
> > current's pid namespace and the seccomp notification file descriptor
> could
> > be passed on to processes in another pid namespaces; this pid value will
> > almost always not be relevant or correct there hence taking a reference
> to
>
> Well, it *has* to be relevant in some cases: if you want to access
> some of the task's memory to e.g. read the args to the syscall, you
> need to ptrace or map_files to access the memory. So we do need to
> pass it, but,
>
> > the struct pid is useful. Then, during read(), the code would use the
> > proper macro to turn the struct pid reference into a pid value relevant
> in
> > the *reader's* pid namespace *or* return something obviously bogus if the
> > reader is in a pid namespace that can't see that pid. This could prevent
> > management processes from being tricked into clobbering the wrong
> process,
> > feeding the wrong process sensitive information via syscall results, etc.
>
> Yes, this makes sense. Seems like we need to do some magic about
> passing the tracee's task struct to the listener, so it can do
> task_pid_vnr(). We could perhaps require the listener to be in the
> init_pid_ns case, but I think things like socket() might still be
> useful even if you can't read the tracee's memory.


You could take a reference to the pid rather than the task
then use pid_vnr(). In that case the changes should result in something
like:


struct seccomp_knotif {
       /* The pid whose filter triggered the notification */
       struct pid *pid;
...

static void seccomp_do_user_notification(...)
{
    ...
    n.pid = get_task_pid(current, PIDTYPE_PID);
    ...
remove_list:
    list_del(&n.list);
    put_pid(n.pid);
    ...
}
...
static ssize_t seccomp_notify_read(...)
{
    ...
    unotif.pid = pid_vnr(knotif->pid);
    ...
}

I like holding the pid reference because it's what we do elsewhere when pid
namespaces
are a concern and it more precisely specifies what the knotif content needs
to convey.
Otherwise I don't think it makes a difference.

Cheers,
     -Matt
Tycho Andersen June 14, 2018, 9:03 p.m.
On Thu, Jun 14, 2018 at 12:44:21PM -0700, Matthew Helsley wrote:
> On Tue, Jun 12, 2018 at 4:16 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> 
> > Hi Matthew,
> >
> > On Tue, Jun 12, 2018 at 02:39:03PM -0700, Matthew Helsley wrote:
> > > On Thu, May 31, 2018 at 7:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > >
> > > <snip>
> > >
> > >
> > > > +struct seccomp_notif {
> > > > +       __u64 id;
> > > > +       pid_t pid;
> > > > +       struct seccomp_data data;
> > > > +};
> > > >
> > >
> > > Since it's part of the UAPI I think it would be good to add documentation
> > > to this patch series. Part of that documentation should talk about which
> > > pid namespaces this pid value is relevant in. This is especially
> > important
> > > since the feature is designed for use by things like container/sandbox
> > > managers.
> >
> > Yes, at least Documentation/userspace-api/seccomp_filter.txt should be
> > updated. I'll add that for the next series.
> >
> 
> Are there some relevant man pages too that should be updated too?

Yes, but since those are in a separate tree, I usually send the sets
separately.

> > > > +static void seccomp_do_user_notification(int this_syscall,
> > > > +                                        struct seccomp_filter *match,
> > > > +                                        const struct seccomp_data *sd)
> > > > +{
> > > > +       int err;
> > > > +       long ret = 0;
> > > > +       struct seccomp_knotif n = {};
> > > > +
> > > > +       mutex_lock(&match->notify_lock);
> > > > +       err = -ENOSYS;
> > > > +       if (!match->has_listener)
> > > > +               goto out;
> > > > +
> > > > +       n.pid = current->pid;
> > > >
> > >
> > > How have you tested this code for correctness? I don't see many
> > > combinations being tested below nor here:
> > > https://github.com/tych0/kernel-utils/blob/master/
> > seccomp/notify_stress.c
> > >
> > > What about processes in different pid namespaces? Make tests that vary
> > key
> > > parameters like these between the task generating the notifications and
> > the
> > > task interested in processing the notifications. Make tests that try to
> > > kill them at interesting times too. etc. Make tests that pass the
> > > notification fd around and see how they work (or not).
> > >
> > > I ask about testing because you're effectively returning a pid value to
> > > userspace here but not using the proper macros to access the task's
> > struct
> > > pid for that purpose. That will work so long as both processes are in the
> > > same pid namespace but breaks otherwise.
> > >
> > > Furthermore, a pid value is not the best solution for the queueing of
> > these
> > > notifications because a single pid value is not meaningful/correct
> > outside
> > > current's pid namespace and the seccomp notification file descriptor
> > could
> > > be passed on to processes in another pid namespaces; this pid value will
> > > almost always not be relevant or correct there hence taking a reference
> > to
> >
> > Well, it *has* to be relevant in some cases: if you want to access
> > some of the task's memory to e.g. read the args to the syscall, you
> > need to ptrace or map_files to access the memory. So we do need to
> > pass it, but,
> >
> > > the struct pid is useful. Then, during read(), the code would use the
> > > proper macro to turn the struct pid reference into a pid value relevant
> > in
> > > the *reader's* pid namespace *or* return something obviously bogus if the
> > > reader is in a pid namespace that can't see that pid. This could prevent
> > > management processes from being tricked into clobbering the wrong
> > process,
> > > feeding the wrong process sensitive information via syscall results, etc.
> >
> > Yes, this makes sense. Seems like we need to do some magic about
> > passing the tracee's task struct to the listener, so it can do
> > task_pid_vnr(). We could perhaps require the listener to be in the
> > init_pid_ns case, but I think things like socket() might still be
> > useful even if you can't read the tracee's memory.
> 
> 
> You could take a reference to the pid rather than the task
> then use pid_vnr(). In that case the changes should result in something
> like:
> 
> 
> struct seccomp_knotif {
>        /* The pid whose filter triggered the notification */
>        struct pid *pid;
> ...
> 
> static void seccomp_do_user_notification(...)
> {
>     ...
>     n.pid = get_task_pid(current, PIDTYPE_PID);
>     ...
> remove_list:
>     list_del(&n.list);
>     put_pid(n.pid);
>     ...
> }
> ...
> static ssize_t seccomp_notify_read(...)
> {
>     ...
>     unotif.pid = pid_vnr(knotif->pid);
>     ...
> }
> 
> I like holding the pid reference because it's what we do elsewhere when pid
> namespaces
> are a concern and it more precisely specifies what the knotif content needs
> to convey.
> Otherwise I don't think it makes a difference.

Great, thanks, I'll do this. I guess we need a put_pid() here too.

Cheers,

Tycho
Eric W. Biederman June 14, 2018, 9:53 p.m.
Tycho Andersen <tycho@tycho.ws> writes:

> On Thu, Jun 14, 2018 at 12:44:21PM -0700, Matthew Helsley wrote:
>> On Tue, Jun 12, 2018 at 4:16 PM, Tycho Andersen <tycho@tycho.ws> wrote:
>> 
>> > Hi Matthew,
>> >
>> > On Tue, Jun 12, 2018 at 02:39:03PM -0700, Matthew Helsley wrote:
>> > > On Thu, May 31, 2018 at 7:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>> > >
>> > > <snip>
>> > >
>> > >
>> > > > +struct seccomp_notif {
>> > > > +       __u64 id;
>> > > > +       pid_t pid;
>> > > > +       struct seccomp_data data;
>> > > > +};
>> > > >
>> > >
>> > > Since it's part of the UAPI I think it would be good to add documentation
>> > > to this patch series. Part of that documentation should talk about which
>> > > pid namespaces this pid value is relevant in. This is especially
>> > important
>> > > since the feature is designed for use by things like container/sandbox
>> > > managers.
>> >
>> > Yes, at least Documentation/userspace-api/seccomp_filter.txt should be
>> > updated. I'll add that for the next series.
>> >
>> 
>> Are there some relevant man pages too that should be updated too?
>
> Yes, but since those are in a separate tree, I usually send the sets
> separately.
>
>> > > > +static void seccomp_do_user_notification(int this_syscall,
>> > > > +                                        struct seccomp_filter *match,
>> > > > +                                        const struct seccomp_data *sd)
>> > > > +{
>> > > > +       int err;
>> > > > +       long ret = 0;
>> > > > +       struct seccomp_knotif n = {};
>> > > > +
>> > > > +       mutex_lock(&match->notify_lock);
>> > > > +       err = -ENOSYS;
>> > > > +       if (!match->has_listener)
>> > > > +               goto out;
>> > > > +
>> > > > +       n.pid = current->pid;
>> > > >
>> > >
>> > > How have you tested this code for correctness? I don't see many
>> > > combinations being tested below nor here:
>> > > https://github.com/tych0/kernel-utils/blob/master/
>> > seccomp/notify_stress.c
>> > >
>> > > What about processes in different pid namespaces? Make tests that vary
>> > key
>> > > parameters like these between the task generating the notifications and
>> > the
>> > > task interested in processing the notifications. Make tests that try to
>> > > kill them at interesting times too. etc. Make tests that pass the
>> > > notification fd around and see how they work (or not).
>> > >
>> > > I ask about testing because you're effectively returning a pid value to
>> > > userspace here but not using the proper macros to access the task's
>> > struct
>> > > pid for that purpose. That will work so long as both processes are in the
>> > > same pid namespace but breaks otherwise.
>> > >
>> > > Furthermore, a pid value is not the best solution for the queueing of
>> > these
>> > > notifications because a single pid value is not meaningful/correct
>> > outside
>> > > current's pid namespace and the seccomp notification file descriptor
>> > could
>> > > be passed on to processes in another pid namespaces; this pid value will
>> > > almost always not be relevant or correct there hence taking a reference
>> > to
>> >
>> > Well, it *has* to be relevant in some cases: if you want to access
>> > some of the task's memory to e.g. read the args to the syscall, you
>> > need to ptrace or map_files to access the memory. So we do need to
>> > pass it, but,
>> >
>> > > the struct pid is useful. Then, during read(), the code would use the
>> > > proper macro to turn the struct pid reference into a pid value relevant
>> > in
>> > > the *reader's* pid namespace *or* return something obviously bogus if the
>> > > reader is in a pid namespace that can't see that pid. This could prevent
>> > > management processes from being tricked into clobbering the wrong
>> > process,
>> > > feeding the wrong process sensitive information via syscall results, etc.
>> >
>> > Yes, this makes sense. Seems like we need to do some magic about
>> > passing the tracee's task struct to the listener, so it can do
>> > task_pid_vnr(). We could perhaps require the listener to be in the
>> > init_pid_ns case, but I think things like socket() might still be
>> > useful even if you can't read the tracee's memory.
>> 
>> 
>> You could take a reference to the pid rather than the task
>> then use pid_vnr(). In that case the changes should result in something
>> like:
>> 
>> 
>> struct seccomp_knotif {
>>        /* The pid whose filter triggered the notification */
>>        struct pid *pid;
>> ...
>> 
>> static void seccomp_do_user_notification(...)
>> {
>>     ...
>>     n.pid = get_task_pid(current, PIDTYPE_PID);
>>     ...
>> remove_list:
>>     list_del(&n.list);
>>     put_pid(n.pid);
>>     ...
>> }
>> ...
>> static ssize_t seccomp_notify_read(...)
>> {
>>     ...
>>     unotif.pid = pid_vnr(knotif->pid);
>>     ...
>> }
>> 
>> I like holding the pid reference because it's what we do elsewhere when pid
>> namespaces
>> are a concern and it more precisely specifies what the knotif content needs
>> to convey.
>> Otherwise I don't think it makes a difference.
>
> Great, thanks, I'll do this. I guess we need a put_pid() here too.

A)  We know that the task is stopped.  Unless there is something
    like SIGKILL that can make the task move you don't need to
    take a reference to anything.

B)  pid_vnr is the wrong answer.  When you create the struct file
    and intialize the filter you need to capture the calling processes
    pid namespace.  The you can use "pid_nr_ns(knotif->pid, filter->pid_ns);".
    That will work consistently even if the file descriptor is passed
    between processes.

Eric
Matthew Helsley June 14, 2018, 9:55 p.m.
(reviewing this some more...)

On Thu, May 31, 2018 at 7:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:

> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
>
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
>
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
>
> v2: * make id a u64; the idea here being that it will never overflow,
>       because 64 is huge (one syscall every nanosecond => wrap every 584
>       years) (Andy)
>     * prevent nesting of user notifications: if someone is already attached
>       the tree in one place, nobody else can attach to the tree (Andy)
>     * notify the listener of signals the tracee receives as well (Andy)
>     * implement poll
> v3: * lockdep fix (Oleg)
>     * drop unnecessary WARN()s (Christian)
>     * rearrange error returns to be more rpetty (Christian)
>     * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
>
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  arch/Kconfig                                  |   7 +
>  include/linux/seccomp.h                       |   3 +-
>  include/uapi/linux/seccomp.h                  |  18 +-
>  kernel/seccomp.c                              | 398 +++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 195 ++++++++-
>  5 files changed, 615 insertions(+), 6 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 75dd23acf133..1c1ae8d8c8b9 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -401,6 +401,13 @@ config SECCOMP_FILTER
>
>           See Documentation/prctl/seccomp_filter.txt for details.
>
> +config SECCOMP_USER_NOTIFICATION
> +       bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
> +       depends on SECCOMP_FILTER
> +       help
> +         Enable SECCOMP_RET_USER_NOTIF, a return code which can be used
> by seccomp
> +         programs to notify a userspace listener that a particular event
> happened.
> +
>  config HAVE_GCC_PLUGINS
>         bool
>         help
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index c723a5c4e3ff..0fd3e0676a1c 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,7 +5,8 @@
>  #include <uapi/linux/seccomp.h>
>
>  #define SECCOMP_FILTER_FLAG_MASK       (SECCOMP_FILTER_FLAG_TSYNC | \
> -                                        SECCOMP_FILTER_FLAG_LOG)
> +                                        SECCOMP_FILTER_FLAG_LOG | \
> +                                        SECCOMP_FILTER_FLAG_GET_LISTENER)
>
>  #ifdef CONFIG_SECCOMP
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 2a0bd9dd104d..8160e6cad528 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,8 +17,9 @@
>  #define SECCOMP_GET_ACTION_AVAIL       2
>
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC      1
> -#define SECCOMP_FILTER_FLAG_LOG                2
> +#define SECCOMP_FILTER_FLAG_TSYNC              1
> +#define SECCOMP_FILTER_FLAG_LOG                        2
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER       4
>
>  /*
>   * All BPF programs must return a 32-bit value.
> @@ -34,6 +35,7 @@
>  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a
> SIGSYS */
>  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or
> disallow */
>  #define SECCOMP_RET_LOG                 0x7ffc0000U /* allow after
> logging */
>  #define SECCOMP_RET_ALLOW       0x7fff0000U /* allow */
> @@ -59,4 +61,16 @@ struct seccomp_data {
>         __u64 args[6];
>  };
>
> +struct seccomp_notif {
> +       __u64 id;
> +       pid_t pid;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u64 id;
> +       __s32 error;
> +       __s64 val;
> +};
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index dc77548167ef..f69327d5f7c7 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -31,6 +31,7 @@
>  #endif
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +#include <linux/file.h>
>  #include <linux/filter.h>
>  #include <linux/pid.h>
>  #include <linux/ptrace.h>
> @@ -38,6 +39,52 @@
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +       SECCOMP_NOTIFY_INIT,
> +       SECCOMP_NOTIFY_SENT,
> +       SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> +       /* The pid whose filter triggered the notification */
> +       pid_t pid;
> +
> +       /*
> +        * The "cookie" for this request; this is unique for this filter.
> +        */
> +       u32 id;
> +
> +       /*
> +        * The seccomp data. This pointer is valid the entire time this
> +        * notification is active, since it comes from __seccomp_filter
> which
> +        * eclipses the entire lifecycle here.
> +        */
> +       const struct seccomp_data *data;
> +
> +       /*
> +        * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> +        * struct seccomp_knotif is created and starts out in INIT. Once
> the
> +        * handler reads the notification off of an FD, it transitions to
> READ.
>

Comment above needs to be updated: READ state no longer exists.


> +        * If a signal is received the state transitions back to INIT and
> +        * another message is sent. When the userspace handler replies,
> state
> +        * transitions to REPLIED.
> +        */
> +       enum notify_state state;
> +
> +       /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> +       int error;
> +       long val;
> +
> +       /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> +       struct completion ready;
> +
> +       struct list_head list;
> +};
> +#endif
> +
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
>   *
> @@ -64,6 +111,27 @@ struct seccomp_filter {
>         bool log;
>         struct seccomp_filter *prev;
>         struct bpf_prog *prog;
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +       /*
> +        * A semaphore that users of this notification can wait on for
> +        * changes. Actual reads and writes are still controlled with
> +        * filter->notify_lock.
> +        */
> +       struct semaphore request;
> +
> +       /* A lock for all notification-related accesses. */
> +       struct mutex notify_lock;
> +
> +       /* Is there currently an attached listener? */
> +       bool has_listener;
>

Assumes only one listener.

Is there any chance userspace could try to attach multiple listeners and
get confused? Perhaps by sharing the fd with multiple processes (via exec,
SCM_RIGHTS..)?

+
> +       /* The id of the next request. */
> +       u64 next_id;
> +
> +       /* A list of struct seccomp_knotif elements. */
> +       struct list_head notifications;
> +#endif
>  };
>
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -383,6 +451,13 @@ static struct seccomp_filter
> *seccomp_prepare_filter(struct sock_fprog *fprog)
>         if (!sfilter)
>                 return ERR_PTR(-ENOMEM);
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +       mutex_init(&sfilter->notify_lock);
> +       sema_init(&sfilter->request, 0);
> +       INIT_LIST_HEAD(&sfilter->notifications);
> +       sfilter->next_id = get_random_u64();
> +#endif
> +
>         ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>                                         seccomp_check_filter, save_orig);
>         if (ret < 0) {
> @@ -547,13 +622,15 @@ static void seccomp_send_sigsys(int syscall, int
> reason)
>  #define SECCOMP_LOG_TRACE              (1 << 4)
>  #define SECCOMP_LOG_LOG                        (1 << 5)
>  #define SECCOMP_LOG_ALLOW              (1 << 6)
> +#define SECCOMP_LOG_USER_NOTIF         (1 << 7)
>
>  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
>                                     SECCOMP_LOG_KILL_THREAD  |
>                                     SECCOMP_LOG_TRAP  |
>                                     SECCOMP_LOG_ERRNO |
>                                     SECCOMP_LOG_TRACE |
> -                                   SECCOMP_LOG_LOG;
> +                                   SECCOMP_LOG_LOG |
> +                                   SECCOMP_LOG_USER_NOTIF;
>
>  static inline void seccomp_log(unsigned long syscall, long signr, u32
> action,
>                                bool requested)
> @@ -572,6 +649,9 @@ static inline void seccomp_log(unsigned long syscall,
> long signr, u32 action,
>         case SECCOMP_RET_TRACE:
>                 log = requested && seccomp_actions_logged &
> SECCOMP_LOG_TRACE;
>                 break;
> +       case SECCOMP_RET_USER_NOTIF:
> +               log = requested && seccomp_actions_logged &
> SECCOMP_LOG_USER_NOTIF;
> +               break;
>         case SECCOMP_RET_LOG:
>                 log = seccomp_actions_logged & SECCOMP_LOG_LOG;
>                 break;
> @@ -645,6 +725,81 @@ void secure_computing_strict(int this_syscall)
>  }
>  #else
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> +{
> +       /* Note: overflow is ok here, the id just needs to be unique */
> +       return filter->next_id++;
> +}
> +
> +static void seccomp_do_user_notification(int this_syscall,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       int err;
> +       long ret = 0;
> +       struct seccomp_knotif n = {};
> +
> +       mutex_lock(&match->notify_lock);
> +       err = -ENOSYS;
> +       if (!match->has_listener)
> +               goto out;
> +
> +       n.pid = current->pid;
> +       n.state = SECCOMP_NOTIFY_INIT;
> +       n.data = sd;
> +       n.id = seccomp_next_notify_id(match);
> +       init_completion(&n.ready);
> +
> +       list_add(&n.list, &match->notifications);
> +
> +       mutex_unlock(&match->notify_lock);
> +       up(&match->request);
> +
> +       err = wait_for_completion_interruptible(&n.ready);
> +       mutex_lock(&match->notify_lock);
> +
> +       /*
> +        * Here it's possible we got a signal and then had to wait on the
> mutex
> +        * while the reply was sent, so let's be sure there wasn't a
> response
> +        * in the meantime.
> +        */
> +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> +               /*
> +                * We got a signal. Let's tell userspace about it
> (potentially
> +                * again, if we had already notified them about the first
> one).
> +                */
> +               if (n.state == SECCOMP_NOTIFY_SENT) {
> +                       n.state = SECCOMP_NOTIFY_INIT;
> +                       up(&match->request);
> +               }
> +               mutex_unlock(&match->notify_lock);
> +               err = wait_for_completion_killable(&n.ready);
> +               mutex_lock(&match->notify_lock);
> +               if (err < 0)
> +                       goto remove_list;
> +       }
>

This section looks a little odd.

I notice you don't loop here yet you reset n.state back to INIT and don't
wait for
it to return to the SENT (much less REPLIED) state. This effectively looks
like a subtle
hard-coded number of "send" retries.

I'm low on time so I couldn't figure out answers to some important
questions:

Is one "retry" always enough? Is it possible the notification might get
lost due to the lack
of a loop here? Is it possible the syscall might return without a "reply"?
That's not
consistent with your comments about the states. Is there any possibility of
missing a
completion because of the interaction between this and other parts that
signal completion
conditionally based on this state?

+
> +       ret = n.val;
> +       err = n.error;
> +
> +remove_list:
> +       list_del(&n.list);
> +out:
> +       mutex_unlock(&match->notify_lock);
> +       syscall_set_return_value(current, task_pt_regs(current),
> +                                err, ret);
> +}
> +#else
> +static void seccomp_do_user_notification(int this_syscall,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_USER_NOTIF, true);
> +       do_exit(SIGSYS);
> +}
> +#endif
> +
>  #ifdef CONFIG_SECCOMP_FILTER
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data
> *sd,
>                             const bool recheck_after_trace)
> @@ -722,6 +877,9 @@ static int __seccomp_filter(int this_syscall, const
> struct seccomp_data *sd,
>
>                 return 0;
>
> +       case SECCOMP_RET_USER_NOTIF:
> +               seccomp_do_user_notification(this_syscall, match, sd);
> +               goto skip;
>         case SECCOMP_RET_LOG:
>                 seccomp_log(this_syscall, 0, action, true);
>                 return 0;
> @@ -828,6 +986,9 @@ static long seccomp_set_mode_strict(void)
>  }
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +static struct file *init_listener(struct task_struct *,
> +                                 struct seccomp_filter *);
> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -847,6 +1008,8 @@ static long seccomp_set_mode_filter(unsigned int
> flags,
>         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>         struct seccomp_filter *prepared = NULL;
>         long ret = -EINVAL;
> +       int listener = 0;
> +       struct file *listener_f = NULL;
>
>         /* Validate flags. */
>         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int
> flags,
>         if (IS_ERR(prepared))
>                 return PTR_ERR(prepared);
>
> +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {

+               listener = get_unused_fd_flags(O_RDWR);
>

So I checked and a man page does need to be update. Among other updates
this needs a mention in the RETURN section of
man 2 seccomp along the lines of:

"If SECCOMP_FILTER_FLAG_GET_LISTENER is specified then the return value is
a file descriptor upon success or -1 otherwise."

"If seccomp() is called multiple times
with SECCOMP_FILTER_FLAG_GET_LISTENER then, after the first successful
call, it will not
create any new file descriptors and may return with either the existing
file descriptor or -1 and errno set to EBUSY" (I smell a testcase!)

Now, that said, this interface is somewhat awkward. The principle of least
surprise suggests that subsequent calls with GET_LISTENER
should return any file descriptor created previously rather than -1 and
errno==EBUSY. Perhaps rather than GET_LISTENER you could rename it
NEW_LISTENER_EXCL. Or you could add the ability to return any pre-existing
fd.


> +               if (listener < 0) {
> +                       ret = listener;
> +                       goto out_free;
> +               }
> +
> +               listener_f = init_listener(current, prepared);
> +               if (IS_ERR(listener_f)) {
> +                       put_unused_fd(listener);
> +                       ret = PTR_ERR(listener_f);
> +                       goto out_free;
> +               }
> +       }
> +
>         /*
>          * Make sure we cannot change seccomp or nnp state via TSYNC
>          * while another thread is in the middle of calling exec.
>          */
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>             mutex_lock_killable(&current->signal->cred_guard_mutex))
> -               goto out_free;
> +               goto out_put_fd;
>
>         spin_lock_irq(&current->sighand->siglock);
>
> @@ -881,6 +1059,16 @@ static long seccomp_set_mode_filter(unsigned int
> flags,
>         spin_unlock_irq(&current->sighand->siglock);
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>                 mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +               if (ret < 0) {
> +                       fput(listener_f);
> +                       put_unused_fd(listener);
> +               } else {
> +                       fd_install(listener, listener_f);
> +                       ret = listener;
> +               }
> +       }
>  out_free:
>         seccomp_filter_free(prepared);
>         return ret;
> @@ -909,6 +1097,9 @@ static long seccomp_get_action_avail(const char
> __user *uaction)
>         case SECCOMP_RET_LOG:
>         case SECCOMP_RET_ALLOW:
>                 break;
> +       case SECCOMP_RET_USER_NOTIF:
> +               if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
> +                       break;
>         default:
>                 return -EOPNOTSUPP;
>         }
> @@ -1105,6 +1296,7 @@ long seccomp_get_metadata(struct task_struct *task,
>  #define SECCOMP_RET_KILL_THREAD_NAME   "kill_thread"
>  #define SECCOMP_RET_TRAP_NAME          "trap"
>  #define SECCOMP_RET_ERRNO_NAME         "errno"
> +#define SECCOMP_RET_USER_NOTIF_NAME    "user_notif"
>  #define SECCOMP_RET_TRACE_NAME         "trace"
>  #define SECCOMP_RET_LOG_NAME           "log"
>  #define SECCOMP_RET_ALLOW_NAME         "allow"
> @@ -1114,6 +1306,7 @@ static const char seccomp_actions_avail[] =
>                                 SECCOMP_RET_KILL_THREAD_NAME    " "
>                                 SECCOMP_RET_TRAP_NAME           " "
>                                 SECCOMP_RET_ERRNO_NAME          " "
> +                               SECCOMP_RET_USER_NOTIF_NAME     " "
>                                 SECCOMP_RET_TRACE_NAME          " "
>                                 SECCOMP_RET_LOG_NAME            " "
>                                 SECCOMP_RET_ALLOW_NAME;
> @@ -1131,6 +1324,7 @@ static const struct seccomp_log_name
> seccomp_log_names[] = {
>         { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
>         { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
>         { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> +       { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
>         { }
>  };
>
> @@ -1279,3 +1473,203 @@ static int __init seccomp_sysctl_init(void)
>  device_initcall(seccomp_sysctl_init)
>
>  #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       struct seccomp_knotif *knotif;
> +
> +       mutex_lock(&filter->notify_lock);
> +
> +       /*
> +        * If this file is being closed because e.g. the task who owned it
> +        * died, let's wake everyone up who was waiting on us.
> +        */
> +       list_for_each_entry(knotif, &filter->notifications, list) {
> +               if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> +                       continue;
> +
> +               knotif->state = SECCOMP_NOTIFY_REPLIED;
> +               knotif->error = -ENOSYS;
> +               knotif->val = 0;
> +
> +               complete(&knotif->ready);
> +       }
> +
> +       filter->has_listener = false;
> +       mutex_unlock(&filter->notify_lock);
> +       __put_seccomp_filter(filter);
> +       return 0;
> +}
> +
> +static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
> +                                  size_t size, loff_t *ppos)
> +{
> +       struct seccomp_filter *filter = f->private_data;
> +       struct seccomp_knotif *knotif = NULL, *cur;
> +       struct seccomp_notif unotif;
> +       ssize_t ret;
> +
> +       /* No offset reads. */
> +       if (*ppos != 0)
> +               return -EINVAL;
> +
>

I think you should use memset to clear unotif -- that prevents any
accidental information disclosure
via the kernel stack if this structure ever has padding.

Will the size of unotif ever *possibly* need to change (i.e. grow) in the
future? You might consider
how that could be enabled while maintaining backwards compatibility.

I'm thinking you should check size here:

if (size < sizeof(unotif))
        return -EFAULT;

Yes, copy_to_user() will stop you from walking past a page boundary but it
won't
stop you from quietly clobbering legitimate userspace memory. Seeing:

size != sizeof(unotif)

is also an indicator that there may be an ABI mismatch -- so again,
consider the
current/possible-future behavior carefully.

+       ret = down_interruptible(&filter->request);
> +       if (ret < 0)
> +               return ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       list_for_each_entry(cur, &filter->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> +                       knotif = cur;
> +                       break;
> +               }
> +       }
> +
> +       /*
> +        * If we didn't find a notification, it could be that the task was
> +        * interrupted between the time we were woken and when we were
> able to
> +        * acquire the rw lock. Should we retry here or just -ENOENT?
> -ENOENT
> +        * for now.
> +        */
> +       if (!knotif) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       unotif.id = knotif->id;
> +       unotif.pid = knotif->pid;
> +       unotif.data = *(knotif->data);
> +
> +       size = min_t(size_t, size, sizeof(struct seccomp_notif));


nit: sizeof(unotif) here since that's shorter, ultimately what we're
copying to userspace,
changes if the type ever changes, and you're also using that to set ret
later.


>

+       if (copy_to_user(buf, &unotif, size)) {
> +               ret = -EFAULT;
> +               goto out;
> +       }
> +
> +       ret = sizeof(unotif);
> +       knotif->state = SECCOMP_NOTIFY_SENT;
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static ssize_t seccomp_notify_write(struct file *file, const char __user
> *buf,
> +                                   size_t size, loff_t *ppos)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       struct seccomp_notif_resp resp = {};
> +       struct seccomp_knotif *knotif = NULL;
> +       ssize_t ret = -EINVAL;
> +
> +       /* No partial writes. */
> +       if (*ppos != 0)
> +               return -EINVAL;
>

What happens if size < sizeof(resp) ? Is there a chance we could give some
kernel bits to
the process we're filtering with seccomp()?

Again: Using memset() to clear resp seems like a good step since it
contains field(s) that
could appear in the filtered program. That or simply:

if (size < sizeof(resp))
        return -EINVAL;


> +
> +       size = min_t(size_t, size, sizeof(resp));
> +       if (copy_from_user(&resp, buf, size))
> +               return -EFAULT;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       list_for_each_entry(knotif, &filter->notifications, list) {
> +               if (knotif->id == resp.id)
> +                       break;
> +       }
> +
> +       if (!knotif || knotif->id != resp.id) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       /* Allow exactly one reply. */
> +       if (knotif->state != SECCOMP_NOTIFY_SENT) {
> +               ret = -EINVAL;
>

nit: is there a better errno than EINVAL for this which would distinguish
this error from simple invalid parameters?

EALREADY (connection already in progress)?      <-- not mentioned in man 2
write ("unused" below)
EINPROGRESS (operation already in progress)? <-- unused (see man 2 connect)
ENOTSUP ?                                                           <--
unused
ENOTUNIQ (name not unique on network)?          <-- unused
ENOSPC (device containing file has no room for the data)?  <-- definitely
used
EIO (low level IO error) ?
       <-- definitely used


> +               goto out;
> +       }
> +
> +       ret = size;
> +       knotif->state = SECCOMP_NOTIFY_REPLIED;
> +       knotif->error = resp.error;
> +       knotif->val = resp.val;
> +       complete(&knotif->ready);
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static __poll_t seccomp_notify_poll(struct file *file,
> +                                   struct poll_table_struct *poll_tab)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       __poll_t ret = 0;
> +       struct seccomp_knotif *cur;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;

+
> +       list_for_each_entry(cur, &filter->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT)
> +                       ret |= EPOLLIN | EPOLLRDNORM;
> +               if (cur->state == SECCOMP_NOTIFY_SENT)
> +                       ret |= EPOLLOUT | EPOLLWRNORM;
>

Hmm, it's been a while since I read poll file ops but if you do wind up
walking this list then you may not have to walk the entire list here:

if (ret == EPOLLIN | EPOLLRDNORM | EPOLLOUT | EPOLLWRNORM)
        break;

Then poll() is not always O(N) (where N is the number of queued
notifications...)


> +       }
> +
> +       mutex_unlock(&filter->notify_lock);
> +
> +       return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +       .read = seccomp_notify_read,
> +       .write = seccomp_notify_write,
> +       .poll = seccomp_notify_poll,
> +       .release = seccomp_notify_release,
> +};
> +
> +static struct file *init_listener(struct task_struct *task,
> +                                 struct seccomp_filter *filter)
> +{
> +       struct file *ret = ERR_PTR(-EBUSY);
> +       struct seccomp_filter *cur;
> +       bool have_listener = false;
> +       int filter_nesting = 0;
> +
> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +               mutex_lock_nested(&cur->notify_lock, filter_nesting);
> +               filter_nesting++;
> +               if (cur->has_listener)
> +                       have_listener = true;

+       }
> +
> +       if (have_listener)
> +               goto out;
> +
> +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> +                                filter, O_RDWR);
> +       if (IS_ERR(ret))
> +               goto out;
> +
> +
> +       /* The file has a reference to it now */
> +       __get_seccomp_filter(filter);
> +       filter->has_listener = true;
> +
> +out:
> +       for (cur = task->seccomp.filter; cur; cur = cur->prev)
> +               mutex_unlock(&cur->notify_lock);
> +
> +       return ret;
> +}
> +#else
> +static struct file *init_listener(struct task_struct *task,
> +                                 struct seccomp_filter *filter)
> +{
> +       return ERR_PTR(-EINVAL);
> +}
> +#endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c
> b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 168c66d74fc5..f439bd3fb15f 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -40,10 +40,12 @@
>  #include <sys/fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/times.h>
> +#include <sys/socket.h>
>
>  #define _GNU_SOURCE
>  #include <unistd.h>
>  #include <sys/syscall.h>
> +#include <poll.h>
>
>  #include "../kselftest_harness.h"
>
> @@ -150,6 +152,24 @@ struct seccomp_metadata {
>  };
>  #endif
>
> +#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
> +
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
> +
> +struct seccomp_notif {
> +       __u64 id;
> +       pid_t pid;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u64 id;
> +       __s32 error;
> +       __s64 val;
> +};
> +#endif
> +
>  #ifndef seccomp
>  int seccomp(unsigned int op, unsigned int flags, void *args)
>  {
> @@ -2072,7 +2092,8 @@ TEST(seccomp_syscall_mode_lock)
>  TEST(detect_seccomp_filter_flags)
>  {
>         unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
> -                                SECCOMP_FILTER_FLAG_LOG };
> +                                SECCOMP_FILTER_FLAG_LOG,
> +                                SECCOMP_FILTER_FLAG_GET_LISTENER };
>         unsigned int flag, all_flags;
>         int i;
>         long ret;
> @@ -2917,6 +2938,178 @@ TEST(get_metadata)
>         ASSERT_EQ(0, kill(pid, SIGKILL));
>  }
>
> +static int user_trap_syscall(int nr, unsigned int flags)
> +{
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +                       offsetof(struct seccomp_data, nr)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> +}
> +
> +static int read_notif(int listener, struct seccomp_notif *req)
> +{
> +       int ret;
> +
> +       do {
> +               errno = 0;
> +               ret = read(listener, req, sizeof(*req));
> +       } while (ret == -1 && errno == ENOENT);
> +       return ret;
> +}
> +
> +static void signal_handler(int signal)
> +{
> +}
> +
> +#define USER_NOTIF_MAGIC 116983961184613L
> +TEST(get_user_notification_syscall)
> +{
> +       pid_t pid;
> +       long ret;
> +       int status, listener;
> +       struct seccomp_notif req = {};
> +       struct seccomp_notif_resp resp = {};
> +       struct pollfd pollfd;
> +
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       /* Check that we get -ENOSYS with no listener attached */
> +       if (pid == 0) {
> +               if (user_trap_syscall(__NR_getpid, 0) < 0)
> +                       exit(1);
> +               ret = syscall(__NR_getpid);
> +               exit(ret >= 0 || errno != ENOSYS);
> +       }
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       /* Add some no-op filters so that we (don't) trigger lockdep. */
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +
> +       /* Check that the basic notification machinery works */
> +       listener = user_trap_syscall(__NR_getpid,
> +                                    SECCOMP_FILTER_FLAG_GET_LISTENER);
> +       EXPECT_GE(listener, 0);
> +
> +       /* Installing a second listener in the chain should EBUSY */
> +       EXPECT_EQ(user_trap_syscall(__NR_getpid,
> +                                   SECCOMP_FILTER_FLAG_GET_LISTENER),
> +                 -1);
> +       EXPECT_EQ(errno, EBUSY);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
> +
> +       pollfd.fd = listener;
> +       pollfd.events = POLLIN | POLLOUT;
> +
> +       EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +       EXPECT_EQ(pollfd.revents, POLLOUT);
> +
> +       EXPECT_EQ(req.data.nr,  __NR_getpid);
> +
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +
> +       EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       /*
> +        * Check that nothing bad happens when we kill the task in the
> middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ret = read(listener, &req, sizeof(req));
> +       EXPECT_EQ(ret, sizeof(req));
> +
> +       EXPECT_EQ(kill(pid, SIGKILL), 0);
> +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +
> +       resp.id = req.id;
> +       ret = write(listener, &resp, sizeof(resp));
> +       EXPECT_EQ(ret, -1);
> +       EXPECT_EQ(errno, EINVAL);
> +
> +       /*
> +        * Check that we get another notification about a signal in the
> middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
> +                       perror("signal");
> +                       exit(1);
> +               }
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ret = read_notif(listener, &req);
> +       EXPECT_EQ(ret, sizeof(req));
> +       EXPECT_EQ(errno, 0);
> +
> +       EXPECT_EQ(kill(pid, SIGUSR1), 0);
> +
> +       ret = read_notif(listener, &req);
> +       EXPECT_EQ(ret, sizeof(req));
> +       EXPECT_EQ(errno, 0);
> +
> +       resp.id = req.id;
> +       ret = write(listener, &resp, sizeof(resp));
> +       EXPECT_EQ(ret, sizeof(resp));
> +       EXPECT_EQ(errno, 0);
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       close(listener);
> +}
> +
>  /*
>   * TODO:
>   * - add microbenchmarks
> --
> 2.17.0
>
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
>
Tobin C . Harding June 20, 2018, 5:05 a.m.
On Thu, May 31, 2018 at 08:49:46AM -0600, Tycho Andersen wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
> 
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
> 
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
> 
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
> 
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
> 
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
> 
> v2: * make id a u64; the idea here being that it will never overflow,
>       because 64 is huge (one syscall every nanosecond => wrap every 584
>       years) (Andy)
>     * prevent nesting of user notifications: if someone is already attached
>       the tree in one place, nobody else can attach to the tree (Andy)
>     * notify the listener of signals the tracee receives as well (Andy)
>     * implement poll
> v3: * lockdep fix (Oleg)
>     * drop unnecessary WARN()s (Christian)
>     * rearrange error returns to be more rpetty (Christian)
>     * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
> 
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  arch/Kconfig                                  |   7 +
>  include/linux/seccomp.h                       |   3 +-
>  include/uapi/linux/seccomp.h                  |  18 +-
>  kernel/seccomp.c                              | 398 +++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 195 ++++++++-
>  5 files changed, 615 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 75dd23acf133..1c1ae8d8c8b9 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -401,6 +401,13 @@ config SECCOMP_FILTER
>  
>  	  See Documentation/prctl/seccomp_filter.txt for details.
>  
> +config SECCOMP_USER_NOTIFICATION
> +	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
> +	depends on SECCOMP_FILTER
> +	help
> +	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
> +	  programs to notify a userspace listener that a particular event happened.
> +
>  config HAVE_GCC_PLUGINS
>  	bool
>  	help
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index c723a5c4e3ff..0fd3e0676a1c 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,7 +5,8 @@
>  #include <uapi/linux/seccomp.h>
>  
>  #define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
> -					 SECCOMP_FILTER_FLAG_LOG)
> +					 SECCOMP_FILTER_FLAG_LOG | \
> +					 SECCOMP_FILTER_FLAG_GET_LISTENER)
>  
>  #ifdef CONFIG_SECCOMP
>  
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 2a0bd9dd104d..8160e6cad528 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,8 +17,9 @@
>  #define SECCOMP_GET_ACTION_AVAIL	2
>  
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC	1
> -#define SECCOMP_FILTER_FLAG_LOG		2
> +#define SECCOMP_FILTER_FLAG_TSYNC		1
> +#define SECCOMP_FILTER_FLAG_LOG			2
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER	4
>  
>  /*
>   * All BPF programs must return a 32-bit value.
> @@ -34,6 +35,7 @@
>  #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
>  #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
> @@ -59,4 +61,16 @@ struct seccomp_data {
>  	__u64 args[6];
>  };
>  
> +struct seccomp_notif {
> +	__u64 id;
> +	pid_t pid;
> +	struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +	__u64 id;
> +	__s32 error;
> +	__s64 val;
> +};
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index dc77548167ef..f69327d5f7c7 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -31,6 +31,7 @@
>  #endif
>  
>  #ifdef CONFIG_SECCOMP_FILTER
> +#include <linux/file.h>
>  #include <linux/filter.h>
>  #include <linux/pid.h>
>  #include <linux/ptrace.h>
> @@ -38,6 +39,52 @@
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +	SECCOMP_NOTIFY_INIT,
> +	SECCOMP_NOTIFY_SENT,
> +	SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> +	/* The pid whose filter triggered the notification */
> +	pid_t pid;
> +
> +	/*
> +	 * The "cookie" for this request; this is unique for this filter.
> +	 */
> +	u32 id;

Perhaps
	 /* The "cookie" for this request; this is unique for this filter. */


Epic patch review :)

thanks,
Tobin.
Tobin C . Harding June 20, 2018, 5:53 a.m.
A few other piddly suggestions.

On Thu, May 31, 2018 at 08:49:46AM -0600, Tycho Andersen wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
> 
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
> 
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
> 
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
> 
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
> 
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
> 
> v2: * make id a u64; the idea here being that it will never overflow,
>       because 64 is huge (one syscall every nanosecond => wrap every 584
>       years) (Andy)
>     * prevent nesting of user notifications: if someone is already attached
>       the tree in one place, nobody else can attach to the tree (Andy)
>     * notify the listener of signals the tracee receives as well (Andy)
>     * implement poll
> v3: * lockdep fix (Oleg)
>     * drop unnecessary WARN()s (Christian)
>     * rearrange error returns to be more rpetty (Christian)
>     * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
> 
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  arch/Kconfig                                  |   7 +
>  include/linux/seccomp.h                       |   3 +-
>  include/uapi/linux/seccomp.h                  |  18 +-
>  kernel/seccomp.c                              | 398 +++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 195 ++++++++-
>  5 files changed, 615 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 75dd23acf133..1c1ae8d8c8b9 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -401,6 +401,13 @@ config SECCOMP_FILTER
>  
>  	  See Documentation/prctl/seccomp_filter.txt for details.
>  
> +config SECCOMP_USER_NOTIFICATION
> +	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
> +	depends on SECCOMP_FILTER
> +	help
> +	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
> +	  programs to notify a userspace listener that a particular event happened.
> +
>  config HAVE_GCC_PLUGINS
>  	bool
>  	help
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index c723a5c4e3ff..0fd3e0676a1c 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,7 +5,8 @@
>  #include <uapi/linux/seccomp.h>
>  
>  #define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
> -					 SECCOMP_FILTER_FLAG_LOG)
> +					 SECCOMP_FILTER_FLAG_LOG | \
> +					 SECCOMP_FILTER_FLAG_GET_LISTENER)
>  
>  #ifdef CONFIG_SECCOMP
>  
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 2a0bd9dd104d..8160e6cad528 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,8 +17,9 @@
>  #define SECCOMP_GET_ACTION_AVAIL	2
>  
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC	1
> -#define SECCOMP_FILTER_FLAG_LOG		2
> +#define SECCOMP_FILTER_FLAG_TSYNC		1
> +#define SECCOMP_FILTER_FLAG_LOG			2
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER	4
>  
>  /*
>   * All BPF programs must return a 32-bit value.
> @@ -34,6 +35,7 @@
>  #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
>  #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
> @@ -59,4 +61,16 @@ struct seccomp_data {
>  	__u64 args[6];
>  };
>  
> +struct seccomp_notif {
> +	__u64 id;
> +	pid_t pid;
> +	struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +	__u64 id;
> +	__s32 error;
> +	__s64 val;
> +};
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index dc77548167ef..f69327d5f7c7 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -31,6 +31,7 @@
>  #endif
>  
>  #ifdef CONFIG_SECCOMP_FILTER
> +#include <linux/file.h>
>  #include <linux/filter.h>
>  #include <linux/pid.h>
>  #include <linux/ptrace.h>
> @@ -38,6 +39,52 @@
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +	SECCOMP_NOTIFY_INIT,
> +	SECCOMP_NOTIFY_SENT,
> +	SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> +	/* The pid whose filter triggered the notification */
> +	pid_t pid;
> +
> +	/*
> +	 * The "cookie" for this request; this is unique for this filter.
> +	 */
> +	u32 id;
> +
> +	/*
> +	 * The seccomp data. This pointer is valid the entire time this
> +	 * notification is active, since it comes from __seccomp_filter which
> +	 * eclipses the entire lifecycle here.
> +	 */
> +	const struct seccomp_data *data;
> +
> +	/*
> +	 * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> +	 * struct seccomp_knotif is created and starts out in INIT. Once the
> +	 * handler reads the notification off of an FD, it transitions to READ.
> +	 * If a signal is received the state transitions back to INIT and
> +	 * another message is sent. When the userspace handler replies, state
> +	 * transitions to REPLIED.
> +	 */
> +	enum notify_state state;
> +
> +	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> +	int error;
> +	long val;
> +
> +	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> +	struct completion ready;
> +
> +	struct list_head list;
> +};
> +#endif
> +
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
>   *
> @@ -64,6 +111,27 @@ struct seccomp_filter {
>  	bool log;
>  	struct seccomp_filter *prev;
>  	struct bpf_prog *prog;
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +	/*
> +	 * A semaphore that users of this notification can wait on for
> +	 * changes. Actual reads and writes are still controlled with
> +	 * filter->notify_lock.
> +	 */
> +	struct semaphore request;
> +
> +	/* A lock for all notification-related accesses. */
> +	struct mutex notify_lock;
> +
> +	/* Is there currently an attached listener? */
> +	bool has_listener;
> +
> +	/* The id of the next request. */
> +	u64 next_id;
> +
> +	/* A list of struct seccomp_knotif elements. */
> +	struct list_head notifications;
> +#endif
>  };
>  
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -383,6 +451,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>  	if (!sfilter)
>  		return ERR_PTR(-ENOMEM);
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +	mutex_init(&sfilter->notify_lock);
> +	sema_init(&sfilter->request, 0);
> +	INIT_LIST_HEAD(&sfilter->notifications);
> +	sfilter->next_id = get_random_u64();
> +#endif
> +
>  	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>  					seccomp_check_filter, save_orig);
>  	if (ret < 0) {
> @@ -547,13 +622,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
>  #define SECCOMP_LOG_TRACE		(1 << 4)
>  #define SECCOMP_LOG_LOG			(1 << 5)
>  #define SECCOMP_LOG_ALLOW		(1 << 6)
> +#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
>  
>  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
>  				    SECCOMP_LOG_KILL_THREAD  |
>  				    SECCOMP_LOG_TRAP  |
>  				    SECCOMP_LOG_ERRNO |
>  				    SECCOMP_LOG_TRACE |
> -				    SECCOMP_LOG_LOG;
> +				    SECCOMP_LOG_LOG |
> +				    SECCOMP_LOG_USER_NOTIF;
>  
>  static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>  			       bool requested)
> @@ -572,6 +649,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>  	case SECCOMP_RET_TRACE:
>  		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
>  		break;
> +	case SECCOMP_RET_USER_NOTIF:
> +		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> +		break;
>  	case SECCOMP_RET_LOG:
>  		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
>  		break;
> @@ -645,6 +725,81 @@ void secure_computing_strict(int this_syscall)
>  }
>  #else
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> +{
> +	/* Note: overflow is ok here, the id just needs to be unique */
> +	return filter->next_id++;
> +}
> +
> +static void seccomp_do_user_notification(int this_syscall,
> +					 struct seccomp_filter *match,
> +					 const struct seccomp_data *sd)
> +{
> +	int err;
> +	long ret = 0;
> +	struct seccomp_knotif n = {};
> +
> +	mutex_lock(&match->notify_lock);
> +	err = -ENOSYS;
> +	if (!match->has_listener)
> +		goto out;
> +
> +	n.pid = current->pid;
> +	n.state = SECCOMP_NOTIFY_INIT;
> +	n.data = sd;
> +	n.id = seccomp_next_notify_id(match);
> +	init_completion(&n.ready);
> +
> +	list_add(&n.list, &match->notifications);
> +
> +	mutex_unlock(&match->notify_lock);
> +	up(&match->request);
> +
> +	err = wait_for_completion_interruptible(&n.ready);
> +	mutex_lock(&match->notify_lock);
> +
> +	/*
> +	 * Here it's possible we got a signal and then had to wait on the mutex
> +	 * while the reply was sent, so let's be sure there wasn't a response
> +	 * in the meantime.
> +	 */
> +	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> +		/*
> +		 * We got a signal. Let's tell userspace about it (potentially
> +		 * again, if we had already notified them about the first one).
> +		 */
> +		if (n.state == SECCOMP_NOTIFY_SENT) {
> +			n.state = SECCOMP_NOTIFY_INIT;
> +			up(&match->request);
> +		}
> +		mutex_unlock(&match->notify_lock);
> +		err = wait_for_completion_killable(&n.ready);
> +		mutex_lock(&match->notify_lock);
> +		if (err < 0)
> +			goto remove_list;
> +	}
> +
> +	ret = n.val;
> +	err = n.error;
> +
> +remove_list:
> +	list_del(&n.list);
> +out:
> +	mutex_unlock(&match->notify_lock);
> +	syscall_set_return_value(current, task_pt_regs(current),
> +				 err, ret);
> +}
> +#else
> +static void seccomp_do_user_notification(int this_syscall,
> +					 struct seccomp_filter *match,
> +					 const struct seccomp_data *sd)
> +{
> +	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_USER_NOTIF, true);
> +	do_exit(SIGSYS);
> +}
> +#endif
> +
>  #ifdef CONFIG_SECCOMP_FILTER
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  			    const bool recheck_after_trace)
> @@ -722,6 +877,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  
>  		return 0;
>  
> +	case SECCOMP_RET_USER_NOTIF:
> +		seccomp_do_user_notification(this_syscall, match, sd);
> +		goto skip;
>  	case SECCOMP_RET_LOG:

Perhaps add newline after 'got skip;' (inline with rest of this function).

>  		seccomp_log(this_syscall, 0, action, true);
>  		return 0;
> @@ -828,6 +986,9 @@ static long seccomp_set_mode_strict(void)
>  }
>  
>  #ifdef CONFIG_SECCOMP_FILTER
> +static struct file *init_listener(struct task_struct *,
> +				  struct seccomp_filter *);
> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -847,6 +1008,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>  	struct seccomp_filter *prepared = NULL;
>  	long ret = -EINVAL;
> +	int listener = 0;
> +	struct file *listener_f = NULL;
>  
>  	/* Validate flags. */
>  	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	if (IS_ERR(prepared))
>  		return PTR_ERR(prepared);
>  
> +	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +		listener = get_unused_fd_flags(O_RDWR);
> +		if (listener < 0) {
> +			ret = listener;
> +			goto out_free;
> +		}
> +
> +		listener_f = init_listener(current, prepared);
> +		if (IS_ERR(listener_f)) {
> +			put_unused_fd(listener);
> +			ret = PTR_ERR(listener_f);
> +			goto out_free;
> +		}
> +	}
> +
>  	/*
>  	 * Make sure we cannot change seccomp or nnp state via TSYNC
>  	 * while another thread is in the middle of calling exec.
>  	 */
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>  	    mutex_lock_killable(&current->signal->cred_guard_mutex))
> -		goto out_free;
> +		goto out_put_fd;
>  
>  	spin_lock_irq(&current->sighand->siglock);
>  
> @@ -881,6 +1059,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	spin_unlock_irq(&current->sighand->siglock);
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>  		mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +		if (ret < 0) {
> +			fput(listener_f);
> +			put_unused_fd(listener);
> +		} else {
> +			fd_install(listener, listener_f);
> +			ret = listener;
> +		}
> +	}
>  out_free:
>  	seccomp_filter_free(prepared);
>  	return ret;
> @@ -909,6 +1097,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
>  	case SECCOMP_RET_LOG:
>  	case SECCOMP_RET_ALLOW:
>  		break;
> +	case SECCOMP_RET_USER_NOTIF:
> +		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
> +			break;
>  	default:
>  		return -EOPNOTSUPP;
>  	}
> @@ -1105,6 +1296,7 @@ long seccomp_get_metadata(struct task_struct *task,
>  #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
>  #define SECCOMP_RET_TRAP_NAME		"trap"
>  #define SECCOMP_RET_ERRNO_NAME		"errno"
> +#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
>  #define SECCOMP_RET_TRACE_NAME		"trace"
>  #define SECCOMP_RET_LOG_NAME		"log"
>  #define SECCOMP_RET_ALLOW_NAME		"allow"
> @@ -1114,6 +1306,7 @@ static const char seccomp_actions_avail[] =
>  				SECCOMP_RET_KILL_THREAD_NAME	" "
>  				SECCOMP_RET_TRAP_NAME		" "
>  				SECCOMP_RET_ERRNO_NAME		" "
> +				SECCOMP_RET_USER_NOTIF_NAME     " "
>  				SECCOMP_RET_TRACE_NAME		" "
>  				SECCOMP_RET_LOG_NAME		" "
>  				SECCOMP_RET_ALLOW_NAME;
> @@ -1131,6 +1324,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
>  	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
>  	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
>  	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> +	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
>  	{ }
>  };
>  
> @@ -1279,3 +1473,203 @@ static int __init seccomp_sysctl_init(void)
>  device_initcall(seccomp_sysctl_init)
>  
>  #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	struct seccomp_knotif *knotif;
> +
> +	mutex_lock(&filter->notify_lock);
> +
> +	/*
> +	 * If this file is being closed because e.g. the task who owned it
> +	 * died, let's wake everyone up who was waiting on us.
> +	 */
> +	list_for_each_entry(knotif, &filter->notifications, list) {
> +		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> +			continue;
> +
> +		knotif->state = SECCOMP_NOTIFY_REPLIED;
> +		knotif->error = -ENOSYS;
> +		knotif->val = 0;
> +
> +		complete(&knotif->ready);
> +	}
> +
> +	filter->has_listener = false;
> +	mutex_unlock(&filter->notify_lock);
> +	__put_seccomp_filter(filter);
> +	return 0;
> +}
> +
> +static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
> +				   size_t size, loff_t *ppos)
> +{
> +	struct seccomp_filter *filter = f->private_data;
> +	struct seccomp_knotif *knotif = NULL, *cur;
> +	struct seccomp_notif unotif;
> +	ssize_t ret;
> +
> +	/* No offset reads. */
> +	if (*ppos != 0)
> +		return -EINVAL;
> +
> +	ret = down_interruptible(&filter->request);
> +	if (ret < 0)
> +		return ret;
> +
> +	mutex_lock(&filter->notify_lock);
> +	list_for_each_entry(cur, &filter->notifications, list) {
> +		if (cur->state == SECCOMP_NOTIFY_INIT) {
> +			knotif = cur;
> +			break;
> +		}
> +	}
> +
> +	/*
> +	 * If we didn't find a notification, it could be that the task was
> +	 * interrupted between the time we were woken and when we were able to
> +	 * acquire the rw lock. Should we retry here or just -ENOENT? -ENOENT
> +	 * for now.
> +	 */
> +	if (!knotif) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	unotif.id = knotif->id;
> +	unotif.pid = knotif->pid;
> +	unotif.data = *(knotif->data);
> +
> +	size = min_t(size_t, size, sizeof(struct seccomp_notif));
> +	if (copy_to_user(buf, &unotif, size)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	ret = sizeof(unotif);
> +	knotif->state = SECCOMP_NOTIFY_SENT;
> +
> +out:
> +	mutex_unlock(&filter->notify_lock);
> +	return ret;
> +}
> +
> +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
> +				    size_t size, loff_t *ppos)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	struct seccomp_notif_resp resp = {};
> +	struct seccomp_knotif *knotif = NULL;

Perhaps the other way around (inverse Christmas tree)

	struct seccomp_knotif *knotif = NULL;
	struct seccomp_notif_resp resp = {};

> +	ssize_t ret = -EINVAL;
> +
> +	/* No partial writes. */
> +	if (*ppos != 0)
> +		return -EINVAL;
> +
> +	size = min_t(size_t, size, sizeof(resp));
> +	if (copy_from_user(&resp, buf, size))
> +		return -EFAULT;
> +
> +	ret = mutex_lock_interruptible(&filter->notify_lock);
> +	if (ret < 0)
> +		return ret;
> +
> +	list_for_each_entry(knotif, &filter->notifications, list) {
> +		if (knotif->id == resp.id)
> +			break;
> +	}
> +
> +	if (!knotif || knotif->id != resp.id) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Allow exactly one reply. */
> +	if (knotif->state != SECCOMP_NOTIFY_SENT) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	ret = size;
> +	knotif->state = SECCOMP_NOTIFY_REPLIED;
> +	knotif->error = resp.error;
> +	knotif->val = resp.val;
> +	complete(&knotif->ready);
> +out:
> +	mutex_unlock(&filter->notify_lock);
> +	return ret;
> +}
> +
> +static __poll_t seccomp_notify_poll(struct file *file,
> +				    struct poll_table_struct *poll_tab)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	__poll_t ret = 0;
> +	struct seccomp_knotif *cur;
> +
> +	ret = mutex_lock_interruptible(&filter->notify_lock);
> +	if (ret < 0)
> +		return ret;
> +
> +	list_for_each_entry(cur, &filter->notifications, list) {
> +		if (cur->state == SECCOMP_NOTIFY_INIT)
> +			ret |= EPOLLIN | EPOLLRDNORM;
> +		if (cur->state == SECCOMP_NOTIFY_SENT)
> +			ret |= EPOLLOUT | EPOLLWRNORM;
> +	}
> +
> +	mutex_unlock(&filter->notify_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +	.read = seccomp_notify_read,
> +	.write = seccomp_notify_write,
> +	.poll = seccomp_notify_poll,
> +	.release = seccomp_notify_release,
> +};
> +
> +static struct file *init_listener(struct task_struct *task,
> +				  struct seccomp_filter *filter)
> +{
> +	struct file *ret = ERR_PTR(-EBUSY);
> +	struct seccomp_filter *cur;
> +	bool have_listener = false;
> +	int filter_nesting = 0;
> +
> +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +		mutex_lock_nested(&cur->notify_lock, filter_nesting);
> +		filter_nesting++;
> +		if (cur->has_listener)
> +			have_listener = true;
> +	}
> +
> +	if (have_listener)
> +		goto out;

Perhaps just goto out directly

		if (cur->has_listener)
			goto out;


Hope this helps,
Tobin
Tycho Andersen June 20, 2018, 2:41 p.m.
Hi Eric,

On Thu, Jun 14, 2018 at 04:53:51PM -0500, Eric W. Biederman wrote:
> >> static void seccomp_do_user_notification(...)
> >> {
> >>     ...
> >>     n.pid = get_task_pid(current, PIDTYPE_PID);
> >>     ...
> >> remove_list:
> >>     list_del(&n.list);
> >>     put_pid(n.pid);
> >>     ...
> >> }
> >> ...
> >> static ssize_t seccomp_notify_read(...)
> >> {
> >>     ...
> >>     unotif.pid = pid_vnr(knotif->pid);
> >>     ...
> >> }
> >> 
> >> I like holding the pid reference because it's what we do elsewhere when pid
> >> namespaces
> >> are a concern and it more precisely specifies what the knotif content needs
> >> to convey.
> >> Otherwise I don't think it makes a difference.
> >
> > Great, thanks, I'll do this. I guess we need a put_pid() here too.
> 
> A)  We know that the task is stopped.  Unless there is something
>     like SIGKILL that can make the task move you don't need to
>     take a reference to anything.

Yes, agreed. (I think the task can't die, because even if it gets an
interrupt, we hold the ->notify_lock here, so it'll block waiting for
that to remove itself from the notification queue.)

> B)  pid_vnr is the wrong answer.  When you create the struct file
>     and intialize the filter you need to capture the calling processes
>     pid namespace.  The you can use "pid_nr_ns(knotif->pid, filter->pid_ns);".
>     That will work consistently even if the file descriptor is passed
>     between processes.

We want the pid of the tracee in the tracer's namespace, so I'm not so
sure. Doesn't your code above give us the pid in the namespace of the
task that happened to create the struct file (which may be unrelated
to the namespace of the tracer)?

Tycho
Tycho Andersen June 20, 2018, 2:55 p.m.
On Thu, Jun 14, 2018 at 02:55:03PM -0700, Matthew Helsley wrote:
> > +       /*
> > +        * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> > +        * struct seccomp_knotif is created and starts out in INIT. Once
> > the
> > +        * handler reads the notification off of an FD, it transitions to
> > READ.
> >
> 
> Comment above needs to be updated: READ state no longer exists.

Thanks, fixed for v4.

> 
> > +        * If a signal is received the state transitions back to INIT and
> > +        * another message is sent. When the userspace handler replies,
> > state
> > +        * transitions to REPLIED.
> > +        */
> > +       enum notify_state state;
> > +
> > +       /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> > +       int error;
> > +       long val;
> > +
> > +       /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> > +       struct completion ready;
> > +
> > +       struct list_head list;
> > +};
> > +#endif
> > +
> >  /**
> >   * struct seccomp_filter - container for seccomp BPF programs
> >   *
> > @@ -64,6 +111,27 @@ struct seccomp_filter {
> >         bool log;
> >         struct seccomp_filter *prev;
> >         struct bpf_prog *prog;
> > +
> > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> > +       /*
> > +        * A semaphore that users of this notification can wait on for
> > +        * changes. Actual reads and writes are still controlled with
> > +        * filter->notify_lock.
> > +        */
> > +       struct semaphore request;
> > +
> > +       /* A lock for all notification-related accesses. */
> > +       struct mutex notify_lock;
> > +
> > +       /* Is there currently an attached listener? */
> > +       bool has_listener;
> >
> 
> Assumes only one listener.
>
> Is there any chance userspace could try to attach multiple listeners and
> get confused? Perhaps by sharing the fd with multiple processes (via exec,
> SCM_RIGHTS..)?

Yes, only one listener fd is allowed, we return -EBUSY if you try to
attach multiple times. Once the fd is created, listening to it with
multiple tasks is fine, we synchronize that (indeed, that's what the
stress test tests).

> > +       /*
> > +        * Here it's possible we got a signal and then had to wait on the
> > mutex
> > +        * while the reply was sent, so let's be sure there wasn't a
> > response
> > +        * in the meantime.
> > +        */
> > +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> > +               /*
> > +                * We got a signal. Let's tell userspace about it
> > (potentially
> > +                * again, if we had already notified them about the first
> > one).
> > +                */
> > +               if (n.state == SECCOMP_NOTIFY_SENT) {
> > +                       n.state = SECCOMP_NOTIFY_INIT;
> > +                       up(&match->request);
> > +               }
> > +               mutex_unlock(&match->notify_lock);
> > +               err = wait_for_completion_killable(&n.ready);
> > +               mutex_lock(&match->notify_lock);
> > +               if (err < 0)
> > +                       goto remove_list;
> > +       }
> >
> 
> This section looks a little odd.
> 
> I notice you don't loop here yet you reset n.state back to INIT and don't
> wait for
> it to return to the SENT (much less REPLIED) state. This effectively looks
> like a subtle
> hard-coded number of "send" retries.

Yes, it is.

> I'm low on time so I couldn't figure out answers to some important
> questions:
> 
> Is one "retry" always enough? Is it possible the notification might get
> lost due to the lack
> of a loop here? Is it possible the syscall might return without a "reply"?
> That's not
> consistent with your comments about the states. Is there any possibility of
> missing a
> completion because of the interaction between this and other parts that
> signal completion
> conditionally based on this state?

This is in response to this subthread on the first version:
https://lkml.org/lkml/2018/3/15/1122

Basically, we want exactly one notification if the thing gets a
signal, and then we wait_killable instead of wait_interruptible
instead. It should not be possible for a syscall to return without a
reply (indeed, for a syscall to return, it needs a wake on the
completion, so it would just hang forever if there's some bug here).

> +
> > +       ret = n.val;
> > +       err = n.error;
> > +
> > +remove_list:
> > +       list_del(&n.list);
> > +out:
> > +       mutex_unlock(&match->notify_lock);
> > +       syscall_set_return_value(current, task_pt_regs(current),
> > +                                err, ret);
> > +}
> > +#else
> > +static void seccomp_do_user_notification(int this_syscall,
> > +                                        struct seccomp_filter *match,
> > +                                        const struct seccomp_data *sd)
> > +{
> > +       seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_USER_NOTIF, true);
> > +       do_exit(SIGSYS);
> > +}
> > +#endif
> > +
> >  #ifdef CONFIG_SECCOMP_FILTER
> >  static int __seccomp_filter(int this_syscall, const struct seccomp_data
> > *sd,
> >                             const bool recheck_after_trace)
> > @@ -722,6 +877,9 @@ static int __seccomp_filter(int this_syscall, const
> > struct seccomp_data *sd,
> >
> >                 return 0;
> >
> > +       case SECCOMP_RET_USER_NOTIF:
> > +               seccomp_do_user_notification(this_syscall, match, sd);
> > +               goto skip;
> >         case SECCOMP_RET_LOG:
> >                 seccomp_log(this_syscall, 0, action, true);
> >                 return 0;
> > @@ -828,6 +986,9 @@ static long seccomp_set_mode_strict(void)
> >  }
> >
> >  #ifdef CONFIG_SECCOMP_FILTER
> > +static struct file *init_listener(struct task_struct *,
> > +                                 struct seccomp_filter *);
> > +
> >  /**
> >   * seccomp_set_mode_filter: internal function for setting seccomp filter
> >   * @flags:  flags to change filter behavior
> > @@ -847,6 +1008,8 @@ static long seccomp_set_mode_filter(unsigned int
> > flags,
> >         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> >         struct seccomp_filter *prepared = NULL;
> >         long ret = -EINVAL;
> > +       int listener = 0;
> > +       struct file *listener_f = NULL;
> >
> >         /* Validate flags. */
> >         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> > @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int
> > flags,
> >         if (IS_ERR(prepared))
> >                 return PTR_ERR(prepared);
> >
> > +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> 
> +               listener = get_unused_fd_flags(O_RDWR);
> >
> 
> So I checked and a man page does need to be update. Among other updates
> this needs a mention in the RETURN section of
> man 2 seccomp along the lines of:
> 
> "If SECCOMP_FILTER_FLAG_GET_LISTENER is specified then the return value is
> a file descriptor upon success or -1 otherwise."
> 
> "If seccomp() is called multiple times
> with SECCOMP_FILTER_FLAG_GET_LISTENER then, after the first successful
> call, it will not
> create any new file descriptors and may return with either the existing
> file descriptor or -1 and errno set to EBUSY" (I smell a testcase!)

Yes, I'll update the man pages when these get merged. And there
already is a test for this.

> Now, that said, this interface is somewhat awkward. The principle of least
> surprise suggests that subsequent calls with GET_LISTENER
> should return any file descriptor created previously rather than -1 and
> errno==EBUSY. Perhaps rather than GET_LISTENER you could rename it
> NEW_LISTENER_EXCL. Or you could add the ability to return any pre-existing
> fd.

How about NEW_LISTENER? We can add a GET_LISTENER command later if
people find it useful, but frankly I think the ptrace interface is
what most people will use, and we could possibly drop this all
together.

> > +static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
> > +                                  size_t size, loff_t *ppos)
> > +{
> > +       struct seccomp_filter *filter = f->private_data;
> > +       struct seccomp_knotif *knotif = NULL, *cur;
> > +       struct seccomp_notif unotif;
> > +       ssize_t ret;
> > +
> > +       /* No offset reads. */
> > +       if (*ppos != 0)
> > +               return -EINVAL;
> > +
> >
> 
> I think you should use memset to clear unotif -- that prevents any
> accidental information disclosure
> via the kernel stack if this structure ever has padding.

Good catch, thanks.

> Will the size of unotif ever *possibly* need to change (i.e. grow) in the
> future? You might consider
> how that could be enabled while maintaining backwards compatibility.
> 
> I'm thinking you should check size here:
> 
> if (size < sizeof(unotif))
>         return -EFAULT;

Well, if it ever needs to grow, we can't do that :). And yes, the
intent is for it to grow -- it does later in this series, for example.

> Yes, copy_to_user() will stop you from walking past a page boundary but it
> won't
> stop you from quietly clobbering legitimate userspace memory. Seeing:
> 
> size != sizeof(unotif)
> 
> is also an indicator that there may be an ABI mismatch -- so again,
> consider the
> current/possible-future behavior carefully.
> 
> +       ret = down_interruptible(&filter->request);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       mutex_lock(&filter->notify_lock);
> > +       list_for_each_entry(cur, &filter->notifications, list) {
> > +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> > +                       knotif = cur;
> > +                       break;
> > +               }
> > +       }
> > +
> > +       /*
> > +        * If we didn't find a notification, it could be that the task was
> > +        * interrupted between the time we were woken and when we were
> > able to
> > +        * acquire the rw lock. Should we retry here or just -ENOENT?
> > -ENOENT
> > +        * for now.
> > +        */
> > +       if (!knotif) {
> > +               ret = -ENOENT;
> > +               goto out;
> > +       }
> > +
> > +       unotif.id = knotif->id;
> > +       unotif.pid = knotif->pid;
> > +       unotif.data = *(knotif->data);
> > +
> > +       size = min_t(size_t, size, sizeof(struct seccomp_notif));
> 
> 
> nit: sizeof(unotif) here since that's shorter, ultimately what we're
> copying to userspace,
> changes if the type ever changes, and you're also using that to set ret
> later.

Fixed, thanks.

> 
> >
> 
> +       if (copy_to_user(buf, &unotif, size)) {
> > +               ret = -EFAULT;
> > +               goto out;
> > +       }
> > +
> > +       ret = sizeof(unotif);
> > +       knotif->state = SECCOMP_NOTIFY_SENT;
> > +
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> > +static ssize_t seccomp_notify_write(struct file *file, const char __user
> > *buf,
> > +                                   size_t size, loff_t *ppos)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       struct seccomp_notif_resp resp = {};
> > +       struct seccomp_knotif *knotif = NULL;
> > +       ssize_t ret = -EINVAL;
> > +
> > +       /* No partial writes. */
> > +       if (*ppos != 0)
> > +               return -EINVAL;
> >
> 
> What happens if size < sizeof(resp) ? Is there a chance we could give some
> kernel bits to
> the process we're filtering with seccomp()?
> 
> Again: Using memset() to clear resp seems like a good step since it
> contains field(s) that
> could appear in the filtered program. That or simply:
> 
> if (size < sizeof(resp))
>         return -EINVAL;

We're initializing with = {}, so I think it's not necessary here.

> 
> > +
> > +       size = min_t(size_t, size, sizeof(resp));
> > +       if (copy_from_user(&resp, buf, size))
> > +               return -EFAULT;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       list_for_each_entry(knotif, &filter->notifications, list) {
> > +               if (knotif->id == resp.id)
> > +                       break;
> > +       }
> > +
> > +       if (!knotif || knotif->id != resp.id) {
> > +               ret = -EINVAL;
> > +               goto out;
> > +       }
> > +
> > +       /* Allow exactly one reply. */
> > +       if (knotif->state != SECCOMP_NOTIFY_SENT) {
> > +               ret = -EINVAL;
> >
> 
> nit: is there a better errno than EINVAL for this which would distinguish
> this error from simple invalid parameters?
> 
> EALREADY (connection already in progress)?      <-- not mentioned in man 2
> write ("unused" below)
> EINPROGRESS (operation already in progress)? <-- unused (see man 2 connect)

Sure, I'll switch to one of these.

> ENOTSUP ?                                                           <--
> unused
> ENOTUNIQ (name not unique on network)?          <-- unused
> ENOSPC (device containing file has no room for the data)?  <-- definitely
> used
> EIO (low level IO error) ?
>        <-- definitely used
> 
> 
> > +               goto out;
> > +       }
> > +
> > +       ret = size;
> > +       knotif->state = SECCOMP_NOTIFY_REPLIED;
> > +       knotif->error = resp.error;
> > +       knotif->val = resp.val;
> > +       complete(&knotif->ready);
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> > +static __poll_t seccomp_notify_poll(struct file *file,
> > +                                   struct poll_table_struct *poll_tab)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       __poll_t ret = 0;
> > +       struct seccomp_knotif *cur;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> 
> +
> > +       list_for_each_entry(cur, &filter->notifications, list) {
> > +               if (cur->state == SECCOMP_NOTIFY_INIT)
> > +                       ret |= EPOLLIN | EPOLLRDNORM;
> > +               if (cur->state == SECCOMP_NOTIFY_SENT)
> > +                       ret |= EPOLLOUT | EPOLLWRNORM;
> >
> 
> Hmm, it's been a while since I read poll file ops but if you do wind up
> walking this list then you may not have to walk the entire list here:
> 
> if (ret == EPOLLIN | EPOLLRDNORM | EPOLLOUT | EPOLLWRNORM)
>         break;
> 
> Then poll() is not always O(N) (where N is the number of queued
> notifications...)

Yep, poll is totally reworked for v4 based on other comments.

Tycho