[v6,4/5] seccomp: add support for passing fds via USER_NOTIF

Submitted by Tycho Andersen on Sept. 6, 2018, 3:28 p.m.

Details

Message ID 20180906152859.7810-5-tycho@tycho.ws
State New
Series "seccomp trap to userspace"
Headers show

Commit Message

Tycho Andersen Sept. 6, 2018, 3:28 p.m.
The idea here is that the userspace handler should be able to pass an fd
back to the trapped task, for example so it can be returned from socket().

I've proposed one API here, but I'm open to other options. In particular,
this only lets you return an fd from a syscall, which may not be enough in
all cases. For example, if an fd is written to an output parameter instead
of returned, the current API can't handle this. Another case is that
netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
ever decides to install an fd and output it, we wouldn't be able to handle
this either.

Still, the vast majority of interesting cases are covered by this API, so
perhaps it is Enough.

I've left it as a separate commit for two reasons:
  * It illustrates the way in which we would grow struct seccomp_notif and
    struct seccomp_notif_resp without using netlink
  * It shows just how little code is needed to accomplish this :)

v2: new in v2
v3: no changes
v4: * pass fd flags back from userspace as well (Jann)
    * update same cgroup data on fd pass as SCM_RIGHTS (Alban)
    * only set the REPLIED state /after/ successful fdget (Alban)
    * reflect GET_LISTENER -> NEW_LISTENER changes
    * add to the new Documentation/ on user notifications about fd replies
v5: * fix documentation typo (O_EXCL -> O_CLOEXEC)

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 .../userspace-api/seccomp_filter.rst          |  11 ++
 include/uapi/linux/seccomp.h                  |   3 +
 kernel/seccomp.c                              |  51 +++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 116 ++++++++++++++++++
 4 files changed, 179 insertions(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index d1498885c1c7..1c0aab306426 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -235,6 +235,9 @@  The interface for a seccomp notification fd consists of two structures:
         __u64 id;
         __s32 error;
         __s64 val;
+        __u8 return_fd;
+        __u32 fd;
+        __u32 fd_flags;
     };
 
 Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
@@ -256,6 +259,14 @@  mentioned above in this document: all arguments being read from the tracee's
 memory should be read into the tracer's memory before any policy decisions are
 made. This allows for an atomic decision on syscall arguments.
 
+Userspace can also return file descriptors. For example, one may decide to
+intercept ``socket()`` syscalls, and return some file descriptor from those
+based on some policy. To return a file descriptor, the ``return_fd`` member
+should be non-zero, the ``fd`` argument should be the fd in the listener's
+table to send to the tracee (similar to how ``SCM_RIGHTS`` works), and
+``fd_flags`` should be the flags that the fd in the tracee's table is opened
+with (e.g. ``O_CLOEXEC`` or similar).
+
 Sysctls
 =======
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index aa5878972128..93f1bd5c7cf0 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -75,6 +75,9 @@  struct seccomp_notif_resp {
 	__u64 id;
 	__s32 error;
 	__s64 val;
+	__u8 return_fd;
+	__u32 fd;
+	__u32 fd_flags;
 };
 
 #define SECCOMP_IOC_MAGIC		0xF7
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 580888785324..4a6db4076ec5 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -43,6 +43,7 @@ 
 
 #ifdef CONFIG_SECCOMP_USER_NOTIFICATION
 #include <linux/anon_inodes.h>
+#include <net/cls_cgroup.h>
 
 enum notify_state {
 	SECCOMP_NOTIFY_INIT,
@@ -80,6 +81,8 @@  struct seccomp_knotif {
 	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
 	int error;
 	long val;
+	struct file *file;
+	unsigned int flags;
 
 	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
 	struct completion ready;
@@ -800,10 +803,44 @@  static void seccomp_do_user_notification(int this_syscall,
 			goto remove_list;
 	}
 
-	ret = n.val;
-	err = n.error;
+	if (n.file) {
+		int fd;
+		struct socket *sock;
+
+		fd = get_unused_fd_flags(n.flags);
+		if (fd < 0) {
+			err = fd;
+			ret = -1;
+			goto remove_list;
+		}
+
+		/*
+		 * Similar to what SCM_RIGHTS does, let's re-set the cgroup
+		 * data to point ot the tracee's cgroups instead of the
+		 * listener's.
+		 */
+		sock = sock_from_file(n.file, &err);
+		if (sock) {
+			sock_update_netprioidx(&sock->sk->sk_cgrp_data);
+			sock_update_classid(&sock->sk->sk_cgrp_data);
+		}
+
+		ret = fd;
+		err = 0;
+
+		fd_install(fd, n.file);
+		/* Don't fput, since fd has a reference now */
+		n.file = NULL;
+	} else {
+		ret = n.val;
+		err = n.error;
+	}
+
 
 remove_list:
+	if (n.file)
+		fput(n.file);
+
 	list_del(&n.list);
 out:
 	mutex_unlock(&match->notify_lock);
@@ -1675,10 +1712,20 @@  static long seccomp_notify_send(struct seccomp_filter *filter,
 		goto out;
 	}
 
+	if (resp.return_fd) {
+		knotif->flags = resp.fd_flags;
+		knotif->file = fget(resp.fd);
+		if (!knotif->file) {
+			ret = -EBADF;
+			goto out;
+		}
+	}
+
 	ret = size;
 	knotif->state = SECCOMP_NOTIFY_REPLIED;
 	knotif->error = resp.error;
 	knotif->val = resp.val;
+
 	complete(&knotif->ready);
 out:
 	mutex_unlock(&filter->notify_lock);
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 61b8e3c5c06b..c756722faa88 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -182,6 +182,9 @@  struct seccomp_notif_resp {
 	__u64 id;
 	__s32 error;
 	__s64 val;
+	__u8 return_fd;
+	__u32 fd;
+	__u32 fd_flags;
 };
 #endif
 
@@ -3233,6 +3236,119 @@  TEST(get_user_notification_ptrace)
 	close(listener);
 }
 
+TEST(user_notification_pass_fd)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	long ret;
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		char buf[16];
+
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+		close(sk_pair[1]);
+
+		/* An fd from getpid(). Let the games begin. */
+		fd = syscall(__NR_getpid);
+		EXPECT_GT(fd, 0);
+		EXPECT_EQ(read(fd, buf, sizeof(buf)), 12);
+		close(fd);
+
+		exit(strcmp("hello world", buf));
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done installing so it can do a getpid */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+	close(sk_pair[0]);
+
+	/* Make a new socket pair so we can send half across */
+	EXPECT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.return_fd = 1;
+	resp.fd = sk_pair[1];
+	resp.fd_flags = 0;
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+	close(sk_pair[1]);
+
+	EXPECT_EQ(write(sk_pair[0], "hello world\0", 12), 12);
+	close(sk_pair[0]);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+	close(listener);
+}
+
+TEST(user_notification_struct_size_mismatch)
+{
+	pid_t pid;
+	long ret;
+	int status, listener, len;
+	struct seccomp_notif req;
+	struct seccomp_notif_resp resp;
+
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	EXPECT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+
+	/*
+	 * Only write a partial structure: this is what was available before we
+	 * had fd support.
+	 */
+	len = offsetof(struct seccomp_notif_resp, val) + sizeof(resp.val);
+	resp.len = len;
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), len);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
 /*
  * Check that a pid in a child namespace still shows up as valid in ours.
  */

Comments

Jann Horn via Containers Sept. 6, 2018, 4:15 p.m.
On Thu, Sep 6, 2018 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> The idea here is that the userspace handler should be able to pass an fd
> back to the trapped task, for example so it can be returned from socket().
[...]
> diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> index d1498885c1c7..1c0aab306426 100644
> --- a/Documentation/userspace-api/seccomp_filter.rst
> +++ b/Documentation/userspace-api/seccomp_filter.rst
> @@ -235,6 +235,9 @@ The interface for a seccomp notification fd consists of two structures:
>          __u64 id;
>          __s32 error;
>          __s64 val;
> +        __u8 return_fd;
> +        __u32 fd;
> +        __u32 fd_flags;

Normally,  syscalls that take an optional file descriptor accept a
signed 32-bit number, with -1 standing for "no file descriptor". Is
there a reason why this uses a separate variable to signal whether an
fd was provided?

Apart from that, this patch looks good to me.
Tycho Andersen Sept. 6, 2018, 4:22 p.m.
On Thu, Sep 06, 2018 at 06:15:18PM +0200, Jann Horn wrote:
> On Thu, Sep 6, 2018 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > The idea here is that the userspace handler should be able to pass an fd
> > back to the trapped task, for example so it can be returned from socket().
> [...]
> > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> > index d1498885c1c7..1c0aab306426 100644
> > --- a/Documentation/userspace-api/seccomp_filter.rst
> > +++ b/Documentation/userspace-api/seccomp_filter.rst
> > @@ -235,6 +235,9 @@ The interface for a seccomp notification fd consists of two structures:
> >          __u64 id;
> >          __s32 error;
> >          __s64 val;
> > +        __u8 return_fd;
> > +        __u32 fd;
> > +        __u32 fd_flags;
> 
> Normally,  syscalls that take an optional file descriptor accept a
> signed 32-bit number, with -1 standing for "no file descriptor". Is
> there a reason why this uses a separate variable to signal whether an
> fd was provided?

No real reason other than I looked at the bpf code and they were using
__u32 for bpf (but I think in their case the fd args are not
optional). I'll switch it to __s32/-1 for the next version.

> Apart from that, this patch looks good to me.

Thanks,

Tycho
Tycho Andersen Sept. 6, 2018, 6:30 p.m.
On Thu, Sep 06, 2018 at 10:22:46AM -0600, Tycho Andersen wrote:
> On Thu, Sep 06, 2018 at 06:15:18PM +0200, Jann Horn wrote:
> > On Thu, Sep 6, 2018 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > The idea here is that the userspace handler should be able to pass an fd
> > > back to the trapped task, for example so it can be returned from socket().
> > [...]
> > > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> > > index d1498885c1c7..1c0aab306426 100644
> > > --- a/Documentation/userspace-api/seccomp_filter.rst
> > > +++ b/Documentation/userspace-api/seccomp_filter.rst
> > > @@ -235,6 +235,9 @@ The interface for a seccomp notification fd consists of two structures:
> > >          __u64 id;
> > >          __s32 error;
> > >          __s64 val;
> > > +        __u8 return_fd;
> > > +        __u32 fd;
> > > +        __u32 fd_flags;
> > 
> > Normally,  syscalls that take an optional file descriptor accept a
> > signed 32-bit number, with -1 standing for "no file descriptor". Is
> > there a reason why this uses a separate variable to signal whether an
> > fd was provided?
> 
> No real reason other than I looked at the bpf code and they were using
> __u32 for bpf (but I think in their case the fd args are not
> optional). I'll switch it to __s32/-1 for the next version.

Oh, I think there is a reason actually: since this is an API addition,
the "0" value needs to be the previously default behavior if userspace
doesn't specify it. Since the previously default behavior was not to
return an fd, and we want to allow fd == 0, we need the extra flag to
make this work.

This is really only a problem because we're introducing this stuff in
a second patch (mostly to illustrate how extending the response
structure would work). I could fold this into the first patch if we
want, or we could keep the return_fd bits if the illustration is
useful.

Tycho
Jann Horn via Containers Sept. 10, 2018, 5 p.m.
On Thu, Sep 6, 2018 at 8:30 PM Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, Sep 06, 2018 at 10:22:46AM -0600, Tycho Andersen wrote:
> > On Thu, Sep 06, 2018 at 06:15:18PM +0200, Jann Horn wrote:
> > > On Thu, Sep 6, 2018 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > The idea here is that the userspace handler should be able to pass an fd
> > > > back to the trapped task, for example so it can be returned from socket().
> > > [...]
> > > > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> > > > index d1498885c1c7..1c0aab306426 100644
> > > > --- a/Documentation/userspace-api/seccomp_filter.rst
> > > > +++ b/Documentation/userspace-api/seccomp_filter.rst
> > > > @@ -235,6 +235,9 @@ The interface for a seccomp notification fd consists of two structures:
> > > >          __u64 id;
> > > >          __s32 error;
> > > >          __s64 val;
> > > > +        __u8 return_fd;
> > > > +        __u32 fd;
> > > > +        __u32 fd_flags;
> > >
> > > Normally,  syscalls that take an optional file descriptor accept a
> > > signed 32-bit number, with -1 standing for "no file descriptor". Is
> > > there a reason why this uses a separate variable to signal whether an
> > > fd was provided?
> >
> > No real reason other than I looked at the bpf code and they were using
> > __u32 for bpf (but I think in their case the fd args are not
> > optional). I'll switch it to __s32/-1 for the next version.
>
> Oh, I think there is a reason actually: since this is an API addition,
> the "0" value needs to be the previously default behavior if userspace
> doesn't specify it. Since the previously default behavior was not to
> return an fd, and we want to allow fd == 0, we need the extra flag to
> make this work.
>
> This is really only a problem because we're introducing this stuff in
> a second patch (mostly to illustrate how extending the response
> structure would work). I could fold this into the first patch if we
> want, or we could keep the return_fd bits if the illustration is
> useful.

I feel like adding extra struct fields just so that it is possible to
write programs against the intermediate new API between two kernel
commits is taking things a bit far.
Tycho Andersen Sept. 11, 2018, 8:29 p.m.
On Mon, Sep 10, 2018 at 07:00:43PM +0200, Jann Horn wrote:
> On Thu, Sep 6, 2018 at 8:30 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, Sep 06, 2018 at 10:22:46AM -0600, Tycho Andersen wrote:
> > > On Thu, Sep 06, 2018 at 06:15:18PM +0200, Jann Horn wrote:
> > > > On Thu, Sep 6, 2018 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > The idea here is that the userspace handler should be able to pass an fd
> > > > > back to the trapped task, for example so it can be returned from socket().
> > > > [...]
> > > > > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> > > > > index d1498885c1c7..1c0aab306426 100644
> > > > > --- a/Documentation/userspace-api/seccomp_filter.rst
> > > > > +++ b/Documentation/userspace-api/seccomp_filter.rst
> > > > > @@ -235,6 +235,9 @@ The interface for a seccomp notification fd consists of two structures:
> > > > >          __u64 id;
> > > > >          __s32 error;
> > > > >          __s64 val;
> > > > > +        __u8 return_fd;
> > > > > +        __u32 fd;
> > > > > +        __u32 fd_flags;
> > > >
> > > > Normally,  syscalls that take an optional file descriptor accept a
> > > > signed 32-bit number, with -1 standing for "no file descriptor". Is
> > > > there a reason why this uses a separate variable to signal whether an
> > > > fd was provided?
> > >
> > > No real reason other than I looked at the bpf code and they were using
> > > __u32 for bpf (but I think in their case the fd args are not
> > > optional). I'll switch it to __s32/-1 for the next version.
> >
> > Oh, I think there is a reason actually: since this is an API addition,
> > the "0" value needs to be the previously default behavior if userspace
> > doesn't specify it. Since the previously default behavior was not to
> > return an fd, and we want to allow fd == 0, we need the extra flag to
> > make this work.
> >
> > This is really only a problem because we're introducing this stuff in
> > a second patch (mostly to illustrate how extending the response
> > structure would work). I could fold this into the first patch if we
> > want, or we could keep the return_fd bits if the illustration is
> > useful.
> 
> I feel like adding extra struct fields just so that it is possible to
> write programs against the intermediate new API between two kernel
> commits is taking things a bit far.

Yep, I tend to agree with you. I'll fold the whole thing into the
first patch for the next version.

Tycho
Andy Lutomirski Sept. 12, 2018, 11:52 p.m.
On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> The idea here is that the userspace handler should be able to pass an fd
> back to the trapped task, for example so it can be returned from socket().
>
> I've proposed one API here, but I'm open to other options. In particular,
> this only lets you return an fd from a syscall, which may not be enough in
> all cases. For example, if an fd is written to an output parameter instead
> of returned, the current API can't handle this. Another case is that
> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> ever decides to install an fd and output it, we wouldn't be able to handle
> this either.

An alternative could be to have an API (an ioctl on the listener,
perhaps) that just copies an fd into the tracee.  There would be the
obvious set of options: do we replace an existing fd or allocate a new
one, and is it CLOEXEC.  Then the tracer could add an fd and then
return it just like it's a regular number.

I feel like this would be more flexible and conceptually simpler, but
maybe a little slower for the common cases.  What do you think?
Tycho Andersen Sept. 13, 2018, 9:25 a.m.
On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > The idea here is that the userspace handler should be able to pass an fd
> > back to the trapped task, for example so it can be returned from socket().
> >
> > I've proposed one API here, but I'm open to other options. In particular,
> > this only lets you return an fd from a syscall, which may not be enough in
> > all cases. For example, if an fd is written to an output parameter instead
> > of returned, the current API can't handle this. Another case is that
> > netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> > ever decides to install an fd and output it, we wouldn't be able to handle
> > this either.
> 
> An alternative could be to have an API (an ioctl on the listener,
> perhaps) that just copies an fd into the tracee.  There would be the
> obvious set of options: do we replace an existing fd or allocate a new
> one, and is it CLOEXEC.  Then the tracer could add an fd and then
> return it just like it's a regular number.
> 
> I feel like this would be more flexible and conceptually simpler, but
> maybe a little slower for the common cases.  What do you think?

Yes, I like this. It also future (or current-) proofs the API against
instances where we return an FD in a structure and not via the return
code of the syscall.

Tycho
Aleksa Sarai Sept. 13, 2018, 9:42 a.m.
On 2018-09-12, Andy Lutomirski <luto@amacapital.net> wrote:
> > The idea here is that the userspace handler should be able to pass an fd
> > back to the trapped task, for example so it can be returned from socket().
> >
> > I've proposed one API here, but I'm open to other options. In particular,
> > this only lets you return an fd from a syscall, which may not be enough in
> > all cases. For example, if an fd is written to an output parameter instead
> > of returned, the current API can't handle this. Another case is that
> > netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> > ever decides to install an fd and output it, we wouldn't be able to handle
> > this either.
> 
> An alternative could be to have an API (an ioctl on the listener,
> perhaps) that just copies an fd into the tracee.  There would be the
> obvious set of options: do we replace an existing fd or allocate a new
> one, and is it CLOEXEC.  Then the tracer could add an fd and then
> return it just like it's a regular number.
> 
> I feel like this would be more flexible and conceptually simpler, but
> maybe a little slower for the common cases.  What do you think?

When we first discussed this I sent a (probably somewhat broken) patch
for "dup4" which would let you inject a file descriptor into a different
process -- I still think that having a method for injecting a file
descriptor without needing ptrace (and SCM_RIGHTS) shenanigans would be
generally useful. (With "dup4" you have a more obvious API for flags and
whether you allocate a new fd or use a specific one.)
Tycho Andersen Sept. 19, 2018, 9:55 a.m.
On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > The idea here is that the userspace handler should be able to pass an fd
> > back to the trapped task, for example so it can be returned from socket().
> >
> > I've proposed one API here, but I'm open to other options. In particular,
> > this only lets you return an fd from a syscall, which may not be enough in
> > all cases. For example, if an fd is written to an output parameter instead
> > of returned, the current API can't handle this. Another case is that
> > netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> > ever decides to install an fd and output it, we wouldn't be able to handle
> > this either.
> 
> An alternative could be to have an API (an ioctl on the listener,
> perhaps) that just copies an fd into the tracee.  There would be the
> obvious set of options: do we replace an existing fd or allocate a new
> one, and is it CLOEXEC.  Then the tracer could add an fd and then
> return it just like it's a regular number.
> 
> I feel like this would be more flexible and conceptually simpler, but
> maybe a little slower for the common cases.  What do you think?

I'm just implementing this now, and there's one question: when do we
actually do the fd install? Should we do it when the user calls
SECCOMP_NOTIF_PUT_FD, or when the actual response is sent? It feels
like we should do it when the response is sent, instead of doing it
right when SECCOMP_NOTIF_PUT_FD is called, since if there's a
subsequent signal and the tracer decides to discard the response,
we'll have to implement some delete mechanism to delete the fd, but it
would have already been visible to the process, etc. So I'll go
forward with this unless there are strong objections, but I thought
I'd point it out just to avoid another round trip.

Tycho
Andy Lutomirski Sept. 19, 2018, 2:19 p.m.
> On Sep 19, 2018, at 2:55 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> 
>> On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
>>> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>>> The idea here is that the userspace handler should be able to pass an fd
>>> back to the trapped task, for example so it can be returned from socket().
>>> 
>>> I've proposed one API here, but I'm open to other options. In particular,
>>> this only lets you return an fd from a syscall, which may not be enough in
>>> all cases. For example, if an fd is written to an output parameter instead
>>> of returned, the current API can't handle this. Another case is that
>>> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
>>> ever decides to install an fd and output it, we wouldn't be able to handle
>>> this either.
>> 
>> An alternative could be to have an API (an ioctl on the listener,
>> perhaps) that just copies an fd into the tracee.  There would be the
>> obvious set of options: do we replace an existing fd or allocate a new
>> one, and is it CLOEXEC.  Then the tracer could add an fd and then
>> return it just like it's a regular number.
>> 
>> I feel like this would be more flexible and conceptually simpler, but
>> maybe a little slower for the common cases.  What do you think?
> 
> I'm just implementing this now, and there's one question: when do we
> actually do the fd install? Should we do it when the user calls
> SECCOMP_NOTIF_PUT_FD, or when the actual response is sent? It feels
> like we should do it when the response is sent, instead of doing it
> right when SECCOMP_NOTIF_PUT_FD is called, since if there's a
> subsequent signal and the tracer decides to discard the response,
> we'll have to implement some delete mechanism to delete the fd, but it
> would have already been visible to the process, etc. So I'll go
> forward with this unless there are strong objections, but I thought
> I'd point it out just to avoid another round trip.
> 
> 

Can you do that non-racily?  That is, you need to commit to an fd *number* right away, but what if another thread uses the number before you actually install the fd?

Do we really allow non-“kill” signals to interrupt the whole process?  It might be the case that we don’t really need to clean up from signals if there’s a guarantee that the thread dies.
Tycho Andersen Sept. 19, 2018, 2:38 p.m.
On Wed, Sep 19, 2018 at 07:19:56AM -0700, Andy Lutomirski wrote:
> 
> 
> > On Sep 19, 2018, at 2:55 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > 
> >> On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
> >>> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >>> The idea here is that the userspace handler should be able to pass an fd
> >>> back to the trapped task, for example so it can be returned from socket().
> >>> 
> >>> I've proposed one API here, but I'm open to other options. In particular,
> >>> this only lets you return an fd from a syscall, which may not be enough in
> >>> all cases. For example, if an fd is written to an output parameter instead
> >>> of returned, the current API can't handle this. Another case is that
> >>> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> >>> ever decides to install an fd and output it, we wouldn't be able to handle
> >>> this either.
> >> 
> >> An alternative could be to have an API (an ioctl on the listener,
> >> perhaps) that just copies an fd into the tracee.  There would be the
> >> obvious set of options: do we replace an existing fd or allocate a new
> >> one, and is it CLOEXEC.  Then the tracer could add an fd and then
> >> return it just like it's a regular number.
> >> 
> >> I feel like this would be more flexible and conceptually simpler, but
> >> maybe a little slower for the common cases.  What do you think?
> > 
> > I'm just implementing this now, and there's one question: when do we
> > actually do the fd install? Should we do it when the user calls
> > SECCOMP_NOTIF_PUT_FD, or when the actual response is sent? It feels
> > like we should do it when the response is sent, instead of doing it
> > right when SECCOMP_NOTIF_PUT_FD is called, since if there's a
> > subsequent signal and the tracer decides to discard the response,
> > we'll have to implement some delete mechanism to delete the fd, but it
> > would have already been visible to the process, etc. So I'll go
> > forward with this unless there are strong objections, but I thought
> > I'd point it out just to avoid another round trip.
> > 
> > 
> 
> Can you do that non-racily?  That is, you need to commit to an fd *number* right away, but what if another thread uses the number before you actually install the fd?

I was thinking we could just do an __alloc_fd() and then do the
fd_install() when the response is sent or clean up the case that the
listener or task dies. I haven't actually tried to run the code yet,
so it's possible the locking won't work :)

> Do we really allow non-“kill” signals to interrupt the whole process?  It might be the case that we don’t really need to clean up from signals if there’s a guarantee that the thread dies.

Yes, we do, because of this: https://lkml.org/lkml/2018/3/15/1122

I could change that to just be a killable wait, though; I don't have
strong opinions about it and several people have commented that the
code is kind of weird.

Tycho
Andy Lutomirski Sept. 19, 2018, 7:58 p.m.
On Wed, Sep 19, 2018 at 7:38 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> On Wed, Sep 19, 2018 at 07:19:56AM -0700, Andy Lutomirski wrote:
>>
>>
>> > On Sep 19, 2018, at 2:55 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>> >
>> >> On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
>> >>> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>> >>> The idea here is that the userspace handler should be able to pass an fd
>> >>> back to the trapped task, for example so it can be returned from socket().
>> >>>
>> >>> I've proposed one API here, but I'm open to other options. In particular,
>> >>> this only lets you return an fd from a syscall, which may not be enough in
>> >>> all cases. For example, if an fd is written to an output parameter instead
>> >>> of returned, the current API can't handle this. Another case is that
>> >>> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
>> >>> ever decides to install an fd and output it, we wouldn't be able to handle
>> >>> this either.
>> >>
>> >> An alternative could be to have an API (an ioctl on the listener,
>> >> perhaps) that just copies an fd into the tracee.  There would be the
>> >> obvious set of options: do we replace an existing fd or allocate a new
>> >> one, and is it CLOEXEC.  Then the tracer could add an fd and then
>> >> return it just like it's a regular number.
>> >>
>> >> I feel like this would be more flexible and conceptually simpler, but
>> >> maybe a little slower for the common cases.  What do you think?
>> >
>> > I'm just implementing this now, and there's one question: when do we
>> > actually do the fd install? Should we do it when the user calls
>> > SECCOMP_NOTIF_PUT_FD, or when the actual response is sent? It feels
>> > like we should do it when the response is sent, instead of doing it
>> > right when SECCOMP_NOTIF_PUT_FD is called, since if there's a
>> > subsequent signal and the tracer decides to discard the response,
>> > we'll have to implement some delete mechanism to delete the fd, but it
>> > would have already been visible to the process, etc. So I'll go
>> > forward with this unless there are strong objections, but I thought
>> > I'd point it out just to avoid another round trip.
>> >
>> >
>>
>> Can you do that non-racily?  That is, you need to commit to an fd *number* right away, but what if another thread uses the number before you actually install the fd?
>
> I was thinking we could just do an __alloc_fd() and then do the
> fd_install() when the response is sent or clean up the case that the
> listener or task dies. I haven't actually tried to run the code yet,
> so it's possible the locking won't work :)

I would be very surprised if the locking works.  How can you run a
thread in a process when another thread has allocated but not
installed an fd and is blocked for an arbitrarily long time?

>
>> Do we really allow non-“kill” signals to interrupt the whole process?  It might be the case that we don’t really need to clean up from signals if there’s a guarantee that the thread dies.
>
> Yes, we do, because of this: https://lkml.org/lkml/2018/3/15/1122
>

I'm still not sure I see the problem.  Suppose I'm implementing a user
notifier for a nasty syscall like recvmsg().  If I'm the tracer, by
the time I decide to install an fd, I've committed to returning
something other than -EINTR, even if a non-fatal signal is sent before
I finish.  No rollback should be necessary.

In the (unlikely?) event that some tracer needs to be able to rollback
an fd installation to return -EINTR, a SECCOMP_NOTIF_CLOSE_FD
operation should be good enough, I think.  Or maybe PUT_FD can put -1
to delete an fd.

--Andy
Andy Lutomirski Sept. 21, 2018, 6:27 p.m.
On Fri, Sep 21, 2018 at 6:39 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Thu, Sep 20, 2018 at 07:18:45PM -0700, Andy Lutomirski wrote:
> >
> > I think we just want the operation to cover all the cases.  Let PUT_FD
> > take a source fd and a dest fd.  If the source fd is -1, the dest is
> > closed.  If the source is -1 and the dest is -1, return -EINVAL.  If
> > the dest is -1, allocate an fd.  If the dest is >= 0, work like
> > dup2().  (The latter could be necessary to emulate things like, say,
> > dup2 :))
>
> ...then if we're going to allow overwriting fds, we'd need to lift out
> the logic from do_dup2 somewhere? Is this getting too complicated? :)
>

fds are complicated :-p

More seriously, though, I think it's okay if we don't support
everything out of the box.  getting the general semantics I suggested
is kind of nice because the resulting API is conceptually simple, even
if it encapsulates three cases.  But I'd be okay with only supporting
add-an-fd-at-an-unused-position and delete-an-fd out of the box --
more can be added if there's demand.

But I think that exposing an operation that allocates and reserves an
fd without putting anything in the slot is awkward, and it opens us up
to weird corner cases becoming visible that are currently there but
mostly hidden.  For example, what happens if someone overwrites a
reserved fd with dup2()?  (The answer is apparently -EBUSY -- see the
big comment in do_dup2() in fs/file.c.)  But there's a more
significant nastiness: what happens if someone abuses your new
mechanism to overwrite a reserved fd that belongs to a different
thread?  It looks like you'll hit the BUG_ON(fdt->fd[fd] != NULL); in
__fd_install().  So unless you actually track which unused fds you own
and enforce that the final installation installs in the right slot,
you have a problem.

BTW, socketpair() isn't the only thing that can add two fds.
recvmsg() can, too, as can pipe() and pipe2().  Some of the DRM ioctls
may as well for all I know.  But socketpair(), pipe(), and recvmsg()
can be credibly emulated by adding each fd in sequence and then
deleting them all of one fails.  Sure, this could race against dup2(),
but I'm not sure we care.

--Andy
Jann Horn via Containers Sept. 21, 2018, 8:46 p.m.
On Fri, Sep 21, 2018 at 3:39 PM Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, Sep 20, 2018 at 07:18:45PM -0700, Andy Lutomirski wrote:
> > On Thu, Sep 20, 2018 at 4:42 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > On Wed, Sep 19, 2018 at 12:58:20PM -0700, Andy Lutomirski wrote:
> > > > On Wed, Sep 19, 2018 at 7:38 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > On Wed, Sep 19, 2018 at 07:19:56AM -0700, Andy Lutomirski wrote:
> > > > >> Do we really allow non-“kill” signals to interrupt the whole process?  It might be the case that we don’t really need to clean up from signals if there’s a guarantee that the thread dies.
> > > > >
> > > > > Yes, we do, because of this: https://lkml.org/lkml/2018/3/15/1122
> > > > >
> > > >
> > > > I'm still not sure I see the problem.  Suppose I'm implementing a user
> > > > notifier for a nasty syscall like recvmsg().  If I'm the tracer, by
> > > > the time I decide to install an fd, I've committed to returning
> > > > something other than -EINTR, even if a non-fatal signal is sent before
> > > > I finish.  No rollback should be necessary.
> > >
> > > I don't understand why this is true. Surely you could stop a handler
> > > on receipt of a new signal, and have it do something else entirely?
> >
> > I think you *could*, but I'm not sure why you would.  In general,
> > syscalls never execute signal handlers mid-syscall.  There is a very
> > small number of syscalls that use sys_restart_syscall(), but I don't
> > think any of them allocate fds, and I'm not sure we need or want to
> > support them with user notifiers.  The rest of the syscalls will, if
> > they're behaving correct, either do *something* (like reading some or
> > all of a buffer) and return success or they'll do nothing and return
> > -EINTR.  Or they return an -ERESTARTSYS variant.  And then, only
> > *after* the syscall logically returns (i.e. completely finishes
> > processing and puts its return code into the relevant register) will a
> > signal be delivered.  In other words, the case where something like
> > recv() gets interrupted but still returns a success code does not mean
> > that a signal handler was called and then recv() resumed.  It means
> > that recv() noticed the signal, stopped receiving, returned the number
> > of bytes read, and then allowed the signal to be delivered.
> >
> > In the -ERESTARTSYS case, the syscall returns -ERESTARTSYS (or a
> > variant) and returns without doing anything.  But it returns in a
> > special case where, after the signal returns, the syscall will happen
> > again.
> >
> > So, for user notifiers, I think that any sane handler that notices a
> > non-fatal signal will do one of these things:
> >
> >  - Return -EINTR without changing any tracee state.
> >
> >  - Return success, possibly without blocking as long as it would have
> > without the signal.
> >
> >  - Return -ERESTARTSYS without changing any tracee state.
> >
> >  - Kill the tracee.
> >
> > None of these would involve backing out an fd that was already
> > installed.  I suppose another way of looking at this is that.
> >
> > Although... now that I think about it, there are some special cases,
> > like socketpair().  Look for put_unused_fd().  So maybe we need to
> > expose get_unused_fd_flags() and put_unused_fd(), but I think that
> > these are exceptions and will be very uncommon in the context of
> > seccomp user notifiers.  (For example, socketpair() can be implemented
> > almost correctly without put_unused_fd().)
>
> socketpair() is a good point. In particular, if we use this queuing
> thing I've done above, then you can only ever send one fd, and you'll
> need to send two here. So perhaps we really do need to do this as soon
> as the tracer calls ioctl(), vs queuing and waiting.
>
> > Hmm.  This does mean that we need a test case for a user notifier
> > returning -ERESTARTSYS.  It should Just Work (tm), but those are
> > famous last words.
> >
> > -ERESTARTSYS_RESTARTBLOCK is the case that I don't think we need to worry about.
> >
> > >
> > > > In the (unlikely?) event that some tracer needs to be able to rollback
> > > > an fd installation to return -EINTR, a SECCOMP_NOTIF_CLOSE_FD
> > > > operation should be good enough, I think.  Or maybe PUT_FD can put -1
> > > > to delete an fd.
> > >
> > > Yes, I think even with something like what I did below we'd need some
> > > sort of REMOVE_FD option, because otherwise there's no way to change
> > > your mind and send -EINTR without the fd you just PUT_FD'd.
> > >
> >
> > I think we just want the operation to cover all the cases.  Let PUT_FD
> > take a source fd and a dest fd.  If the source fd is -1, the dest is
> > closed.  If the source is -1 and the dest is -1, return -EINVAL.  If
> > the dest is -1, allocate an fd.  If the dest is >= 0, work like
> > dup2().  (The latter could be necessary to emulate things like, say,
> > dup2 :))
>
> ...then if we're going to allow overwriting fds, we'd need to lift out
> the logic from do_dup2 somewhere? Is this getting too complicated? :)

In particular if you end up allowing overwriting fds of a remote task,
please add a scary warning to the code that does that, informing the
reader that that's only safe because you know that the target task is
stopped outside syscall context, and that it would be a very bad idea
to just copypaste that code to somewhere else. If someone tried doing
that to a single-threaded task that's in the middle of a syscall, the
results would be interesting - and by "interesting", I mean
"use-after-free on a struct file".
Tycho Andersen Sept. 21, 2018, 10:03 p.m.
On Fri, Sep 21, 2018 at 11:27:59AM -0700, Andy Lutomirski wrote:
> On Fri, Sep 21, 2018 at 6:39 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Thu, Sep 20, 2018 at 07:18:45PM -0700, Andy Lutomirski wrote:
> > >
> > > I think we just want the operation to cover all the cases.  Let PUT_FD
> > > take a source fd and a dest fd.  If the source fd is -1, the dest is
> > > closed.  If the source is -1 and the dest is -1, return -EINVAL.  If
> > > the dest is -1, allocate an fd.  If the dest is >= 0, work like
> > > dup2().  (The latter could be necessary to emulate things like, say,
> > > dup2 :))
> >
> > ...then if we're going to allow overwriting fds, we'd need to lift out
> > the logic from do_dup2 somewhere? Is this getting too complicated? :)
> >
> 
> fds are complicated :-p

:D

> More seriously, though, I think it's okay if we don't support
> everything out of the box.  getting the general semantics I suggested
> is kind of nice because the resulting API is conceptually simple, even
> if it encapsulates three cases.  But I'd be okay with only supporting
> add-an-fd-at-an-unused-position and delete-an-fd out of the box --
> more can be added if there's demand.

It's the delete/replace-an-fd one that has me worried. Anyway, I'll
take a look and see what I can figure out.

> But I think that exposing an operation that allocates and reserves an
> fd without putting anything in the slot is awkward, and it opens us up
> to weird corner cases becoming visible that are currently there but
> mostly hidden.  For example, what happens if someone overwrites a
> reserved fd with dup2()?  (The answer is apparently -EBUSY -- see the
> big comment in do_dup2() in fs/file.c.)  But there's a more
> significant nastiness: what happens if someone abuses your new
> mechanism to overwrite a reserved fd that belongs to a different
> thread?  It looks like you'll hit the BUG_ON(fdt->fd[fd] != NULL); in
> __fd_install().  So unless you actually track which unused fds you own
> and enforce that the final installation installs in the right slot,
> you have a problem.
> 
> BTW, socketpair() isn't the only thing that can add two fds.
> recvmsg() can, too, as can pipe() and pipe2().  Some of the DRM ioctls
> may as well for all I know.  But socketpair(), pipe(), and recvmsg()
> can be credibly emulated by adding each fd in sequence and then
> deleting them all of one fails.  Sure, this could race against dup2(),
> but I'm not sure we care.

Yup agreed. We need to do the install when the ioctl() is called.

Tycho
Tycho Andersen Sept. 25, 2018, 12:53 p.m.
On Thu, Sep 20, 2018 at 07:18:45PM -0700, Andy Lutomirski wrote:
> Hmm.  This does mean that we need a test case for a user notifier
> returning -ERESTARTSYS.  It should Just Work (tm), but those are
> famous last words.

Just to confirm, I've got a test case that works like this:

1. fork and install a SIGUSR1 handler
2. tracee does a syscall that gets trapped
3. send SIGUSR1
4. respond from the listener with -ERESTARTSYS
5. see another of the same syscall, even though the tracee still thinks
   its in the first one
6. respond with something reasonable, the tracee sees this response

I think that's the intended behavior. Note that when the listener
responds with -ERESTARTSYS and there is no signal pending, the task
just dies. That might be reasonable, I'm not sure.

Tycho