[v2,1/2] seccomp: notify user trap about unused filter

Submitted by Christian Brauner on May 28, 2020, 3:14 p.m.

Details

Message ID 20200528151412.265444-1-christian.brauner@ubuntu.com
State New
Series "Series without cover letter"
Headers show

Commit Message

Christian Brauner May 28, 2020, 3:14 p.m.
We've been making heavy use of the seccomp notifier to intercept and
handle certain syscalls for containers. This patch allows a syscall
supervisor listening on a given notifier to be notified when a seccomp
filter has become unused.

A container is often managed by a singleton supervisor process the
so-called "monitor". This monitor process has an event loop which has
various event handlers registered. If the user specified a seccomp
profile that included a notifier for various syscalls then we also
register a seccomp notify even handler. For any container using a
separate pid namespace the lifecycle of the seccomp notifier is bound to
the init process of the pid namespace, i.e. when the init process exits
the filter must be unused.
If a new process attaches to a container we force it to assume a seccomp
profile. This can either be the same seccomp profile as the container
was started with or a modified one. If the attaching process makes use
of the seccomp notifier we will register a new seccomp notifier handler
in the monitor's event loop. However, when the attaching process exits
we can't simply delete the handler since other child processes could've
been created (daemons spawned etc.) that have inherited the seccomp
filter and so we need to keep the seccomp notifier fd alive in the event
loop. But this is problematic since we don't get a notification when the
seccomp filter has become unused and so we currently never remove the
seccomp notifier fd from the event loop and just keep accumulating fds
in the event loop. We've had this issue for a while but it has recently
become more pressing as more and larger users make use of this.

To fix this, we introduce a new "live" reference counter that tracks the
live tasks making use of a given filter and when a notifier is
registered waiting tasks will be notified that the filter is now empty
by receiving a (E)POLLHUP event.
The concept in this patch introduces is the same as for signal_struct,
i.e. reference counting for life-cycle management is decoupled from
reference counting live taks using the object.

There's probably some trickery possible but the second counter is just
the correct way of doing this imho and has precedence. The patch also
lifts the waitqeue from struct notification into into sruct
seccomp_filter. This is cleaner overall and let's us avoid having to
take the notifier mutex since we neither need to read nor modify the
notifier specific aspects of the seccomp filter. In the exit path I'd
very much like to avoid having to take the notifier mutex for each
filter in the task's filter hierarchy.

Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Denton <mpdenton@google.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Chris Palmer <palmer@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Robert Sesek <rsesek@google.com>
Cc: Jeffrey Vander Stoep <jeffv@google.com>
Cc: Linux Containers <containers@lists.linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Jann Horn <jannh@google.com>:
  - Use more descriptive instead of seccomp_filter_notify().
    (I went with seccomp_filter_release().)
---
 include/linux/seccomp.h |  5 +++++
 kernel/exit.c           |  2 ++
 kernel/seccomp.c        | 33 +++++++++++++++++++++++++++------
 3 files changed, 34 insertions(+), 6 deletions(-)


base-commit: b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce

Patch hide | download patch | download mbox

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4192369b8418..d72b5b43f8ea 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -84,6 +84,7 @@  static inline int seccomp_mode(struct seccomp *s)
 #ifdef CONFIG_SECCOMP_FILTER
 extern void put_seccomp_filter(struct task_struct *tsk);
 extern void get_seccomp_filter(struct task_struct *tsk);
+extern void seccomp_filter_release(const struct task_struct *tsk);
 #else  /* CONFIG_SECCOMP_FILTER */
 static inline void put_seccomp_filter(struct task_struct *tsk)
 {
@@ -93,6 +94,10 @@  static inline void get_seccomp_filter(struct task_struct *tsk)
 {
 	return;
 }
+static inline void seccomp_filter_release(const struct task_struct *tsk)
+{
+	return;
+}
 #endif /* CONFIG_SECCOMP_FILTER */
 
 #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
diff --git a/kernel/exit.c b/kernel/exit.c
index ce2a75bc0ade..b332e3635eb5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -193,6 +193,8 @@  void release_task(struct task_struct *p)
 
 	cgroup_release(p);
 
+	seccomp_filter_release(p);
+
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
 	thread_pid = get_pid(p->thread_pid);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55a6184f5990..9fa642d6d549 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -94,13 +94,11 @@  struct seccomp_knotif {
  *           filter->notify_lock.
  * @next_id: The id of the next request.
  * @notifications: A list of struct seccomp_knotif elements.
- * @wqh: A wait queue for poll.
  */
 struct notification {
 	struct semaphore request;
 	u64 next_id;
 	struct list_head notifications;
-	wait_queue_head_t wqh;
 };
 
 /**
@@ -115,6 +113,10 @@  struct notification {
  * @prog: the BPF program to evaluate
  * @notif: the struct that holds all notification related information
  * @notify_lock: A lock for all notification-related accesses.
+ * @wqh: A wait queue for poll if a notifier is in use.
+ * @live: Number of tasks that use this filter directly and number
+ *	  of dependent filters that have a non-zero @live counter.
+ *	  Altered during fork(), exit(), and filter installation
  *
  * seccomp_filter objects are organized in a tree linked via the @prev
  * pointer.  For any task, it appears to be a singly-linked list starting
@@ -133,6 +135,8 @@  struct seccomp_filter {
 	struct bpf_prog *prog;
 	struct notification *notif;
 	struct mutex notify_lock;
+	refcount_t live;
+	wait_queue_head_t wqh;
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -396,6 +400,7 @@  static inline void seccomp_sync_threads(unsigned long flags)
 		 * allows a put before the assignment.)
 		 */
 		put_seccomp_filter(thread);
+		seccomp_filter_release(thread);
 		smp_store_release(&thread->seccomp.filter,
 				  caller->seccomp.filter);
 
@@ -462,6 +467,8 @@  static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	}
 
 	refcount_set(&sfilter->usage, 1);
+	refcount_set(&sfilter->live, 1);
+	init_waitqueue_head(&sfilter->wqh);
 
 	return sfilter;
 }
@@ -564,6 +571,7 @@  void get_seccomp_filter(struct task_struct *tsk)
 	if (!orig)
 		return;
 	__get_seccomp_filter(orig);
+	refcount_inc(&orig->live);
 }
 
 static inline void seccomp_filter_free(struct seccomp_filter *filter)
@@ -590,6 +598,17 @@  void put_seccomp_filter(struct task_struct *tsk)
 	__put_seccomp_filter(tsk->seccomp.filter);
 }
 
+void seccomp_filter_release(const struct task_struct *tsk)
+{
+	struct seccomp_filter *orig = tsk->seccomp.filter;
+
+	while (orig && refcount_dec_and_test(&orig->live)) {
+		if (waitqueue_active(&orig->wqh))
+			wake_up_poll(&orig->wqh, EPOLLHUP);
+		orig = orig->prev;
+	}
+}
+
 static void seccomp_init_siginfo(kernel_siginfo_t *info, int syscall, int reason)
 {
 	clear_siginfo(info);
@@ -757,7 +776,7 @@  static int seccomp_do_user_notification(int this_syscall,
 	list_add(&n.list, &match->notif->notifications);
 
 	up(&match->notif->request);
-	wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
+	wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
 	mutex_unlock(&match->notify_lock);
 
 	/*
@@ -1064,7 +1083,7 @@  static long seccomp_notify_recv(struct seccomp_filter *filter,
 	unotif.data = *(knotif->data);
 
 	knotif->state = SECCOMP_NOTIFY_SENT;
-	wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
+	wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
 	ret = 0;
 out:
 	mutex_unlock(&filter->notify_lock);
@@ -1200,7 +1219,7 @@  static __poll_t seccomp_notify_poll(struct file *file,
 	__poll_t ret = 0;
 	struct seccomp_knotif *cur;
 
-	poll_wait(file, &filter->notif->wqh, poll_tab);
+	poll_wait(file, &filter->wqh, poll_tab);
 
 	if (mutex_lock_interruptible(&filter->notify_lock) < 0)
 		return EPOLLERR;
@@ -1216,6 +1235,9 @@  static __poll_t seccomp_notify_poll(struct file *file,
 
 	mutex_unlock(&filter->notify_lock);
 
+	if (refcount_read(&filter->live) == 0)
+		ret |= EPOLLHUP;
+
 	return ret;
 }
 
@@ -1244,7 +1266,6 @@  static struct file *init_listener(struct seccomp_filter *filter)
 	sema_init(&filter->notif->request, 0);
 	filter->notif->next_id = get_random_u64();
 	INIT_LIST_HEAD(&filter->notif->notifications);
-	init_waitqueue_head(&filter->notif->wqh);
 
 	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
 				 filter, O_RDWR);

Comments

Kees Cook May 28, 2020, 11:11 p.m.
On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote:
>   * @usage: reference count to manage the object lifetime.
>   *         get/put helpers should be used when accessing an instance
>   *         outside of a lifetime-guarded section.  In general, this
>   *         is only needed for handling filters shared across tasks.
> [...]
> + * @live: Number of tasks that use this filter directly and number
> + *	  of dependent filters that have a non-zero @live counter.
> + *	  Altered during fork(), exit(), and filter installation
> [...]
>  	refcount_set(&sfilter->usage, 1);
> +	refcount_set(&sfilter->live, 1);

I'd like these reference counters to have more descriptive names. "usage"
by what? "live" from what perspective? At the least, I think we need
to be explicit in the comment, and at best we should do that and rename
them to be a bit more clear.

A filter's "usage" is incremented for each directly-attached task
(task::seccomp_data.filter, via fork() or thread_sync), once for the
dependent filter (filter::prev), and once for an open user_notif file
(file::private_data). When it reaches zero, there are (should be) no more
active memory references back to the struct filter and it can be freed.

A filter's "live" is incremented for each directly-attached task
(task::seccomp_data.filter, via fork() or thread_sync), and once for
the dependent filter (filter::prev). When it reaches zero there is no
way for new tasks to get associated with the filter, but there may still
be user_notif file::private_data references pointing at the filter.

But we're tracking "validity lifetime" (live) and "memory reference
safety" (usage).

signal_struct has "sigcnt" and "live". I find "sigcnt" to be an
unhelpful name too. (And why isn't it refcount_t?)

So, perhaps leave "live", but rename "usage" -> "references".

After looking at these other lifetime management examples in the kernel,
I'm convinced that tracking these states separately is correct, but I
remain uncomfortable about task management needing to explicitly make
two calls to let go of the filter.

I wonder if release_task() should also detach the filter from the task
and do a put_seccomp_filter() instead of waiting for task_free(). This
is supported by the other place where seccomp_filter_release() is
called:

> @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
>  		 * allows a put before the assignment.)
>  		*/
>   		put_seccomp_filter(thread);
> +		seccomp_filter_release(thread);

This would also remove the only put_seccomp_filter() call outside of
seccomp.c, since the free_task() call will be removed now in favor of
the task_release() call.

So, is it safe to detach the filter in release_task()? Has dethreading
happened yet? i.e. can we race TSYNC? -- is there a possible
inc-from-zero? (Actually, all our refcount_inc()s should be
refcount_inc_not_zero() just for robustness.) I *think* we can do it
before the release_thread() call (instead of after cgroup_release()).

With that, then seccomp_filter_release() could assign the filter to NULL
and do the clean up:

void seccomp_filter_release(const struct task_struct *tsk)
{
	struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter);

	smp_store_release(&tsk->seccomp.filter, NULL);
	__seccomp_filter_release(orig);
}

All other refcounting is then internal to seccomp.c. Which brings me
back to TSYNC, since we don't want to write NULL to task->seccomp.filter
during TSYNC. TSYNC can use:

void __seccomp_filter_release(struct seccomp_filter *filter)
{
	while (filter && refcount_dec_and_test(&filter->live)) {
		if (waitqueue_active(&filter->wqh))
			wake_up_poll(&filter->wqh, EPOLLHUP);
		filter = filter->prev;
	}
	__put_seccomp_filter(filter);
}

Thoughts?
Jann Horn via Containers May 28, 2020, 11:32 p.m.
On Fri, May 29, 2020 at 1:11 AM Kees Cook <keescook@chromium.org> wrote:
> On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote:
> >   * @usage: reference count to manage the object lifetime.
> >   *         get/put helpers should be used when accessing an instance
> >   *         outside of a lifetime-guarded section.  In general, this
> >   *         is only needed for handling filters shared across tasks.
> > [...]
> > + * @live: Number of tasks that use this filter directly and number
> > + *     of dependent filters that have a non-zero @live counter.
> > + *     Altered during fork(), exit(), and filter installation
> > [...]
> >       refcount_set(&sfilter->usage, 1);
> > +     refcount_set(&sfilter->live, 1);
[...]
> After looking at these other lifetime management examples in the kernel,
> I'm convinced that tracking these states separately is correct, but I
> remain uncomfortable about task management needing to explicitly make
> two calls to let go of the filter.
>
> I wonder if release_task() should also detach the filter from the task
> and do a put_seccomp_filter() instead of waiting for task_free(). This
> is supported by the other place where seccomp_filter_release() is
> called:
>
> > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
> >                * allows a put before the assignment.)
> >               */
> >               put_seccomp_filter(thread);
> > +             seccomp_filter_release(thread);
>
> This would also remove the only put_seccomp_filter() call outside of
> seccomp.c, since the free_task() call will be removed now in favor of
> the task_release() call.
>
> So, is it safe to detach the filter in release_task()? Has dethreading
> happened yet? i.e. can we race TSYNC? -- is there a possible
> inc-from-zero?

release_task -> __exit_signal -> __unhash_process ->
list_del_rcu(&p->thread_node) drops us from the thread list under
siglock, which is the same lock TSYNC uses.

One other interesting thing that can look at seccomp state is
task_seccomp() in procfs - that can still happen at this point. At the
moment, procfs only lets you see the numeric filter state, not the
actual filter contents, so that's not a problem; but if we ever add a
procfs interface for dumping seccomp filters (in addition to the
ptrace interface that already exists), that's something to keep in
mind.

> (Actually, all our refcount_inc()s should be
> refcount_inc_not_zero() just for robustness.)

Eeeh... wouldn't that just make the code more complicated for no good reason?

> I *think* we can do it
> before the release_thread() call (instead of after cgroup_release()).
Kees Cook May 29, 2020, 5:36 a.m.
On Fri, May 29, 2020 at 01:32:03AM +0200, Jann Horn wrote:
> On Fri, May 29, 2020 at 1:11 AM Kees Cook <keescook@chromium.org> wrote:
> > So, is it safe to detach the filter in release_task()? Has dethreading
> > happened yet? i.e. can we race TSYNC? -- is there a possible
> > inc-from-zero?
> 
> release_task -> __exit_signal -> __unhash_process ->
> list_del_rcu(&p->thread_node) drops us from the thread list under
> siglock, which is the same lock TSYNC uses.

Ah, there it is. I missed the __unhash_process() in __exit_signal, but
once I saw the call to release_task(), I figured it was safe at that
point. So this seems correct:

> > I *think* we can do it
> > before the release_thread() call (instead of after cgroup_release()).

> One other interesting thing that can look at seccomp state is
> task_seccomp() in procfs - that can still happen at this point. At the
> moment, procfs only lets you see the numeric filter state, not the
> actual filter contents, so that's not a problem; but if we ever add a
> procfs interface for dumping seccomp filters (in addition to the
> ptrace interface that already exists), that's something to keep in
> mind.

Right -- but we can just reuse the get/put to pin the filter while
dumping it from proc (there IS someone working on this feature...)

> > (Actually, all our refcount_inc()s should be
> > refcount_inc_not_zero() just for robustness.)
> 
> Eeeh... wouldn't that just make the code more complicated for no good reason?

Sorry, ignore that. I got myself briefly confused -- we're fine;
refcount_inc() already does inc-from-zero checking.
Christian Brauner May 29, 2020, 7:47 a.m.
On Thu, May 28, 2020 at 04:11:00PM -0700, Kees Cook wrote:
> On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote:
> >   * @usage: reference count to manage the object lifetime.
> >   *         get/put helpers should be used when accessing an instance
> >   *         outside of a lifetime-guarded section.  In general, this
> >   *         is only needed for handling filters shared across tasks.
> > [...]
> > + * @live: Number of tasks that use this filter directly and number
> > + *	  of dependent filters that have a non-zero @live counter.
> > + *	  Altered during fork(), exit(), and filter installation
> > [...]
> >  	refcount_set(&sfilter->usage, 1);
> > +	refcount_set(&sfilter->live, 1);
> 
> I'd like these reference counters to have more descriptive names. "usage"
> by what? "live" from what perspective? At the least, I think we need
> to be explicit in the comment, and at best we should do that and rename
> them to be a bit more clear.

Well the correct way would probably be:
"usage" -> "refs"
"live"  -> "users"

So we'd need a first patch to convert "usage" to "refs" and then
introduce "users".

> 
> A filter's "usage" is incremented for each directly-attached task
> (task::seccomp_data.filter, via fork() or thread_sync), once for the
> dependent filter (filter::prev), and once for an open user_notif file
> (file::private_data). When it reaches zero, there are (should be) no more
> active memory references back to the struct filter and it can be freed.
> 
> A filter's "live" is incremented for each directly-attached task
> (task::seccomp_data.filter, via fork() or thread_sync), and once for
> the dependent filter (filter::prev). When it reaches zero there is no
> way for new tasks to get associated with the filter, but there may still
> be user_notif file::private_data references pointing at the filter.

or - at least briefyl - ptrace or whatever, yes.

> 
> But we're tracking "validity lifetime" (live) and "memory reference
> safety" (usage).
> 
> signal_struct has "sigcnt" and "live". I find "sigcnt" to be an
> unhelpful name too. (And why isn't it refcount_t?)

I think I once looked that up and there was some sort of "not needed, no
gain" style rationale.

> 
> So, perhaps leave "live", but rename "usage" -> "references".

usage -> refs
live  -> users/active

> 
> After looking at these other lifetime management examples in the kernel,
> I'm convinced that tracking these states separately is correct, but I
> remain uncomfortable about task management needing to explicitly make
> two calls to let go of the filter.
> 
> I wonder if release_task() should also detach the filter from the task
> and do a put_seccomp_filter() instead of waiting for task_free(). This
> is supported by the other place where seccomp_filter_release() is
> called:
> 
> > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
> >  		 * allows a put before the assignment.)
> >  		*/
> >   		put_seccomp_filter(thread);
> > +		seccomp_filter_release(thread);
> 
> This would also remove the only put_seccomp_filter() call outside of
> seccomp.c, since the free_task() call will be removed now in favor of
> the task_release() call.
> 
> So, is it safe to detach the filter in release_task()? Has dethreading
> happened yet? i.e. can we race TSYNC? -- is there a possible
> inc-from-zero? (Actually, all our refcount_inc()s should be
> refcount_inc_not_zero() just for robustness.) I *think* we can do it
> before the release_thread() call (instead of after cgroup_release()).
> 
> With that, then seccomp_filter_release() could assign the filter to NULL
> and do the clean up:
> 
> void seccomp_filter_release(const struct task_struct *tsk)
> {
> 	struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter);
> 
> 	smp_store_release(&tsk->seccomp.filter, NULL);
> 	__seccomp_filter_release(orig);
> }
> 
> All other refcounting is then internal to seccomp.c. Which brings me
> back to TSYNC, since we don't want to write NULL to task->seccomp.filter
> during TSYNC. TSYNC can use:
> 
> void __seccomp_filter_release(struct seccomp_filter *filter)
> {
> 	while (filter && refcount_dec_and_test(&filter->live)) {
> 		if (waitqueue_active(&filter->wqh))
> 			wake_up_poll(&filter->wqh, EPOLLHUP);
> 		filter = filter->prev;
> 	}
> 	__put_seccomp_filter(filter);
> }
> 
> Thoughts?
> 
> -- 
> Kees Cook
Christian Brauner May 29, 2020, 7:51 a.m.
On Fri, May 29, 2020 at 01:32:03AM +0200, Jann Horn wrote:
> On Fri, May 29, 2020 at 1:11 AM Kees Cook <keescook@chromium.org> wrote:
> > On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote:
> > >   * @usage: reference count to manage the object lifetime.
> > >   *         get/put helpers should be used when accessing an instance
> > >   *         outside of a lifetime-guarded section.  In general, this
> > >   *         is only needed for handling filters shared across tasks.
> > > [...]
> > > + * @live: Number of tasks that use this filter directly and number
> > > + *     of dependent filters that have a non-zero @live counter.
> > > + *     Altered during fork(), exit(), and filter installation
> > > [...]
> > >       refcount_set(&sfilter->usage, 1);
> > > +     refcount_set(&sfilter->live, 1);
> [...]
> > After looking at these other lifetime management examples in the kernel,
> > I'm convinced that tracking these states separately is correct, but I
> > remain uncomfortable about task management needing to explicitly make
> > two calls to let go of the filter.
> >
> > I wonder if release_task() should also detach the filter from the task
> > and do a put_seccomp_filter() instead of waiting for task_free(). This
> > is supported by the other place where seccomp_filter_release() is
> > called:
> >
> > > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
> > >                * allows a put before the assignment.)
> > >               */
> > >               put_seccomp_filter(thread);
> > > +             seccomp_filter_release(thread);
> >
> > This would also remove the only put_seccomp_filter() call outside of
> > seccomp.c, since the free_task() call will be removed now in favor of
> > the task_release() call.
> >
> > So, is it safe to detach the filter in release_task()? Has dethreading
> > happened yet? i.e. can we race TSYNC? -- is there a possible
> > inc-from-zero?
> 
> release_task -> __exit_signal -> __unhash_process ->
> list_del_rcu(&p->thread_node) drops us from the thread list under
> siglock, which is the same lock TSYNC uses.
> 
> One other interesting thing that can look at seccomp state is
> task_seccomp() in procfs - that can still happen at this point. At the
> moment, procfs only lets you see the numeric filter state, not the
> actual filter contents, so that's not a problem; but if we ever add a
> procfs interface for dumping seccomp filters (in addition to the
> ptrace interface that already exists), that's something to keep in
> mind.

Aside from this being not an issue now, can we please not dump seccomp
filter contents in proc. That sounds terrible and what's the rationale,
libseccomp already let's you dump filter contents while loading and you
could ptrace it. But maybe I'm missing a giant need for this...

Christian
Christian Brauner May 29, 2020, 7:56 a.m.
On Thu, May 28, 2020 at 04:11:00PM -0700, Kees Cook wrote:
> On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote:
> >   * @usage: reference count to manage the object lifetime.
> >   *         get/put helpers should be used when accessing an instance
> >   *         outside of a lifetime-guarded section.  In general, this
> >   *         is only needed for handling filters shared across tasks.
> > [...]
> > + * @live: Number of tasks that use this filter directly and number
> > + *	  of dependent filters that have a non-zero @live counter.
> > + *	  Altered during fork(), exit(), and filter installation
> > [...]
> >  	refcount_set(&sfilter->usage, 1);
> > +	refcount_set(&sfilter->live, 1);
> 
> I'd like these reference counters to have more descriptive names. "usage"
> by what? "live" from what perspective? At the least, I think we need
> to be explicit in the comment, and at best we should do that and rename
> them to be a bit more clear.
> 
> A filter's "usage" is incremented for each directly-attached task
> (task::seccomp_data.filter, via fork() or thread_sync), once for the
> dependent filter (filter::prev), and once for an open user_notif file
> (file::private_data). When it reaches zero, there are (should be) no more
> active memory references back to the struct filter and it can be freed.
> 
> A filter's "live" is incremented for each directly-attached task
> (task::seccomp_data.filter, via fork() or thread_sync), and once for
> the dependent filter (filter::prev). When it reaches zero there is no
> way for new tasks to get associated with the filter, but there may still
> be user_notif file::private_data references pointing at the filter.
> 
> But we're tracking "validity lifetime" (live) and "memory reference
> safety" (usage).
> 
> signal_struct has "sigcnt" and "live". I find "sigcnt" to be an
> unhelpful name too. (And why isn't it refcount_t?)
> 
> So, perhaps leave "live", but rename "usage" -> "references".
> 
> After looking at these other lifetime management examples in the kernel,
> I'm convinced that tracking these states separately is correct, but I
> remain uncomfortable about task management needing to explicitly make
> two calls to let go of the filter.
> 
> I wonder if release_task() should also detach the filter from the task
> and do a put_seccomp_filter() instead of waiting for task_free(). This
> is supported by the other place where seccomp_filter_release() is
> called:
> 
> > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
> >  		 * allows a put before the assignment.)
> >  		*/
> >   		put_seccomp_filter(thread);
> > +		seccomp_filter_release(thread);
> 
> This would also remove the only put_seccomp_filter() call outside of
> seccomp.c, since the free_task() call will be removed now in favor of
> the task_release() call.
> 
> So, is it safe to detach the filter in release_task()? Has dethreading
> happened yet? i.e. can we race TSYNC? -- is there a possible

So I've just gone through this and this is safe. I suspect that it was
moved to free_task() because it was not considered high-priority work,
i.e. nobody depended on the filter being released at a certain point.
That's changed with the notifier so yes, we could move this.

> inc-from-zero? (Actually, all our refcount_inc()s should be
> refcount_inc_not_zero() just for robustness.) I *think* we can do it

I think this was mentioned in another mail refcount_inc() does that
check. Believe me, I know that from other debugging experience. ;)

> before the release_thread() call (instead of after cgroup_release()).
> 
> With that, then seccomp_filter_release() could assign the filter to NULL
> and do the clean up:
> 
> void seccomp_filter_release(const struct task_struct *tsk)
> {
> 	struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter);
> 
> 	smp_store_release(&tsk->seccomp.filter, NULL);

I need to go through the memory ordering requirements before I can say
yay or nay confidently to this. :)

> 	__seccomp_filter_release(orig);
> }
> 
> All other refcounting is then internal to seccomp.c. Which brings me
> back to TSYNC, since we don't want to write NULL to task->seccomp.filter
> during TSYNC. TSYNC can use:
> 
> void __seccomp_filter_release(struct seccomp_filter *filter)
> {
> 	while (filter && refcount_dec_and_test(&filter->live)) {
> 		if (waitqueue_active(&filter->wqh))
> 			wake_up_poll(&filter->wqh, EPOLLHUP);
> 		filter = filter->prev;
> 	}
> 	__put_seccomp_filter(filter);
> }
> 
> Thoughts?
> 
> -- 
> Kees Cook
Kees Cook May 29, 2020, 7:56 a.m.
On Fri, May 29, 2020 at 09:51:37AM +0200, Christian Brauner wrote:
> Aside from this being not an issue now, can we please not dump seccomp
> filter contents in proc. That sounds terrible and what's the rationale,
> libseccomp already let's you dump filter contents while loading and you
> could ptrace it. But maybe I'm missing a giant need for this...

The use-case comes from Android wanting to audit seccomp filters at
runtime. I think this is stalled until there is a good answer to "what
are you going to audit for, and how, given raw BPF?"
Christian Brauner May 29, 2020, 8 a.m.
On Fri, May 29, 2020 at 12:56:50AM -0700, Kees Cook wrote:
> On Fri, May 29, 2020 at 09:51:37AM +0200, Christian Brauner wrote:
> > Aside from this being not an issue now, can we please not dump seccomp
> > filter contents in proc. That sounds terrible and what's the rationale,
> > libseccomp already let's you dump filter contents while loading and you
> > could ptrace it. But maybe I'm missing a giant need for this...
> 
> The use-case comes from Android wanting to audit seccomp filters at
> runtime. I think this is stalled until there is a good answer to "what
> are you going to audit for, and how, given raw BPF?"

Doing this in proc seems very suboptimal why isn't this simply an
extension to the seccomp syscall (passing in a struct with the target's
pid or pidfd for example) to identify the target? But yeah, if there's
no real audit strategy all of that seems weird.

Christian
Kees Cook May 29, 2020, 8:02 a.m.
On Fri, May 29, 2020 at 09:47:44AM +0200, Christian Brauner wrote:
> Well the correct way would probably be:
> "usage" -> "refs"
> "live"  -> "users"

Yeah, I like it! :)

> So we'd need a first patch to convert "usage" to "refs" and then
> introduce "users".

Yup, sounds right.

> > signal_struct has "sigcnt" and "live". I find "sigcnt" to be an
> > unhelpful name too. (And why isn't it refcount_t?)
> 
> I think I once looked that up and there was some sort of "not needed, no
> gain" style rationale.

hrm. it uses _inc and _dec_and_test... imo, that should make it be a
refcount_t. Even if we're not protecting some clear UAF issue, it's
still good to notification of potential bugs.
Kees Cook May 29, 2020, 8:06 a.m.
On Fri, May 29, 2020 at 09:56:41AM +0200, Christian Brauner wrote:
> On Thu, May 28, 2020 at 04:11:00PM -0700, Kees Cook wrote:
> > void seccomp_filter_release(const struct task_struct *tsk)
> > {
> > 	struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter);
> > 
> > 	smp_store_release(&tsk->seccomp.filter, NULL);
> 
> I need to go through the memory ordering requirements before I can say
> yay or nay confidently to this. :)
> 
> > 	__seccomp_filter_release(orig);
> > }

The only caller will be release_task() after dethread, so honestly this
was just me being paranoid. I don't think it actually needs the
READ_ONCE() nor the store_release. I think I wrote all that before I'd
convinced myself it was safe to remove a filter then. But I'm still
suspicious given the various ways release_task() gets called... I just
know that if mode 2 is set and filter == NULL, seccomp will fail closed,
so I went the paranoid route. :)
Christian Brauner May 29, 2020, 8:37 a.m.
On Fri, May 29, 2020 at 01:06:59AM -0700, Kees Cook wrote:
> On Fri, May 29, 2020 at 09:56:41AM +0200, Christian Brauner wrote:
> > On Thu, May 28, 2020 at 04:11:00PM -0700, Kees Cook wrote:
> > > void seccomp_filter_release(const struct task_struct *tsk)
> > > {
> > > 	struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter);
> > > 
> > > 	smp_store_release(&tsk->seccomp.filter, NULL);
> > 
> > I need to go through the memory ordering requirements before I can say
> > yay or nay confidently to this. :)
> > 
> > > 	__seccomp_filter_release(orig);
> > > }
> 
> The only caller will be release_task() after dethread, so honestly this
> was just me being paranoid. I don't think it actually needs the
> READ_ONCE() nor the store_release. I think I wrote all that before I'd
> convinced myself it was safe to remove a filter then. But I'm still
> suspicious given the various ways release_task() gets called... I just
> know that if mode 2 is set and filter == NULL, seccomp will fail closed,
> so I went the paranoid route. :)

release_task() should only be called once per thread otherwise we'd have
big problems. And every time we call release_task() we're already
EXIT_DEAD iirc. So there should be no way someone else can find us (in a
relevant way and especially not from userspace).
exit_notify() -> if we're autoreaping we're EXIT_DEAD anyway, if we're
not autoreaping we'll wait_task_zombie() eventually -> we're EXIT_DEAD
(parent has reaped us)

find_child_reaper() -> we couldn't find a child reaper for us, i.e. we
were (namespace) init -> unlink all tasks we were ptracing
(exit_ptrace()) and if they're EXIT_DEAD move them to the dead list ->
release_task()'s that are EXIT_DEAD that we ptraced.

and de_thread() relies on EXIT_DEAD too:
/*
 * We are going to release_task()->ptrace_unlink() silently,
 * the tracer can sleep in do_wait(). EXIT_DEAD guarantees
 * the tracer wont't block again waiting for this thread.
*/

(This is a very _rough_ sketch.)
So I think that's safe.

Christian