[4/8] seccomp: Implement constant action bitmaps

Submitted by Kees Cook on June 16, 2020, 7:49 a.m.

Details

Message ID 20200616074934.1600036-5-keescook@chromium.org
State New
Series "seccomp: Implement constant action bitmaps"
Headers show

Commit Message

Kees Cook June 16, 2020, 7:49 a.m.
One of the most common pain points with seccomp filters has been dealing
with the overhead of processing the filters, especially for "always allow"
or "always reject" cases. While BPF is extremely fast[1], it will always
have overhead associated with it. Additionally, due to seccomp's design,
filters are layered, which means processing time goes up as the number
of filters attached goes up.

In the past, efforts have been focused on making filter execution complete
in a shorter amount of time. For example, filters were rewritten from
using linear if/then/else syscall search to using balanced binary trees,
or moving tests for syscalls common to the process's workload to the
front of the filter. However, there are limits to this, especially when
some processes are dealing with tens of filters[2], or when some
architectures have a less efficient BPF engine[3].

The most common use of seccomp, constructing syscall block/allow-lists,
where syscalls that are always allowed or always rejected (without regard
to any arguments), also tends to produce the most pathological runtime
problems, in that a large number of syscall checks in the filter need
to be performed to come to a determination.

In order to optimize these cases from O(n) to O(1), seccomp can
use bitmaps to immediately determine the desired action. A critical
observation in the prior paragraph bears repeating: the common case for
syscall tests do not check arguments. For any given filter, there is a
constant mapping from the combination of architecture and syscall to the
seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
there is a single architecture.). As such, it is possible to construct
a mapping of arch/syscall to action, which can be updated as new filters
are attached to a process.

In order to build this mapping at filter attach time, each filter is
executed for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
there is a constant mapping for that syscall, and bitmaps can be updated
accordingly. If any accesses happen outside of those struct members,
seccomp must not bypass filter execution for that syscall, since program
state will be used to determine filter action result.

During syscall action probing, in order to determine whether other members
of struct seccomp_data are being accessed during a filter execution,
the struct is placed across a page boundary with the "arch" and "nr"
members in the first page, and everything else in the second page. The
"page accessed" flag is cleared in the second page's PTE, and the filter
is run. If the "page accessed" flag appears as set after running the
filter, we can determine that the filter looked beyond the "arch" and
"nr" members, and exclude that syscall from the constant action bitmaps.

For architectures to support this optimization, they must declare
their architectures for seccomp to see (via SECCOMP_ARCH and
SECCOMP_ARCH_COMPAT macros), and provide a way to perform efficient
CPU-local kernel TLB flushes (via local_flush_tlb_kernel_range()),
and then set HAVE_ARCH_SECCOMP_BITMAP in their Kconfig.

Areas needing more attention:

On x86, this currently adds 168 bytes (or 336 bytes under CONFIG_COMPAT)
to the size of task_struct. Allocating these on demand may be a better
use of memory, but may not result in good cache locality.

For architectures with "synthetic" architectures, like x86_x32,
additional work is needed. It should be possible to define a simple
mechanism based on the masking done in the x86 syscall entry path to
create another set of bitmaps for seccomp to key off of. I am, however,
considering just leaving HAVE_ARCH_SECCOMP_BITMAP depend on !X86_X32.

[1] https://lore.kernel.org/bpf/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/
[2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/
[3] https://lore.kernel.org/bpf/717a06e7f35740ccb4c70470ec70fb2f@huawei.com/

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/Kconfig            |   7 ++
 include/linux/seccomp.h |  15 +++
 kernel/seccomp.c        | 227 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 246 insertions(+), 3 deletions(-)

Patch hide | download patch | download mbox

diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc35dc556c7..4e692b7a4435 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -465,6 +465,13 @@  config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+config HAVE_ARCH_SECCOMP_BITMAP
+	bool
+	help
+	  An arch should select this symbol if it provides all of these things:
+	  - SECCOMP_ARCH (and SECCOMP_ARCH_COMPAT if appropriate)
+	  - local_flush_tlb_kernel_range()
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 6525ddec177a..31ee2d6f4ec0 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -16,6 +16,17 @@ 
 #include <linux/atomic.h>
 #include <asm/seccomp.h>
 
+/* When no bits are set for a syscall, filters are run. */
+struct seccomp_bitmaps {
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_BITMAP
+	/* "allow" are initialized to set and only ever get cleared. */
+	DECLARE_BITMAP(allow, NR_syscalls);
+	/* These are initialized to clear and only ever get set. */
+	DECLARE_BITMAP(kill_thread, NR_syscalls);
+	DECLARE_BITMAP(kill_process, NR_syscalls);
+#endif
+};
+
 struct seccomp_filter;
 /**
  * struct seccomp - the state of a seccomp'ed process
@@ -35,6 +46,10 @@  struct seccomp {
 #endif
 	atomic_t filter_count;
 	struct seccomp_filter *filter;
+	struct seccomp_bitmaps native;
+#ifdef CONFIG_COMPAT
+	struct seccomp_bitmaps compat;
+#endif
 };
 
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 43edf53c2d84..2fbe7d2260f7 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -44,6 +44,11 @@ 
 #include <linux/anon_inodes.h>
 #include <linux/lockdep.h>
 
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_BITMAP
+#include <linux/pgtable.h>
+#include <asm/tlbflush.h>
+#endif
+
 enum notify_state {
 	SECCOMP_NOTIFY_INIT,
 	SECCOMP_NOTIFY_SENT,
@@ -476,6 +481,16 @@  static inline void seccomp_sync_threads(unsigned long flags)
 		atomic_set(&thread->seccomp.filter_count,
 			   atomic_read(&thread->seccomp.filter_count));
 
+		/* Copy syscall filter bitmaps. */
+		memcpy(&thread->seccomp.native,
+		       &caller->seccomp.native,
+		       sizeof(caller->seccomp.native));
+#ifdef CONFIG_COMPAT
+		memcpy(&thread->seccomp.compat,
+		       &caller->seccomp.compat,
+		       sizeof(caller->seccomp.compat));
+#endif
+
 		/*
 		 * Don't let an unprivileged task work around
 		 * the no_new_privs restriction by creating
@@ -578,6 +593,144 @@  seccomp_prepare_user_filter(const char __user *user_filter)
 	return filter;
 }
 
+static inline bool sd_touched(pte_t *ptep)
+{
+	return !!pte_young(*(READ_ONCE(ptep)));
+}
+
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_BITMAP
+/*
+ * We can build bitmaps only when an arch/nr combination reads nothing more
+ * that sd->nr and sd->arch, since those have a constant mapping to the
+ * syscall. To do this, we can run the filters for each syscall number, and
+ * examine the page table entry that is aligned to everything past sd->arch,
+ * checking for the ACCESSED flag.
+ *
+ * This approach could also be used to test for access to sd->arch too,
+ * if we wanted to warn about compat-unsafe filters.
+ */
+static void seccomp_update_bitmap(struct seccomp_filter *filter,
+				  void *pagepair, u32 arch,
+				  struct seccomp_bitmaps *bitmaps)
+{
+	struct seccomp_data *sd;
+	unsigned long vaddr;
+	u32 nr, ret;
+	pte_t *ptep;
+	u64 check;
+
+	/* Initialize bitmaps for first filter. */
+	if (!filter->prev)
+		bitmap_fill(bitmaps->allow, NR_syscalls);
+	/*
+	 * Prepare to detect memory accesses: find the PTE for the second page
+	 * in the page pair.
+	 */
+	vaddr = (unsigned long)(pagepair + PAGE_SIZE);
+	ptep = virt_to_kpte(vaddr);
+	/*
+	 * Split struct seccomp_data across two pages, with everything after
+	 * sd->arch (i.e. starting with sd->instruction_pointer), in the second
+	 * page of the page pair.
+	 */
+	sd = pagepair + PAGE_SIZE - offsetof(struct seccomp_data, instruction_pointer);
+
+	/* Mark the second page as untouched (i.e. "old") */
+	preempt_disable();
+	set_pte_at(&init_mm, vaddr, ptep, pte_mkold(*(READ_ONCE(ptep))));
+	local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
+	preempt_enable();
+	/* Make sure the PTE agrees that it is untouched. */
+	if (WARN_ON_ONCE(sd_touched(ptep)))
+		return;
+	/* Read a portion of struct seccomp_data from the second page. */
+	check = sd->instruction_pointer;
+	/* First, verify the contents are zero from vzalloc(). */
+	if (WARN_ON_ONCE(check))
+		return;
+	/* Now make sure the ACCESSED bit has been set after the read. */
+	if (!sd_touched(ptep)) {
+		/*
+		 * If autodetection fails, fall back to standard beahavior by
+		 * clearing the entire "allow" bitmap.
+		 */
+		pr_warn_once("seccomp: cannot build automatic syscall filters\n");
+		bitmap_zero(bitmaps->allow, NR_syscalls);
+		return;
+	}
+
+	/*
+	 * For every syscall, if we don't already know we need to run
+	 * the full filter, simulate the filter with our static values.
+	 */
+	for (nr = 0; nr < NR_syscalls; nr++) {
+		/* Are we already at the maximal rejection state? */
+		if (test_bit(nr, bitmaps->kill_process))
+			continue;
+
+		sd->nr = nr;
+		sd->arch = arch;
+
+		/* Do we need to reset the ACCESSED bit? */
+		if (sd_touched(ptep)) {
+			preempt_disable();
+			set_pte_at(&init_mm, vaddr, ptep, pte_mkold(*(READ_ONCE(ptep))));
+			local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
+			preempt_enable();
+		}
+
+		/* Evaluate filter for this syscall. */
+		ret = bpf_prog_run_pin_on_cpu(filter->prog, sd);
+		/*
+		 * If this run through the filter didn't access
+		 * beyond "arch", we know the result is a constant
+		 * mapping for arch/nr -> ret.
+		 */
+		if (!sd_touched(ptep)) {
+			/* Constant evaluation. Mark appropriate bitmaps. */
+			switch (ret) {
+			case SECCOMP_RET_KILL_PROCESS:
+				set_bit(nr, bitmaps->kill_process);
+				break;
+			case SECCOMP_RET_KILL_THREAD:
+				set_bit(nr, bitmaps->kill_thread);
+				break;
+			default:
+				break;
+			case SECCOMP_RET_ALLOW:
+				/*
+				 * If we always map to allow, there are
+				 * no changes needed to the bitmaps.
+				 */
+				continue;
+			}
+		}
+
+		/*
+		 * Dynamic evaluation of syscall, or non-allow constant
+		 * mapping to something other than SECCOMP_RET_ALLOW: we
+		 * must not short-circuit-allow it anymore.
+		 */
+		clear_bit(nr, bitmaps->allow);
+	}
+}
+
+static void seccomp_update_bitmaps(struct seccomp_filter *filter,
+				   void *pagepair)
+{
+	seccomp_update_bitmap(filter, pagepair, SECCOMP_ARCH,
+			      &current->seccomp.native);
+#ifdef CONFIG_COMPAT
+	seccomp_update_bitmap(filter, pagepair, SECCOMP_ARCH_COMPAT,
+			      &current->seccomp.compat);
+#endif
+}
+#else
+static void seccomp_update_bitmaps(struct seccomp_filter *filter,
+				   void *pagepair)
+{ }
+#endif
+
 /**
  * seccomp_attach_filter: validate and attach filter
  * @flags:  flags to change filter behavior
@@ -591,7 +744,8 @@  seccomp_prepare_user_filter(const char __user *user_filter)
  *   - in NEW_LISTENER mode: the fd of the new listener
  */
 static long seccomp_attach_filter(unsigned int flags,
-				  struct seccomp_filter *filter)
+				  struct seccomp_filter *filter,
+				  void *pagepair)
 {
 	unsigned long total_insns;
 	struct seccomp_filter *walker;
@@ -630,6 +784,9 @@  static long seccomp_attach_filter(unsigned int flags,
 	current->seccomp.filter = filter;
 	atomic_inc(&current->seccomp.filter_count);
 
+	/* Evaluate filter for new known-outcome syscalls */
+	seccomp_update_bitmaps(filter, pagepair);
+
 	/* Now that the new filter is in place, synchronize to all threads. */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		seccomp_sync_threads(flags);
@@ -857,6 +1014,56 @@  static int seccomp_do_user_notification(int this_syscall,
 	return -1;
 }
 
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_BITMAP
+static inline bool __bypass_filter(struct seccomp_bitmaps *bitmaps,
+				   u32 nr, u32 *filter_ret)
+{
+	if (nr < NR_syscalls) {
+		if (test_bit(nr, current->seccomp.native.allow)) {
+			*filter_ret = SECCOMP_RET_ALLOW;
+			return true;
+		}
+		if (test_bit(nr, current->seccomp.native.kill_process)) {
+			*filter_ret = SECCOMP_RET_KILL_PROCESS;
+			return true;
+		}
+		if (test_bit(nr, current->seccomp.native.kill_thread)) {
+			*filter_ret = SECCOMP_RET_KILL_THREAD;
+			return true;
+		}
+	}
+	return false;
+}
+
+static inline u32 check_syscall(const struct seccomp_data *sd,
+				struct seccomp_filter **match)
+{
+	u32 filter_ret = SECCOMP_RET_KILL_PROCESS;
+
+#ifdef CONFIG_COMPAT
+	if (sd->arch == SECCOMP_ARCH) {
+#endif
+		if (__bypass_filter(&current->seccomp.native, sd->nr, &filter_ret))
+			return filter_ret;
+#ifdef CONFIG_COMPAT
+	} else if (sd->arch == SECCOMP_ARCH_COMPAT) {
+		if (__bypass_filter(&current->seccomp.compat, sd->nr, &filter_ret))
+			return filter_ret;
+	} else {
+		WARN_ON_ONCE(1);
+		return filter_ret;
+	}
+#endif
+	return seccomp_run_filters(sd, match);
+}
+#else
+static inline u32 check_syscall(const struct seccomp_data *sd,
+				struct seccomp_filter **match)
+{
+	return seccomp_run_filters(sd, match);
+}
+#endif
+
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
 {
@@ -876,7 +1083,7 @@  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 		sd = &sd_local;
 	}
 
-	filter_ret = seccomp_run_filters(sd, &match);
+	filter_ret = check_syscall(sd, &match);
 	data = filter_ret & SECCOMP_RET_DATA;
 	action = filter_ret & SECCOMP_RET_ACTION_FULL;
 
@@ -1346,6 +1553,7 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	long ret = -EINVAL;
 	int listener = -1;
 	struct file *listener_f = NULL;
+	void *pagepair;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -1391,12 +1599,24 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
 		goto out_put_fd;
 
+	/*
+	 * This memory will be needed for bitmap testing, but we'll
+	 * be holding a spinlock at that point. Do the allocation
+	 * (and free) outside of the lock.
+	 *
+	 * Alternative: we could do the bitmap update before attach
+	 * to avoid spending too much time under lock.
+	 */
+	pagepair = vzalloc(PAGE_SIZE * 2);
+	if (!pagepair)
+		goto out_put_fd;
+
 	spin_lock_irq(&current->sighand->siglock);
 
 	if (!seccomp_may_assign_mode(seccomp_mode))
 		goto out;
 
-	ret = seccomp_attach_filter(flags, prepared);
+	ret = seccomp_attach_filter(flags, prepared, pagepair);
 	if (ret)
 		goto out;
 	/* Do not free the successfully attached filter. */
@@ -1405,6 +1625,7 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	seccomp_assign_mode(current, seccomp_mode, flags);
 out:
 	spin_unlock_irq(&current->sighand->siglock);
+	vfree(pagepair);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		mutex_unlock(&current->signal->cred_guard_mutex);
 out_put_fd:

Comments

Jann Horn via Containers June 16, 2020, 12:14 p.m.
On Tue, Jun 16, 2020 at 9:49 AM Kees Cook <keescook@chromium.org> wrote:
> One of the most common pain points with seccomp filters has been dealing
> with the overhead of processing the filters, especially for "always allow"
> or "always reject" cases. While BPF is extremely fast[1], it will always
> have overhead associated with it. Additionally, due to seccomp's design,
> filters are layered, which means processing time goes up as the number
> of filters attached goes up.
>
> In the past, efforts have been focused on making filter execution complete
> in a shorter amount of time. For example, filters were rewritten from
> using linear if/then/else syscall search to using balanced binary trees,
> or moving tests for syscalls common to the process's workload to the
> front of the filter. However, there are limits to this, especially when
> some processes are dealing with tens of filters[2], or when some
> architectures have a less efficient BPF engine[3].
>
> The most common use of seccomp, constructing syscall block/allow-lists,
> where syscalls that are always allowed or always rejected (without regard
> to any arguments), also tends to produce the most pathological runtime
> problems, in that a large number of syscall checks in the filter need
> to be performed to come to a determination.
>
> In order to optimize these cases from O(n) to O(1), seccomp can
> use bitmaps to immediately determine the desired action. A critical
> observation in the prior paragraph bears repeating: the common case for
> syscall tests do not check arguments. For any given filter, there is a
> constant mapping from the combination of architecture and syscall to the
> seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
> there is a single architecture.). As such, it is possible to construct
> a mapping of arch/syscall to action, which can be updated as new filters
> are attached to a process.
>
> In order to build this mapping at filter attach time, each filter is
> executed for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> there is a constant mapping for that syscall, and bitmaps can be updated
> accordingly. If any accesses happen outside of those struct members,
> seccomp must not bypass filter execution for that syscall, since program
> state will be used to determine filter action result.
>
> During syscall action probing, in order to determine whether other members
> of struct seccomp_data are being accessed during a filter execution,
> the struct is placed across a page boundary with the "arch" and "nr"
> members in the first page, and everything else in the second page. The
> "page accessed" flag is cleared in the second page's PTE, and the filter
> is run. If the "page accessed" flag appears as set after running the
> filter, we can determine that the filter looked beyond the "arch" and
> "nr" members, and exclude that syscall from the constant action bitmaps.
>
> For architectures to support this optimization, they must declare
> their architectures for seccomp to see (via SECCOMP_ARCH and
> SECCOMP_ARCH_COMPAT macros), and provide a way to perform efficient
> CPU-local kernel TLB flushes (via local_flush_tlb_kernel_range()),
> and then set HAVE_ARCH_SECCOMP_BITMAP in their Kconfig.

Wouldn't it be simpler to use a function that can run a subset of
seccomp cBPF and bails out on anything that indicates that a syscall's
handling is complex or on instructions it doesn't understand? For
syscalls that have a fixed policy, a typical seccomp filter doesn't
even use any of the BPF_ALU ops, the scratch space, or the X register;
it just uses something like the following set of operations, which is
easy to emulate without much code:

BPF_LD | BPF_W | BPF_ABS
BPF_JMP | BPF_JEQ | BPF_K
BPF_JMP | BPF_JGE | BPF_K
BPF_JMP | BPF_JGT | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K

Something like (completely untested):

/*
 * Try to statically determine whether @filter will always return a fixed result
 * when run for syscall @nr under architecture @arch.
 * Returns true if the result could be determined; if so, the result will be
 * stored in @action.
 */
static bool seccomp_check_syscall(struct sock_filter *filter, unsigned int arch,
                                  unsigned int nr, unsigned int *action)
{
  int pc;
  unsigned int reg_value = 0;

  for (pc = 0; 1; pc++) {
    struct sock_filter *insn = &filter[pc];
    u16 code = insn->code;
    u32 k = insn->k;

    switch (code) {
    case BPF_LD | BPF_W | BPF_ABS:
      if (k == offsetof(struct seccomp_data, nr)) {
        reg_value = nr;
      } else if (k == offsetof(struct seccomp_data, arch)) {
        reg_value = arch;
      } else {
        return false; /* can't optimize (non-constant value load) */
      }
      break;
    case BPF_RET | BPF_K:
      *action = insn->k;
      return true; /* success: reached return with constant values only */
    case BPF_JMP | BPF_JA:
      pc += insn->k;
      break;
    case BPF_JMP | BPF_JEQ | BPF_K:
    case BPF_JMP | BPF_JGE | BPF_K:
    case BPF_JMP | BPF_JGT | BPF_K:
    default:
      if (BPF_CLASS(code) == BPF_JMP && BPF_SRC(code) == BPF_K) {
        u16 op = BPF_OP(code);
        bool op_res;

        switch (op) {
        case BPF_JEQ:
          op_res = reg_value == k;
          break;
        case BPF_JGE:
          op_res = reg_value >= k;
          break;
        case BPF_JGT:
          op_res = reg_value > k;
          break;
        default:
          return false; /* can't optimize (unknown insn) */
        }

        pc += op_res ? insn->jt : insn->jf;
        break;
      }
      return false; /* can't optimize (unknown insn) */
    }
  }
}

That way, you won't need any of this complicated architecture-specific stuff.
Dave Hansen June 16, 2020, 2:40 p.m.
On 6/16/20 12:49 AM, Kees Cook wrote:
> +	/* Mark the second page as untouched (i.e. "old") */
> +	preempt_disable();
> +	set_pte_at(&init_mm, vaddr, ptep, pte_mkold(*(READ_ONCE(ptep))));
> +	local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
> +	preempt_enable();

If you can, I'd wrap that nugget up in a helper.  I'd also suggest being
very explicit in a comment about what it is trying to do: ensure no TLB
entries exist so that a future access will always set the Accessed bit.

> +	/* Make sure the PTE agrees that it is untouched. */
> +	if (WARN_ON_ONCE(sd_touched(ptep)))
> +		return;
> +	/* Read a portion of struct seccomp_data from the second page. */
> +	check = sd->instruction_pointer;
> +	/* First, verify the contents are zero from vzalloc(). */
> +	if (WARN_ON_ONCE(check))
> +		return;
> +	/* Now make sure the ACCESSED bit has been set after the read. */
> +	if (!sd_touched(ptep)) {
> +		/*
> +		 * If autodetection fails, fall back to standard beahavior by
> +		 * clearing the entire "allow" bitmap.
> +		 */
> +		pr_warn_once("seccomp: cannot build automatic syscall filters\n");
> +		bitmap_zero(bitmaps->allow, NR_syscalls);
> +		return;
> +	}

I can't find any big holes with this.  It's the kind of code that makes
me nervous, but mostly because it's pretty different that anything else
we have in the kernel.

It's also clear to me here that you probably have a slightly different
expectation of what the PTE accessed flag means versus the hardware
guys.  What you are looking for it to mean is roughly: "a retired
instruction touched this page".

The hardware guys would probably say it's closer to "a TLB entry was
established for this page."  Remember that TLB entries can be
established speculatively or from things like prefetchers.  While I
don't know of anything microarchitectural today which would trip this
mechanism, it's entirely possible that something in the future might.
Accessing close to the page boundary is the exact kind of place folks
might want to optimize.

*But*, at least it would err in the direction of being conservative.  It
would say "somebody touched the page!" more often than it should, but
never _less_ often than it should.

One thing about the implementation (which is roughly):

	// Touch the data:
	check = sd->instruction_pointer;
	// Examine the PTE mapping that data:
	if (!sd_touched(ptep)) {
		// something
	}

There aren't any barriers in there, which could lead to the sd_touched()
check being ordered before the data touch.  I think a rmb() will
suffice.  You could even do it inside sd_touched().

Was there a reason you chose to export a ranged TLB flush?  I probably
would have just used the single-page flush_tlb_one_kernel() for this
purpose if I were working in arch-specific code.
Kees Cook June 16, 2020, 3:48 p.m.
On Tue, Jun 16, 2020 at 02:14:47PM +0200, Jann Horn wrote:
> Wouldn't it be simpler to use a function that can run a subset of
> seccomp cBPF and bails out on anything that indicates that a syscall's
> handling is complex or on instructions it doesn't understand? For
> syscalls that have a fixed policy, a typical seccomp filter doesn't
> even use any of the BPF_ALU ops, the scratch space, or the X register;
> it just uses something like the following set of operations, which is
> easy to emulate without much code:
> 
> BPF_LD | BPF_W | BPF_ABS
> BPF_JMP | BPF_JEQ | BPF_K
> BPF_JMP | BPF_JGE | BPF_K
> BPF_JMP | BPF_JGT | BPF_K
> BPF_JMP | BPF_JA
> BPF_RET | BPF_K

Initially, I started down this path. It needed a bit of plumbing into
BPF to better control the lifetime of the cBPF "saved original filter"
(normally used by CHECKPOINT_RESTORE uses), and then I needed to keep
making exceptions (same list you have: ALU, X register, scratch, etc)
in the name of avoiding too much complexity in the emulator. I decided
I'd rather reuse the existing infrastructure to actually execute the
filter (no cBPF copy needed to be saved, no separate code, and full
instruction coverage).

> 
> Something like (completely untested):
> 
> /*
>  * Try to statically determine whether @filter will always return a fixed result
>  * when run for syscall @nr under architecture @arch.
>  * Returns true if the result could be determined; if so, the result will be
>  * stored in @action.
>  */
> static bool seccomp_check_syscall(struct sock_filter *filter, unsigned int arch,
>                                   unsigned int nr, unsigned int *action)
> {
>   int pc;
>   unsigned int reg_value = 0;
> 
>   for (pc = 0; 1; pc++) {
>     struct sock_filter *insn = &filter[pc];
>     u16 code = insn->code;
>     u32 k = insn->k;
> 
>     switch (code) {
>     case BPF_LD | BPF_W | BPF_ABS:
>       if (k == offsetof(struct seccomp_data, nr)) {
>         reg_value = nr;
>       } else if (k == offsetof(struct seccomp_data, arch)) {
>         reg_value = arch;
>       } else {
>         return false; /* can't optimize (non-constant value load) */
>       }
>       break;
>     case BPF_RET | BPF_K:
>       *action = insn->k;
>       return true; /* success: reached return with constant values only */
>     case BPF_JMP | BPF_JA:
>       pc += insn->k;
>       break;
>     case BPF_JMP | BPF_JEQ | BPF_K:
>     case BPF_JMP | BPF_JGE | BPF_K:
>     case BPF_JMP | BPF_JGT | BPF_K:
>     default:
>       if (BPF_CLASS(code) == BPF_JMP && BPF_SRC(code) == BPF_K) {
>         u16 op = BPF_OP(code);
>         bool op_res;
> 
>         switch (op) {
>         case BPF_JEQ:
>           op_res = reg_value == k;
>           break;
>         case BPF_JGE:
>           op_res = reg_value >= k;
>           break;
>         case BPF_JGT:
>           op_res = reg_value > k;
>           break;
>         default:
>           return false; /* can't optimize (unknown insn) */
>         }
> 
>         pc += op_res ? insn->jt : insn->jf;
>         break;
>       }
>       return false; /* can't optimize (unknown insn) */
>     }
>   }
> }

I didn't actually finish going down the emulator path (I stopped right
around the time I verified that libseccomp does use BPF_ALU -- though
only BPF_AND), so I didn't actually evaluate the filter contents for other
filter builders (i.e. Chrome).

But, if BPF_ALU | BPF_AND were added to your code above, it would cover
everything libseccomp generates (which covers a lot of the seccomp
filters, e.g. systemd, docker). I just felt funny about an "incomplete"
emulator.

Though now you've got me looking. It seems this is the core
of Chrome's BPF instruction generation:
https://github.com/chromium/chromium/blob/master/sandbox/linux/bpf_dsl/policy_compiler.cc
It also uses ALU|AND, but adds JMP|JSET.

So... that's only 2 more instructions to cover what I think are likely
the two largest seccomp instruction generators.

> That way, you won't need any of this complicated architecture-specific stuff.

There are two arch-specific needs, and using a cBPF-subset emulator
just gets rid of the local TLB flush. The other part is distinguishing
the archs. Neither requirement is onerous (TLB flush usually just
needs little more than an extern, arch is already documented in the
per-arch syscall_get_arch()). The awkward part I ran into for arm64
was a header include loop for compat due to how unistd is handled for
getting NR_syscalls for the bitmap sizing (which I'm sure is solvable,
but I just wanted to get the x86 RFC posted first).
Kees Cook June 16, 2020, 4:01 p.m.
On Tue, Jun 16, 2020 at 07:40:17AM -0700, Dave Hansen wrote:
> On 6/16/20 12:49 AM, Kees Cook wrote:
> > +	/* Mark the second page as untouched (i.e. "old") */
> > +	preempt_disable();
> > +	set_pte_at(&init_mm, vaddr, ptep, pte_mkold(*(READ_ONCE(ptep))));
> > +	local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
> > +	preempt_enable();
> 
> If you can, I'd wrap that nugget up in a helper.  I'd also suggest being
> very explicit in a comment about what it is trying to do: ensure no TLB
> entries exist so that a future access will always set the Accessed bit.

Yeah, good idea!

> 
> > +	/* Make sure the PTE agrees that it is untouched. */
> > +	if (WARN_ON_ONCE(sd_touched(ptep)))
> > +		return;
> > +	/* Read a portion of struct seccomp_data from the second page. */
> > +	check = sd->instruction_pointer;
> > +	/* First, verify the contents are zero from vzalloc(). */
> > +	if (WARN_ON_ONCE(check))
> > +		return;
> > +	/* Now make sure the ACCESSED bit has been set after the read. */
> > +	if (!sd_touched(ptep)) {
> > +		/*
> > +		 * If autodetection fails, fall back to standard beahavior by
> > +		 * clearing the entire "allow" bitmap.
> > +		 */
> > +		pr_warn_once("seccomp: cannot build automatic syscall filters\n");
> > +		bitmap_zero(bitmaps->allow, NR_syscalls);
> > +		return;
> > +	}
> 
> I can't find any big holes with this.  It's the kind of code that makes
> me nervous, but mostly because it's pretty different that anything else
> we have in the kernel.
> 
> It's also clear to me here that you probably have a slightly different
> expectation of what the PTE accessed flag means versus the hardware
> guys.  What you are looking for it to mean is roughly: "a retired
> instruction touched this page".
> 
> The hardware guys would probably say it's closer to "a TLB entry was
> established for this page."  Remember that TLB entries can be
> established speculatively or from things like prefetchers.  While I
> don't know of anything microarchitectural today which would trip this
> mechanism, it's entirely possible that something in the future might.
> Accessing close to the page boundary is the exact kind of place folks
> might want to optimize.

Yeah, and to that end, going the cBPF emulator route removes this kind
of "weird" behavior.

> 
> *But*, at least it would err in the direction of being conservative.  It
> would say "somebody touched the page!" more often than it should, but
> never _less_ often than it should.

Right -- I made sure to design the bitmaps and the direction of the
checking to fail towards running the filter instead of bypassing it.

> One thing about the implementation (which is roughly):
> 
> 	// Touch the data:
> 	check = sd->instruction_pointer;
> 	// Examine the PTE mapping that data:
> 	if (!sd_touched(ptep)) {
> 		// something
> 	}
> 
> There aren't any barriers in there, which could lead to the sd_touched()
> check being ordered before the data touch.  I think a rmb() will
> suffice.  You could even do it inside sd_touched().

Ah yeah, I had convinced myself that READ_ONCE() gained me that
coverage, but I guess that's not actually true here.

> Was there a reason you chose to export a ranged TLB flush?  I probably
> would have just used the single-page flush_tlb_one_kernel() for this
> purpose if I were working in arch-specific code.

No particular reason -- it just seemed easiest to make available given
the interfaces. I could do the single-page version instead, if this way
of doing things survives review. ;)

Thanks for looking at it!
Jann Horn via Containers June 16, 2020, 6:36 p.m.
On Tue, Jun 16, 2020 at 5:49 PM Kees Cook <keescook@chromium.org> wrote:
> On Tue, Jun 16, 2020 at 02:14:47PM +0200, Jann Horn wrote:
> > Wouldn't it be simpler to use a function that can run a subset of
> > seccomp cBPF and bails out on anything that indicates that a syscall's
> > handling is complex or on instructions it doesn't understand? For
> > syscalls that have a fixed policy, a typical seccomp filter doesn't
> > even use any of the BPF_ALU ops, the scratch space, or the X register;
> > it just uses something like the following set of operations, which is
> > easy to emulate without much code:
> >
> > BPF_LD | BPF_W | BPF_ABS
> > BPF_JMP | BPF_JEQ | BPF_K
> > BPF_JMP | BPF_JGE | BPF_K
> > BPF_JMP | BPF_JGT | BPF_K
> > BPF_JMP | BPF_JA
> > BPF_RET | BPF_K
>
> Initially, I started down this path. It needed a bit of plumbing into
> BPF to better control the lifetime of the cBPF "saved original filter"
> (normally used by CHECKPOINT_RESTORE uses)

I don't think you need that? When a filter is added, you can compute
the results of the added individual filter, and then merge the state.

> and then I needed to keep
> making exceptions (same list you have: ALU, X register, scratch, etc)
> in the name of avoiding too much complexity in the emulator. I decided
> I'd rather reuse the existing infrastructure to actually execute the
> filter (no cBPF copy needed to be saved, no separate code, and full
> instruction coverage).

If you really think that this bit of emulation is so bad, you could
also make a copy of the BPF filter in which you replace all load
instructions from syscall arguments with "return NON_CONSTANT_RESULT",
and then run that through the normal BPF infrastructure.

> > Something like (completely untested):
[...]
> I didn't actually finish going down the emulator path (I stopped right
> around the time I verified that libseccomp does use BPF_ALU -- though
> only BPF_AND), so I didn't actually evaluate the filter contents for other
> filter builders (i.e. Chrome).
>
> But, if BPF_ALU | BPF_AND were added to your code above, it would cover
> everything libseccomp generates (which covers a lot of the seccomp
> filters, e.g. systemd, docker). I just felt funny about an "incomplete"
> emulator.
>
> Though now you've got me looking. It seems this is the core
> of Chrome's BPF instruction generation:
> https://github.com/chromium/chromium/blob/master/sandbox/linux/bpf_dsl/policy_compiler.cc
> It also uses ALU|AND, but adds JMP|JSET.
>
> So... that's only 2 more instructions to cover what I think are likely
> the two largest seccomp instruction generators.
>
> > That way, you won't need any of this complicated architecture-specific stuff.
>
> There are two arch-specific needs, and using a cBPF-subset emulator
> just gets rid of the local TLB flush. The other part is distinguishing
> the archs. Neither requirement is onerous (TLB flush usually just
> needs little more than an extern, arch is already documented in the
> per-arch syscall_get_arch()).

But it's also somewhat layer-breaking and reliant on very specific
assumptions. Normal kernel code doesn't mess around with page table
magic, outside of very specific low-level things. And your method
would break if the fixed-value members were not all packed together at
the start of the structure.


And from a hardening perspective: The more code we add that fiddles
around with PTEs directly, rather than going through higher-level
abstractions, the higher the chance that something gets horribly
screwed up. For example, this bit from your patch looks *really*
suspect:

+                       preempt_disable();
+                       set_pte_at(&init_mm, vaddr, ptep,
pte_mkold(*(READ_ONCE(ptep))));
+                       local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
+                       preempt_enable();

First off, that set_pte_at() is just a memory write; I don't see why
you put it inside a preempt_disable() region.
But more importantly, sticking a local TLB flush inside a
preempt_disable() region with nothing else in there looks really
shady. How is that supposed to work? If we migrate from CPU0 to CPU1
directly before this region, and then from CPU1 back to CPU0 directly
afterwards, the local TLB flush will have no effect.
Kees Cook June 16, 2020, 6:49 p.m.
On Tue, Jun 16, 2020 at 08:36:28PM +0200, Jann Horn wrote:
> On Tue, Jun 16, 2020 at 5:49 PM Kees Cook <keescook@chromium.org> wrote:
> > On Tue, Jun 16, 2020 at 02:14:47PM +0200, Jann Horn wrote:
> > > Wouldn't it be simpler to use a function that can run a subset of
> > > seccomp cBPF and bails out on anything that indicates that a syscall's
> > > handling is complex or on instructions it doesn't understand? For
> > > syscalls that have a fixed policy, a typical seccomp filter doesn't
> > > even use any of the BPF_ALU ops, the scratch space, or the X register;
> > > it just uses something like the following set of operations, which is
> > > easy to emulate without much code:
> > >
> > > BPF_LD | BPF_W | BPF_ABS
> > > BPF_JMP | BPF_JEQ | BPF_K
> > > BPF_JMP | BPF_JGE | BPF_K
> > > BPF_JMP | BPF_JGT | BPF_K
> > > BPF_JMP | BPF_JA
> > > BPF_RET | BPF_K
> >
> > Initially, I started down this path. It needed a bit of plumbing into
> > BPF to better control the lifetime of the cBPF "saved original filter"
> > (normally used by CHECKPOINT_RESTORE uses)
> 
> I don't think you need that? When a filter is added, you can compute
> the results of the added individual filter, and then merge the state.

That's what I thought too, but unfortunately not (unless I missed
something) -- the seccomp verifier is run as a callback from the BPF
internals, so seccomp only see what the user sends (which is unverified)
and the final eBPF filter. There isn't state I can attach during the
callback, so I opted to just do the same thing as CHECKPOINT_RESTORE,
but to then explicitly free the cBPF after bitmap generation.

> > and then I needed to keep
> > making exceptions (same list you have: ALU, X register, scratch, etc)
> > in the name of avoiding too much complexity in the emulator. I decided
> > I'd rather reuse the existing infrastructure to actually execute the
> > filter (no cBPF copy needed to be saved, no separate code, and full
> > instruction coverage).
> 
> If you really think that this bit of emulation is so bad, you could
> also make a copy of the BPF filter in which you replace all load
> instructions from syscall arguments with "return NON_CONSTANT_RESULT",
> and then run that through the normal BPF infrastructure.
> 
> > > Something like (completely untested):
> [...]
> > I didn't actually finish going down the emulator path (I stopped right
> > around the time I verified that libseccomp does use BPF_ALU -- though
> > only BPF_AND), so I didn't actually evaluate the filter contents for other
> > filter builders (i.e. Chrome).
> >
> > But, if BPF_ALU | BPF_AND were added to your code above, it would cover
> > everything libseccomp generates (which covers a lot of the seccomp
> > filters, e.g. systemd, docker). I just felt funny about an "incomplete"
> > emulator.
> >
> > Though now you've got me looking. It seems this is the core
> > of Chrome's BPF instruction generation:
> > https://github.com/chromium/chromium/blob/master/sandbox/linux/bpf_dsl/policy_compiler.cc
> > It also uses ALU|AND, but adds JMP|JSET.
> >
> > So... that's only 2 more instructions to cover what I think are likely
> > the two largest seccomp instruction generators.
> >
> > > That way, you won't need any of this complicated architecture-specific stuff.
> >
> > There are two arch-specific needs, and using a cBPF-subset emulator
> > just gets rid of the local TLB flush. The other part is distinguishing
> > the archs. Neither requirement is onerous (TLB flush usually just
> > needs little more than an extern, arch is already documented in the
> > per-arch syscall_get_arch()).
> 
> But it's also somewhat layer-breaking and reliant on very specific
> assumptions. Normal kernel code doesn't mess around with page table
> magic, outside of very specific low-level things. And your method
> would break if the fixed-value members were not all packed together at
> the start of the structure.

Right -- that was lucky. I suspect the emulation route will win out
here.

> And from a hardening perspective: The more code we add that fiddles
> around with PTEs directly, rather than going through higher-level
> abstractions, the higher the chance that something gets horribly
> screwed up. For example, this bit from your patch looks *really*
> suspect:
> 
> +                       preempt_disable();
> +                       set_pte_at(&init_mm, vaddr, ptep,
> pte_mkold(*(READ_ONCE(ptep))));
> +                       local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
> +                       preempt_enable();
> 
> First off, that set_pte_at() is just a memory write; I don't see why
> you put it inside a preempt_disable() region.
> But more importantly, sticking a local TLB flush inside a
> preempt_disable() region with nothing else in there looks really
> shady. How is that supposed to work? If we migrate from CPU0 to CPU1
> directly before this region, and then from CPU1 back to CPU0 directly
> afterwards, the local TLB flush will have no effect.

Yeah, true, that's another good reason not to do this.
Andy Lutomirski June 16, 2020, 9:13 p.m.
> On Jun 16, 2020, at 11:36 AM, Jann Horn <jannh@google.com> wrote:
> 
> On Tue, Jun 16, 2020 at 5:49 PM Kees Cook <keescook@chromium.org> wrote:
>>> On Tue, Jun 16, 2020 at 02:14:47PM +0200, Jann Horn wrote:
>>> Wouldn't it be simpler to use a function that can run a subset of
>>> seccomp cBPF and bails out on anything that indicates that a syscall's
>>> handling is complex or on instructions it doesn't understand? For
>>> syscalls that have a fixed policy, a typical seccomp filter doesn't
>>> even use any of the BPF_ALU ops, the scratch space, or the X register;
>>> it just uses something like the following set of operations, which is
>>> easy to emulate without much code:
>>> 
>>> BPF_LD | BPF_W | BPF_ABS
>>> BPF_JMP | BPF_JEQ | BPF_K
>>> BPF_JMP | BPF_JGE | BPF_K
>>> BPF_JMP | BPF_JGT | BPF_K
>>> BPF_JMP | BPF_JA
>>> BPF_RET | BPF_K
>> 
>> Initially, I started down this path. It needed a bit of plumbing into
>> BPF to better control the lifetime of the cBPF "saved original filter"
>> (normally used by CHECKPOINT_RESTORE uses)
> 
> I don't think you need that? When a filter is added, you can compute
> the results of the added individual filter, and then merge the state.
> 
>> and then I needed to keep
>> making exceptions (same list you have: ALU, X register, scratch, etc)
>> in the name of avoiding too much complexity in the emulator. I decided
>> I'd rather reuse the existing infrastructure to actually execute the
>> filter (no cBPF copy needed to be saved, no separate code, and full
>> instruction coverage).
> 
> If you really think that this bit of emulation is so bad, you could
> also make a copy of the BPF filter in which you replace all load
> instructions from syscall arguments with "return NON_CONSTANT_RESULT",
> and then run that through the normal BPF infrastructure.
> 
>>> Something like (completely untested):
> [...]
>> I didn't actually finish going down the emulator path (I stopped right
>> around the time I verified that libseccomp does use BPF_ALU -- though
>> only BPF_AND), so I didn't actually evaluate the filter contents for other
>> filter builders (i.e. Chrome).
>> 
>> But, if BPF_ALU | BPF_AND were added to your code above, it would cover
>> everything libseccomp generates (which covers a lot of the seccomp
>> filters, e.g. systemd, docker). I just felt funny about an "incomplete"
>> emulator.
>> 
>> Though now you've got me looking. It seems this is the core
>> of Chrome's BPF instruction generation:
>> https://github.com/chromium/chromium/blob/master/sandbox/linux/bpf_dsl/policy_compiler.cc
>> It also uses ALU|AND, but adds JMP|JSET.
>> 
>> So... that's only 2 more instructions to cover what I think are likely
>> the two largest seccomp instruction generators.
>> 
>>> That way, you won't need any of this complicated architecture-specific stuff.
>> 
>> There are two arch-specific needs, and using a cBPF-subset emulator
>> just gets rid of the local TLB flush. The other part is distinguishing
>> the archs. Neither requirement is onerous (TLB flush usually just
>> needs little more than an extern, arch is already documented in the
>> per-arch syscall_get_arch()).
> 
> But it's also somewhat layer-breaking and reliant on very specific
> assumptions. Normal kernel code doesn't mess around with page table
> magic, outside of very specific low-level things. And your method
> would break if the fixed-value members were not all packed together at
> the start of the structure.
> 
> 
> And from a hardening perspective: The more code we add that fiddles
> around with PTEs directly, rather than going through higher-level
> abstractions, the higher the chance that something gets horribly
> screwed up. For example, this bit from your patch looks *really*
> suspect:
> 
> +                       preempt_disable();
> +                       set_pte_at(&init_mm, vaddr, ptep,
> pte_mkold(*(READ_ONCE(ptep))));
> +                       local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
> +                       preempt_enable();
> 
> First off, that set_pte_at() is just a memory write; I don't see why
> you put it inside a preempt_disable() region.
> But more importantly, sticking a local TLB flush inside a
> preempt_disable() region with nothing else in there looks really
> shady. How is that supposed to work? If we migrate from CPU0 to CPU1
> directly before this region, and then from CPU1 back to CPU0 directly
> afterwards, the local TLB flush will have no effect.


Indeed.

With my x86/mm maintainer hat on, this is highly questionable. Either the real API should be used, or there should be a sane API. The former will have really atrocious performance, and the latter would need some thought. Basically, if you pin entire process to one CPU, you can clear the dirty bit, flush, do some magic, and read it back. This is only valid if you have a short enough operation that running with preemption off is reasonable.  Otherwise you need to arrange to flush when you schedule in, which could be done with a voluntary preemption style or with scheduler hooks.

I’m not convinced this is worthwhile.