[REVIEW] mm: Add a user_ns owner to mm_struct and fix ptrace_may_access

Submitted by Eric W. Biederman on Oct. 17, 2016, 4:39 p.m.

Details

Message ID 87twcbq696.fsf@x220.int.ebiederm.org
State New
Series "mm: Add a user_ns owner to mm_struct and fix ptrace_may_access"
Headers show

Commit Message

Eric W. Biederman Oct. 17, 2016, 4:39 p.m.
During exec dumpable is cleared if the file that is being executed is
not readable by the user executing the file.  A bug in
ptrace_may_access allows reading the file if the executable happens to
enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

This problem is fixed with only necessary userspace breakage by adding
a user namespace owner to mm_struct, captured at the time of exec,
so it is clear in which user namespace CAP_SYS_PTRACE must be present
in to be able to safely give read permission to the executable.

The function ptrace_may_access is modified to verify that the ptracer
has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
This ensures that if the task changes it's cred into a subordinate
user namespace it does not become ptraceable.

Cc: stable@vger.kernel.org
Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

It turns out that dumpable needs to be fixed to be user namespace
aware to fix this issue.  When this patch is ready I plan to place it in
my userns tree and send it to Linus, hopefully for -rc2.

 include/linux/mm_types.h |  1 +
 kernel/fork.c            |  9 ++++++---
 kernel/ptrace.c          | 17 ++++++-----------
 mm/init-mm.c             |  2 ++
 4 files changed, 15 insertions(+), 14 deletions(-)

Patch hide | download patch | download mbox

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4a8acedf4b7d..08d947fc4c59 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -473,6 +473,7 @@  struct mm_struct {
 	 */
 	struct task_struct __rcu *owner;
 #endif
+	struct user_namespace *user_ns;
 
 	/* store ref to file /proc/<pid>/exe symlink points to */
 	struct file __rcu *exe_file;
diff --git a/kernel/fork.c b/kernel/fork.c
index 623259fc794d..fd85c68c2791 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -742,7 +742,8 @@  static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
 #endif
 }
 
-static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
+static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
+	struct user_namespace *user_ns)
 {
 	mm->mmap = NULL;
 	mm->mm_rb = RB_ROOT;
@@ -782,6 +783,7 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	if (init_new_context(p, mm))
 		goto fail_nocontext;
 
+	mm->user_ns = get_user_ns(user_ns);
 	return mm;
 
 fail_nocontext:
@@ -827,7 +829,7 @@  struct mm_struct *mm_alloc(void)
 		return NULL;
 
 	memset(mm, 0, sizeof(*mm));
-	return mm_init(mm, current);
+	return mm_init(mm, current, current_user_ns());
 }
 
 /*
@@ -842,6 +844,7 @@  void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
+	put_user_ns(mm->user_ns);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -1123,7 +1126,7 @@  static struct mm_struct *dup_mm(struct task_struct *tsk)
 
 	memcpy(mm, oldmm, sizeof(*mm));
 
-	if (!mm_init(mm, tsk))
+	if (!mm_init(mm, tsk, mm->user_ns))
 		goto fail_nomem;
 
 	err = dup_mmap(mm, oldmm);
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 2a99027312a6..f2d1b9afb3f8 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -220,7 +220,7 @@  static int ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
 static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 {
 	const struct cred *cred = current_cred(), *tcred;
-	int dumpable = 0;
+	struct mm_struct *mm;
 	kuid_t caller_uid;
 	kgid_t caller_gid;
 
@@ -271,16 +271,11 @@  static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 	return -EPERM;
 ok:
 	rcu_read_unlock();
-	smp_rmb();
-	if (task->mm)
-		dumpable = get_dumpable(task->mm);
-	rcu_read_lock();
-	if (dumpable != SUID_DUMP_USER &&
-	    !ptrace_has_cap(__task_cred(task)->user_ns, mode)) {
-		rcu_read_unlock();
-		return -EPERM;
-	}
-	rcu_read_unlock();
+	mm = task->mm;
+	if (!mm ||
+	    ((get_dumpable(mm) != SUID_DUMP_USER) &&
+	     !ptrace_has_cap(mm->user_ns, mode)))
+	    return -EPERM;
 
 	return security_ptrace_access_check(task, mode);
 }
diff --git a/mm/init-mm.c b/mm/init-mm.c
index a56a851908d2..975e49f00f34 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -6,6 +6,7 @@ 
 #include <linux/cpumask.h>
 
 #include <linux/atomic.h>
+#include <linux/user_namespace.h>
 #include <asm/pgtable.h>
 #include <asm/mmu.h>
 
@@ -21,5 +22,6 @@  struct mm_struct init_mm = {
 	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
+	.user_ns	= &init_user_ns,
 	INIT_MM_CONTEXT(init_mm)
 };

Comments

Jann Horn Oct. 17, 2016, 5:25 p.m.
On Mon, Oct 17, 2016 at 11:39:49AM -0500, Eric W. Biederman wrote:
> 
> During exec dumpable is cleared if the file that is being executed is
> not readable by the user executing the file.  A bug in
> ptrace_may_access allows reading the file if the executable happens to
> enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
> unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
> 
> This problem is fixed with only necessary userspace breakage by adding
> a user namespace owner to mm_struct, captured at the time of exec,
> so it is clear in which user namespace CAP_SYS_PTRACE must be present
> in to be able to safely give read permission to the executable.
> 
> The function ptrace_may_access is modified to verify that the ptracer
> has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
> This ensures that if the task changes it's cred into a subordinate
> user namespace it does not become ptraceable.

This looks good! Basically applies the same rules that already apply to
EUID/... changes to namespace changes, and anyone entering a user
namespace can now safely drop UIDs and GIDs to namespace root.

This integrates better in the existing security concept than my old
patch "ptrace: being capable wrt a process requires mapped uids/gids",
and it has less issues in cases where e.g. the extra privileges of an
entering process are the filesystem root or so.

FWIW, if you want, you can add "Reviewed-by: Jann Horn <jann@thejh.net>".

> Cc: stable@vger.kernel.org
> Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> 
> It turns out that dumpable needs to be fixed to be user namespace
> aware to fix this issue.  When this patch is ready I plan to place it in
> my userns tree and send it to Linus, hopefully for -rc2.
> 
>  include/linux/mm_types.h |  1 +
>  kernel/fork.c            |  9 ++++++---
>  kernel/ptrace.c          | 17 ++++++-----------
>  mm/init-mm.c             |  2 ++
>  4 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4a8acedf4b7d..08d947fc4c59 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -473,6 +473,7 @@ struct mm_struct {
>  	 */
>  	struct task_struct __rcu *owner;
>  #endif
> +	struct user_namespace *user_ns;
>  
>  	/* store ref to file /proc/<pid>/exe symlink points to */
>  	struct file __rcu *exe_file;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 623259fc794d..fd85c68c2791 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -742,7 +742,8 @@ static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
>  #endif
>  }
>  
> -static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> +static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> +	struct user_namespace *user_ns)
>  {
>  	mm->mmap = NULL;
>  	mm->mm_rb = RB_ROOT;
> @@ -782,6 +783,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
>  	if (init_new_context(p, mm))
>  		goto fail_nocontext;
>  
> +	mm->user_ns = get_user_ns(user_ns);
>  	return mm;
>  
>  fail_nocontext:
> @@ -827,7 +829,7 @@ struct mm_struct *mm_alloc(void)
>  		return NULL;
>  
>  	memset(mm, 0, sizeof(*mm));
> -	return mm_init(mm, current);
> +	return mm_init(mm, current, current_user_ns());
>  }
>  
>  /*
> @@ -842,6 +844,7 @@ void __mmdrop(struct mm_struct *mm)
>  	destroy_context(mm);
>  	mmu_notifier_mm_destroy(mm);
>  	check_mm(mm);
> +	put_user_ns(mm->user_ns);
>  	free_mm(mm);
>  }
>  EXPORT_SYMBOL_GPL(__mmdrop);
> @@ -1123,7 +1126,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
>  
>  	memcpy(mm, oldmm, sizeof(*mm));
>  
> -	if (!mm_init(mm, tsk))
> +	if (!mm_init(mm, tsk, mm->user_ns))
>  		goto fail_nomem;
>  
>  	err = dup_mmap(mm, oldmm);
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 2a99027312a6..f2d1b9afb3f8 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -220,7 +220,7 @@ static int ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
>  static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
>  {
>  	const struct cred *cred = current_cred(), *tcred;
> -	int dumpable = 0;
> +	struct mm_struct *mm;
>  	kuid_t caller_uid;
>  	kgid_t caller_gid;
>  
> @@ -271,16 +271,11 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
>  	return -EPERM;
>  ok:
>  	rcu_read_unlock();
> -	smp_rmb();
> -	if (task->mm)
> -		dumpable = get_dumpable(task->mm);
> -	rcu_read_lock();
> -	if (dumpable != SUID_DUMP_USER &&
> -	    !ptrace_has_cap(__task_cred(task)->user_ns, mode)) {
> -		rcu_read_unlock();
> -		return -EPERM;
> -	}
> -	rcu_read_unlock();
> +	mm = task->mm;
> +	if (!mm ||
> +	    ((get_dumpable(mm) != SUID_DUMP_USER) &&
> +	     !ptrace_has_cap(mm->user_ns, mode)))
> +	    return -EPERM;
>  
>  	return security_ptrace_access_check(task, mode);
>  }
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index a56a851908d2..975e49f00f34 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -6,6 +6,7 @@
>  #include <linux/cpumask.h>
>  
>  #include <linux/atomic.h>
> +#include <linux/user_namespace.h>
>  #include <asm/pgtable.h>
>  #include <asm/mmu.h>
>  
> @@ -21,5 +22,6 @@ struct mm_struct init_mm = {
>  	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
>  	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
>  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
> +	.user_ns	= &init_user_ns,
>  	INIT_MM_CONTEXT(init_mm)
>  };
> -- 
> 2.8.3
Eric W. Biederman Oct. 17, 2016, 5:33 p.m.
Jann Horn <jann@thejh.net> writes:

> On Mon, Oct 17, 2016 at 11:39:49AM -0500, Eric W. Biederman wrote:
>> 
>> During exec dumpable is cleared if the file that is being executed is
>> not readable by the user executing the file.  A bug in
>> ptrace_may_access allows reading the file if the executable happens to
>> enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
>> unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
>> 
>> This problem is fixed with only necessary userspace breakage by adding
>> a user namespace owner to mm_struct, captured at the time of exec,
>> so it is clear in which user namespace CAP_SYS_PTRACE must be present
>> in to be able to safely give read permission to the executable.
>> 
>> The function ptrace_may_access is modified to verify that the ptracer
>> has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
>> This ensures that if the task changes it's cred into a subordinate
>> user namespace it does not become ptraceable.
>
> This looks good! Basically applies the same rules that already apply to
> EUID/... changes to namespace changes, and anyone entering a user
> namespace can now safely drop UIDs and GIDs to namespace root.

Yes.  It just required the right perspective and it turned out to be
straight forward to solve.  Especially since it is buggy today for
unreadable executables.

> This integrates better in the existing security concept than my old
> patch "ptrace: being capable wrt a process requires mapped uids/gids",
> and it has less issues in cases where e.g. the extra privileges of an
> entering process are the filesystem root or so.
>
> FWIW, if you want, you can add "Reviewed-by: Jann Horn
> <jann@thejh.net>".

Will do. Thank you.

Eric
Michal Hocko Oct. 18, 2016, 1:50 p.m.
On Mon 17-10-16 11:39:49, Eric W. Biederman wrote:
> 
> During exec dumpable is cleared if the file that is being executed is
> not readable by the user executing the file.  A bug in
> ptrace_may_access allows reading the file if the executable happens to
> enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
> unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
> 
> This problem is fixed with only necessary userspace breakage by adding
> a user namespace owner to mm_struct, captured at the time of exec,
> so it is clear in which user namespace CAP_SYS_PTRACE must be present
> in to be able to safely give read permission to the executable.
> 
> The function ptrace_may_access is modified to verify that the ptracer
> has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
> This ensures that if the task changes it's cred into a subordinate
> user namespace it does not become ptraceable.

I haven't studied your patch too deeply but one thing that immediately 
raised a red flag was that mm might be shared between processes (aka
thread groups). What prevents those two to sit in different user
namespaces?

I am primarily asking because this generated a lot of headache for the
memcg handling as those processes might sit in different cgroups while
there is only one correct memcg for them which can disagree with the
cgroup associated with one of the processes.

> Cc: stable@vger.kernel.org
> Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> 
> It turns out that dumpable needs to be fixed to be user namespace
> aware to fix this issue.  When this patch is ready I plan to place it in
> my userns tree and send it to Linus, hopefully for -rc2.
> 
>  include/linux/mm_types.h |  1 +
>  kernel/fork.c            |  9 ++++++---
>  kernel/ptrace.c          | 17 ++++++-----------
>  mm/init-mm.c             |  2 ++
>  4 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4a8acedf4b7d..08d947fc4c59 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -473,6 +473,7 @@ struct mm_struct {
>  	 */
>  	struct task_struct __rcu *owner;
>  #endif
> +	struct user_namespace *user_ns;
>  
>  	/* store ref to file /proc/<pid>/exe symlink points to */
>  	struct file __rcu *exe_file;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 623259fc794d..fd85c68c2791 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -742,7 +742,8 @@ static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
>  #endif
>  }
>  
> -static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> +static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> +	struct user_namespace *user_ns)
>  {
>  	mm->mmap = NULL;
>  	mm->mm_rb = RB_ROOT;
> @@ -782,6 +783,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
>  	if (init_new_context(p, mm))
>  		goto fail_nocontext;
>  
> +	mm->user_ns = get_user_ns(user_ns);
>  	return mm;
>  
>  fail_nocontext:
> @@ -827,7 +829,7 @@ struct mm_struct *mm_alloc(void)
>  		return NULL;
>  
>  	memset(mm, 0, sizeof(*mm));
> -	return mm_init(mm, current);
> +	return mm_init(mm, current, current_user_ns());
>  }
>  
>  /*
> @@ -842,6 +844,7 @@ void __mmdrop(struct mm_struct *mm)
>  	destroy_context(mm);
>  	mmu_notifier_mm_destroy(mm);
>  	check_mm(mm);
> +	put_user_ns(mm->user_ns);
>  	free_mm(mm);
>  }
>  EXPORT_SYMBOL_GPL(__mmdrop);
> @@ -1123,7 +1126,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
>  
>  	memcpy(mm, oldmm, sizeof(*mm));
>  
> -	if (!mm_init(mm, tsk))
> +	if (!mm_init(mm, tsk, mm->user_ns))
>  		goto fail_nomem;
>  
>  	err = dup_mmap(mm, oldmm);
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 2a99027312a6..f2d1b9afb3f8 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -220,7 +220,7 @@ static int ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
>  static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
>  {
>  	const struct cred *cred = current_cred(), *tcred;
> -	int dumpable = 0;
> +	struct mm_struct *mm;
>  	kuid_t caller_uid;
>  	kgid_t caller_gid;
>  
> @@ -271,16 +271,11 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
>  	return -EPERM;
>  ok:
>  	rcu_read_unlock();
> -	smp_rmb();
> -	if (task->mm)
> -		dumpable = get_dumpable(task->mm);
> -	rcu_read_lock();
> -	if (dumpable != SUID_DUMP_USER &&
> -	    !ptrace_has_cap(__task_cred(task)->user_ns, mode)) {
> -		rcu_read_unlock();
> -		return -EPERM;
> -	}
> -	rcu_read_unlock();
> +	mm = task->mm;
> +	if (!mm ||
> +	    ((get_dumpable(mm) != SUID_DUMP_USER) &&
> +	     !ptrace_has_cap(mm->user_ns, mode)))
> +	    return -EPERM;
>  
>  	return security_ptrace_access_check(task, mode);
>  }
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index a56a851908d2..975e49f00f34 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -6,6 +6,7 @@
>  #include <linux/cpumask.h>
>  
>  #include <linux/atomic.h>
> +#include <linux/user_namespace.h>
>  #include <asm/pgtable.h>
>  #include <asm/mmu.h>
>  
> @@ -21,5 +22,6 @@ struct mm_struct init_mm = {
>  	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
>  	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
>  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
> +	.user_ns	= &init_user_ns,
>  	INIT_MM_CONTEXT(init_mm)
>  };
> -- 
> 2.8.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Jann Horn Oct. 18, 2016, 1:57 p.m.
On Tue, Oct 18, 2016 at 03:50:32PM +0200, Michal Hocko wrote:
> On Mon 17-10-16 11:39:49, Eric W. Biederman wrote:
> > 
> > During exec dumpable is cleared if the file that is being executed is
> > not readable by the user executing the file.  A bug in
> > ptrace_may_access allows reading the file if the executable happens to
> > enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
> > unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
> > 
> > This problem is fixed with only necessary userspace breakage by adding
> > a user namespace owner to mm_struct, captured at the time of exec,
> > so it is clear in which user namespace CAP_SYS_PTRACE must be present
> > in to be able to safely give read permission to the executable.
> > 
> > The function ptrace_may_access is modified to verify that the ptracer
> > has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
> > This ensures that if the task changes it's cred into a subordinate
> > user namespace it does not become ptraceable.
> 
> I haven't studied your patch too deeply but one thing that immediately 
> raised a red flag was that mm might be shared between processes (aka
> thread groups).

You're conflating things. Threads always share memory, but sharing memory
doesn't imply being part of the same thread group.

> What prevents those two to sit in different user
> namespaces?

For thread groups: You can't change user namespace in a thread group
with more than one task.

For shared mm: Yeah, I think that could happen - but it doesn't matter.
The patch just needs the mm to determine the namespace in which the mm
was created, and that's always the same for tasks that share mm.
Eric W. Biederman Oct. 18, 2016, 2:56 p.m.
Michal Hocko <mhocko@kernel.org> writes:

> On Mon 17-10-16 11:39:49, Eric W. Biederman wrote:
>> 
>> During exec dumpable is cleared if the file that is being executed is
>> not readable by the user executing the file.  A bug in
>> ptrace_may_access allows reading the file if the executable happens to
>> enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
>> unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
>> 
>> This problem is fixed with only necessary userspace breakage by adding
>> a user namespace owner to mm_struct, captured at the time of exec,
>> so it is clear in which user namespace CAP_SYS_PTRACE must be present
>> in to be able to safely give read permission to the executable.
>> 
>> The function ptrace_may_access is modified to verify that the ptracer
>> has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
>> This ensures that if the task changes it's cred into a subordinate
>> user namespace it does not become ptraceable.
>
> I haven't studied your patch too deeply but one thing that immediately 
> raised a red flag was that mm might be shared between processes (aka
> thread groups). What prevents those two to sit in different user
> namespaces?
>
> I am primarily asking because this generated a lot of headache for the
> memcg handling as those processes might sit in different cgroups while
> there is only one correct memcg for them which can disagree with the
> cgroup associated with one of the processes.

That is a legitimate concern, but I do not see any of those kinds of
issues here.

Part of the memcg pain comes from the fact that control groups are
process centric, and part of the pain comes from the fact that it is
possible to change control groups.  What I am doing is making the mm
owned by a user namespace (at creation time), and I am not allowing
changes to that ownership. The credentials of the tasks that use that mm
may be in the same user namespace or descendent user namespaces.

The core goal is to enforce the unreadability of an mm when an
non-readable file is executed.  This is a time of mm creation property.
The enforcement of which fits very well with the security/permission
checking role of the user namespace.

Could this use of mm->user_ns be extended for some kind of
accounting/limiting in the future?  Possibly.  I can imagine a limit on
the total number of page table entries a group of processes are allowed
to have as being a sane kind of limit in this setting much like
RLIMIT_AS is sane on a single mm level.  Pages don't belong to mm's so I
can't imagine anything like the memcg being built on this kind of
infrastructure.

Eric
Jann Horn Oct. 18, 2016, 3:05 p.m.
On Tue, Oct 18, 2016 at 09:56:53AM -0500, Eric W. Biederman wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Mon 17-10-16 11:39:49, Eric W. Biederman wrote:
> >> 
> >> During exec dumpable is cleared if the file that is being executed is
> >> not readable by the user executing the file.  A bug in
> >> ptrace_may_access allows reading the file if the executable happens to
> >> enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
> >> unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
> >> 
> >> This problem is fixed with only necessary userspace breakage by adding
> >> a user namespace owner to mm_struct, captured at the time of exec,
> >> so it is clear in which user namespace CAP_SYS_PTRACE must be present
> >> in to be able to safely give read permission to the executable.
> >> 
> >> The function ptrace_may_access is modified to verify that the ptracer
> >> has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
> >> This ensures that if the task changes it's cred into a subordinate
> >> user namespace it does not become ptraceable.
> >
> > I haven't studied your patch too deeply but one thing that immediately 
> > raised a red flag was that mm might be shared between processes (aka
> > thread groups). What prevents those two to sit in different user
> > namespaces?
> >
> > I am primarily asking because this generated a lot of headache for the
> > memcg handling as those processes might sit in different cgroups while
> > there is only one correct memcg for them which can disagree with the
> > cgroup associated with one of the processes.
> 
> That is a legitimate concern, but I do not see any of those kinds of
> issues here.
> 
> Part of the memcg pain comes from the fact that control groups are
> process centric, and part of the pain comes from the fact that it is
> possible to change control groups.  What I am doing is making the mm
> owned by a user namespace (at creation time), and I am not allowing
> changes to that ownership. The credentials of the tasks that use that mm
> may be in the same user namespace or descendent user namespaces.
> 
> The core goal is to enforce the unreadability of an mm when an
> non-readable file is executed.  This is a time of mm creation property.
> The enforcement of which fits very well with the security/permission
> checking role of the user namespace.

How is that going to work? I thought the core goal was better security for
entering containers.

If I want to dump a non-readable file, afaik, I can just make a new user
namespace, then run the file in there and dump its memory.
I guess you could fix that by entirely prohibiting the execution of a
non-readable file whose owner UID is not mapped. (Adding more dumping
restrictions wouldn't help much because you could still e.g. supply a
malicious dynamic linker if you control the mount namespace.)
Eric W. Biederman Oct. 18, 2016, 3:35 p.m.
Jann Horn <jann@thejh.net> writes:

> On Tue, Oct 18, 2016 at 09:56:53AM -0500, Eric W. Biederman wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Mon 17-10-16 11:39:49, Eric W. Biederman wrote:
>> >> 
>> >> During exec dumpable is cleared if the file that is being executed is
>> >> not readable by the user executing the file.  A bug in
>> >> ptrace_may_access allows reading the file if the executable happens to
>> >> enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
>> >> unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
>> >> 
>> >> This problem is fixed with only necessary userspace breakage by adding
>> >> a user namespace owner to mm_struct, captured at the time of exec,
>> >> so it is clear in which user namespace CAP_SYS_PTRACE must be present
>> >> in to be able to safely give read permission to the executable.
>> >> 
>> >> The function ptrace_may_access is modified to verify that the ptracer
>> >> has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
>> >> This ensures that if the task changes it's cred into a subordinate
>> >> user namespace it does not become ptraceable.
>> >
>> > I haven't studied your patch too deeply but one thing that immediately 
>> > raised a red flag was that mm might be shared between processes (aka
>> > thread groups). What prevents those two to sit in different user
>> > namespaces?
>> >
>> > I am primarily asking because this generated a lot of headache for the
>> > memcg handling as those processes might sit in different cgroups while
>> > there is only one correct memcg for them which can disagree with the
>> > cgroup associated with one of the processes.
>> 
>> That is a legitimate concern, but I do not see any of those kinds of
>> issues here.
>> 
>> Part of the memcg pain comes from the fact that control groups are
>> process centric, and part of the pain comes from the fact that it is
>> possible to change control groups.  What I am doing is making the mm
>> owned by a user namespace (at creation time), and I am not allowing
>> changes to that ownership. The credentials of the tasks that use that mm
>> may be in the same user namespace or descendent user namespaces.
>> 
>> The core goal is to enforce the unreadability of an mm when an
>> non-readable file is executed.  This is a time of mm creation property.
>> The enforcement of which fits very well with the security/permission
>> checking role of the user namespace.
>
> How is that going to work? I thought the core goal was better security for
> entering containers.

The better security when entering containers came from fixing the the
check for unreadable files.  Because that is fundamentally what
the mm dumpable settings are for.

> If I want to dump a non-readable file, afaik, I can just make a new user
> namespace, then run the file in there and dump its memory.
> I guess you could fix that by entirely prohibiting the execution of a
> non-readable file whose owner UID is not mapped. (Adding more dumping
> restrictions wouldn't help much because you could still e.g. supply a
> malicious dynamic linker if you control the mount namespace.)

That seems to be a part of this puzzle I have incompletely addressed,
thank you.

It looks like I need to change either the owning user namespace or
fail the exec.  Malicious dynamic linkers are doubly interesting.

As mount name spaces are also owned if I have privileges I can address
the possibility of a malicious dynamic linker that way.  AKA who cares
about the link if the owner of the mount namespace has permissions to
read the file.

I am going to look at failing the exec if the owning user namespace
of the mm would not have permissions to read the file.  That should just
be a couple of lines of code and easy to maintain.  Plus it does not
appear that non-readable executables are particularly common.

Eric
Michal Hocko Oct. 18, 2016, 6:06 p.m.
On Tue 18-10-16 09:56:53, Eric W. Biederman wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Mon 17-10-16 11:39:49, Eric W. Biederman wrote:
> >> 
> >> During exec dumpable is cleared if the file that is being executed is
> >> not readable by the user executing the file.  A bug in
> >> ptrace_may_access allows reading the file if the executable happens to
> >> enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
> >> unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
> >> 
> >> This problem is fixed with only necessary userspace breakage by adding
> >> a user namespace owner to mm_struct, captured at the time of exec,
> >> so it is clear in which user namespace CAP_SYS_PTRACE must be present
> >> in to be able to safely give read permission to the executable.
> >> 
> >> The function ptrace_may_access is modified to verify that the ptracer
> >> has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
> >> This ensures that if the task changes it's cred into a subordinate
> >> user namespace it does not become ptraceable.
> >
> > I haven't studied your patch too deeply but one thing that immediately 
> > raised a red flag was that mm might be shared between processes (aka
> > thread groups). What prevents those two to sit in different user
> > namespaces?
> >
> > I am primarily asking because this generated a lot of headache for the
> > memcg handling as those processes might sit in different cgroups while
> > there is only one correct memcg for them which can disagree with the
> > cgroup associated with one of the processes.
> 
> That is a legitimate concern, but I do not see any of those kinds of
> issues here.
> 
> Part of the memcg pain comes from the fact that control groups are
> process centric, and part of the pain comes from the fact that it is
> possible to change control groups.  What I am doing is making the mm
> owned by a user namespace (at creation time), and I am not allowing
> changes to that ownership. The credentials of the tasks that use that mm
> may be in the same user namespace or descendent user namespaces.

OK, then my worries about this weird "threading" model is void.

Thanks for the clarification.
Jann Horn Oct. 18, 2016, 7:12 p.m.
On Tue, Oct 18, 2016 at 10:35:23AM -0500, Eric W. Biederman wrote:
> Jann Horn <jann@thejh.net> writes:
> 
> > On Tue, Oct 18, 2016 at 09:56:53AM -0500, Eric W. Biederman wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> >> 
> >> > On Mon 17-10-16 11:39:49, Eric W. Biederman wrote:
> >> >> 
> >> >> During exec dumpable is cleared if the file that is being executed is
> >> >> not readable by the user executing the file.  A bug in
> >> >> ptrace_may_access allows reading the file if the executable happens to
> >> >> enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
> >> >> unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
> >> >> 
> >> >> This problem is fixed with only necessary userspace breakage by adding
> >> >> a user namespace owner to mm_struct, captured at the time of exec,
> >> >> so it is clear in which user namespace CAP_SYS_PTRACE must be present
> >> >> in to be able to safely give read permission to the executable.
> >> >> 
> >> >> The function ptrace_may_access is modified to verify that the ptracer
> >> >> has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
> >> >> This ensures that if the task changes it's cred into a subordinate
> >> >> user namespace it does not become ptraceable.
> >> >
> >> > I haven't studied your patch too deeply but one thing that immediately 
> >> > raised a red flag was that mm might be shared between processes (aka
> >> > thread groups). What prevents those two to sit in different user
> >> > namespaces?
> >> >
> >> > I am primarily asking because this generated a lot of headache for the
> >> > memcg handling as those processes might sit in different cgroups while
> >> > there is only one correct memcg for them which can disagree with the
> >> > cgroup associated with one of the processes.
> >> 
> >> That is a legitimate concern, but I do not see any of those kinds of
> >> issues here.
> >> 
> >> Part of the memcg pain comes from the fact that control groups are
> >> process centric, and part of the pain comes from the fact that it is
> >> possible to change control groups.  What I am doing is making the mm
> >> owned by a user namespace (at creation time), and I am not allowing
> >> changes to that ownership. The credentials of the tasks that use that mm
> >> may be in the same user namespace or descendent user namespaces.
> >> 
> >> The core goal is to enforce the unreadability of an mm when an
> >> non-readable file is executed.  This is a time of mm creation property.
> >> The enforcement of which fits very well with the security/permission
> >> checking role of the user namespace.
> >
> > How is that going to work? I thought the core goal was better security for
> > entering containers.
> 
> The better security when entering containers came from fixing the the
> check for unreadable files.  Because that is fundamentally what
> the mm dumpable settings are for.

Oh, interesting.


> > If I want to dump a non-readable file, afaik, I can just make a new user
> > namespace, then run the file in there and dump its memory.
> > I guess you could fix that by entirely prohibiting the execution of a
> > non-readable file whose owner UID is not mapped. (Adding more dumping
> > restrictions wouldn't help much because you could still e.g. supply a
> > malicious dynamic linker if you control the mount namespace.)
> 
> That seems to be a part of this puzzle I have incompletely addressed,
> thank you.
> 
> It looks like I need to change either the owning user namespace or
> fail the exec.  Malicious dynamic linkers are doubly interesting.
> 
> As mount name spaces are also owned if I have privileges I can address
> the possibility of a malicious dynamic linker that way.  AKA who cares
> about the link if the owner of the mount namespace has permissions to
> read the file.

If you just check the owner of the mount namespace, someone could still
use a user namespace to chroot() the process. That should also be
sufficient to get the evil linker in. I think it really needs to be the
user namespace of the executing process that's checked, not the user
namespace associated with some mount namespace.


> I am going to look at failing the exec if the owning user namespace
> of the mm would not have permissions to read the file.  That should just
> be a couple of lines of code and easy to maintain.  Plus it does not
> appear that non-readable executables are particularly common.

Hm. Yeah, I guess mode 04111 probably isn't sooo common.
From a security perspective, I think that should work.
Eric W. Biederman Oct. 18, 2016, 9:07 p.m.
Jann Horn <jann@thejh.net> writes:

> On Tue, Oct 18, 2016 at 10:35:23AM -0500, Eric W. Biederman wrote:
>> Jann Horn <jann@thejh.net> writes:
>> 
>> > On Tue, Oct 18, 2016 at 09:56:53AM -0500, Eric W. Biederman wrote:
>> >> Michal Hocko <mhocko@kernel.org> writes:
>> >> 
>> >> > On Mon 17-10-16 11:39:49, Eric W. Biederman wrote:
>> >> >> 
>> >> >> During exec dumpable is cleared if the file that is being executed is
>> >> >> not readable by the user executing the file.  A bug in
>> >> >> ptrace_may_access allows reading the file if the executable happens to
>> >> >> enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
>> >> >> unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
>> >> >> 
>> >> >> This problem is fixed with only necessary userspace breakage by adding
>> >> >> a user namespace owner to mm_struct, captured at the time of exec,
>> >> >> so it is clear in which user namespace CAP_SYS_PTRACE must be present
>> >> >> in to be able to safely give read permission to the executable.
>> >> >> 
>> >> >> The function ptrace_may_access is modified to verify that the ptracer
>> >> >> has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
>> >> >> This ensures that if the task changes it's cred into a subordinate
>> >> >> user namespace it does not become ptraceable.
>> >> >
>> >> > I haven't studied your patch too deeply but one thing that immediately 
>> >> > raised a red flag was that mm might be shared between processes (aka
>> >> > thread groups). What prevents those two to sit in different user
>> >> > namespaces?
>> >> >
>> >> > I am primarily asking because this generated a lot of headache for the
>> >> > memcg handling as those processes might sit in different cgroups while
>> >> > there is only one correct memcg for them which can disagree with the
>> >> > cgroup associated with one of the processes.
>> >> 
>> >> That is a legitimate concern, but I do not see any of those kinds of
>> >> issues here.
>> >> 
>> >> Part of the memcg pain comes from the fact that control groups are
>> >> process centric, and part of the pain comes from the fact that it is
>> >> possible to change control groups.  What I am doing is making the mm
>> >> owned by a user namespace (at creation time), and I am not allowing
>> >> changes to that ownership. The credentials of the tasks that use that mm
>> >> may be in the same user namespace or descendent user namespaces.
>> >> 
>> >> The core goal is to enforce the unreadability of an mm when an
>> >> non-readable file is executed.  This is a time of mm creation property.
>> >> The enforcement of which fits very well with the security/permission
>> >> checking role of the user namespace.
>> >
>> > How is that going to work? I thought the core goal was better security for
>> > entering containers.
>> 
>> The better security when entering containers came from fixing the the
>> check for unreadable files.  Because that is fundamentally what
>> the mm dumpable settings are for.
>
> Oh, interesting.
>
>
>> > If I want to dump a non-readable file, afaik, I can just make a new user
>> > namespace, then run the file in there and dump its memory.
>> > I guess you could fix that by entirely prohibiting the execution of a
>> > non-readable file whose owner UID is not mapped. (Adding more dumping
>> > restrictions wouldn't help much because you could still e.g. supply a
>> > malicious dynamic linker if you control the mount namespace.)
>> 
>> That seems to be a part of this puzzle I have incompletely addressed,
>> thank you.
>> 
>> It looks like I need to change either the owning user namespace or
>> fail the exec.  Malicious dynamic linkers are doubly interesting.
>> 
>> As mount name spaces are also owned if I have privileges I can address
>> the possibility of a malicious dynamic linker that way.  AKA who cares
>> about the link if the owner of the mount namespace has permissions to
>> read the file.
>
> If you just check the owner of the mount namespace, someone could still
> use a user namespace to chroot() the process. That should also be
> sufficient to get the evil linker in. I think it really needs to be the
> user namespace of the executing process that's checked, not the user
> namespace associated with some mount namespace.

Something.  I will just note that this is hard to analyze and
theoretically possible for now, since I don't intend to pursue
that solution.

>> I am going to look at failing the exec if the owning user namespace
>> of the mm would not have permissions to read the file.  That should just
>> be a couple of lines of code and easy to maintain.  Plus it does not
>> appear that non-readable executables are particularly common.
>
> Hm. Yeah, I guess mode 04111 probably isn't sooo common.
> From a security perspective, I think that should work.

Well there is at least one common distro that installs sudo
that way so I would not say uncommon.   But we already ignore
the suid and sgid bit when executing such executables as without
having the uid or gid mapping into a user namespace suid and sgid
can not be supported.

So the only case that could cause a real regression/loss of
functionality is if there are unreadable executables without the suid or
sgid bit set.  I can't find any of those.

Patch for this second bug in a moment.

Eric