[v2,1/4] capability: add ns_capable_cred()

Submitted by Christian Brauner on April 30, 2020, 4:57 p.m.

Details

Message ID 20200430165717.1001605-1-christian.brauner@ubuntu.com
State New
Series "Series without cover letter"
Headers show

Commit Message

Christian Brauner April 30, 2020, 4:57 p.m.
Add a simple capability helper which makes it possible to determine
whether a set of creds is ns capable wrt to the passed in credentials.
This is not something exciting it's just a more pleasant wrapper around
security_capable() by allowing ns_capable_common() to ake a const struct
cred argument. In ptrace_has_cap() for example, we're using
security_capable() directly. ns_capable_cred() will be used in the next
patch to check against the target credentials the caller is going to
switch to.

Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
patch introduced
---
 include/linux/capability.h |  3 +++
 kernel/capability.c        | 17 +++++++++++------
 2 files changed, 14 insertions(+), 6 deletions(-)


base-commit: ae83d0b416db002fe95601e7f97f64b59514d936

Patch hide | download patch | download mbox

diff --git a/include/linux/capability.h b/include/linux/capability.h
index ecce0f43c73a..743a08d936fb 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -40,6 +40,7 @@  struct cpu_vfs_cap_data {
 struct file;
 struct inode;
 struct dentry;
+struct cred;
 struct task_struct;
 struct user_namespace;
 
@@ -209,6 +210,8 @@  extern bool has_ns_capability_noaudit(struct task_struct *t,
 				      struct user_namespace *ns, int cap);
 extern bool capable(int cap);
 extern bool ns_capable(struct user_namespace *ns, int cap);
+extern bool ns_capable_cred(const struct cred *cred,
+			    struct user_namespace *ns, int cap);
 extern bool ns_capable_noaudit(struct user_namespace *ns, int cap);
 extern bool ns_capable_setid(struct user_namespace *ns, int cap);
 #else
diff --git a/kernel/capability.c b/kernel/capability.c
index 1444f3954d75..84425781917e 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -361,8 +361,8 @@  bool has_capability_noaudit(struct task_struct *t, int cap)
 	return has_ns_capability_noaudit(t, &init_user_ns, cap);
 }
 
-static bool ns_capable_common(struct user_namespace *ns,
-			      int cap,
+static bool ns_capable_common(const struct cred *cred,
+			      struct user_namespace *ns, int cap,
 			      unsigned int opts)
 {
 	int capable;
@@ -372,7 +372,7 @@  static bool ns_capable_common(struct user_namespace *ns,
 		BUG();
 	}
 
-	capable = security_capable(current_cred(), ns, cap, opts);
+	capable = security_capable(cred, ns, cap, opts);
 	if (capable == 0) {
 		current->flags |= PF_SUPERPRIV;
 		return true;
@@ -393,10 +393,15 @@  static bool ns_capable_common(struct user_namespace *ns,
  */
 bool ns_capable(struct user_namespace *ns, int cap)
 {
-	return ns_capable_common(ns, cap, CAP_OPT_NONE);
+	return ns_capable_common(current_cred(), ns, cap, CAP_OPT_NONE);
 }
 EXPORT_SYMBOL(ns_capable);
 
+bool ns_capable_cred(const struct cred *cred, struct user_namespace *ns, int cap)
+{
+	return ns_capable_common(cred, ns, cap, CAP_OPT_NONE);
+}
+
 /**
  * ns_capable_noaudit - Determine if the current task has a superior capability
  * (unaudited) in effect
@@ -411,7 +416,7 @@  EXPORT_SYMBOL(ns_capable);
  */
 bool ns_capable_noaudit(struct user_namespace *ns, int cap)
 {
-	return ns_capable_common(ns, cap, CAP_OPT_NOAUDIT);
+	return ns_capable_common(current_cred(), ns, cap, CAP_OPT_NOAUDIT);
 }
 EXPORT_SYMBOL(ns_capable_noaudit);
 
@@ -430,7 +435,7 @@  EXPORT_SYMBOL(ns_capable_noaudit);
  */
 bool ns_capable_setid(struct user_namespace *ns, int cap)
 {
-	return ns_capable_common(ns, cap, CAP_OPT_INSETID);
+	return ns_capable_common(current_cred(), ns, cap, CAP_OPT_INSETID);
 }
 EXPORT_SYMBOL(ns_capable_setid);
 

Comments

Eric W. Biederman April 30, 2020, 6:09 p.m.
Christian Brauner <christian.brauner@ubuntu.com> writes:

> Add a simple capability helper which makes it possible to determine
> whether a set of creds is ns capable wrt to the passed in credentials.
> This is not something exciting it's just a more pleasant wrapper around
> security_capable() by allowing ns_capable_common() to ake a const struct
> cred argument. In ptrace_has_cap() for example, we're using
> security_capable() directly. ns_capable_cred() will be used in the next
> patch to check against the target credentials the caller is going to
> switch to.

Given that this is to suppot setns.  I don't understand the
justification for this.

Is it your intention to use the reduced permissions that you get
when you install a user namespace?

Why do you want to use the reduced permissions when installing multiple
namespaces at once?

Eric


> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
> /* v2 */
> patch introduced
> ---
>  include/linux/capability.h |  3 +++
>  kernel/capability.c        | 17 +++++++++++------
>  2 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index ecce0f43c73a..743a08d936fb 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -40,6 +40,7 @@ struct cpu_vfs_cap_data {
>  struct file;
>  struct inode;
>  struct dentry;
> +struct cred;
>  struct task_struct;
>  struct user_namespace;
>  
> @@ -209,6 +210,8 @@ extern bool has_ns_capability_noaudit(struct task_struct *t,
>  				      struct user_namespace *ns, int cap);
>  extern bool capable(int cap);
>  extern bool ns_capable(struct user_namespace *ns, int cap);
> +extern bool ns_capable_cred(const struct cred *cred,
> +			    struct user_namespace *ns, int cap);
>  extern bool ns_capable_noaudit(struct user_namespace *ns, int cap);
>  extern bool ns_capable_setid(struct user_namespace *ns, int cap);
>  #else
> diff --git a/kernel/capability.c b/kernel/capability.c
> index 1444f3954d75..84425781917e 100644
> --- a/kernel/capability.c
> +++ b/kernel/capability.c
> @@ -361,8 +361,8 @@ bool has_capability_noaudit(struct task_struct *t, int cap)
>  	return has_ns_capability_noaudit(t, &init_user_ns, cap);
>  }
>  
> -static bool ns_capable_common(struct user_namespace *ns,
> -			      int cap,
> +static bool ns_capable_common(const struct cred *cred,
> +			      struct user_namespace *ns, int cap,
>  			      unsigned int opts)
>  {
>  	int capable;
> @@ -372,7 +372,7 @@ static bool ns_capable_common(struct user_namespace *ns,
>  		BUG();
>  	}
>  
> -	capable = security_capable(current_cred(), ns, cap, opts);
> +	capable = security_capable(cred, ns, cap, opts);
>  	if (capable == 0) {
>  		current->flags |= PF_SUPERPRIV;
>  		return true;
> @@ -393,10 +393,15 @@ static bool ns_capable_common(struct user_namespace *ns,
>   */
>  bool ns_capable(struct user_namespace *ns, int cap)
>  {
> -	return ns_capable_common(ns, cap, CAP_OPT_NONE);
> +	return ns_capable_common(current_cred(), ns, cap, CAP_OPT_NONE);
>  }
>  EXPORT_SYMBOL(ns_capable);
>  
> +bool ns_capable_cred(const struct cred *cred, struct user_namespace *ns, int cap)
> +{
> +	return ns_capable_common(cred, ns, cap, CAP_OPT_NONE);
> +}
> +
>  /**
>   * ns_capable_noaudit - Determine if the current task has a superior capability
>   * (unaudited) in effect
> @@ -411,7 +416,7 @@ EXPORT_SYMBOL(ns_capable);
>   */
>  bool ns_capable_noaudit(struct user_namespace *ns, int cap)
>  {
> -	return ns_capable_common(ns, cap, CAP_OPT_NOAUDIT);
> +	return ns_capable_common(current_cred(), ns, cap, CAP_OPT_NOAUDIT);
>  }
>  EXPORT_SYMBOL(ns_capable_noaudit);
>  
> @@ -430,7 +435,7 @@ EXPORT_SYMBOL(ns_capable_noaudit);
>   */
>  bool ns_capable_setid(struct user_namespace *ns, int cap)
>  {
> -	return ns_capable_common(ns, cap, CAP_OPT_INSETID);
> +	return ns_capable_common(current_cred(), ns, cap, CAP_OPT_INSETID);
>  }
>  EXPORT_SYMBOL(ns_capable_setid);
>  
>
> base-commit: ae83d0b416db002fe95601e7f97f64b59514d936
Christian Brauner May 1, 2020, 3:42 p.m.
On Thu, Apr 30, 2020 at 01:09:30PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> 
> > Add a simple capability helper which makes it possible to determine
> > whether a set of creds is ns capable wrt to the passed in credentials.
> > This is not something exciting it's just a more pleasant wrapper around
> > security_capable() by allowing ns_capable_common() to ake a const struct
> > cred argument. In ptrace_has_cap() for example, we're using
> > security_capable() directly. ns_capable_cred() will be used in the next
> > patch to check against the target credentials the caller is going to
> > switch to.
> 
> Given that this is to suppot setns.  I don't understand the
> justification for this.
> 
> Is it your intention to use the reduced permissions that you get
> when you install a user namespace?

Indeed.

> 
> Why do you want to use the reduced permissions when installing multiple
> namespaces at once?

The intention is to use the target credentials we are going to install
when setns() hits the point of no return. The target permissions are
either the permissions of the caller or the reduced permissions if
CLONE_NEWUSER has been requested. This has multiple reasons.

The most obvious reason imho is that all other namespaces have an owning
user namespace. Attaching to any other namespace requires the attacher
to be privileged over the owning user namespace of that namespace.
Consequently, all our current install handlers for every namespace we
have check whether we are privileged in the owning user namespace of
that user namespace. So in order to attach to any of those namespaces -
especially when attaching as an unprivileged user - requires that we are
attached to the user namespace first. (That's especially useful given
that most users especially container runtimes will unshare all
namespaces. Doing it this way we can not just have attach privileged
users attach to their containers but also unprivileged users to their
containers in one shot.)

A few other points about this. If one looks at clone(CLONE_NEW*) or
unshare(CLONE_NEW*) then the ordering and permissions checking is the
same way. All permissions checks are performed against the reduced
permissions, i.e. if CLONE_NEWUSER is specified you check privilege
against the reduced permissions too otherwise you wouldn't be able to
spawn into a complete set of new namespaces as an unprivileged user.

This logic is also expressed in how setns() is already used in
userspace. Any container runtime will attach to the user namespace first,
so all subsequent calls to attach to other namespaces perform the checks
against the reduced permissions. It also has to be that way because of
fully unprivileged containers.

To put it another way, if we were to always perform the permission checks
against the current permissions (i.e. no matter if CLONE_NEWUSER is
specified or not) then we'd never be able to attach to a set of
namespaces at once as an unprivileged user.
We also really want to be able to express both semantics:
1. setns(flags & ~CLONE_NEWUSER) --> attach to all namespaces with my
   current permission level
2. setns(flags | CLONE_NEWUSER) attach to all namespaces with the target
   permission level
It feels weird if both 1 and 2 would mean the exact same thing given
that the user namespace has an owernship relation with all the other
namespaces.

Christian
Eric W. Biederman May 2, 2020, 12:35 p.m.
Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Thu, Apr 30, 2020 at 01:09:30PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@ubuntu.com> writes:
>> 
>> > Add a simple capability helper which makes it possible to determine
>> > whether a set of creds is ns capable wrt to the passed in credentials.
>> > This is not something exciting it's just a more pleasant wrapper around
>> > security_capable() by allowing ns_capable_common() to ake a const struct
>> > cred argument. In ptrace_has_cap() for example, we're using
>> > security_capable() directly. ns_capable_cred() will be used in the next
>> > patch to check against the target credentials the caller is going to
>> > switch to.
>> 
>> Given that this is to suppot setns.  I don't understand the
>> justification for this.
>> 
>> Is it your intention to use the reduced permissions that you get
>> when you install a user namespace?
>
> Indeed.
>
>> 
>> Why do you want to use the reduced permissions when installing multiple
>> namespaces at once?
>
> The intention is to use the target credentials we are going to install
> when setns() hits the point of no return. The target permissions are
> either the permissions of the caller or the reduced permissions if
> CLONE_NEWUSER has been requested. This has multiple reasons.
>
> The most obvious reason imho is that all other namespaces have an owning
> user namespace. Attaching to any other namespace requires the attacher
> to be privileged over the owning user namespace of that namespace.
> Consequently, all our current install handlers for every namespace we
> have check whether we are privileged in the owning user namespace of
> that user namespace. So in order to attach to any of those namespaces -
> especially when attaching as an unprivileged user - requires that we are
> attached to the user namespace first.

No actually it doesn't.  As if you have privileges to attach to the user
namespace you have the privileges to attach to anything it owns.  Or you
should I think I am missing some subtle detail at the moment.


> (That's especially useful given
> that most users especially container runtimes will unshare all
> namespaces. Doing it this way we can not just have attach privileged
> users attach to their containers but also unprivileged users to their
> containers in one shot.)

That is a wonderful reason for doing things, and it is the reason
why I am asking about it because I think you have it backwards.

Especially in the context of some container runtimes like Kubernetes
that I have been told will do things like share a network namespace
across all containers in a POD.

> A few other points about this. If one looks at clone(CLONE_NEW*) or
> unshare(CLONE_NEW*) then the ordering and permissions checking is the
> same way. All permissions checks are performed against the reduced
> permissions, i.e. if CLONE_NEWUSER is specified you check privilege
> against the reduced permissions too otherwise you wouldn't be able to
> spawn into a complete set of new namespaces as an unprivileged user.

That is a good catch and definitely a reason for looking at doing
things in this order.

For unshare and clone putting things in a user namespace means you can
create namespaces you could not create otherwise.


> This logic is also expressed in how setns() is already used in
> userspace. Any container runtime will attach to the user namespace first,
> so all subsequent calls to attach to other namespaces perform the checks
> against the reduced permissions. It also has to be that way because of
> fully unprivileged containers.

So I sat and looked.  For nsetner it winds up trying to enter
the namespaces in either order.

        /*
         * Now that we know which namespaces we want to enter, enter
         * them.  Do this in two passes, not entering the user
         * namespace on the first pass.  So if we're deprivileging the
         * container we'll enter the user namespace last and if we're
         * privileging it then we enter the user namespace first
         * (because the initial setns will fail).
         */
        for (pass = 0; pass < 2; pass ++) {
                for (nsfile = namespace_files + 1 - pass; nsfile->nstype; nsfile++) {
                        if (nsfile->fd < 0)
                                continue;
                        if (nsfile->nstype == CLONE_NEWPID && do_fork == -1)
                                do_fork = 1;
                        if (setns(nsfile->fd, nsfile->nstype)) {
                                if (pass != 0)
                                        err(EXIT_FAILURE,
                                            _("reassociate to namespace '%s' failed"),
                                            nsfile->name);
                                else
                                        continue;
                        }

                        close(nsfile->fd);
                        nsfile->fd = -1;
                }
        }


Looking a little close we have at least for entering the mntns the
following checks:

	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
	    !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
		return -EPERM;

Which require CAP_SYS_CHROOT and CAP_SYS_ADMIN in the current user namespace.

So there is defintiely an issue there.

I suspect the clean approach is to simply require CAP_SYS_CHROOT in
the new user namespace, when we are setting several of them at once.

> To put it another way, if we were to always perform the permission checks
> against the current permissions (i.e. no matter if CLONE_NEWUSER is
> specified or not) then we'd never be able to attach to a set of
> namespaces at once as an unprivileged user.
> We also really want to be able to express both semantics:
> 1. setns(flags & ~CLONE_NEWUSER) --> attach to all namespaces with my
>    current permission level
> 2. setns(flags | CLONE_NEWUSER) attach to all namespaces with the target
>    permission level
> It feels weird if both 1 and 2 would mean the exact same thing given
> that the user namespace has an owernship relation with all the other
> namespaces.

It feels weird to me to disallow anything that we have permissions for.

Can you dig up the actual install permissions checks and see if there is
anything other than the mount namespace that needs permissions in the
current user namespace?

Please let's walk this through.  I think there should be a way to
carefully phrase the permission checks that we don't need
ns_capable_cred that will allow goofy cases like setns into Kuberneties
PODs that share network namespaces.

I believe that will be a way to phrase the permission checks so that
with or without CLONE_NEWUSER they make sense, and give very similar
results.

Certainly attaching to a mount namespace is going to need either being
root or attaching to a user namespace at the same time.  Because
attaching to a mount namespace needs functionality that the user
namespace provides.

Eric
Christian Brauner May 2, 2020, 2:32 p.m.
On Sat, May 02, 2020 at 07:35:53AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> 
> > On Thu, Apr 30, 2020 at 01:09:30PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@ubuntu.com> writes:
> >> 
> >> > Add a simple capability helper which makes it possible to determine
> >> > whether a set of creds is ns capable wrt to the passed in credentials.
> >> > This is not something exciting it's just a more pleasant wrapper around
> >> > security_capable() by allowing ns_capable_common() to ake a const struct
> >> > cred argument. In ptrace_has_cap() for example, we're using
> >> > security_capable() directly. ns_capable_cred() will be used in the next
> >> > patch to check against the target credentials the caller is going to
> >> > switch to.
> >> 
> >> Given that this is to suppot setns.  I don't understand the
> >> justification for this.
> >> 
> >> Is it your intention to use the reduced permissions that you get
> >> when you install a user namespace?
> >
> > Indeed.
> >
> >> 
> >> Why do you want to use the reduced permissions when installing multiple
> >> namespaces at once?
> >
> > The intention is to use the target credentials we are going to install
> > when setns() hits the point of no return. The target permissions are
> > either the permissions of the caller or the reduced permissions if
> > CLONE_NEWUSER has been requested. This has multiple reasons.
> >
> > The most obvious reason imho is that all other namespaces have an owning
> > user namespace. Attaching to any other namespace requires the attacher
> > to be privileged over the owning user namespace of that namespace.
> > Consequently, all our current install handlers for every namespace we
> > have check whether we are privileged in the owning user namespace of
> > that user namespace. So in order to attach to any of those namespaces -
> > especially when attaching as an unprivileged user - requires that we are
> > attached to the user namespace first.
> 
> No actually it doesn't.  As if you have privileges to attach to the user
> namespace you have the privileges to attach to anything it owns.  Or you
> should I think I am missing some subtle detail at the moment.

Sorry, I phrased that very misleadingly. What I'm talking about should
be obvious in the second patch:

-static int utsns_install(struct nsproxy *nsproxy, struct ns_common *new)
+static int utsns_install(struct nsset *nsset, struct ns_common *new)
 {
+       struct nsproxy *nsproxy = nsset->nsproxy;
        struct uts_namespace *ns = to_uts_ns(new);

-       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
-           !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+       if (!ns_capable_cred(nsset->cred, ns->user_ns, CAP_SYS_ADMIN) ||
+           !ns_capable_cred(nsset->cred, nsset->cred->user_ns, CAP_SYS_ADMIN))
                return -EPERM;

All our current install handlers check that we're privileged wrt to our
_current_ user namespace, i.e. they all have a
ns_capable(current_user_ns(), CAP_SYS_ADMIN)
line. So if we specify
setns(CLONE_NEWUSER | CLONE_NEW*)
as an unprivileged user we aren't able to attach to any other namespace
unless we check against the target credentials.

> 
> 
> > (That's especially useful given
> > that most users especially container runtimes will unshare all
> > namespaces. Doing it this way we can not just have attach privileged
> > users attach to their containers but also unprivileged users to their
> > containers in one shot.)
> 
> That is a wonderful reason for doing things, and it is the reason
> why I am asking about it because I think you have it backwards.
> 
> Especially in the context of some container runtimes like Kubernetes
> that I have been told will do things like share a network namespace
> across all containers in a POD.

Kubernetes doesn't use user namespace at all so they need to run as root
anyway so that's not a problem. And if we're talking about scenarios
where some namespaces are shared and other's aren't then there'll be
batch-attaching any, i.e. not all at once but some at once. So I don't
think this is a good argument.

> 
> > A few other points about this. If one looks at clone(CLONE_NEW*) or
> > unshare(CLONE_NEW*) then the ordering and permissions checking is the
> > same way. All permissions checks are performed against the reduced
> > permissions, i.e. if CLONE_NEWUSER is specified you check privilege
> > against the reduced permissions too otherwise you wouldn't be able to
> > spawn into a complete set of new namespaces as an unprivileged user.
> 
> That is a good catch and definitely a reason for looking at doing
> things in this order.
> 
> For unshare and clone putting things in a user namespace means you can
> create namespaces you could not create otherwise.
> 
> 
> > This logic is also expressed in how setns() is already used in
> > userspace. Any container runtime will attach to the user namespace first,
> > so all subsequent calls to attach to other namespaces perform the checks
> > against the reduced permissions. It also has to be that way because of
> > fully unprivileged containers.
> 
> So I sat and looked.  For nsetner it winds up trying to enter
> the namespaces in either order.
> 
>         /*
>          * Now that we know which namespaces we want to enter, enter
>          * them.  Do this in two passes, not entering the user
>          * namespace on the first pass.  So if we're deprivileging the
>          * container we'll enter the user namespace last and if we're
>          * privileging it then we enter the user namespace first
>          * (because the initial setns will fail).
>          */
>         for (pass = 0; pass < 2; pass ++) {
>                 for (nsfile = namespace_files + 1 - pass; nsfile->nstype; nsfile++) {
>                         if (nsfile->fd < 0)
>                                 continue;
>                         if (nsfile->nstype == CLONE_NEWPID && do_fork == -1)
>                                 do_fork = 1;
>                         if (setns(nsfile->fd, nsfile->nstype)) {
>                                 if (pass != 0)
>                                         err(EXIT_FAILURE,
>                                             _("reassociate to namespace '%s' failed"),
>                                             nsfile->name);
>                                 else
>                                         continue;
>                         }
> 
>                         close(nsfile->fd);
>                         nsfile->fd = -1;
>                 }
>         }
> 
> 
> Looking a little close we have at least for entering the mntns the
> following checks:
> 
> 	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
> 	    !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
> 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> 		return -EPERM;
> 
> Which require CAP_SYS_CHROOT and CAP_SYS_ADMIN in the current user namespace.
> 
> So there is defintiely an issue there.
> 
> I suspect the clean approach is to simply require CAP_SYS_CHROOT in
> the new user namespace, when we are setting several of them at once.

See my example above. All install handlers check for
ns_capable(current_user_ns(), CAP_SYS_ADMIN).


static int ipcns_install(struct nsproxy *nsproxy, struct ns_common *new)
{
	struct ipc_namespace *ns = to_ipc_ns(new);
	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
		return -EPERM;


static int netns_install(struct nsproxy *nsproxy, struct ns_common *ns)
{
	struct net *net = to_net_ns(ns);

	if (!ns_capable(net->user_ns, CAP_SYS_ADMIN) ||
	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
		return -EPERM;

static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns)
{
	struct fs_struct *fs = current->fs;
	struct mnt_namespace *mnt_ns = to_mnt_ns(ns), *old_mnt_ns;
	struct path root;
	int err;

	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
	    !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
		return -EPERM;

static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
{
	struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);

	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
		return -EPERM;

static int utsns_install(struct nsproxy *nsproxy, struct ns_common *new)
{
	struct uts_namespace *ns = to_uts_ns(new);

	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
		return -EPERM;

static int pidns_install(struct nsproxy *nsproxy, struct ns_common *ns)
{
	struct pid_namespace *active = task_active_pid_ns(current);
	struct pid_namespace *ancestor, *new = to_pid_ns(ns);

	if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) ||
	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
		return -EPERM;

> 
> > To put it another way, if we were to always perform the permission checks
> > against the current permissions (i.e. no matter if CLONE_NEWUSER is
> > specified or not) then we'd never be able to attach to a set of
> > namespaces at once as an unprivileged user.
> > We also really want to be able to express both semantics:
> > 1. setns(flags & ~CLONE_NEWUSER) --> attach to all namespaces with my
> >    current permission level
> > 2. setns(flags | CLONE_NEWUSER) attach to all namespaces with the target
> >    permission level
> > It feels weird if both 1 and 2 would mean the exact same thing given
> > that the user namespace has an owernship relation with all the other
> > namespaces.
> 
> It feels weird to me to disallow anything that we have permissions for.
> 
> Can you dig up the actual install permissions checks and see if there is
> anything other than the mount namespace that needs permissions in the
> current user namespace?

Yep did, all of them check ns_capable(current_user_ns(), CAP_SYS_ADMIN),
see above please.

> 
> Please let's walk this through.  I think there should be a way to
> carefully phrase the permission checks that we don't need
> ns_capable_cred that will allow goofy cases like setns into Kuberneties
> PODs that share network namespaces.

Hm, Kubernetes doesn't use user namespace. I think I misunderstand your
concern maybe. But see below for a suggestion.

> 
> I believe that will be a way to phrase the permission checks so that
> with or without CLONE_NEWUSER they make sense, and give very similar
> results.
> 
> Certainly attaching to a mount namespace is going to need either being
> root or attaching to a user namespace at the same time.  Because
> attaching to a mount namespace needs functionality that the user
> namespace provides.

So how about we add a new flag to setns()
SETNS_NEWUSER_CRED that is only valid in
conjunction with CLONE_NEWUSER and which allows callers to tell the
kernel "assume the target credentials first". This way we can support
both cases and users can specify what they want and we can clearly
document it on the manpages.

Christian
Eric W. Biederman May 2, 2020, 2:52 p.m.
Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Sat, May 02, 2020 at 07:35:53AM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@ubuntu.com> writes:
>> 
>> > On Thu, Apr 30, 2020 at 01:09:30PM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner <christian.brauner@ubuntu.com> writes:
>> >> 
>> >> > Add a simple capability helper which makes it possible to determine
>> >> > whether a set of creds is ns capable wrt to the passed in credentials.
>> >> > This is not something exciting it's just a more pleasant wrapper around
>> >> > security_capable() by allowing ns_capable_common() to ake a const struct
>> >> > cred argument. In ptrace_has_cap() for example, we're using
>> >> > security_capable() directly. ns_capable_cred() will be used in the next
>> >> > patch to check against the target credentials the caller is going to
>> >> > switch to.
>> >> 
>> >> Given that this is to suppot setns.  I don't understand the
>> >> justification for this.
>> >> 
>> >> Is it your intention to use the reduced permissions that you get
>> >> when you install a user namespace?
>> >
>> > Indeed.
>> >
>> >> 
>> >> Why do you want to use the reduced permissions when installing multiple
>> >> namespaces at once?
>> >
>> > The intention is to use the target credentials we are going to install
>> > when setns() hits the point of no return. The target permissions are
>> > either the permissions of the caller or the reduced permissions if
>> > CLONE_NEWUSER has been requested. This has multiple reasons.
>> >
>> > The most obvious reason imho is that all other namespaces have an owning
>> > user namespace. Attaching to any other namespace requires the attacher
>> > to be privileged over the owning user namespace of that namespace.
>> > Consequently, all our current install handlers for every namespace we
>> > have check whether we are privileged in the owning user namespace of
>> > that user namespace. So in order to attach to any of those namespaces -
>> > especially when attaching as an unprivileged user - requires that we are
>> > attached to the user namespace first.
>> 
>> No actually it doesn't.  As if you have privileges to attach to the user
>> namespace you have the privileges to attach to anything it owns.  Or you
>> should I think I am missing some subtle detail at the moment.
>
> Sorry, I phrased that very misleadingly. What I'm talking about should
> be obvious in the second patch:
>
> -static int utsns_install(struct nsproxy *nsproxy, struct ns_common *new)
> +static int utsns_install(struct nsset *nsset, struct ns_common *new)
>  {
> +       struct nsproxy *nsproxy = nsset->nsproxy;
>         struct uts_namespace *ns = to_uts_ns(new);
>
> -       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> -           !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> +       if (!ns_capable_cred(nsset->cred, ns->user_ns, CAP_SYS_ADMIN) ||
> +           !ns_capable_cred(nsset->cred, nsset->cred->user_ns, CAP_SYS_ADMIN))
>                 return -EPERM;
>
> All our current install handlers check that we're privileged wrt to our
> _current_ user namespace, i.e. they all have a
> ns_capable(current_user_ns(), CAP_SYS_ADMIN)
> line. So if we specify
> setns(CLONE_NEWUSER | CLONE_NEW*)
> as an unprivileged user we aren't able to attach to any other namespace
> unless we check against the target credentials.
>
>> 
>> 
>> > (That's especially useful given
>> > that most users especially container runtimes will unshare all
>> > namespaces. Doing it this way we can not just have attach privileged
>> > users attach to their containers but also unprivileged users to their
>> > containers in one shot.)
>> 
>> That is a wonderful reason for doing things, and it is the reason
>> why I am asking about it because I think you have it backwards.
>> 
>> Especially in the context of some container runtimes like Kubernetes
>> that I have been told will do things like share a network namespace
>> across all containers in a POD.
>
> Kubernetes doesn't use user namespace at all so they need to run as root
> anyway so that's not a problem. And if we're talking about scenarios
> where some namespaces are shared and other's aren't then there'll be
> batch-attaching any, i.e. not all at once but some at once. So I don't
> think this is a good argument.
>
>> 
>> > A few other points about this. If one looks at clone(CLONE_NEW*) or
>> > unshare(CLONE_NEW*) then the ordering and permissions checking is the
>> > same way. All permissions checks are performed against the reduced
>> > permissions, i.e. if CLONE_NEWUSER is specified you check privilege
>> > against the reduced permissions too otherwise you wouldn't be able to
>> > spawn into a complete set of new namespaces as an unprivileged user.
>> 
>> That is a good catch and definitely a reason for looking at doing
>> things in this order.
>> 
>> For unshare and clone putting things in a user namespace means you can
>> create namespaces you could not create otherwise.
>> 
>> 
>> > This logic is also expressed in how setns() is already used in
>> > userspace. Any container runtime will attach to the user namespace first,
>> > so all subsequent calls to attach to other namespaces perform the checks
>> > against the reduced permissions. It also has to be that way because of
>> > fully unprivileged containers.
>> 
>> So I sat and looked.  For nsetner it winds up trying to enter
>> the namespaces in either order.
>> 
>>         /*
>>          * Now that we know which namespaces we want to enter, enter
>>          * them.  Do this in two passes, not entering the user
>>          * namespace on the first pass.  So if we're deprivileging the
>>          * container we'll enter the user namespace last and if we're
>>          * privileging it then we enter the user namespace first
>>          * (because the initial setns will fail).
>>          */
>>         for (pass = 0; pass < 2; pass ++) {
>>                 for (nsfile = namespace_files + 1 - pass; nsfile->nstype; nsfile++) {
>>                         if (nsfile->fd < 0)
>>                                 continue;
>>                         if (nsfile->nstype == CLONE_NEWPID && do_fork == -1)
>>                                 do_fork = 1;
>>                         if (setns(nsfile->fd, nsfile->nstype)) {
>>                                 if (pass != 0)
>>                                         err(EXIT_FAILURE,
>>                                             _("reassociate to namespace '%s' failed"),
>>                                             nsfile->name);
>>                                 else
>>                                         continue;
>>                         }
>> 
>>                         close(nsfile->fd);
>>                         nsfile->fd = -1;
>>                 }
>>         }
>> 
>> 
>> Looking a little close we have at least for entering the mntns the
>> following checks:
>> 
>> 	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
>> 	    !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
>> 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
>> 		return -EPERM;
>> 
>> Which require CAP_SYS_CHROOT and CAP_SYS_ADMIN in the current user namespace.
>> 
>> So there is defintiely an issue there.
>> 
>> I suspect the clean approach is to simply require CAP_SYS_CHROOT in
>> the new user namespace, when we are setting several of them at once.
>
> See my example above. All install handlers check for
> ns_capable(current_user_ns(), CAP_SYS_ADMIN).
>
>
> static int ipcns_install(struct nsproxy *nsproxy, struct ns_common *new)
> {
> 	struct ipc_namespace *ns = to_ipc_ns(new);
> 	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> 		return -EPERM;
>
>
> static int netns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> {
> 	struct net *net = to_net_ns(ns);
>
> 	if (!ns_capable(net->user_ns, CAP_SYS_ADMIN) ||
> 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> 		return -EPERM;
>
> static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> {
> 	struct fs_struct *fs = current->fs;
> 	struct mnt_namespace *mnt_ns = to_mnt_ns(ns), *old_mnt_ns;
> 	struct path root;
> 	int err;
>
> 	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
> 	    !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
> 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> 		return -EPERM;
>
> static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> {
> 	struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);
>
> 	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
> 	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
> 		return -EPERM;
>
> static int utsns_install(struct nsproxy *nsproxy, struct ns_common *new)
> {
> 	struct uts_namespace *ns = to_uts_ns(new);
>
> 	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> 		return -EPERM;
>
> static int pidns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> {
> 	struct pid_namespace *active = task_active_pid_ns(current);
> 	struct pid_namespace *ancestor, *new = to_pid_ns(ns);
>
> 	if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) ||
> 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> 		return -EPERM;
>
>> 
>> > To put it another way, if we were to always perform the permission checks
>> > against the current permissions (i.e. no matter if CLONE_NEWUSER is
>> > specified or not) then we'd never be able to attach to a set of
>> > namespaces at once as an unprivileged user.
>> > We also really want to be able to express both semantics:
>> > 1. setns(flags & ~CLONE_NEWUSER) --> attach to all namespaces with my
>> >    current permission level
>> > 2. setns(flags | CLONE_NEWUSER) attach to all namespaces with the target
>> >    permission level
>> > It feels weird if both 1 and 2 would mean the exact same thing given
>> > that the user namespace has an owernship relation with all the other
>> > namespaces.
>> 
>> It feels weird to me to disallow anything that we have permissions for.
>> 
>> Can you dig up the actual install permissions checks and see if there is
>> anything other than the mount namespace that needs permissions in the
>> current user namespace?
>
> Yep did, all of them check ns_capable(current_user_ns(), CAP_SYS_ADMIN),
> see above please.
>
>> 
>> Please let's walk this through.  I think there should be a way to
>> carefully phrase the permission checks that we don't need
>> ns_capable_cred that will allow goofy cases like setns into Kuberneties
>> PODs that share network namespaces.
>
> Hm, Kubernetes doesn't use user namespace. I think I misunderstand your
> concern maybe. But see below for a suggestion.
>
>> 
>> I believe that will be a way to phrase the permission checks so that
>> with or without CLONE_NEWUSER they make sense, and give very similar
>> results.
>> 
>> Certainly attaching to a mount namespace is going to need either being
>> root or attaching to a user namespace at the same time.  Because
>> attaching to a mount namespace needs functionality that the user
>> namespace provides.
>
> So how about we add a new flag to setns()
> SETNS_NEWUSER_CRED that is only valid in
> conjunction with CLONE_NEWUSER and which allows callers to tell the
> kernel "assume the target credentials first". This way we can support
> both cases and users can specify what they want and we can clearly
> document it on the manpages.

Let's not get complicated.  Let's make this very simple.

Change "ns_capable(current_user_user_ns(), CAP_SYS_ADMIN)"
to     "ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN)".

We still use the current credentials, but we require the caller of setns
to have CAP_SYS_ADMIN in the user namespace you are going to wind up in.
That is exactly what userns_install does today.

That won't be any kind of security hole because we still require
     	"ns_capable(ns->user_ns, CAP_SYS_ADMIN)"
on all of the namespaces as well.

That way if you are sufficiently privileged you can still handle joining
namespaces that are not owned by the owner of the processes user
namespace.  So we can join weird processes.

Does that make sense?

Eric
Christian Brauner May 2, 2020, 3:13 p.m.
On Sat, May 02, 2020 at 09:52:03AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
> 
> > On Sat, May 02, 2020 at 07:35:53AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@ubuntu.com> writes:
> >> 
> >> > On Thu, Apr 30, 2020 at 01:09:30PM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner <christian.brauner@ubuntu.com> writes:
> >> >> 
> >> >> > Add a simple capability helper which makes it possible to determine
> >> >> > whether a set of creds is ns capable wrt to the passed in credentials.
> >> >> > This is not something exciting it's just a more pleasant wrapper around
> >> >> > security_capable() by allowing ns_capable_common() to ake a const struct
> >> >> > cred argument. In ptrace_has_cap() for example, we're using
> >> >> > security_capable() directly. ns_capable_cred() will be used in the next
> >> >> > patch to check against the target credentials the caller is going to
> >> >> > switch to.
> >> >> 
> >> >> Given that this is to suppot setns.  I don't understand the
> >> >> justification for this.
> >> >> 
> >> >> Is it your intention to use the reduced permissions that you get
> >> >> when you install a user namespace?
> >> >
> >> > Indeed.
> >> >
> >> >> 
> >> >> Why do you want to use the reduced permissions when installing multiple
> >> >> namespaces at once?
> >> >
> >> > The intention is to use the target credentials we are going to install
> >> > when setns() hits the point of no return. The target permissions are
> >> > either the permissions of the caller or the reduced permissions if
> >> > CLONE_NEWUSER has been requested. This has multiple reasons.
> >> >
> >> > The most obvious reason imho is that all other namespaces have an owning
> >> > user namespace. Attaching to any other namespace requires the attacher
> >> > to be privileged over the owning user namespace of that namespace.
> >> > Consequently, all our current install handlers for every namespace we
> >> > have check whether we are privileged in the owning user namespace of
> >> > that user namespace. So in order to attach to any of those namespaces -
> >> > especially when attaching as an unprivileged user - requires that we are
> >> > attached to the user namespace first.
> >> 
> >> No actually it doesn't.  As if you have privileges to attach to the user
> >> namespace you have the privileges to attach to anything it owns.  Or you
> >> should I think I am missing some subtle detail at the moment.
> >
> > Sorry, I phrased that very misleadingly. What I'm talking about should
> > be obvious in the second patch:
> >
> > -static int utsns_install(struct nsproxy *nsproxy, struct ns_common *new)
> > +static int utsns_install(struct nsset *nsset, struct ns_common *new)
> >  {
> > +       struct nsproxy *nsproxy = nsset->nsproxy;
> >         struct uts_namespace *ns = to_uts_ns(new);
> >
> > -       if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> > -           !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > +       if (!ns_capable_cred(nsset->cred, ns->user_ns, CAP_SYS_ADMIN) ||
> > +           !ns_capable_cred(nsset->cred, nsset->cred->user_ns, CAP_SYS_ADMIN))
> >                 return -EPERM;
> >
> > All our current install handlers check that we're privileged wrt to our
> > _current_ user namespace, i.e. they all have a
> > ns_capable(current_user_ns(), CAP_SYS_ADMIN)
> > line. So if we specify
> > setns(CLONE_NEWUSER | CLONE_NEW*)
> > as an unprivileged user we aren't able to attach to any other namespace
> > unless we check against the target credentials.
> >
> >> 
> >> 
> >> > (That's especially useful given
> >> > that most users especially container runtimes will unshare all
> >> > namespaces. Doing it this way we can not just have attach privileged
> >> > users attach to their containers but also unprivileged users to their
> >> > containers in one shot.)
> >> 
> >> That is a wonderful reason for doing things, and it is the reason
> >> why I am asking about it because I think you have it backwards.
> >> 
> >> Especially in the context of some container runtimes like Kubernetes
> >> that I have been told will do things like share a network namespace
> >> across all containers in a POD.
> >
> > Kubernetes doesn't use user namespace at all so they need to run as root
> > anyway so that's not a problem. And if we're talking about scenarios
> > where some namespaces are shared and other's aren't then there'll be
> > batch-attaching any, i.e. not all at once but some at once. So I don't
> > think this is a good argument.
> >
> >> 
> >> > A few other points about this. If one looks at clone(CLONE_NEW*) or
> >> > unshare(CLONE_NEW*) then the ordering and permissions checking is the
> >> > same way. All permissions checks are performed against the reduced
> >> > permissions, i.e. if CLONE_NEWUSER is specified you check privilege
> >> > against the reduced permissions too otherwise you wouldn't be able to
> >> > spawn into a complete set of new namespaces as an unprivileged user.
> >> 
> >> That is a good catch and definitely a reason for looking at doing
> >> things in this order.
> >> 
> >> For unshare and clone putting things in a user namespace means you can
> >> create namespaces you could not create otherwise.
> >> 
> >> 
> >> > This logic is also expressed in how setns() is already used in
> >> > userspace. Any container runtime will attach to the user namespace first,
> >> > so all subsequent calls to attach to other namespaces perform the checks
> >> > against the reduced permissions. It also has to be that way because of
> >> > fully unprivileged containers.
> >> 
> >> So I sat and looked.  For nsetner it winds up trying to enter
> >> the namespaces in either order.
> >> 
> >>         /*
> >>          * Now that we know which namespaces we want to enter, enter
> >>          * them.  Do this in two passes, not entering the user
> >>          * namespace on the first pass.  So if we're deprivileging the
> >>          * container we'll enter the user namespace last and if we're
> >>          * privileging it then we enter the user namespace first
> >>          * (because the initial setns will fail).
> >>          */
> >>         for (pass = 0; pass < 2; pass ++) {
> >>                 for (nsfile = namespace_files + 1 - pass; nsfile->nstype; nsfile++) {
> >>                         if (nsfile->fd < 0)
> >>                                 continue;
> >>                         if (nsfile->nstype == CLONE_NEWPID && do_fork == -1)
> >>                                 do_fork = 1;
> >>                         if (setns(nsfile->fd, nsfile->nstype)) {
> >>                                 if (pass != 0)
> >>                                         err(EXIT_FAILURE,
> >>                                             _("reassociate to namespace '%s' failed"),
> >>                                             nsfile->name);
> >>                                 else
> >>                                         continue;
> >>                         }
> >> 
> >>                         close(nsfile->fd);
> >>                         nsfile->fd = -1;
> >>                 }
> >>         }
> >> 
> >> 
> >> Looking a little close we have at least for entering the mntns the
> >> following checks:
> >> 
> >> 	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
> >> 	    !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
> >> 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> >> 		return -EPERM;
> >> 
> >> Which require CAP_SYS_CHROOT and CAP_SYS_ADMIN in the current user namespace.
> >> 
> >> So there is defintiely an issue there.
> >> 
> >> I suspect the clean approach is to simply require CAP_SYS_CHROOT in
> >> the new user namespace, when we are setting several of them at once.
> >
> > See my example above. All install handlers check for
> > ns_capable(current_user_ns(), CAP_SYS_ADMIN).
> >
> >
> > static int ipcns_install(struct nsproxy *nsproxy, struct ns_common *new)
> > {
> > 	struct ipc_namespace *ns = to_ipc_ns(new);
> > 	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> > 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > 		return -EPERM;
> >
> >
> > static int netns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> > {
> > 	struct net *net = to_net_ns(ns);
> >
> > 	if (!ns_capable(net->user_ns, CAP_SYS_ADMIN) ||
> > 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > 		return -EPERM;
> >
> > static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> > {
> > 	struct fs_struct *fs = current->fs;
> > 	struct mnt_namespace *mnt_ns = to_mnt_ns(ns), *old_mnt_ns;
> > 	struct path root;
> > 	int err;
> >
> > 	if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
> > 	    !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
> > 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > 		return -EPERM;
> >
> > static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> > {
> > 	struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);
> >
> > 	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
> > 	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
> > 		return -EPERM;
> >
> > static int utsns_install(struct nsproxy *nsproxy, struct ns_common *new)
> > {
> > 	struct uts_namespace *ns = to_uts_ns(new);
> >
> > 	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> > 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > 		return -EPERM;
> >
> > static int pidns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> > {
> > 	struct pid_namespace *active = task_active_pid_ns(current);
> > 	struct pid_namespace *ancestor, *new = to_pid_ns(ns);
> >
> > 	if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) ||
> > 	    !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > 		return -EPERM;
> >
> >> 
> >> > To put it another way, if we were to always perform the permission checks
> >> > against the current permissions (i.e. no matter if CLONE_NEWUSER is
> >> > specified or not) then we'd never be able to attach to a set of
> >> > namespaces at once as an unprivileged user.
> >> > We also really want to be able to express both semantics:
> >> > 1. setns(flags & ~CLONE_NEWUSER) --> attach to all namespaces with my
> >> >    current permission level
> >> > 2. setns(flags | CLONE_NEWUSER) attach to all namespaces with the target
> >> >    permission level
> >> > It feels weird if both 1 and 2 would mean the exact same thing given
> >> > that the user namespace has an owernship relation with all the other
> >> > namespaces.
> >> 
> >> It feels weird to me to disallow anything that we have permissions for.
> >> 
> >> Can you dig up the actual install permissions checks and see if there is
> >> anything other than the mount namespace that needs permissions in the
> >> current user namespace?
> >
> > Yep did, all of them check ns_capable(current_user_ns(), CAP_SYS_ADMIN),
> > see above please.
> >
> >> 
> >> Please let's walk this through.  I think there should be a way to
> >> carefully phrase the permission checks that we don't need
> >> ns_capable_cred that will allow goofy cases like setns into Kuberneties
> >> PODs that share network namespaces.
> >
> > Hm, Kubernetes doesn't use user namespace. I think I misunderstand your
> > concern maybe. But see below for a suggestion.
> >
> >> 
> >> I believe that will be a way to phrase the permission checks so that
> >> with or without CLONE_NEWUSER they make sense, and give very similar
> >> results.
> >> 
> >> Certainly attaching to a mount namespace is going to need either being
> >> root or attaching to a user namespace at the same time.  Because
> >> attaching to a mount namespace needs functionality that the user
> >> namespace provides.
> >
> > So how about we add a new flag to setns()
> > SETNS_NEWUSER_CRED that is only valid in
> > conjunction with CLONE_NEWUSER and which allows callers to tell the
> > kernel "assume the target credentials first". This way we can support
> > both cases and users can specify what they want and we can clearly
> > document it on the manpages.
> 
> Let's not get complicated.  Let's make this very simple.

I'd call it "flexible". :)

> 
> Change "ns_capable(current_user_user_ns(), CAP_SYS_ADMIN)"
> to     "ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN)".
> 
> We still use the current credentials, but we require the caller of setns
> to have CAP_SYS_ADMIN in the user namespace you are going to wind up in.
> That is exactly what userns_install does today.
> 
> That won't be any kind of security hole because we still require
>      	"ns_capable(ns->user_ns, CAP_SYS_ADMIN)"
> on all of the namespaces as well.
> 
> That way if you are sufficiently privileged you can still handle joining
> namespaces that are not owned by the owner of the processes user
> namespace.  So we can join weird processes.
> 
> Does that make sense?

Well, it's funny that you say that since I had a version of that
patchset and I still have it at setns_pidfd_v2_wip_v5. I justed pushed
it to
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=setns_pidfd_v2_wip_v5
The reason I refrained from doing this was fear of introducing a
security hole but since we agree that this would be fine let's give it a
go.
In any case, I think I never ran the test-suite I added on that version.
Let me plug this in and test this.

Christian