Revert "vfs: Allow userns root to call mknod on owned filesystems."

Submitted by Christian Brauner on July 5, 2018, 3:51 p.m.

Details

Message ID 20180705155120.22102-1-christian@brauner.io
State New
Series "Revert "vfs: Allow userns root to call mknod on owned filesystems.""
Headers show

Commit Message

Christian Brauner July 5, 2018, 3:51 p.m.
This reverts commit 55956b59df336f6738da916dbb520b6e37df9fbd.

commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.")
enabled mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since any filesystem
mounted from a non-initial user namespace will set the SB_I_NODEV flag on
the filesystem. Now, when a device node s created in a non-initial user
namespace a call to open() on said device node will fail due to:

bool may_open_dev(const struct path *path)
{
        return !(path->mnt->mnt_flags & MNT_NODEV) &&
                !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}

The problem with this is that as of the aforementioned commit mknod()
creates partially functional device nodes in non-initial user namespaces.
In particular, it has the consequence that as of the aforementioned commit
open() will be more privileged with respect to device nodes than mknod().
Before it was the other way around. Specifically, if mknod() succeeded
then it was transparent for any userspace application that a fatal error
must have occured when open() failed.

All of this breaks multiple userspace workloads and a widespread assumption
about how to handle mknod(). Basically, all container runtimes and systemd
live by the slogan "ask for forgiveness not permission" when running user
namespace workloads. For mknod() the assumption is that if the syscall
succeeds the device nodes are useable irrespective of whether it succeeds
in a non-initial user namespace or not. This logic was chosen explicitly
to allow for the glorious day when mknod() will actually be able to create
fully functional device nodes in user namespaces.
A specific problem people are already running into when running 4.18 rc
kernels are failing systemd services. For any distro that is run in a
container systemd services started with the PrivateDevices= property set
will fail to start since the device nodes in question cannot be
opened (cf. the arguments in [1]).

Full disclosure, Seth made the very sound argument that it is already
possible to end up with partially functional device nodes. Any filesystem
mounted with MS_NODEV set will allow mknod() to succeed but will not allow
open() to succeed. The difference to the case here is that the MS_NODEV
case is transparent to userspace since it is an explicitly set mount option
while the SB_I_NODEV case is an implicit property enforced by the kernel
and hence opaque to userspace.

[1]: https://github.com/systemd/systemd/pull/9483

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Serge Hallyn <serge@hallyn.com>
---
 fs/namei.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/fs/namei.c b/fs/namei.c
index 734cef54fdf8..389e48e93542 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3711,8 +3711,7 @@  int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
 	if (error)
 		return error;
 
-	if ((S_ISCHR(mode) || S_ISBLK(mode)) &&
-	    !ns_capable(dentry->d_sb->s_user_ns, CAP_MKNOD))
+	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
 		return -EPERM;
 
 	if (!dir->i_op->mknod)

Comments

Eric W. Biederman July 5, 2018, 4:48 p.m.
Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Your description is usesless.

It needs to detail exactly what breaks, what regressions and why.
All I see below is hand waving.

We need to know why this does not work so someone does not come in and try
this again.  Or so that someone can fix this and then try again.

You do not include that kind of information in your commit log.

Calling mknod to create device nodes can not be widespread.  There are
not that many privileged processes and calling mknod outside of being
a specialed process like udev is broken.

Therefore I refute your assertion that this is a widespread issue.


I expect somewhere there is a reasonable argument for reverting this
change on the basis that it causes a regression. You have not made it.

Until that time I am going to oppose this revert because your
justfication for the revert is lacking.


It has never been the case that mknod on a device node will guarantee
that you even can open the device node.  The applications that regress
are broken.  It doesn't mean we shouldn't be bug compatible, but we darn
well should document very clearly the bugs we are being bug compatible
with.

Eric
Christian Brauner July 5, 2018, 5:34 p.m.
On Thu, Jul 05, 2018 at 11:48:11AM -0500, Eric W. Biederman wrote:
> 
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Your description is usesless.
> 
> It needs to detail exactly what breaks, what regressions and why.
> All I see below is hand waving.
> 
> We need to know why this does not work so someone does not come in and try
> this again.  Or so that someone can fix this and then try again.
> 
> You do not include that kind of information in your commit log.

My commit log explicitly states that if you run systemd services in a
user namespace with PrivateDevices=true it will fail as soon as anything
tries to open such a device node. Before that change this worked just
fine. The commit log also leads to the related thread.
I can come up with a list of services that fail to start if that helps.

> 
> Calling mknod to create device nodes can not be widespread.  There are

Well, there are a few. There's the container runtimes (aka
systemd-nspawn, rkt, runC, LXC, LXD), udev, systemd, openrc to name just
a few.
Recently we even worked on udev being useable in user namespaces.

Please also note, that a lot of applications were also switched to
fallback to bind-mounts on mknod() permission failures since this was
the easiest and least costly way to deal with all of the LSMs, user
namespaces, capability dropping, and seccomp. They all would need way
more complex logic to decide whether to fallback to a bind-mount or not.

> not that many privileged processes and calling mknod outside of being
> a specialed process like udev is broken.
> 
> Therefore I refute your assertion that this is a widespread issue.
> 
> 
> I expect somewhere there is a reasonable argument for reverting this
> change on the basis that it causes a regression. You have not made it.

Fair enough, I can rewrite the commit message and focs on the container
workload and container runtime regressions as a clear example if that
seems a sufficient argument to you.

> 
> Until that time I am going to oppose this revert because your
> justfication for the revert is lacking.

I sympathize with you wanting a proper and thorough justification and
I'm sorry if I apparently did not provided exhaustive details.
However, I think (see above) that I've provided at least a sufficient
argument in my commit log to start a reasonable discussion about this
that doesn't end with "You're saying the kernel is broken. I'm saying
userspace is broken.".

> 
> 
> It has never been the case that mknod on a device node will guarantee
> that you even can open the device node.  The applications that regress
> are broken.  It doesn't mean we shouldn't be bug compatible, but we darn

It seems a fair assumption to me that an object you created (with the
_right permissions_) you can also interact with.
Also if I may retort, I see no good argument why the applications are
broken.

> well should document very clearly the bugs we are being bug compatible
> with.
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Eric W. Biederman July 5, 2018, 5:36 p.m.
ebiederm@xmission.com (Eric W. Biederman) writes:

> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Your description is usesless.
>
> It needs to detail exactly what breaks, what regressions and why.
> All I see below is hand waving.
>
> We need to know why this does not work so someone does not come in and try
> this again.  Or so that someone can fix this and then try again.
>
> You do not include that kind of information in your commit log.
>
> Calling mknod to create device nodes can not be widespread.  There are
> not that many privileged processes and calling mknod outside of being
> a specialed process like udev is broken.
>
> Therefore I refute your assertion that this is a widespread issue.
>
>
> I expect somewhere there is a reasonable argument for reverting this
> change on the basis that it causes a regression. You have not made it.
>
> Until that time I am going to oppose this revert because your
> justfication for the revert is lacking.
>
>
> It has never been the case that mknod on a device node will guarantee
> that you even can open the device node.  The applications that regress
> are broken.  It doesn't mean we shouldn't be bug compatible, but we darn
> well should document very clearly the bugs we are being bug compatible
> with.
>

Further from what I have seen of this issue, there is a compelling case
that what the applications that are broken what what is enabled by
allowing mknod to succeed.  So we absolutely need a good description of
what is going on, because at best a revert to fix today's breaking is
temporary until userspace gets their bugs fixed.

Eric