[RHEL8,COMMIT] ms/Revert "vfs: Allow userns root to call mknod on owned filesystems."

Submitted by Konstantin Khorenko on Jan. 22, 2020, 2:52 p.m.


Message ID 202001221452.00MEqTWG023478@finist_co8.work.ct
State New
Series "ve/fs: allow to mknod inside a Container"
Headers show

Commit Message

Konstantin Khorenko Jan. 22, 2020, 2:52 p.m.
The commit is pushed to "branch-rh8-4.18.0-80.1.2.vz8.3.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-80.1.2.vz8.3.1
commit 6fa346a6cacc2ecd78be636a972b079ef1f27999
Author: Christian Brauner <christian@brauner.io>
Date:   Thu Jul 5 17:51:20 2018 +0200

    ms/Revert "vfs: Allow userns root to call mknod on owned filesystems."
    This reverts commit 55956b59df336f6738da916dbb520b6e37df9fbd.
    commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.")
    enabled mknod() in user namespaces for userns root if CAP_MKNOD is
    available. However, these device nodes are useless since any filesystem
    mounted from a non-initial user namespace will set the SB_I_NODEV flag on
    the filesystem. Now, when a device node s created in a non-initial user
    namespace a call to open() on said device node will fail due to:
    bool may_open_dev(const struct path *path)
            return !(path->mnt->mnt_flags & MNT_NODEV) &&
                    !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
    The problem with this is that as of the aforementioned commit mknod()
    creates partially functional device nodes in non-initial user namespaces.
    In particular, it has the consequence that as of the aforementioned commit
    open() will be more privileged with respect to device nodes than mknod().
    Before it was the other way around. Specifically, if mknod() succeeded
    then it was transparent for any userspace application that a fatal error
    must have occured when open() failed.
    All of this breaks multiple userspace workloads and a widespread assumption
    about how to handle mknod(). Basically, all container runtimes and systemd
    live by the slogan "ask for forgiveness not permission" when running user
    namespace workloads. For mknod() the assumption is that if the syscall
    succeeds the device nodes are useable irrespective of whether it succeeds
    in a non-initial user namespace or not. This logic was chosen explicitly
    to allow for the glorious day when mknod() will actually be able to create
    fully functional device nodes in user namespaces.
    A specific problem people are already running into when running 4.18 rc
    kernels are failing systemd services. For any distro that is run in a
    container systemd services started with the PrivateDevices= property set
    will fail to start since the device nodes in question cannot be
    opened (cf. the arguments in [1]).
    Full disclosure, Seth made the very sound argument that it is already
    possible to end up with partially functional device nodes. Any filesystem
    mounted with MS_NODEV set will allow mknod() to succeed but will not allow
    open() to succeed. The difference to the case here is that the MS_NODEV
    case is transparent to userspace since it is an explicitly set mount option
    while the SB_I_NODEV case is an implicit property enforced by the kernel
    and hence opaque to userspace.
    [1]: https://github.com/systemd/systemd/pull/9483
    Signed-off-by: Christian Brauner <christian@brauner.io>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Seth Forshee <seth.forshee@canonical.com>
    Cc: Serge Hallyn <serge@hallyn.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Warning: next patch will allow mknod in CT init userns.
    The explanation above is good, but we always had such a behavior (mknod
    succeeds but later open() can fail) and never had a problem because of
    that, so let it be the same until we face the problem.
    Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
 fs/namei.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/fs/namei.c b/fs/namei.c
index fac616427913..07ce37de342a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3700,8 +3700,7 @@  int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
 	if (error)
 		return error;
-	if ((S_ISCHR(mode) || S_ISBLK(mode)) &&
-	    !ns_capable(dentry->d_sb->s_user_ns, CAP_MKNOD))
+	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
 		return -EPERM;
 	if (!dir->i_op->mknod)