[RHEL8,COMMIT] mnt: allow to add a mount into an existing group

Message ID 202005081443.048Eheb1016022@finist_co8.work.ct
Commit Message

Konstantin Khorenko May 8, 2020, 2:43 p.m.
The commit is pushed to "branch-rh8-4.18.0-80.1.2.vz8.3.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-80.1.2.vz8.3.9
commit 6c0cfbbab61642b5ce6d4cefad810cdd84f9234d
Author: Andrei Vagin <avagin@openvz.org>
Date:   Fri May 8 17:43:40 2020 +0300

    mnt: allow to add a mount into an existing group
    Now a shared group can be only inherited from a source mount.
    This patch adds an ability to add a mount into an existing shared
    mount(source, target, NULL, MS_SET_GROUP, NULL)
    mount() with the MS_SET_GROUP flag adds the "target" mount into a group
    of the "source" mount. The calling process has to have the CAP_SYS_ADMIN
    capability in namespaces of these mounts. The source and the target
    mounts have to have the same super block.
    This new functionality together with "mnt: Tuck mounts under others
    instead of creating shadow/side mounts." allows CRIU to dump and restore
    any set of mount namespaces.
    Currently we have a lot of issues about dumping and restoring mount
    namespaces. The bigest problem is that we can't construct mount trees
    directly due to several reasons:
    * groups can't be set, they can be only inherited
    * file systems has to be mounted from the specified user namespaces
    * the mount() syscall doesn't just create one mount -- the mount is
      also propagated to all members of a parent group
    * umount() doesn't detach mounts from all members of a group
      (mounts with children are not umounted)
    * mounts are propagated underneath of existing mounts
    * mount() doesn't allow to make bind-mounts between two namespaces
    * processes can have opened file descriptors to overmounted files
    All these operations are non-trivial, making the task of restoring
    a mount namespace practically unsolvable for reasonable time. The
    proposed change allows to restore a mount namespace in a direct
    manner, without any super complex logic.
    Cc: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrei Vagin <avagin@openvz.org>
    Patch hangs long in lkml without much review:
    But with it we can implement correct mounts restore in vzcriu much
    Add some restrictions: a) prohibit setting group on non-mnt_root dentry;
    b) prohibit destination mount to be in non-current mntns; c) only super
    or pseudosuper ve can set group.
    Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
    > Is it OK to have the flag's semantics overloaded?
    1) do_mount is only called from syscall:
    2) previousely MS_SUBMOUNT was explicitly ignored in vz7 in do_mount
    because it is kernel internal flag:
    in ms and vz8 it is a bit more complex but still ignored. Because it is
    kernel internal flag and userspace can't set it.
    If we add MS_SET_GROUP with same number as MS_SUBMOUNT but only check it
    in do_mount where it was previousely ignored it looks OK to me.
 fs/namespace.c          | 65 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h |  6 +++++
 2 files changed, 71 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index b06fdd118629..2bc53000c026 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2324,6 +2324,69 @@  static inline int tree_contains_unbindable(struct mount *mnt)
 	return 0;
+static int do_set_group(struct path *path, const char *sibling_name)
+	struct ve_struct *ve = get_exec_env();
+	struct mount *sibling, *mnt;
+	struct path sibling_path;
+	int err;
+	if (!ve_is_super(ve) && !ve->is_pseudosuper)
+		return -EPERM;
+	if (!sibling_name || !*sibling_name)
+		return -EINVAL;
+	if (path->dentry != path->mnt->mnt_root)
+		return -EINVAL;
+	err = kern_path(sibling_name, LOOKUP_FOLLOW, &sibling_path);
+	if (err)
+		return err;
+	err = -EINVAL;
+	if (sibling_path.dentry != sibling_path.mnt->mnt_root)
+		goto out_put;
+	sibling = real_mount(sibling_path.mnt);
+	mnt = real_mount(path->mnt);
+	if (!check_mnt(mnt))
+		goto out_put;
+	namespace_lock();
+	err = -EPERM;
+	if (!sibling->mnt_ns ||
+	    !ns_capable(sibling->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out_unlock;
+	err = -EINVAL;
+	if (sibling->mnt.mnt_sb != mnt->mnt.mnt_sb)
+		goto out_unlock;
+	if (IS_MNT_SHARED(mnt) || IS_MNT_SLAVE(mnt))
+		goto out_unlock;
+	if (IS_MNT_SLAVE(sibling)) {
+		list_add(&mnt->mnt_slave, &sibling->mnt_slave);
+		mnt->mnt_master = sibling->mnt_master;
+	}
+	if (IS_MNT_SHARED(sibling)) {
+		mnt->mnt_group_id = sibling->mnt_group_id;
+		list_add(&mnt->mnt_share, &sibling->mnt_share);
+		set_mnt_shared(mnt);
+	}
+	err = 0;
+	namespace_unlock();
+	path_put(&sibling_path);
+	return err;
 static int do_move_mount(struct path *path, const char *old_name)
 	struct path old_path, parent_path;
@@ -2810,6 +2873,8 @@  long do_mount(const char *dev_name, const char __user *dir_name,
 		retval = do_change_type(&path, flags);
 	else if (flags & MS_MOVE)
 		retval = do_move_mount(&path, dev_name);
+	else if (flags & MS_SET_GROUP)
+		retval = do_set_group(&path, dev_name);
 		retval = do_new_mount(&path, type_page, sb_flags, mnt_flags,
 				      dev_name, data_page);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 876c308d57c0..699ad890ac76 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -132,6 +132,12 @@  struct inodes_stat_t {
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 #define MS_LAZYTIME	(1<<25) /* Update the on-disk [acm]times lazily */
+ * Here are commands and flags. Commands are handled in do_mount()
+ * and can intersect with kernel internal flags.
+ */
+#define MS_SET_GROUP	(1<<26) /* Add a mount into a shared group */
 /* These sb flags are internal to the kernel */
 #define MS_SUBMOUNT     (1<<26)
 #define MS_NOREMOTELOCK	(1<<27)