[RFC,6/6] shiftfs: support nested shiftfs mounts

Submitted by Seth Forshee on Nov. 1, 2018, 9:48 p.m.

Details

Message ID 20181101214856.4563-7-seth.forshee@canonical.com
State New
Series "shiftfs fixes and enhancements"
Headers show

Commit Message

Seth Forshee Nov. 1, 2018, 9:48 p.m.
shiftfs mounts cannot be nested for two reasons -- global
CAP_SYS_ADMIN is required to set up a mark mount, and a single
functional shiftfs mount meets the filesystem stacking depth
limit.

The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
ids in a mount must be within that mount's s_user_ns, so all that
is needed is CAP_SYS_ADMIN within that s_user_ns.

The stack depth issue can be worked around with a couple of
adjustments. First, a mark mount doesn't really need to count
against the stacking depth as it doesn't contribute to the call
stack depth during filesystem operations. Therefore the mount
over the mark mount only needs to count as one more than the
lower filesystems stack depth.

Second, when the lower mount is shiftfs we can just skip down to
that mount's lower filesystem and shift ids relative to that.
There is no reason to shift ids twice, and the lower path has
already been marked safe for id shifting by a user privileged
towards all ids in that mount's user ns.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
---
 fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 46 insertions(+), 22 deletions(-)

Patch hide | download patch | download mbox

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index b19af7b2fe75..008ace2842b9 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -930,7 +930,7 @@  static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
 	struct shiftfs_data *data = raw_data;
 	char *name = kstrdup(data->path, GFP_KERNEL);
 	int err = -ENOMEM;
-	struct shiftfs_super_info *ssi = NULL;
+	struct shiftfs_super_info *ssi = NULL, *mp_ssi;
 	struct path path;
 	struct dentry *dentry;
 
@@ -946,11 +946,7 @@  static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
 	if (err)
 		goto out;
 
-	/* to mark a mount point, must be real root */
-	if (ssi->mark && !capable(CAP_SYS_ADMIN))
-		goto out;
-
-	/* else to mount a mark, must be userns admin */
+	/* to mount a mark, must be userns admin */
 	if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
 		goto out;
 
@@ -962,41 +958,66 @@  static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
 
 	if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
 		err = -ENOTDIR;
-		goto out_put;
-	}
-
-	sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
-	if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
-		printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
-		err = -EINVAL;
-		goto out_put;
+		goto out_put_path;
 	}
 
 	if (ssi->mark) {
+		struct super_block *lower_sb = path.mnt->mnt_sb;
+
+		/* to mark a mount point, must root wrt lower s_user_ns */
+		if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN))
+			goto out_put_path;
+
+
 		/*
 		 * this part is visible unshifted, so make sure no
 		 * executables that could be used to give suid
 		 * privileges
 		 */
 		sb->s_iflags = SB_I_NOEXEC;
-		ssi->mnt = path.mnt;
-		dentry = path.dentry;
-	} else {
-		struct shiftfs_super_info *mp_ssi;
 
+		/*
+		 * Handle nesting of shiftfs mounts by referring this mark
+		 * mount back to the original mark mount. This is more
+		 * efficient and alleviates concerns about stack depth.
+		 */
+		if (lower_sb->s_magic == SHIFTFS_MAGIC) {
+			mp_ssi = lower_sb->s_fs_info;
+
+			/* Doesn't make sense to mark a mark mount */
+			if (mp_ssi->mark) {
+				err = -EINVAL;
+				goto out_put_path;
+			}
+
+			ssi->mnt = mntget(mp_ssi->mnt);
+			dentry = dget(path.dentry->d_fsdata);
+		} else {
+			ssi->mnt = mntget(path.mnt);
+			dentry = dget(path.dentry);
+		}
+	} else {
 		/*
 		 * this leg executes if we're admin capable in
 		 * the namespace, so be very careful
 		 */
 		if (path.dentry->d_sb->s_magic != SHIFTFS_MAGIC)
-			goto out_put;
+			goto out_put_path;
 		mp_ssi = path.dentry->d_sb->s_fs_info;
 		if (!mp_ssi->mark)
-			goto out_put;
+			goto out_put_path;
 		ssi->mnt = mntget(mp_ssi->mnt);
 		dentry = dget(path.dentry->d_fsdata);
-		path_put(&path);
 	}
+
+	sb->s_stack_depth = dentry->d_sb->s_stack_depth + 1;
+	if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+		printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
+		err = -EINVAL;
+		goto out_put_mnt;
+	}
+
+	path_put(&path);
 	ssi->userns = get_user_ns(dentry->d_sb->s_user_ns);
 	sb->s_fs_info = ssi;
 	sb->s_magic = SHIFTFS_MAGIC;
@@ -1009,7 +1030,10 @@  static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
 
 	return 0;
 
- out_put:
+ out_put_mnt:
+	mntput(ssi->mnt);
+	dput(dentry);
+ out_put_path:
 	path_put(&path);
  out:
 	kfree(name);

Comments

Amir Goldstein Nov. 2, 2018, 10:02 a.m.
On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canonical.com> wrote:
>
> shiftfs mounts cannot be nested for two reasons -- global
> CAP_SYS_ADMIN is required to set up a mark mount, and a single
> functional shiftfs mount meets the filesystem stacking depth
> limit.
>
> The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> ids in a mount must be within that mount's s_user_ns, so all that
> is needed is CAP_SYS_ADMIN within that s_user_ns.
>
> The stack depth issue can be worked around with a couple of
> adjustments. First, a mark mount doesn't really need to count
> against the stacking depth as it doesn't contribute to the call
> stack depth during filesystem operations. Therefore the mount
> over the mark mount only needs to count as one more than the
> lower filesystems stack depth.

That's true, but it also highlights the point that the "mark" sb is
completely unneeded and it really is just a nice little hack.
All the information it really stores is a lower mount reference,
a lower root dentry and a declarative statement "I am shiftable!".

Come to think about it, "container shiftable" really isn't that different from
NFS export with "no_root_squash" and auto mounted USB drive.
I mean the shifting itself is different of course, but the
declaration, not so much.
If I am allowing sudoers on another machine to mess with root owned
files visible
on my machine, I am pretty much have the same issues as container admins
accessing root owned files on my init_user_ns filesystem. In all those cases,
I'd better not be executing suid binaries from the untrusted "external" source.

Instead of mounting a dummy filesystem to make the declaration, you could
get the same thing with:
   mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
or whatnot)  constant to uapi and manage to come up good man page description.

Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
avoid the extra bind mount step (for a full filesystem tree export).
Declaring a mounted image MS_EXTERN has merits on its own even without
containers and shitfs, for example with pluggable storage. Other LSMs could make
good use of that declaration.

>
> Second, when the lower mount is shiftfs we can just skip down to
> that mount's lower filesystem and shift ids relative to that.
> There is no reason to shift ids twice, and the lower path has
> already been marked safe for id shifting by a user privileged
> towards all ids in that mount's user ns.
>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> ---
>  fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> index b19af7b2fe75..008ace2842b9 100644
> --- a/fs/shiftfs.c
> +++ b/fs/shiftfs.c
> @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>         struct shiftfs_data *data = raw_data;
>         char *name = kstrdup(data->path, GFP_KERNEL);
>         int err = -ENOMEM;
> -       struct shiftfs_super_info *ssi = NULL;
> +       struct shiftfs_super_info *ssi = NULL, *mp_ssi;
>         struct path path;
>         struct dentry *dentry;
>
> @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>         if (err)
>                 goto out;
>
> -       /* to mark a mount point, must be real root */
> -       if (ssi->mark && !capable(CAP_SYS_ADMIN))
> -               goto out;
> -
> -       /* else to mount a mark, must be userns admin */
> +       /* to mount a mark, must be userns admin */
>         if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
>                 goto out;

Isn't this check performed by vfs anyway? i.e. in mount_nodev() -> sget()

>
> @@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
>
>         if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
>                 err = -ENOTDIR;
> -               goto out_put;
> -       }
> -
> -       sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
> -       if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> -               printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
> -               err = -EINVAL;
> -               goto out_put;
> +               goto out_put_path;
>         }
>
>         if (ssi->mark) {
> +               struct super_block *lower_sb = path.mnt->mnt_sb;
> +
> +               /* to mark a mount point, must root wrt lower s_user_ns */
> +               if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN))
> +                       goto out_put_path;
> +
> +
>                 /*
>                  * this part is visible unshifted, so make sure no
>                  * executables that could be used to give suid
>                  * privileges
>                  */
>                 sb->s_iflags = SB_I_NOEXEC;

As commented on cover letter, why allow access to any files besides root at all?
In fact, the only justification for a dummy sb (instead of bind mount with
MS_EXTERN flag) would be in order to override inode operations with noop ops
to prevent access to unshifted files from within container.

Thanks,
Amir.
Seth Forshee Nov. 2, 2018, 12:44 p.m.
On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canonical.com> wrote:
> >
> > shiftfs mounts cannot be nested for two reasons -- global
> > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > functional shiftfs mount meets the filesystem stacking depth
> > limit.
> >
> > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > ids in a mount must be within that mount's s_user_ns, so all that
> > is needed is CAP_SYS_ADMIN within that s_user_ns.
> >
> > The stack depth issue can be worked around with a couple of
> > adjustments. First, a mark mount doesn't really need to count
> > against the stacking depth as it doesn't contribute to the call
> > stack depth during filesystem operations. Therefore the mount
> > over the mark mount only needs to count as one more than the
> > lower filesystems stack depth.
> 
> That's true, but it also highlights the point that the "mark" sb is
> completely unneeded and it really is just a nice little hack.
> All the information it really stores is a lower mount reference,
> a lower root dentry and a declarative statement "I am shiftable!".

Seems I should have saved some of the things I said in my previous
response for this one. As you no doubt gleaned from that email, I agree
with this.

> Come to think about it, "container shiftable" really isn't that different from
> NFS export with "no_root_squash" and auto mounted USB drive.
> I mean the shifting itself is different of course, but the
> declaration, not so much.
> If I am allowing sudoers on another machine to mess with root owned
> files visible
> on my machine, I am pretty much have the same issues as container admins
> accessing root owned files on my init_user_ns filesystem. In all those cases,
> I'd better not be executing suid binaries from the untrusted "external" source.
> 
> Instead of mounting a dummy filesystem to make the declaration, you could
> get the same thing with:
>    mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
> or whatnot)  constant to uapi and manage to come up good man page description.
> 
> Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
> avoid the extra bind mount step (for a full filesystem tree export).
> Declaring a mounted image MS_EXTERN has merits on its own even without
> containers and shitfs, for example with pluggable storage. Other LSMs could make
> good use of that declaration.

I'm missing how we figure out the target user ns in this scheme. We need
a context with privileges towards the source path's s_user_ns to say
it's okay to shift ids for the files under the source path, and then we
need a target user ns for the id shifts. Currently the target is
current_user_ns when the final shiftfs mount is created.

So, how do we determine the target s_user_ns in your scheme?

> 
> >
> > Second, when the lower mount is shiftfs we can just skip down to
> > that mount's lower filesystem and shift ids relative to that.
> > There is no reason to shift ids twice, and the lower path has
> > already been marked safe for id shifting by a user privileged
> > towards all ids in that mount's user ns.
> >
> > Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> > ---
> >  fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 46 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> > index b19af7b2fe75..008ace2842b9 100644
> > --- a/fs/shiftfs.c
> > +++ b/fs/shiftfs.c
> > @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> >         struct shiftfs_data *data = raw_data;
> >         char *name = kstrdup(data->path, GFP_KERNEL);
> >         int err = -ENOMEM;
> > -       struct shiftfs_super_info *ssi = NULL;
> > +       struct shiftfs_super_info *ssi = NULL, *mp_ssi;
> >         struct path path;
> >         struct dentry *dentry;
> >
> > @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> >         if (err)
> >                 goto out;
> >
> > -       /* to mark a mount point, must be real root */
> > -       if (ssi->mark && !capable(CAP_SYS_ADMIN))
> > -               goto out;
> > -
> > -       /* else to mount a mark, must be userns admin */
> > +       /* to mount a mark, must be userns admin */
> >         if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> >                 goto out;
> 
> Isn't this check performed by vfs anyway? i.e. in mount_nodev() -> sget()

Yeah, I noticed that too. I left it in for the moment to emphasize the
change I was making, but it can be removed.

> 
> >
> > @@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data,
> >
> >         if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
> >                 err = -ENOTDIR;
> > -               goto out_put;
> > -       }
> > -
> > -       sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
> > -       if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> > -               printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
> > -               err = -EINVAL;
> > -               goto out_put;
> > +               goto out_put_path;
> >         }
> >
> >         if (ssi->mark) {
> > +               struct super_block *lower_sb = path.mnt->mnt_sb;
> > +
> > +               /* to mark a mount point, must root wrt lower s_user_ns */
> > +               if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN))
> > +                       goto out_put_path;
> > +
> > +
> >                 /*
> >                  * this part is visible unshifted, so make sure no
> >                  * executables that could be used to give suid
> >                  * privileges
> >                  */
> >                 sb->s_iflags = SB_I_NOEXEC;
> 
> As commented on cover letter, why allow access to any files besides root at all?
> In fact, the only justification for a dummy sb (instead of bind mount with
> MS_EXTERN flag) would be in order to override inode operations with noop ops
> to prevent access to unshifted files from within container.

Summarizing my response to the other message, if the mark mount is kept
(and I would prefer that it isn't kept) that seems reasonable to me.

Thanks,
Seth
Amir Goldstein Nov. 2, 2018, 1:16 p.m.
On Fri, Nov 2, 2018 at 2:44 PM Seth Forshee <seth.forshee@canonical.com> wrote:
>
> On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canonical.com> wrote:
> > >
> > > shiftfs mounts cannot be nested for two reasons -- global
> > > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > > functional shiftfs mount meets the filesystem stacking depth
> > > limit.
> > >
> > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > > ids in a mount must be within that mount's s_user_ns, so all that
> > > is needed is CAP_SYS_ADMIN within that s_user_ns.
> > >
> > > The stack depth issue can be worked around with a couple of
> > > adjustments. First, a mark mount doesn't really need to count
> > > against the stacking depth as it doesn't contribute to the call
> > > stack depth during filesystem operations. Therefore the mount
> > > over the mark mount only needs to count as one more than the
> > > lower filesystems stack depth.
> >
> > That's true, but it also highlights the point that the "mark" sb is
> > completely unneeded and it really is just a nice little hack.
> > All the information it really stores is a lower mount reference,
> > a lower root dentry and a declarative statement "I am shiftable!".
>
> Seems I should have saved some of the things I said in my previous
> response for this one. As you no doubt gleaned from that email, I agree
> with this.
>
> > Come to think about it, "container shiftable" really isn't that different from
> > NFS export with "no_root_squash" and auto mounted USB drive.
> > I mean the shifting itself is different of course, but the
> > declaration, not so much.
> > If I am allowing sudoers on another machine to mess with root owned
> > files visible
> > on my machine, I am pretty much have the same issues as container admins
> > accessing root owned files on my init_user_ns filesystem. In all those cases,
> > I'd better not be executing suid binaries from the untrusted "external" source.
> >
> > Instead of mounting a dummy filesystem to make the declaration, you could
> > get the same thing with:
> >    mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> > and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
> > or whatnot)  constant to uapi and manage to come up good man page description.
> >
> > Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
> > avoid the extra bind mount step (for a full filesystem tree export).
> > Declaring a mounted image MS_EXTERN has merits on its own even without
> > containers and shitfs, for example with pluggable storage. Other LSMs could make
> > good use of that declaration.
>
> I'm missing how we figure out the target user ns in this scheme. We need
> a context with privileges towards the source path's s_user_ns to say
> it's okay to shift ids for the files under the source path, and then we
> need a target user ns for the id shifts. Currently the target is
> current_user_ns when the final shiftfs mount is created.
>
> So, how do we determine the target s_user_ns in your scheme?
>

Unless I am completely misunderstanding shiftfs, I think we are saying the
same thing. You said you wish to get rid of the "mark" fs and that you had
a POC of implementing the "mark" using xattr.
I'm just saying another option to implement the mark is using a super block
flag and you get the target s_user_ns from mnt_sb.
I did miss the fact that a mount flag is not enough, so that makes the bind
mount concept fail. Unless, maybe, the mount in the container is a slave
mount and the "shiftable" mark is set on the master.
I know too little about mount ns vs. user ns to tell if any of this makes sense.

Feel free to ignore MS_EXTERN idea.

Hopefully, mount API v2 will provide the proper facility to implement mark.

Thanks,
Amir.
Seth Forshee Nov. 2, 2018, 1:47 p.m.
On Fri, Nov 02, 2018 at 03:16:05PM +0200, Amir Goldstein wrote:
> On Fri, Nov 2, 2018 at 2:44 PM Seth Forshee <seth.forshee@canonical.com> wrote:
> >
> > On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> > > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canonical.com> wrote:
> > > >
> > > > shiftfs mounts cannot be nested for two reasons -- global
> > > > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > > > functional shiftfs mount meets the filesystem stacking depth
> > > > limit.
> > > >
> > > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > > > ids in a mount must be within that mount's s_user_ns, so all that
> > > > is needed is CAP_SYS_ADMIN within that s_user_ns.
> > > >
> > > > The stack depth issue can be worked around with a couple of
> > > > adjustments. First, a mark mount doesn't really need to count
> > > > against the stacking depth as it doesn't contribute to the call
> > > > stack depth during filesystem operations. Therefore the mount
> > > > over the mark mount only needs to count as one more than the
> > > > lower filesystems stack depth.
> > >
> > > That's true, but it also highlights the point that the "mark" sb is
> > > completely unneeded and it really is just a nice little hack.
> > > All the information it really stores is a lower mount reference,
> > > a lower root dentry and a declarative statement "I am shiftable!".
> >
> > Seems I should have saved some of the things I said in my previous
> > response for this one. As you no doubt gleaned from that email, I agree
> > with this.
> >
> > > Come to think about it, "container shiftable" really isn't that different from
> > > NFS export with "no_root_squash" and auto mounted USB drive.
> > > I mean the shifting itself is different of course, but the
> > > declaration, not so much.
> > > If I am allowing sudoers on another machine to mess with root owned
> > > files visible
> > > on my machine, I am pretty much have the same issues as container admins
> > > accessing root owned files on my init_user_ns filesystem. In all those cases,
> > > I'd better not be executing suid binaries from the untrusted "external" source.
> > >
> > > Instead of mounting a dummy filesystem to make the declaration, you could
> > > get the same thing with:
> > >    mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> > > and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
> > > or whatnot)  constant to uapi and manage to come up good man page description.
> > >
> > > Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
> > > avoid the extra bind mount step (for a full filesystem tree export).
> > > Declaring a mounted image MS_EXTERN has merits on its own even without
> > > containers and shitfs, for example with pluggable storage. Other LSMs could make
> > > good use of that declaration.
> >
> > I'm missing how we figure out the target user ns in this scheme. We need
> > a context with privileges towards the source path's s_user_ns to say
> > it's okay to shift ids for the files under the source path, and then we
> > need a target user ns for the id shifts. Currently the target is
> > current_user_ns when the final shiftfs mount is created.
> >
> > So, how do we determine the target s_user_ns in your scheme?
> >
> 
> Unless I am completely misunderstanding shiftfs, I think we are saying the
> same thing. You said you wish to get rid of the "mark" fs and that you had
> a POC of implementing the "mark" using xattr.

The PoC was using filesystem contexts and (an earlier version of) the
new mount API, but yes I do think we're in agreement about the mark fs
being awkward.

> I'm just saying another option to implement the mark is using a super block
> flag and you get the target s_user_ns from mnt_sb.
> I did miss the fact that a mount flag is not enough, so that makes the bind
> mount concept fail. Unless, maybe, the mount in the container is a slave
> mount and the "shiftable" mark is set on the master.
> I know too little about mount ns vs. user ns to tell if any of this makes sense.

We need a source and target user ns for the shifting, currently they are
the lower filesystem's s_user_ns and the s_user_ns of the shiftfs fs
(which is current_user_ns at the time the real shiftfs fs is mounted). I
think doing it this way makes sense.

The problem with the slave mount idea is that s_user_ns for the shiftfs
sb is still not the target ns for the id shifting, which is going to
cause all kinds of problems. Unless all of this is done in the vfs, then
it could really be a bind mount. The shifting could be done based on the
mount namespace's user_ns and all permission checks in the vfs could
take this into account. This approach certainly has benefits and is one
of the things I want to discuss.

> Feel free to ignore MS_EXTERN idea.
> 
> Hopefully, mount API v2 will provide the proper facility to implement mark.

The approach I took was to have the source user_ns context set up an fd
that was passed to the target user_ns context. Then that fd was used to
attach the mount to the mount tree, and when that happened the target
s_user_ns was set. It worked, but there were some awkware bits. I
haven't kept up with the changes since then to know if things are any
better or worse with the current patches.

Thanks,
Seth
James Bottomley Nov. 2, 2018, 4:57 p.m.
On Fri, 2018-11-02 at 15:16 +0200, Amir Goldstein wrote:
> On Fri, Nov 2, 2018 at 2:44 PM Seth Forshee <seth.forshee@canonical.c
> om> wrote:
> > 
> > On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> > > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee <seth.forshee@canoni
> > > cal.com> wrote:
> > > > 
> > > > shiftfs mounts cannot be nested for two reasons -- global
> > > > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > > > functional shiftfs mount meets the filesystem stacking depth
> > > > limit.
> > > > 
> > > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > > > ids in a mount must be within that mount's s_user_ns, so all
> > > > that
> > > > is needed is CAP_SYS_ADMIN within that s_user_ns.
> > > > 
> > > > The stack depth issue can be worked around with a couple of
> > > > adjustments. First, a mark mount doesn't really need to count
> > > > against the stacking depth as it doesn't contribute to the call
> > > > stack depth during filesystem operations. Therefore the mount
> > > > over the mark mount only needs to count as one more than the
> > > > lower filesystems stack depth.
> > > 
> > > That's true, but it also highlights the point that the "mark" sb
> > > is
> > > completely unneeded and it really is just a nice little hack.
> > > All the information it really stores is a lower mount reference,
> > > a lower root dentry and a declarative statement "I am
> > > shiftable!".
> > 
> > Seems I should have saved some of the things I said in my previous
> > response for this one. As you no doubt gleaned from that email, I
> > agree
> > with this.
> > 
> > > Come to think about it, "container shiftable" really isn't that
> > > different from
> > > NFS export with "no_root_squash" and auto mounted USB drive.
> > > I mean the shifting itself is different of course, but the
> > > declaration, not so much.
> > > If I am allowing sudoers on another machine to mess with root
> > > owned
> > > files visible
> > > on my machine, I am pretty much have the same issues as container
> > > admins
> > > accessing root owned files on my init_user_ns filesystem. In all
> > > those cases,
> > > I'd better not be executing suid binaries from the untrusted
> > > "external" source.
> > > 
> > > Instead of mounting a dummy filesystem to make the declaration,
> > > you could
> > > get the same thing with:
> > >    mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> > > and all you need to do is add MS_EXTERN (MS_SHIFTABLE
> > > MS_UNTRUSTED
> > > or whatnot)  constant to uapi and manage to come up good man page
> > > description.
> > > 
> > > Then users could actually mount a filesystem in init_user_ns
> > > MS_EXTERN and
> > > avoid the extra bind mount step (for a full filesystem tree
> > > export).
> > > Declaring a mounted image MS_EXTERN has merits on its own even
> > > without
> > > containers and shitfs, for example with pluggable storage. Other
> > > LSMs could make
> > > good use of that declaration.
> > 
> > I'm missing how we figure out the target user ns in this scheme. We
> > need
> > a context with privileges towards the source path's s_user_ns to
> > say
> > it's okay to shift ids for the files under the source path, and
> > then we
> > need a target user ns for the id shifts. Currently the target is
> > current_user_ns when the final shiftfs mount is created.
> > 
> > So, how do we determine the target s_user_ns in your scheme?
> > 
> 
> Unless I am completely misunderstanding shiftfs, I think we are
> saying the same thing. You said you wish to get rid of the "mark" fs
> and that you had a POC of implementing the "mark" using xattr.

I've got one of these too ... it works nicely.

> I'm just saying another option to implement the mark is using a super
> block flag and you get the target s_user_ns from mnt_sb.

This works a lot less well because the entire mount becomes shiftable,
not just a subtree, which is suboptimal for the unprivileged use case. 
The idea for getting around this was the one Ted mentioned of attaching
properties to the vfsmount structure (he'd like to use this for case
insensitive name comparisons in subtrees), but that requires
rethreading quite a few vfs calls to take a struct path instead of a
struct dentry.

James