[RFC,0/1] mount: universally disallow mounting over symlinks

Submitted by Al Viro on Jan. 1, 2020, 3:08 a.m.

Details

Message ID 20200101030815.GA17593@ZenIV.linux.org.uk
State New
Series "mount: universally disallow mounting over symlinks"
Headers show

Commit Message

Al Viro Jan. 1, 2020, 3:08 a.m.
On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote:
> Note, BTW, that lookup_last() (aka walk_component()) does just
> that - we only hit step_into() on LAST_NORM.  The same goes
> for do_last().  mountpoint_last() not doing the same is _not_
> intentional - it's definitely a bug.
> 
> Consider your testcase; link points to . here.  So the only
> thing you could expect from trying to follow it would be
> the directory 'link' lives in.  And you don't have it
> when you reach the fscker via /proc/self/fd/3; what happens
> instead is nd->path set to ./link (by nd_jump_link()) *AND*
> step_into() called, pushing the same ./link onto stack.
> It violates all kinds of assumptions made by fs/namei.c -
> when pushing a symlink onto stack nd->path is expected to
> contain the base directory for resolving it.
> 
> I'm fairly sure that this is the cause of at least some
> of the insanity you've caught; there always could be
> something else, of course, but this hole needs to be
> closed in any case.

... and with removal of now unused local variable, that's

mountpoint_last(): fix the treatment of LAST_BIND

step_into() should be attempted only in LAST_NORM
case, when we have the parent directory (in nd->path).
We get away with that for LAST_DOT and LOST_DOTDOT,
since those can't be symlinks, making step_init() and
equivalent of path_to_nameidata() - we do a bit of
useless work, but that's it.  For LAST_BIND (i.e.
the case when we'd just followed a procfs-style
symlink) we really can't go there - result might
be a symlink and we really can't attempt following
it.

lookup_last() and do_last() do handle that properly;
mountpoint_last() should do the same.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

Patch hide | download patch | download mbox

diff --git a/fs/namei.c b/fs/namei.c
index d6c91d1e88cb..13f9f973722b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2643,7 +2643,6 @@  EXPORT_SYMBOL(user_path_at_empty);
 static int
 mountpoint_last(struct nameidata *nd)
 {
-	int error = 0;
 	struct dentry *dir = nd->path.dentry;
 	struct path path;
 
@@ -2656,10 +2655,7 @@  mountpoint_last(struct nameidata *nd)
 	nd->flags &= ~LOOKUP_PARENT;
 
 	if (unlikely(nd->last_type != LAST_NORM)) {
-		error = handle_dots(nd, nd->last_type);
-		if (error)
-			return error;
-		path.dentry = dget(nd->path.dentry);
+		return handle_dots(nd, nd->last_type);
 	} else {
 		path.dentry = d_lookup(dir, &nd->last);
 		if (!path.dentry) {

Comments

Aleksa Sarai Jan. 1, 2020, 2:44 p.m.
On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote:
> > Note, BTW, that lookup_last() (aka walk_component()) does just
> > that - we only hit step_into() on LAST_NORM.  The same goes
> > for do_last().  mountpoint_last() not doing the same is _not_
> > intentional - it's definitely a bug.
> > 
> > Consider your testcase; link points to . here.  So the only
> > thing you could expect from trying to follow it would be
> > the directory 'link' lives in.  And you don't have it
> > when you reach the fscker via /proc/self/fd/3; what happens
> > instead is nd->path set to ./link (by nd_jump_link()) *AND*
> > step_into() called, pushing the same ./link onto stack.
> > It violates all kinds of assumptions made by fs/namei.c -
> > when pushing a symlink onto stack nd->path is expected to
> > contain the base directory for resolving it.
> > 
> > I'm fairly sure that this is the cause of at least some
> > of the insanity you've caught; there always could be
> > something else, of course, but this hole needs to be
> > closed in any case.
> 
> ... and with removal of now unused local variable, that's
> 
> mountpoint_last(): fix the treatment of LAST_BIND
> 
> step_into() should be attempted only in LAST_NORM
> case, when we have the parent directory (in nd->path).
> We get away with that for LAST_DOT and LOST_DOTDOT,
> since those can't be symlinks, making step_init() and
> equivalent of path_to_nameidata() - we do a bit of
> useless work, but that's it.  For LAST_BIND (i.e.
> the case when we'd just followed a procfs-style
> symlink) we really can't go there - result might
> be a symlink and we really can't attempt following
> it.
> 
> lookup_last() and do_last() do handle that properly;
> mountpoint_last() should do the same.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Thanks, this fixes the issue for me (and also fixes another reproducer I
found -- mounting a symlink on top of itself then trying to umount it).

Reported-by: Aleksa Sarai <cyphar@cyphar.com>
Tested-by: Aleksa Sarai <cyphar@cyphar.com>

As for the original topic of bind-mounting symlinks -- given this is a
supported feature, would you be okay with me sending an updated
O_EMPTYPATH series?

> ---
> diff --git a/fs/namei.c b/fs/namei.c
> index d6c91d1e88cb..13f9f973722b 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2643,7 +2643,6 @@ EXPORT_SYMBOL(user_path_at_empty);
>  static int
>  mountpoint_last(struct nameidata *nd)
>  {
> -	int error = 0;
>  	struct dentry *dir = nd->path.dentry;
>  	struct path path;
>  
> @@ -2656,10 +2655,7 @@ mountpoint_last(struct nameidata *nd)
>  	nd->flags &= ~LOOKUP_PARENT;
>  
>  	if (unlikely(nd->last_type != LAST_NORM)) {
> -		error = handle_dots(nd, nd->last_type);
> -		if (error)
> -			return error;
> -		path.dentry = dget(nd->path.dentry);
> +		return handle_dots(nd, nd->last_type);
>  	} else {
>  		path.dentry = d_lookup(dir, &nd->last);
>  		if (!path.dentry) {
Al Viro Jan. 1, 2020, 11:40 p.m.
On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:

> Thanks, this fixes the issue for me (and also fixes another reproducer I
> found -- mounting a symlink on top of itself then trying to umount it).
> 
> Reported-by: Aleksa Sarai <cyphar@cyphar.com>
> Tested-by: Aleksa Sarai <cyphar@cyphar.com>

Pushed into #fixes.

> As for the original topic of bind-mounting symlinks -- given this is a
> supported feature, would you be okay with me sending an updated
> O_EMPTYPATH series?

Post it on fsdevel; I'll need to reread it anyway to say anything useful...
Aleksa Sarai Jan. 2, 2020, 3:59 a.m.
On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
> 
> > Thanks, this fixes the issue for me (and also fixes another reproducer I
> > found -- mounting a symlink on top of itself then trying to umount it).
> > 
> > Reported-by: Aleksa Sarai <cyphar@cyphar.com>
> > Tested-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> Pushed into #fixes.

Thanks. One other thing I noticed is that umount applies to the
underlying symlink rather than the mountpoint on top. So, for example
(using the same scripts I posted in the thread):

  # ln -s /tmp/foo link
  # ./mount_to_symlink /etc/passwd link
  # umount -l link # will attempt to unmount "/tmp/foo"

Is that intentional?
Al Viro Jan. 3, 2020, 1:49 a.m.
On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
> On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
> > 
> > > Thanks, this fixes the issue for me (and also fixes another reproducer I
> > > found -- mounting a symlink on top of itself then trying to umount it).
> > > 
> > > Reported-by: Aleksa Sarai <cyphar@cyphar.com>
> > > Tested-by: Aleksa Sarai <cyphar@cyphar.com>
> > 
> > Pushed into #fixes.
> 
> Thanks. One other thing I noticed is that umount applies to the
> underlying symlink rather than the mountpoint on top. So, for example
> (using the same scripts I posted in the thread):
> 
>   # ln -s /tmp/foo link
>   # ./mount_to_symlink /etc/passwd link
>   # umount -l link # will attempt to unmount "/tmp/foo"
> 
> Is that intentional?

It's a mess, again in mountpoint_last().  FWIW, at some point I proposed
to have nd_jump_link() to fail with -ELOOP if the target was a symlink;
Linus asked for reasons deeper than my dislike of the semantics, I looked
around and hadn't spotted anything.  And there hadn't been at the time,
but when four months later umount_lookup_last() went in I failed to look
for that source of potential problems in it ;-/

I've looked at that area again now.  Aside of usual cursing at do_last()
horrors (yes, its control flow is a horror; yes, it needs serious massage;
no, it's not a good idea to get sidetracked into that right now), there
are several fun questions:
	* d_manage() and d_automount().  We almost certainly don't
want those for autofs on the final component of pathname in umount,
including the trailing symlinks.  But do we want those on usual access
via /proc/*/fd/*?  I.e. suppose somebody does open() (O_PATH or not)
in autofs; do we want ->d_manage()/->d_automount() called when
resolving /proc/self/fd/<whatever>/foo/bar?  We do not; is that
correct from autofs point of view?  I suspect that refusing to
do ->d_automount() is correct, but I don't understand ->d_manage()
purpose well enough to tell.
	* I really hope that the weird "trailing / forces automount
even in cases when we normally wouldn't trigger it" (stat /mnt/foo
vs. stat /mnt/foo/) is not meant to extend to umount.  I'd like
Ian's confirmation, though.
	* do we want ->d_manage() on following .. into overmounted
directory?  Again, autofs question...

	The minimal fix to mountpoint_last() would be to have
follow_mount() done in LAST_NORM case.  However, I'd like to understand
(and hopefully regularize) the rules for follow_mount()/follow_managed().
Additional scary question is nfsd iterplay with automount.  For nfs4
exports it's potentially interesting...

	Ian, could you comment on the autofs questions above?
I'd rather avoid doing changes in that area without your input -
it's subtle and breakage in automount-related behaviour can be
mysterious as hell.
Ian Kent Jan. 4, 2020, 4:46 a.m.
It may be a bit off-topic here but, in autofs symlinks can be used
in place of mounts. That mechanism can be used (mostly nowadays) with
amd map format maps.

If I'm using symlinks instead of mounts (where I can) I definitely
don't want these to be over mounted by a mount.

I haven't seen problems like that happening but if it did happen
that would be a bug in automount or user mis-use of some sort.

On Fri, 2020-01-03 at 01:49 +0000, Al Viro wrote:
> On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
> > On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
> > > 
> > > > Thanks, this fixes the issue for me (and also fixes another
> > > > reproducer I
> > > > found -- mounting a symlink on top of itself then trying to
> > > > umount it).
> > > > 
> > > > Reported-by: Aleksa Sarai <cyphar@cyphar.com>
> > > > Tested-by: Aleksa Sarai <cyphar@cyphar.com>
> > > 
> > > Pushed into #fixes.
> > 
> > Thanks. One other thing I noticed is that umount applies to the
> > underlying symlink rather than the mountpoint on top. So, for
> > example
> > (using the same scripts I posted in the thread):
> > 
> >   # ln -s /tmp/foo link
> >   # ./mount_to_symlink /etc/passwd link
> >   # umount -l link # will attempt to unmount "/tmp/foo"
> > 
> > Is that intentional?
> 
> It's a mess, again in mountpoint_last().  FWIW, at some point I
> proposed
> to have nd_jump_link() to fail with -ELOOP if the target was a
> symlink;
> Linus asked for reasons deeper than my dislike of the semantics, I
> looked
> around and hadn't spotted anything.  And there hadn't been at the
> time,
> but when four months later umount_lookup_last() went in I failed to
> look
> for that source of potential problems in it ;-/
> 
> I've looked at that area again now.  Aside of usual cursing at
> do_last()
> horrors (yes, its control flow is a horror; yes, it needs serious
> massage;
> no, it's not a good idea to get sidetracked into that right now),
> there
> are several fun questions:
> 	* d_manage() and d_automount().  We almost certainly don't
> want those for autofs on the final component of pathname in umount,
> including the trailing symlinks.  But do we want those on usual
> access
> via /proc/*/fd/*?  I.e. suppose somebody does open() (O_PATH or not)
> in autofs; do we want ->d_manage()/->d_automount() called when
> resolving /proc/self/fd/<whatever>/foo/bar?  We do not; is that
> correct from autofs point of view?  I suspect that refusing to
> do ->d_automount() is correct, but I don't understand ->d_manage()
> purpose well enough to tell.

Yes, we don't want those on the final component of the path in
umount. The following of a symlink will give use a new path of
some sort so the rules would change to the usual ones for the
new path.

The semantics of following a symlink, be the source a proc entry
or not (I think) should always be the same. If the follow takes
us to an autofs file system (be it a trigger mount or an indirect
mount in an autofs file system) the behaviour should be that of
the autofs file system when we arrive there, from an auto-mount
POV.

The original intent of ->d_manage() was to prevent walks into an
under construction mount and that might not be as simple as mounting
a source on a mount point.

For example take the case of an automount indirect mount map entry
like this:

test    /some/path/one  server:/source/path1 \
        /some/path/two  server2:/source/path2 \
        /some/other/path server:/source/path3 \
        /some/other/path/three server:/source/path4

This entry has no mount at the root of the tree (so called root-less
multi-mount) but walks need to block when it's under construction as
the topology isn't known until the directory tree and any associated
mounts (usually trigger mounts) have been completed.

In this case it's needed to go to ref-walk mode and block until it's
done.

The other (perhaps not so obvious) use of ->d_manage() is to detect
expire to mount races. When an automount is expiring at the same time
a process (that would cause an automount) is traversing the path. The
base (I'll not say root, since the root of the expire might not be the
root of the tree) needs to block the walk until the expire is done.

These multi-mounts are meant to provide a "mount as you go" mechanism
so that only portions of the tree of mounts are mounted or expired at
any one time.

For example, the offsets in the above entry are /some/path/one,
/some/path/two, /some/other/path and /some/other/path/three.

On access to <autofs mount>/test automount is meant to mount trigger
mounts for offsets /some/path/one, /some/path/two and /some/other/path
and mount an offset trigger for /some/other/path/three into the mount
for /some/other/path when it's accessed and that might not happen
during the initial mount of the tree. The reverse being done on umount
in sub-trees of mounts when a nesting point like /some/other/path is
encountered.

But that's something of an aside because in all cases below the root
there will be an actual mount preventing walks into the tree under
nesting point mounts being constructed or expired.

Anyway, returning to the topic at hand, the answer to whether we want
->d_manage()/->d_automount() after a symlink has been followed is
yes, I think, because at that point we could be within a file system
that has automounts of some sort.

But perhaps I'm missing something about the description of the case
above ...

> 	* I really hope that the weird "trailing / forces automount
> even in cases when we normally wouldn't trigger it" (stat /mnt/foo
> vs. stat /mnt/foo/) is not meant to extend to umount.  I'd like
> Ian's confirmation, though.

I can't see any way that the trailing "/" can realte to umount.

It has always been meant to be used to trigger a mount on something
that would otherwise not be mounted and that's the only case I'm
aware of.

> 	* do we want ->d_manage() on following .. into overmounted
> directory?  Again, autofs question...

I think that amounts to asking "can the target of the ../ be in the
process of being constructed or expired at this time" and that's
probably yes. A root-less multi-mount would be one case where this
could happen (although it's not strictly an over-mounted directory).

> 
> 	The minimal fix to mountpoint_last() would be to have
> follow_mount() done in LAST_NORM case.  However, I'd like to
> understand
> (and hopefully regularize) the rules for
> follow_mount()/follow_managed().
> Additional scary question is nfsd iterplay with automount.  For nfs4
> exports it's potentially interesting...

I'm not sure about nfs (and other cross mounting file systems). The
automounting in file systems other than autofs always have a real
mount as the target (AFAIK) so there's an implied blocking that occurs
on crossing the mount point. That's always made the nfs automounting
case simpler to my thinking anyway.

The real problem with nfs automount trees is when the topology of
the exports tree changes while parts of it are in use. People that
have any idea of how nfs cross mounting (and mount dependencies in
general) work shouldn't do that but they do it and then wonder why
things go wrong ...

> 
> 	Ian, could you comment on the autofs questions above?
> I'd rather avoid doing changes in that area without your input -
> it's subtle and breakage in automount-related behaviour can be
> mysterious as hell.

Thanks for the heads up.

As always I can run tests on changes you want to do.
Fortunately that's generally worked out ok for us in the past.

Ian
Andy Lutomirski Jan. 4, 2020, 5:52 a.m.
> On Jan 1, 2020, at 11:44 PM, Aleksa Sarai <cyphar@cyphar.com> wrote:
> 
> On 2020-01-01, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>> On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote:
>>> Note, BTW, that lookup_last() (aka walk_component()) does just
>>> that - we only hit step_into() on LAST_NORM.  The same goes
>>> for do_last().  mountpoint_last() not doing the same is _not_
>>> intentional - it's definitely a bug.
>>> 
>>> Consider your testcase; link points to . here.  So the only
>>> thing you could expect from trying to follow it would be
>>> the directory 'link' lives in.  And you don't have it
>>> when you reach the fscker via /proc/self/fd/3; what happens
>>> instead is nd->path set to ./link (by nd_jump_link()) *AND*
>>> step_into() called, pushing the same ./link onto stack.
>>> It violates all kinds of assumptions made by fs/namei.c -
>>> when pushing a symlink onto stack nd->path is expected to
>>> contain the base directory for resolving it.
>>> 
>>> I'm fairly sure that this is the cause of at least some
>>> of the insanity you've caught; there always could be
>>> something else, of course, but this hole needs to be
>>> closed in any case.
>> 
>> ... and with removal of now unused local variable, that's
>> 
>> mountpoint_last(): fix the treatment of LAST_BIND
>> 
>> step_into() should be attempted only in LAST_NORM
>> case, when we have the parent directory (in nd->path).
>> We get away with that for LAST_DOT and LOST_DOTDOT,
>> since those can't be symlinks, making step_init() and
>> equivalent of path_to_nameidata() - we do a bit of
>> useless work, but that's it.  For LAST_BIND (i.e.
>> the case when we'd just followed a procfs-style
>> symlink) we really can't go there - result might
>> be a symlink and we really can't attempt following
>> it.
>> 
>> lookup_last() and do_last() do handle that properly;
>> mountpoint_last() should do the same.
>> 
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> Thanks, this fixes the issue for me (and also fixes another reproducer I
> found -- mounting a symlink on top of itself then trying to umount it).
> 
> Reported-by: Aleksa Sarai <cyphar@cyphar.com>
> Tested-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> As for the original topic of bind-mounting symlinks -- given this is a
> supported feature, would you be okay with me sending an updated
> O_EMPTYPATH series?

FWIW, I have an actual use case for mounting over a symlink: replacing /etc/resolv.conf.  My virtme tool is presented with somewhat arbitrary crud in /etc, where /etc/resolv.conf might be a plain file or a symlink, but, regardless, has inappropriate contents. If it’s a file, I can mount a new file over it. If it’s a symlink and the kernel properly supported it, I could also mount over it.

Yes, I could also use overlayfs.  Maybe I should regardless.
Al Viro Jan. 8, 2020, 3:13 a.m.
On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote:

> It's a mess, again in mountpoint_last().  FWIW, at some point I proposed
> to have nd_jump_link() to fail with -ELOOP if the target was a symlink;
> Linus asked for reasons deeper than my dislike of the semantics, I looked
> around and hadn't spotted anything.  And there hadn't been at the time,
> but when four months later umount_lookup_last() went in I failed to look
> for that source of potential problems in it ;-/
> 
> I've looked at that area again now.  Aside of usual cursing at do_last()
> horrors (yes, its control flow is a horror; yes, it needs serious massage;
> no, it's not a good idea to get sidetracked into that right now), there
> are several fun questions:
> 	* d_manage() and d_automount().  We almost certainly don't
> want those for autofs on the final component of pathname in umount,
> including the trailing symlinks.  But do we want those on usual access
> via /proc/*/fd/*?  I.e. suppose somebody does open() (O_PATH or not)
> in autofs; do we want ->d_manage()/->d_automount() called when
> resolving /proc/self/fd/<whatever>/foo/bar?  We do not; is that
> correct from autofs point of view?  I suspect that refusing to
> do ->d_automount() is correct, but I don't understand ->d_manage()
> purpose well enough to tell.
> 	* I really hope that the weird "trailing / forces automount
> even in cases when we normally wouldn't trigger it" (stat /mnt/foo
> vs. stat /mnt/foo/) is not meant to extend to umount.  I'd like
> Ian's confirmation, though.
> 	* do we want ->d_manage() on following .. into overmounted
> directory?  Again, autofs question...

FWIW, I suspect that we want to do something along the following lines:

1) make build_open_flags() treat O_CREAT | O_EXCL as if there had been
O_NOFOLLOW in the mix.  Reason: if there is a trailing symlink, we want
to fail with EEXIST anyway.  Benefit: this fragment in do_last()
        error = follow_managed(&path, nd);
        if (unlikely(error < 0))
                return error;

        /*
         * create/update audit record if it already exists.
         */
        audit_inode(nd->name, path.dentry, 0);

        if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
                path_to_nameidata(&path, nd);
                return -EEXIST;
        }

        seq = 0;        /* out of RCU mode, so the value doesn't matter */
        inode = d_backing_inode(path.dentry);
finish_lookup:  
        error = step_into(nd, &path, 0, inode, seq);
        if (unlikely(error))
                return error;
can become
        error = follow_managed(&path, nd);
        if (unlikely(error < 0))
                return error;

        seq = 0;        /* out of RCU mode, so the value doesn't matter */
        inode = d_backing_inode(path.dentry);
finish_lookup:  
        error = step_into(nd, &path, 0, inode, seq);
        if (unlikely(error))
                return error;

        if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
		audit_inode(nd->name, nd->path.dentry, 0);
                return -EEXIST;
        }
Equivalent transformation, since the the only goto finish_lookup; is under
if (!(open_flag & O_CREAT)).  What it buys us is more regular structure
of follow_managed() callers.

2) make follow_managed() take &inode and &seq.  Look: follow_managed() never
returns 0 (we have
        if (ret == -EISDIR || !ret)
                ret = 1;
on the way to the only return in it) and the callers are
        err = follow_managed(path, nd);
        if (likely(err > 0))
                *inode = d_backing_inode(path->dentry);
        return err;
in lookup_fast(),
                err = follow_managed(&path, nd);
                if (unlikely(err < 0))
                        return err;

                seq = 0;        /* we are already out of RCU mode */
                inode = d_backing_inode(path.dentry);
in walk_component(),
                err = follow_managed(&path, nd);
                if (unlikely(err < 0))
                        return err;
                inode = d_backing_inode(path.dentry);
                seq = 0;
in handle_lookup_down() and (after the previous change)
        error = follow_managed(&path, nd);
        if (unlikely(error < 0))
                return error;

        seq = 0;        /* out of RCU mode, so the value doesn't matter */
        inode = d_backing_inode(path.dentry);
in do_last().  That's begging to fold those followups into follow_managed()
itself, doesn't it?  And having *seqp = 0; equivalent added in lookup_fast()
is not going to hurt the performance - in all callers it's an address of
local variable, right next to the one whose address is passed as inodep.
Which we'd just dirtied, and the cacheline is not going to have been shared
anyway.

Note that after that the arguments for follow_managed() become identical
to those for __follow_mount_rcu().  Which makes a lot of sense, since
the latter is RCU-mode counterpart of the former.

3) have the followup to failing __follow_mount_rcu() taken into it.
After (2) we have this in lookup_fast():

                *seqp = seq;
                status = d_revalidate(dentry, nd->flags);
                if (likely(status > 0)) {
                        /*
                         * Note: do negative dentry check after revalidation in
                         * case that drops it.
                         */
                        if (unlikely(negative))
                                return -ENOENT;
                        path->mnt = mnt;
                        path->dentry = dentry;
                        if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
                                return 1;
                }
                if (unlazy_child(nd, dentry, seq))
                        return -ECHILD;
                if (unlikely(status == -ECHILD))
                        /* we'd been told to redo it in non-rcu mode */
                        status = d_revalidate(dentry, nd->flags);
        } else {   
		...
	}
        if (unlikely(status <= 0)) {
                if (!status)
                        d_invalidate(dentry);
                dput(dentry);
                return status;
        }

        path->mnt = mnt;
        path->dentry = dentry;
        return follow_managed(path, nd, inode, seqp);

Suppose __follow_mount_rcu() returns false; what follows is
                if (unlazy_child(nd, dentry, seq))
                        return -ECHILD;
seq here is equal to *seqp here, dentry - the value of path->dentry at the
time of __follow_mount_rcu() call.
                if (unlikely(status == -ECHILD))
			....
not taken - we know that status must have been positive
        if (unlikely(status <= 0)) {
		...
	}
ditto
        path->mnt = mnt;
        path->dentry = dentry;
        return follow_managed(path, nd, inode, seqp);
we return *path to original and call follow_managed().  IOW, we could bloody
well do all of that in the __follow_mount_rcu() itself, having it return 1
when the original would've returned true and doing that "revert *path,
call unlazy_child() and fall back to follow_mount_rcu() in case of success"
in cases when the original would've returned false.  The caller turns into
                        /*
                         * Note: do negative dentry check after revalidation in
                         * case that drops it.
                         */
                        if (unlikely(negative))
                                return -ENOENT;
                        path->mnt = mnt;
                        path->dentry = dentry;
                        return __follow_mount_rcu(nd, path, inode, seqp);

4) fold __follow_mount_rcu() into follow_managed(), using the latter both in
RCU and non-RCU cases.

5) take the calls of follow_managed() out of lookup_fast() into its callers.
That would be
        err = lookup_fast(nd, &path, &inode, &seq);
        if (unlikely(err <= 0)) {
                if (err < 0)
                        return err;
                path.dentry = lookup_slow(&nd->last, nd->path.dentry,
                                          nd->flags);
                if (IS_ERR(path.dentry))
                        return PTR_ERR(path.dentry);

                path.mnt = nd->path.mnt;
                err = follow_managed(&path, nd, &inode, &seq);
                if (unlikely(err < 0))
                        return err;
        }
turning into
        err = lookup_fast(nd, &path, &inode, &seq);
        if (unlikely(err <= 0)) {
                if (err < 0)
                        return err;
                path.dentry = lookup_slow(&nd->last, nd->path.dentry,
                                          nd->flags);
                if (IS_ERR(path.dentry))
                        return PTR_ERR(path.dentry);

                path.mnt = nd->path.mnt;
	}
	err = follow_managed(&path, nd, &inode, &seq);
        if (unlikely(err < 0))
		return err;
in walk_component() and
                error = lookup_fast(nd, &path, &inode, &seq);
                if (likely(error > 0))
                        goto finish_lookup;
...
        error = follow_managed(&path, nd, &inode, &seq);
        if (unlikely(error < 0))
                return error;
finish_lookup:  
turning into
                error = lookup_fast(nd, &path, &inode, &seq);
                if (likely(error > 0))
                        goto finish_lookup;
...
finish_lookup:
        error = follow_managed(&path, nd, &inode, &seq);
        if (unlikely(error < 0))
                return error;
in do_last().

6) after that we have 3 callers of step_into(); the ones in
walk_component() and in do_last() would be immediately preceded
by the calls of follow_managed().  The last one is in
mountpoint_last().  That's
        if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) {
                dput(path.dentry);
                return -ENOENT;
        }
        path.mnt = nd->path.mnt;
        return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0);
And that's where we are missing the mountpoint traversal in symlink case -
sure, the caller does follow_mount(), but it doesn't catch the case when
we have a symlink overmounted - we run into step_into() before that.
Note that smp_load_acquire + d_flags_negative is what we would've done
in follow_managed(), as well as getting d_backing_inode().  So here
we also have an open-coded bastardized variant of follow_managed().
The difference is, we don't want to trigger ->d_automount() and ->d_manage()
in that one.

And at that point the only call of follow_managed() *NOT* followed by
step_into() is in handle_lookup_down().  What it is followed by is
        path_to_nameidata(&path, nd);
        nd->inode = inode;
        nd->seq = seq;
And that's a piece of step_into():
        if (likely(!d_is_symlink(path->dentry)) ||
           !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW)) {
                /* not a symlink or should not follow */
                path_to_nameidata(path, nd);
                nd->inode = inode;
                nd->seq = seq;
                return 0;
        }
is the normal path through that sucker.  What's more, we are guaranteed
that this will _not_ be a symlink (it's the starting point of pathwalk,
and path_init() would've told us to sod off were it not a directory).

So if we manage to convert the damn thing in mountpoint_last() into
follow_managed(), we could fold follow_managed() into step_into().
Which suggests the way to do that - not that step_into() takes an
argument containing ORed WALK_... constants.  So we can simply add
WALK_NOAUTOMOUNT and put a check for it into
                if (flags & DCACHE_MANAGE_TRANSIT) {
and
                if (flags & DCACHE_NEED_AUTOMOUNT) {
bodies, so that they would be ignored if that's passed to
follow_mount()/step_into() hybrid.

At that point we have one primitive for moving into child, handling
both the mountpoint traversals and keeping track of symlinks.  Moreover,
there's a fairly strong argument for using it in case of .. as well.
As it is, if the parent is overmounted, we cross into whatever is
mounted on top of it.  And we ignore ->d_manage/->d_automount on
the damn thing.  Which is not an issue for anything other than
autofs (nobody else has ->d_manage() and nfs/afs/cifs automount
points don't have children) and for autofs we *want* those called;
that's not something likely to be encountered, but it's an impossible
setup (autofs direct mount set on an ancestor of somebody's current
directory) and autofs does count upon not walking into something
being set up by the daemon.

I'll put together such series and see how well does it work; it would
fix the idiocies in user_path_mountpoint_at() and make the pathwalk
machinery easier to follow - the boilerplate around mountpoint
crossing and symlink handling is demonstrably easy to get wrong.
If that works and doesn't cause observable slowdown, I'll put it
into -next, either stepping around the changes done by openat2()
series, or rebasing it on top of that.

Another interesting question is whether we want O_PATH open
to trigger automounts.  The thing is, we do *NOT* trigger them
(or traverse mountpoints) at the starting point of lookups.
I believe it's a mistake (and mine, at that), but I doubt that
there's anything that can be done about it at that point.
It's a user-visible behaviour and I can easily imagine
a custom /init that ends up relying upon it ;-/  mkdir /root,
mount the final root there, chdir /root, mount --move . /,
remove everything on initramfs using absolute pathnames
and chroot to "." to finish...  Traversing mounts at the
beginning of pathwalk would break the hell out of that,
potentially with root filesystem contents wiped out... ;-/

I wish we could change that, but I'm afraid that's cast
in stone by now (and had been for 20 years or so).  As it is,
we have an unpleasant side effect - O_PATH open does *NOT*
trigger automounts.  So if you do that to e.g. referral point
and try to do ...at() syscalls with that as the origin, you'll
get an unpleasant surprise - automount won't trigger at all.

I think the easiest way to handle that is to have O_PATH
turn LOOKUP_AUTOMOUNT, same as the normal open() does.  That's
trivial to do, but that changes user-visible behaviour.  OTOH,
with the current behaviour nobody can rely upon automount not
getting triggered by somebody else just as they are entering
their open(dir, O_PATH), so I think that's not a problem.

Linus, do you have any objections to such O_PATH semantics
change?

PS: I think I see how to untangle the control flow horrors
in do_last() with this massage done, but I'm not going there
until this is sorted out - by previous experience touching
the damn thing can easily turn into several weeks of digging
through the nfs/gfs2/etc. guts trying to verify something,
with a couple of detours into fixing something in there
found in process... ;-/
Linus Torvalds Jan. 8, 2020, 3:54 a.m.
On Tue, Jan 7, 2020 at 7:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, I suspect that we want to do something along the following lines:
>
> 1) make build_open_flags() treat O_CREAT | O_EXCL as if there had been
> O_NOFOLLOW in the mix.

My reaction to that is "Whee, that's a big change".

But:

> Benefit: this fragment in do_last()

you're right.

That's the semantics we have right now (and I think it's the correct
safe semantics when I think about it). But when I first looked at your
email I without thinking more about it actually thought we followed
the symlink, and then did the O_CREAT | O_EXCL on the target (and
potentially succeeded).

So I agree - making O_CREAT | O_EXCL imply O_NOFOLLOW seems to be the
right thing to do, and not only should simplify our code, it's much
more descriptive of what the real semantics are.

Even if my first reaction was that it would act differently.

Slash-and-burn approach to your explanatory subsequent steps:

> 2) make follow_managed() take &inode and &seq.
> 3) have the followup to failing __follow_mount_rcu() taken into it.
> 4) fold __follow_mount_rcu() into follow_managed(), using the latter both in
> RCU and non-RCU cases.
> 5) take the calls of follow_managed() out of lookup_fast() into its callers.
> 6) after that we have 3 callers of step_into(); [..]
> So if we manage to convert the damn thing in mountpoint_last() into
> follow_managed(), we could fold follow_managed() into step_into().

I think that all makes sense. I didn't go to look at the source, but
from the email contents your steps look reasonable to me.

> Another interesting question is whether we want O_PATH open
> to trigger automounts.

It does sound like they shouldn't, but as you say:

>     The thing is, we do *NOT* trigger them
> (or traverse mountpoints) at the starting point of lookups.
> I believe it's a mistake (and mine, at that), but I doubt that
> there's anything that can be done about it at that point.
> It's a user-visible behaviour [..]

Hmm. I wonder how set in stone that is. We may have two decades of
history of not doing it at start point of lookups, but we do *not*
have two decades of history of O_PATH.

So what I think we agree would be sane behavior would be for O_PATH
opens to not trigger automounts (unless there's a slash at the end,
whatever), but _do_ add the mount-point traversal to the beginning of
lookups.

But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.

That way we maintain original behavior: if somebody overmounts your
cwd, you still see the pre-mount directory on lookups, because your
cwd is "under" the mount.

But if you open a file with O_PATH, and somebody does a mount
_afterwards_, the openat() will see that later mount and/or do the
automount.

Don't you think that would be the more sane/obvious semantics of how
O_PATH should work?

> I think the easiest way to handle that is to have O_PATH
> turn LOOKUP_AUTOMOUNT, same as the normal open() does.  That's
> trivial to do, but that changes user-visible behaviour.  OTOH,
> with the current behaviour nobody can rely upon automount not
> getting triggered by somebody else just as they are entering
> their open(dir, O_PATH), so I think that's not a problem.
>
> Linus, do you have any objections to such O_PATH semantics
> change?

See above: I think I'd prefer the O_PATH behavior the other way
around. That seems to be more of a consistent behavior of what
"O_PATH" means - it means "don't really open, we'll do it only when
you use it as a directory".

But I don't have any _strong_ opinions. If you have a good reason to
tell me that I'm being stupid, go ahead and do so and override my
stupidity.

             Linus
Al Viro Jan. 8, 2020, 9:34 p.m.
On Tue, Jan 07, 2020 at 07:54:02PM -0800, Linus Torvalds wrote:

> > Another interesting question is whether we want O_PATH open
> > to trigger automounts.
> 
> It does sound like they shouldn't, but as you say:
> 
> >     The thing is, we do *NOT* trigger them
> > (or traverse mountpoints) at the starting point of lookups.
> > I believe it's a mistake (and mine, at that), but I doubt that
> > there's anything that can be done about it at that point.
> > It's a user-visible behaviour [..]
> 
> Hmm. I wonder how set in stone that is. We may have two decades of
> history of not doing it at start point of lookups, but we do *not*
> have two decades of history of O_PATH.
> 
> So what I think we agree would be sane behavior would be for O_PATH
> opens to not trigger automounts (unless there's a slash at the end,
> whatever), but _do_ add the mount-point traversal to the beginning of
> lookups.
> 
> But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.
> 
> That way we maintain original behavior: if somebody overmounts your
> cwd, you still see the pre-mount directory on lookups, because your
> cwd is "under" the mount.
> 
> But if you open a file with O_PATH, and somebody does a mount
> _afterwards_, the openat() will see that later mount and/or do the
> automount.
> 
> Don't you think that would be the more sane/obvious semantics of how
> O_PATH should work?

Maybe, but... note that we do not (and AFAICS never had) follow mounts
on /proc/self/cwd, /proc/self/fd/42, etc.  And there are very good
reasons for that.  First of all, if your stdin is from /tmp/foo,
you'd better get that file when you open /dev/stdin, even if somebody
has done mount --bind /tmp/bar /tmp/foo; another issue is with
the use of stat("/proc/self/fd/42", &buf) - it should be an equivalent
of fstat(42, &buf), even if somebody has overmounted that.  BTW, for
similar reason after
	link(".", "foo");
	fd = open("foo", O_PATH);	// return 42
we really should (and do) have resolution of /proc/self/fd/42 stop at
foo, not .   Reason: consistency of stat() behaviour...

The point is, we'd never followed mounts on /proc/self/cwd et.al.
I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me)
is that way.  Actually, scratch that - 2.0 behaves the same way
(mountpoint crossing is done in iget() there; is that Minix influence
or straight from the Lions' book?)

Hmm...  Looking through the history, we have

(for reference) v7: mount traversal in iget()
(forward) and namei() (back); due to the way it's done, forward
traversal happens
	* at starting point
	* after any component (. and .. included)
	* on results of forward traversal (due to a loop in iget()).
Back traversal (to covered on .. from root directory) is also
to unlimited depth.

0.01: no mount handling

0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry()
(not by Lions' Book, then - v6 didn't do back traversals at all).
Forward traversal
	* after any component (. and .. included)
No traversal on starting point, no traversal on result of traversal.
OTOH, mount(2) refuses to mount on top of root, so the lack of the last
one is not an issue.

0.12: symlinks added; no mount traversal on starting point of those either.
We start at the process' root for absolute ones, even if it happens to be
overmounted, and we start from parent for relative ones.  The latter matters
only if we were in the beginning of the pathwalk, since anything else would've
traversed mounts back when we'd picked said parent.  Mount traversal takes
precedence over symlink traversal, but that's not an issue since mount follows
links on mountpoint.  It does not, at that point, reject fs image with
symlink for root, but that actually more or less works.

0.97.3: same, with addition of procfs symlinks.  No mount crossing on their
targets (for normal symlinks we don't do mount crossing in the beginning
and any component inside triggers mount crossing as usual; for procfs ones
there's no components inside)

Situation remains essentially unchanged until 2.1.42.  Next few kernels
are in flux, to put it politely - initial merge had been insane and it
took until 2.1.44 or so for the things to get more or less working.

At 2.1.44: forward traversal in fs/namei.c:lookup(), back traversal in
fs/namei.c:reserved_lookup().  Otherwise the same behaviour as pre-dcache
(wrt mount traversals, that is).

2.1.51pre1: forward traversal moved into real_lookup() and __d_lookup().
Forward traversal happens *ONLY* after normal components - not after . or ..

2.1.61: forward traversal moved into follow_mount(), behaviour reverted to
pre-dcache one.

Previous is from reading through the historical trees; my involvement started
circa 2.1.120-something.

2.3.50pre3: call of follow_mount() moved a bit, reverting to 2.1.51pre1
behaviour (nor traversal on . or ..) *again*.  Not sure whose idea had that
been - might've been mine, but unlike the other patch that went into fs/namei.c
in the same release, I hadn't been able to find anything related to that
one.  If your memories (or mail archives) are better...

2.3.99pre4-5: massive surgery in there.  Preparations to allowing mount on top
of mount; forward traversal adjusted accordingly, back traversal still isn't.

2.3.99pre7-1: more surgery, back traversals are also to unlimited depth now
and mount on top of mount has been allowed.

2.3.99pre9-4: mount --bind taught to mount non-directories on top of
non-directories.  At that point it does *NOT* follow trailing symlinks, so
mounting of symlinks and mounting on top of symlinks becomes possible.
Mount traversal still takes precedence over symlink traversal, symlink traversal
of mount traversal result still generally works, even though it's not something
I considered at the time.

v2.4.5.2: mount --bind started to follow symlinks.  So that source of mounting
of and on the symlinks was no more.

2.5.0.5: forward mount traversal is done after .. (in handle_dotdot()).
That brings back the pre-dcache behaviour for those suckers.  Still no
forward traversal after ., though.

At about the same time I'd been getting rid of the early-boot incestous
relationships with fs/namespace.c (initramfs work) and that was probably
the last time we could realistically switch to following mounts at starting
point; I considered trying to do that, but decided not to.  Pity, that...

2.6.5-rc2: normal mount now checks for corrupt fs with symlink for root.
Since it has always been following symlinks for mountpoint, the remaining
source of mounting of and on symlinks was gone; that lasted until
after O_PATH introduction.

2.6.39-rc1: mount traps support - instead of abusing ->follow_link()
for automounting, we have an explicit pair of methods that can be
called at the same places where we traverse mounts.  None too consistent -
we don't do that on .. results.  That was Dave Howells and Ian Kent.

2.6.39-rc1: O_PATH introduced and, later in the same series, allowed for
symlinks.  That has changed things - now procfs symlink targets could
be symlinks themselves.  Originally an attempt to follow those would
blow up with -ELOOP (there's simply no good way to follow such beast;
it's either "stop even if we are asked to follow" or "give an error").

3.6.0-rc1: nd_jump_link() introduction (hch) had unnoticed side effects -
we'd switched from "fail traversal with -ELOOP" to "stop there".  Mostly it
doesn't change behaviour, but it has opened a way to mount symlinks and
mount on top of symlinks.  Which generally worked.

circa 3.8--3.9: side effects had been noticed; my first reaction had been
"let's make nd_jump_link() return an error, then", but I hadn't been
able to find good reasons when challenged to do so.  Did an audit,
found no obvious problems, went "oh, well - whether it works by accident
or by design, it doesn't break anything".

3.12.0-rc1: lookups for umount(2) are different - we don't want
revalidate on the last component.  Which had been handled by
introduction of path_umountat()/umount_lookup_last(), parallel to
path_lookupat().  Which has gotten quite a few things wrong -
it *did* try to follow symlinks obtained by following procfs
ones (and blew up big way) and it didn't follow mounts on
overmounted trailing symlinks.  Nobody noticed for 6 years,
until folks actually tried to play with mount-on-symlink...
Patches were by Jeff Layton, neither he nor I have spotted the
problem back then.  And I should have, since it had been only
a few months since the audit for exactly that kind of problems...

AFAICS, there'd been no serious semantical changes since then.  What we
have right now:
	* no mount traversal on the starting point
	* mount traversal after any component other than "."
	* symlink traversal consists of possibly jumping to given
point plus following a given (possibly empty) series of components.
It can be both - e.g. symlink to "/foo/bar" is 'jump to root,
then traverse "foo", then traverse "bar"'.  Procfs "magic" symlinks
are not really magical - they behave as symlinks to "/" as far as
the pathwalk semantics is concerned.  The only differences is that
jump might be not to process' root.
	* mount traversal takes precedence over symlink traversal.
	* jump (if any) in symlink traversal is treated the same
as the starting point - it's not followed by mount traversal.
It's also not followed by symlink traversal, even if we are jumping
into a symlink.  Of course, in any position other than the end of
pathname that's an instant error.  That's also not different from
the starting point treatment - if ...at(2) is given a symlink for
starting point, it leaves it as-is if AT_EMPTY_PATH is given and
fails with -ENOTDIR otherwise.
	* umount(2) handles the final component differently -
for one thing, it does not do revalidate, for another - its
mount traversal (if any) does not include automount-related
parts.  And there we *do* want mount traversal at the final
point, for obvious reasons.

> > I think the easiest way to handle that is to have O_PATH
> > turn LOOKUP_AUTOMOUNT, same as the normal open() does.  That's
> > trivial to do, but that changes user-visible behaviour.  OTOH,
> > with the current behaviour nobody can rely upon automount not
> > getting triggered by somebody else just as they are entering
> > their open(dir, O_PATH), so I think that's not a problem.
> >
> > Linus, do you have any objections to such O_PATH semantics
> > change?
> 
> See above: I think I'd prefer the O_PATH behavior the other way
> around. That seems to be more of a consistent behavior of what
> "O_PATH" means - it means "don't really open, we'll do it only when
> you use it as a directory".

How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ)
vs. faccessat(42, "foo", MAY_READ)?  The latter would trigger automount,
the former would not...  Or would you extend that to "traverse mounts
upon following procfs links, if the file in question had been opened with
O_PATH"?  We could do that (give nd_jump_link() an extra argument telling
if we want mount traversal), but I'm not sure if the resulting semantics
is sane...

Note, BTW, that O_PATH users really can't rely upon automounts _not_
being triggered - all it takes is a lookup on bogus path with such prefix
by anybody who can reach that place...  We are not opening anything,
really, but we are not able to ignore automounts triggered by somebody
else.
Linus Torvalds Jan. 10, 2020, 12:08 a.m.
On Wed, Jan 8, 2020 at 1:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> The point is, we'd never followed mounts on /proc/self/cwd et.al.
> I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me)
> is that way.

Hmm. If that's the case, maybe they should be marked implicitly as
O_PATH when opened?

> Actually, scratch that - 2.0 behaves the same way
> (mountpoint crossing is done in iget() there; is that Minix influence
> or straight from the Lions' book?)

I don't think I ever had access to Lions' - I've _seen_ a printout of
it later, and obviously maybe others did,

More likely it's from Maurice Bach: the Design of the Unix Operating
System. I'm pretty sure that's where a lot of the FS layer stuff came
from.  Certainly the bad old buffer head interfaces, and quite likely
the iget() stuff too.

> 0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry()

Whee, you _really_ went back in time.

So I did too.

And looking at that code in iget(), I doubt it came from anywhere.
Christ. It's just looping over a fixed-size array, both when finding
the inode, and finding the superblock.

Cute, but unbelievably stupid. It was a more innocent time.

In other words, I think you can chalk it up to just me, because
blaming anybody else for that garbage would be very very unfair indeed
;)

> How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ)
> vs. faccessat(42, "foo", MAY_READ)?

I think that in a perfect world, the O_PATH'ness of '42' would be the
deciding factor. Wouldn't those be the best and most consistent
semantics?

And then 'cwd'/'root' always have the O_PATH behavior.

> The latter would trigger automount,
> the former would not...  Or would you extend that to "traverse mounts
> upon following procfs links, if the file in question had been opened with
> O_PATH"?

Exactly.

But you know what? I do not believe this is all that important, and I
doubt it will matter to anybody.

So what matters most is what makes the most sense to the VFS layer,
and what makes the most sense to _you_.

Because my reaction from this thread is that not only have you thought
about this issue and followed the history a whole lot more than I
would ever have done, it's also that I trust you to DTRT.

I think it would be good to have some self-consistency, but at the
same time clearly we already don't really, and our behavior here has
subtly changed over the years (and not so subtly - if you go back
sufficiently far, /proc behavior wrt file descriptors has had both
"dup()" behavior and "make a new file descriptor with the same inode"
behavior, afaik).

               Linus
Al Viro Jan. 10, 2020, 4:15 a.m.
On Thu, Jan 09, 2020 at 04:08:16PM -0800, Linus Torvalds wrote:
> On Wed, Jan 8, 2020 at 1:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > The point is, we'd never followed mounts on /proc/self/cwd et.al.
> > I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me)
> > is that way.
> 
> Hmm. If that's the case, maybe they should be marked implicitly as
> O_PATH when opened?

I thought you wanted O_PATH as starting point to have mounts traversed?
Confused...

> > Actually, scratch that - 2.0 behaves the same way
> > (mountpoint crossing is done in iget() there; is that Minix influence
> > or straight from the Lions' book?)
> 
> I don't think I ever had access to Lions' - I've _seen_ a printout of
> it later, and obviously maybe others did,
> 
> More likely it's from Maurice Bach: the Design of the Unix Operating
> System. I'm pretty sure that's where a lot of the FS layer stuff came
> from.  Certainly the bad old buffer head interfaces, and quite likely
> the iget() stuff too.
>
> > 0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry()
> 
> Whee, you _really_ went back in time.
> 
> So I did too.
> 
> And looking at that code in iget(), I doubt it came from anywhere.
> Christ. It's just looping over a fixed-size array, both when finding
> the inode, and finding the superblock.
> 
> Cute, but unbelievably stupid. It was a more innocent time.
> 
> In other words, I think you can chalk it up to just me, because
> blaming anybody else for that garbage would be very very unfair indeed
> ;)

See https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/sys/iget.c
Exactly the same algorithm, complete with linear searches over those
fixed-sized array.

<grabs Bach> Right, he simply transcribes v7 iget().

So I suspect that you are right - your variant of iget was pretty much
one-to-one implementation of Bach's description of v7 iget.

Your namei wasn't - Bach has 'if the entry points to root and you are
in the root and name is "..", find mount table entry (by device number),
drop your directory inode, grab the inode of mountpount and restart
the search for ".." in there', which gives back traversals to arbitrary
depth.  And v7 namei() (as Bach mentions) uses iget() for starting point
as well as for each component.  You kept pointers instead, which is where
the other difference has come from (no mount traversal at the starting
point)...

Actually, I've misread your code in 0.10 - it does unlimited forward
traversals; it's back traversals that go only one level.  The forward
ones got limited to one level in 0.95, but then mount-over-root had
been banned all along.  I'd read the pre-dcache variant of iget(),
seen it go pretty much all the way back to beginning and hadn't
sorted out the 0.12 -> 0.95 transition...

> > How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ)
> > vs. faccessat(42, "foo", MAY_READ)?
> 
> I think that in a perfect world, the O_PATH'ness of '42' would be the
> deciding factor. Wouldn't those be the best and most consistent
> semantics?
> 
> And then 'cwd'/'root' always have the O_PATH behavior.

See above - unless I'm misparsing you, you wanted mount traversals in the
starting point if it's ...at() with O_PATH fd.  With O_PATH open() not
doing them.

For cwd and root the situation is opposite - we do NOT traverse mounts
for those.  And that's really too late to change.

> > The latter would trigger automount,
> > the former would not...  Or would you extend that to "traverse mounts
> > upon following procfs links, if the file in question had been opened with
> > O_PATH"?
> 
> Exactly.
> 
> But you know what? I do not believe this is all that important, and I
> doubt it will matter to anybody.

FWIW, digging through the automount-related parts of that stuff has
caught several fun issues.  One (and I'm rather embarrassed by it)
should've been caught back in commit 8aef18845266 (VFS: Fix vfsmount
overput on simultaneous automount).  To quote the commit message:
    The problem is that lock_mount() drops the caller's reference to the
    mountpoint's vfsmount in the case where it finds something already mounted on
    the mountpoint as it transits to the mounted filesystem and replaces path->mnt
    with the new mountpoint vfsmount.
    
    During a pathwalk, however, we don't take a reference on the vfsmount if it is
    the same as the one in the nameidata struct, but do_add_mount() doesn't know
    this.
At which point I should've gone "what the fuck?" - lock_mount() does, indeed,
drop path->mnt in this situation and replaces it with the whatever's come to
cover it.  For mount(2) that's the right thing to do - we _want_ to mount
on top of whatever we have at the mountpoint.  For automounts we very much
don't want that - it's either "mount right on top of the automount trigger"
or discard whatever we'd been about to mount and walk into whatever's got
mounted there (presumably the same thing triggered by another process).
We kinda-sorta get that effect, but in a very convoluted way: do_add_mount()
will refuse to mount something on top of itself -
        /* Refuse the same filesystem on the same mount point */
        err = -EBUSY;
        if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
            path->mnt->mnt_root == path->dentry)
                goto unlock;
which will end up with -EBUSY returned (and recognized by follow_automount()).

First of all, that's unreliable.  If somebody not only has triggered that
automount, but managed to _mount_ something else on top (for example,
has triggered it by lookup of mountpoint-to-be in mount(2)), we'll end
up not triggering that check.  In which case we'll get something like
nfs referral point under nfs automounted there under tmpfs from explicit
overmount under same nfs mount we'd automounted there - identical to what's
been buried under tmpfs.  It's hard to hit, but not impossibly so.

What's more, the whole solution is a kludge - the root of problem is
that lock_mount() is the wrong thing to do in case of finish_automount().
We don't want to go into whatever's overmounting us there, both for
the reasons above *and* because it's a PITA for the caller.  So the
right solution is
	* lift lock_mount() call from do_add_mount() into its callers
(all 2 of them); while we are at it, lift unlock_mount() as well
(makes for simpler failure exits in do_add_mount()).
	* replace the call of lock_mount() in finish_automount()
with variant that doesn't do "unlock, walk deeper and retry locking",
returning ERR_PTR(-EBUSY) in such case.
	* get rid of the kludge introduced in that commit.  Better
yet, don't bother with traversing into the covering mount in case
of success - let the caller of follow_automount() do that.  Which
eliminates the need to pass need_mntput to the sucker and suggests
an even better solution - have this analogue of lock_mount()
return NULL instead of ERR_PTR(-EBUSY) and treat it in finish_automount()
as "OK, discard what we wanted to mount and return 0".  That gets
rid of the entire
        err = finish_automount(mnt, path);
        switch (err) {
        case -EBUSY:
                /* Someone else made a mount here whilst we were busy */
                return 0;
        case 0:
                path_put(path);
                path->mnt = mnt;
                path->dentry = dget(mnt->mnt_root);
                return 0;
        default:
                return err;
        }
chunk in follow_automount() - it would just be
	return finish_automount(mnt, path);

Another thing (in the same area) is not a bug per se, but...
after the call of ->d_automount() we have this:
        if (IS_ERR(mnt)) {
                /*
                 * The filesystem is allowed to return -EISDIR here to indicate
                 * it doesn't want to automount.  For instance, autofs would do
                 * this so that its userspace daemon can mount on this dentry.
                 *
                 * However, we can only permit this if it's a terminal point in
                 * the path being looked up; if it wasn't then the remainder of
                 * the path is inaccessible and we should say so.
                 */
                if (PTR_ERR(mnt) == -EISDIR && (nd->flags & LOOKUP_PARENT))
                        return -EREMOTE;
                return PTR_ERR(mnt);
	}
Except that not a single instance of ->d_automount() has ever returned
-EISDIR.  Certainly not autofs one, despite the what the comment says.
That chunk has come from dhowells, back when the whole mount trap series
had been merged.  After talking that thing over (fun: trying to figure
out what had been intended nearly 9 years ago, when people involved are
in UK, US east coast and AU west coast respectively.  The only way it
could suck more would've been if I were on the west coast - then all
timezone deltas would be 8-hour ones)...  looks like it's a rudiment
of plans that got superseded during the series development, nobody
quite remembers exact details.  Conclusion: it's not even dead, it's
stillborn; bury it.

Unfortunately, there are other interesting questions related to
autofs-specific bits (->d_manage()) and the timezone-related fun
is, of course, still there.  I hope to sort that out today or
tomorrow, at least enough to do a reasonable set of backportable
fixes to put in front of follow_managed()/step_into() queue.
Oh, well...
Linus Torvalds Jan. 10, 2020, 5:03 a.m.
On Thu, Jan 9, 2020 at 8:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Hmm. If that's the case, maybe they should be marked implicitly as
> > O_PATH when opened?
>
> I thought you wanted O_PATH as starting point to have mounts traversed?
> Confused...

No, I'm confused.  I meant "non-O_PATH", just got the rules reversed in my mind.

So cwd/root would always act as it non-O_PATH, and only using an
actual fd would look at the O_PATH flag, and if it was set would walk
the mountpoints.

> <grabs Bach> Right, he simply transcribes v7 iget().
>
> So I suspect that you are right - your variant of iget was pretty much
> one-to-one implementation of Bach's description of v7 iget.

Ok, that makes sense. My copy of Bach literally had the system call
list "marked off" when I implemented them back when.

I may still have that paperbook copy somewhere. I don't _think_ I'd
have thrown it out, it has sentimental value.

> > I think that in a perfect world, the O_PATH'ness of '42' would be the
> > deciding factor. Wouldn't those be the best and most consistent
> > semantics?
> >
> > And then 'cwd'/'root' always have the O_PATH behavior.
>
> See above - unless I'm misparsing you, you wanted mount traversals in the
> starting point if it's ...at() with O_PATH fd.

.. and see above, it was just my confusion about the sense of O_PATH.

> For cwd and root the situation is opposite - we do NOT traverse mounts
> for those.  And that's really too late to change.

Oh, absolutely.

[ snip some more about your automount digging. Looks about right, but
I'm not going to make a peep after getting O_PATH reversed ;) ]

            Linus
Ian Kent Jan. 10, 2020, 6:20 a.m.
On Fri, 2020-01-10 at 04:15 +0000, Al Viro wrote:
> On Thu, Jan 09, 2020 at 04:08:16PM -0800, Linus Torvalds wrote:
> > On Wed, Jan 8, 2020 at 1:34 PM Al Viro <viro@zeniv.linux.org.uk>
> > wrote:
> > > The point is, we'd never followed mounts on /proc/self/cwd et.al.
> > > I hadn't checked 2.0, but 2.1.100 ('97, before any changes from
> > > me)
> > > is that way.
> > 
> > Hmm. If that's the case, maybe they should be marked implicitly as
> > O_PATH when opened?
> 
> I thought you wanted O_PATH as starting point to have mounts
> traversed?
> Confused...
> 
> > > Actually, scratch that - 2.0 behaves the same way
> > > (mountpoint crossing is done in iget() there; is that Minix
> > > influence
> > > or straight from the Lions' book?)
> > 
> > I don't think I ever had access to Lions' - I've _seen_ a printout
> > of
> > it later, and obviously maybe others did,
> > 
> > More likely it's from Maurice Bach: the Design of the Unix
> > Operating
> > System. I'm pretty sure that's where a lot of the FS layer stuff
> > came
> > from.  Certainly the bad old buffer head interfaces, and quite
> > likely
> > the iget() stuff too.
> > 
> > > 0.10: forward traversal in iget(), back traversal in
> > > fs/namei.c:find_entry()
> > 
> > Whee, you _really_ went back in time.
> > 
> > So I did too.
> > 
> > And looking at that code in iget(), I doubt it came from anywhere.
> > Christ. It's just looping over a fixed-size array, both when
> > finding
> > the inode, and finding the superblock.
> > 
> > Cute, but unbelievably stupid. It was a more innocent time.
> > 
> > In other words, I think you can chalk it up to just me, because
> > blaming anybody else for that garbage would be very very unfair
> > indeed
> > ;)
> 
> See 
> https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/sys/iget.c
> Exactly the same algorithm, complete with linear searches over those
> fixed-sized array.
> 
> <grabs Bach> Right, he simply transcribes v7 iget().
> 
> So I suspect that you are right - your variant of iget was pretty
> much
> one-to-one implementation of Bach's description of v7 iget.
> 
> Your namei wasn't - Bach has 'if the entry points to root and you are
> in the root and name is "..", find mount table entry (by device
> number),
> drop your directory inode, grab the inode of mountpount and restart
> the search for ".." in there', which gives back traversals to
> arbitrary
> depth.  And v7 namei() (as Bach mentions) uses iget() for starting
> point
> as well as for each component.  You kept pointers instead, which is
> where
> the other difference has come from (no mount traversal at the
> starting
> point)...
> 
> Actually, I've misread your code in 0.10 - it does unlimited forward
> traversals; it's back traversals that go only one level.  The forward
> ones got limited to one level in 0.95, but then mount-over-root had
> been banned all along.  I'd read the pre-dcache variant of iget(),
> seen it go pretty much all the way back to beginning and hadn't
> sorted out the 0.12 -> 0.95 transition...
> 
> > > How would your proposal deal with access("/proc/self/fd/42/foo",
> > > MAY_READ)
> > > vs. faccessat(42, "foo", MAY_READ)?
> > 
> > I think that in a perfect world, the O_PATH'ness of '42' would be
> > the
> > deciding factor. Wouldn't those be the best and most consistent
> > semantics?
> > 
> > And then 'cwd'/'root' always have the O_PATH behavior.
> 
> See above - unless I'm misparsing you, you wanted mount traversals in
> the
> starting point if it's ...at() with O_PATH fd.  With O_PATH open()
> not
> doing them.
> 
> For cwd and root the situation is opposite - we do NOT traverse
> mounts
> for those.  And that's really too late to change.
> 
> > > The latter would trigger automount,
> > > the former would not...  Or would you extend that to "traverse
> > > mounts
> > > upon following procfs links, if the file in question had been
> > > opened with
> > > O_PATH"?
> > 
> > Exactly.
> > 
> > But you know what? I do not believe this is all that important, and
> > I
> > doubt it will matter to anybody.
> 
> FWIW, digging through the automount-related parts of that stuff has
> caught several fun issues.  One (and I'm rather embarrassed by it)
> should've been caught back in commit 8aef18845266 (VFS: Fix vfsmount
> overput on simultaneous automount).  To quote the commit message:
>     The problem is that lock_mount() drops the caller's reference to
> the
>     mountpoint's vfsmount in the case where it finds something
> already mounted on
>     the mountpoint as it transits to the mounted filesystem and
> replaces path->mnt
>     with the new mountpoint vfsmount.
>     
>     During a pathwalk, however, we don't take a reference on the
> vfsmount if it is
>     the same as the one in the nameidata struct, but do_add_mount()
> doesn't know
>     this.
> At which point I should've gone "what the fuck?" - lock_mount() does,
> indeed,
> drop path->mnt in this situation and replaces it with the whatever's
> come to
> cover it.  For mount(2) that's the right thing to do - we _want_ to
> mount
> on top of whatever we have at the mountpoint.  For automounts we very
> much
> don't want that - it's either "mount right on top of the automount
> trigger"
> or discard whatever we'd been about to mount and walk into whatever's
> got
> mounted there (presumably the same thing triggered by another
> process).
> We kinda-sorta get that effect, but in a very convoluted way:
> do_add_mount()
> will refuse to mount something on top of itself -
>         /* Refuse the same filesystem on the same mount point */
>         err = -EBUSY;
>         if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
>             path->mnt->mnt_root == path->dentry)
>                 goto unlock;
> which will end up with -EBUSY returned (and recognized by
> follow_automount()).
> 
> First of all, that's unreliable.  If somebody not only has triggered
> that
> automount, but managed to _mount_ something else on top (for example,
> has triggered it by lookup of mountpoint-to-be in mount(2)), we'll
> end
> up not triggering that check.  In which case we'll get something like
> nfs referral point under nfs automounted there under tmpfs from
> explicit
> overmount under same nfs mount we'd automounted there - identical to
> what's
> been buried under tmpfs.  It's hard to hit, but not impossibly so.
> 
> What's more, the whole solution is a kludge - the root of problem is
> that lock_mount() is the wrong thing to do in case of
> finish_automount().
> We don't want to go into whatever's overmounting us there, both for
> the reasons above *and* because it's a PITA for the caller.  So the
> right solution is
> 	* lift lock_mount() call from do_add_mount() into its callers
> (all 2 of them); while we are at it, lift unlock_mount() as well
> (makes for simpler failure exits in do_add_mount()).
> 	* replace the call of lock_mount() in finish_automount()
> with variant that doesn't do "unlock, walk deeper and retry locking",
> returning ERR_PTR(-EBUSY) in such case.
> 	* get rid of the kludge introduced in that commit.  Better
> yet, don't bother with traversing into the covering mount in case
> of success - let the caller of follow_automount() do that.  Which
> eliminates the need to pass need_mntput to the sucker and suggests
> an even better solution - have this analogue of lock_mount()
> return NULL instead of ERR_PTR(-EBUSY) and treat it in
> finish_automount()
> as "OK, discard what we wanted to mount and return 0".  That gets
> rid of the entire
>         err = finish_automount(mnt, path);
>         switch (err) {
>         case -EBUSY:
>                 /* Someone else made a mount here whilst we were busy
> */
>                 return 0;
>         case 0:
>                 path_put(path);
>                 path->mnt = mnt;
>                 path->dentry = dget(mnt->mnt_root);
>                 return 0;
>         default:
>                 return err;
>         }
> chunk in follow_automount() - it would just be
> 	return finish_automount(mnt, path);
> 
> Another thing (in the same area) is not a bug per se, but...
> after the call of ->d_automount() we have this:
>         if (IS_ERR(mnt)) {
>                 /*
>                  * The filesystem is allowed to return -EISDIR here
> to indicate
>                  * it doesn't want to automount.  For instance,
> autofs would do
>                  * this so that its userspace daemon can mount on
> this dentry.
>                  *
>                  * However, we can only permit this if it's a
> terminal point in
>                  * the path being looked up; if it wasn't then the
> remainder of
>                  * the path is inaccessible and we should say so.
>                  */
>                 if (PTR_ERR(mnt) == -EISDIR && (nd->flags &
> LOOKUP_PARENT))
>                         return -EREMOTE;
>                 return PTR_ERR(mnt);
> 	}
> Except that not a single instance of ->d_automount() has ever
> returned
> -EISDIR.  Certainly not autofs one, despite the what the comment
> says.
> That chunk has come from dhowells, back when the whole mount trap
> series
> had been merged.  After talking that thing over (fun: trying to
> figure
> out what had been intended nearly 9 years ago, when people involved
> are
> in UK, US east coast and AU west coast respectively.  The only way it
> could suck more would've been if I were on the west coast - then all
> timezone deltas would be 8-hour ones)...  looks like it's a rudiment
> of plans that got superseded during the series development, nobody
> quite remembers exact details.  Conclusion: it's not even dead, it's
> stillborn; bury it.

Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time
we get there it's not relevant any more, so that check looks
redundant. I'm not aware of any other fs automount implementation
that needs that EISDIR pass-thru function.

I didn't notice it at the time of the merge, sorry about that.

While we're at it that:
   if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
       return -EREMOTE;

at the top of follow_automount() isn't going to be be relevant
for autofs because ->d_automount() really must always be defined
for it.

But, at the time of the merge, I didn't object to it because
there were (are) other file systems that use the VFS automount
function which may accidentally not define the method.

> 
> Unfortunately, there are other interesting questions related to
> autofs-specific bits (->d_manage()) and the timezone-related fun
> is, of course, still there.  I hope to sort that out today or
> tomorrow, at least enough to do a reasonable set of backportable
> fixes to put in front of follow_managed()/step_into() queue.
> Oh, well...

Yeah, I know it slows you down but I kink-off like having a chance
to look at what's going and think about your questions before trying
to answer them, rather than replying prematurely, as I usually do ...

It's been a bit of a busy day so far but I'm getting to look into
the questions you've asked.

Ian
Aleksa Sarai Jan. 10, 2020, 9:07 p.m.
On 2020-01-07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, Jan 7, 2020 at 7:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > Another interesting question is whether we want O_PATH open
> > to trigger automounts.
> 
> It does sound like they shouldn't, but as you say:
> 
> >     The thing is, we do *NOT* trigger them
> > (or traverse mountpoints) at the starting point of lookups.
> > I believe it's a mistake (and mine, at that), but I doubt that
> > there's anything that can be done about it at that point.
> > It's a user-visible behaviour [..]
> 
> Hmm. I wonder how set in stone that is. We may have two decades of
> history of not doing it at start point of lookups, but we do *not*
> have two decades of history of O_PATH.
> 
> So what I think we agree would be sane behavior would be for O_PATH
> opens to not trigger automounts (unless there's a slash at the end,
> whatever), but _do_ add the mount-point traversal to the beginning of
> lookups.
> 
> But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.
> 
> That way we maintain original behavior: if somebody overmounts your
> cwd, you still see the pre-mount directory on lookups, because your
> cwd is "under" the mount.
> 
> But if you open a file with O_PATH, and somebody does a mount
> _afterwards_, the openat() will see that later mount and/or do the
> automount.
> 
> Don't you think that would be the more sane/obvious semantics of how
> O_PATH should work?

If I'm understanding this proposal correctly, this would be a problem
for the libpathrs use-case -- if this is done then there's no way to
avoid a TOCTOU with someone mounting and the userspace program checking
whether something is a mountpoint (unless you have Linux >5.6 and
RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:

  1. Open the candidate directory.
  2. umount2(MNT_EXPIRE) the fd.
    * -EINVAL means it wasn't a mountpoint when we got the fd, and the
	  fd is a stable handle to the underlying directory.
	* -EAGAIN or -EBUSY means that it was a mountpoint or became a
	  mountpoint after the fd was opened (we don't care about that, but
	  fail-safe is better here).
  3. Use the fd from (1) for all operations.

Don't get me wrong, I want to fix this issue *properly* by adding some
new kernel features that allow us to avoid worrying about
mounts-over-magiclinks -- but on old kernels (which libpathrs cares
about) I would be worried about changes like this being backported
resulting in it being not possible to implement the hardening I
mentioned up-thread.
Al Viro Jan. 12, 2020, 9:33 p.m.
On Fri, Jan 10, 2020 at 02:20:55PM +0800, Ian Kent wrote:

> Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time
> we get there it's not relevant any more, so that check looks
> redundant. I'm not aware of any other fs automount implementation
> that needs that EISDIR pass-thru function.
> 
> I didn't notice it at the time of the merge, sorry about that.
> 
> While we're at it that:
>    if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
>        return -EREMOTE;
> 
> at the top of follow_automount() isn't going to be be relevant
> for autofs because ->d_automount() really must always be defined
> for it.
> 
> But, at the time of the merge, I didn't object to it because
> there were (are) other file systems that use the VFS automount
> function which may accidentally not define the method.

OK...

> > Unfortunately, there are other interesting questions related to
> > autofs-specific bits (->d_manage()) and the timezone-related fun
> > is, of course, still there.  I hope to sort that out today or
> > tomorrow, at least enough to do a reasonable set of backportable
> > fixes to put in front of follow_managed()/step_into() queue.
> > Oh, well...
> 
> Yeah, I know it slows you down but I kink-off like having a chance

Nice typo, that ;-)

> to look at what's going and think about your questions before trying
> to answer them, rather than replying prematurely, as I usually do ...
> 
> It's been a bit of a busy day so far but I'm getting to look into
> the questions you've asked.

Here's a bit more of those (I might've missed some of your replies on
IRC; my apologies if that's the case):

1) AFAICS, -EISDIR from ->d_manage() actually means "don't even try
->d_automount() here".  If its effect can be delayed until the decision
to call ->d_automount(), the things seem to get simpler.  Is it ever
returned in situation when the sucker _is_ overmounted?

2) can autofs_d_automount() ever be called for a daemon?  Looks like it
shouldn't be...

3) is _anything_ besides root directory ever created in direct autofs
superblocks by anyone?  If not, why does autofs_lookup() even bother to
do anything there?  IOW, why not have it return ERR_PTR(-ENOENT) immediately
for direct ones?  Or am I missing something and it is, in fact, possible
to have the daemon create something in those?

4) Symlinks look like they should qualify for parent being non-empty;
at least autofs_d_manage() seems to think so (simple_empty() use).
So shouldn't we remove the trap from its parent on symlink/restore on
unlink if parent gets empty?  For version 4 or earlier, that is.  Or is
it simply that daemon only creates symlinks in root directory?


Anyway, intermediate state of the series is in #work.namei right now,
and some _very_ interesting possibilities open up.  It definitely
needs more massage around __follow_mount_rcu() (as it is, the
fastpath in there is still too twisted).  Said that
	* call graph is less convoluted
	* follow_managed() calls are folded into step_into().  Interface:
int step_into(nd, flags, dentry, inode, seq), with inode/seq used only
if we are in RCU mode.
	* ".." still doesn't use that; it probably ought to.
	* lookup_fast() doesn't take path - nd, &inode, &seq and returns dentry
	* lookup_open() and fs/namei.c:atomic_open() get similar treatment
- don't take path, return dentry.
	* calls of follow_managed()/step_into() combination returning 1
are always followed by get_link(), and very shortly, at that.  So much
that we can realistically merge pick_link() (in the end of
step_into()) with get_link().  That merge is NOT done in this branch yet.

The last one promises to get rid of a rather unpleasant group of calling
conventions.  Right now we have several functions (step_into()/
walk_component()/lookup_last()/do_last()) with the following calling
conventions:
	-E...	=> error
	0	=> non-symlink or symlink not followed; nd->path points to it
	1	=> picked a symlink to follow; its mount/dentry/seq has been
pushed on nd->stack[]; its inode is stashed into nd->link_inode for
subsequent get_link() to pick.  nd->path is left unchanged.

That way all of those become
	ERR_PTR(-E...)	=> error
	NULL		=> non-symlink, symlink not followed or a pure
jump (bare "/" or procfs ones); nd->path points to where we end up
        string		=> symlink being followed; the sucker's pushed
to stack, initial jump (if any) has been handled and the string returned
is what we need to traverse.

IMO it's less arbitrary that way.  More importantly, the separation between
step_into() committing to symlink traversal and (inevitably following)
get_link() is gone - it's one operation after that change.  No nd->link_inode
either - it's only needed to carry the information from pick_link() to the
next get_link().

Loops turn into
	while (!(err = link_path_walk(nd, s)) &&
	       (s = lookup_last(nd)) != NULL)
		;
and
	while (!(err = link_path_walk(nd, s)) &&
	       (s = do_last(nd, file, op)) != NULL)
		;

trailing_symlink() goes away (folded into pick_link()/get_link() combo,
conditional upon nd->depth at the entry).  And in link_path_walk() we'll
have
                if (unlikely(!*name)) {
                        /* pathname body, done */
                        if (!nd->depth)
                                return 0;
                        name = nd->stack[nd->depth - 1].name;
                        /* trailing symlink, done */
                        if (!name)
                                return 0;
                        /* last component of nested symlink */
                        s = walk_component(nd, WALK_FOLLOW);
                } else {
                        /* not the last component */
                        s = walk_component(nd, WALK_FOLLOW | WALK_MORE);
                }
                if (s) {
                        if (IS_ERR(s))
                                return PTR_ERR(s);
			/* a symlink to follow */
			nd->stack[nd->depth - 1].name = name;
                        name = s;
                        continue;
                }

Anyway, before I try that one I'm going to fold path_openat2() into
that series - that step is definitely going to require some massage
there; it's too close to get_link() changes done in Aleksa's series.

If we do that, we get a single primitive for "here's the result of
lookup; traverse mounts and either move into the result or, if
it's a symlink that needs to be traversed, start the symlink
traversal - jump into the base position for it (if needed) and
return the pathname that needs to be handled".  As it is, mainline
has that logics spread over about a dozen locations...

Diffstat at the moment:
 fs/autofs/dev-ioctl.c |   6 +-
 fs/internal.h         |   1 -
 fs/namei.c            | 460 ++++++++++++++------------------------------------
 fs/namespace.c        |  97 +++++++----
 fs/nfs/nfstrace.h     |   2 -
 fs/open.c             |   4 +-
 include/linux/namei.h |   3 +-
 7 files changed, 197 insertions(+), 376 deletions(-)

In the current form the sucker appears to work (so far - about 30%
into the usual xfstests run) without visible slowdowns...
Ian Kent Jan. 13, 2020, 2:59 a.m.
On Sun, 2020-01-12 at 21:33 +0000, Al Viro wrote:
> On Fri, Jan 10, 2020 at 02:20:55PM +0800, Ian Kent wrote:
> 
> > Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time
> > we get there it's not relevant any more, so that check looks
> > redundant. I'm not aware of any other fs automount implementation
> > that needs that EISDIR pass-thru function.
> > 
> > I didn't notice it at the time of the merge, sorry about that.
> > 
> > While we're at it that:
> >    if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
> >        return -EREMOTE;
> > 
> > at the top of follow_automount() isn't going to be be relevant
> > for autofs because ->d_automount() really must always be defined
> > for it.
> > 
> > But, at the time of the merge, I didn't object to it because
> > there were (are) other file systems that use the VFS automount
> > function which may accidentally not define the method.
> 
> OK...
> 
> > > Unfortunately, there are other interesting questions related to
> > > autofs-specific bits (->d_manage()) and the timezone-related fun
> > > is, of course, still there.  I hope to sort that out today or
> > > tomorrow, at least enough to do a reasonable set of backportable
> > > fixes to put in front of follow_managed()/step_into() queue.
> > > Oh, well...
> > 
> > Yeah, I know it slows you down but I kink-off like having a chance
> 
> Nice typo, that ;-)
> 
> > to look at what's going and think about your questions before
> > trying
> > to answer them, rather than replying prematurely, as I usually do
> > ...
> > 
> > It's been a bit of a busy day so far but I'm getting to look into
> > the questions you've asked.
> 
> Here's a bit more of those (I might've missed some of your replies on
> IRC; my apologies if that's the case):
> 
> 1) AFAICS, -EISDIR from ->d_manage() actually means "don't even try
> ->d_automount() here".  If its effect can be delayed until the
> decision
> to call ->d_automount(), the things seem to get simpler.  Is it ever
> returned in situation when the sucker _is_ overmounted?

In theory it shouldn't need to be returned when there is an
actual mount there.

If there is a real mount at this point that should be enough to
prevent walks into that mount until it's mount is complete.

The whole idea of -EISDIR is to prevent processes from walking
into a directory tree that "doesn't have a real mount at its
base" (the so called multi-mount map construct).

> 
> 2) can autofs_d_automount() ever be called for a daemon?  Looks like
> it
> shouldn't be...

Can't do that, it will lead to deadlock very quickly.

> 
> 3) is _anything_ besides root directory ever created in direct autofs
> superblocks by anyone?  If not, why does autofs_lookup() even bother
> to
> do anything there?  IOW, why not have it return ERR_PTR(-ENOENT)
> immediately
> for direct ones?  Or am I missing something and it is, in fact,
> possible
> to have the daemon create something in those?

Short answer is no, longer answer is directories "shouldn't" ever
be created inside direct mount points.

The thing is that the multi-mount map construct can be used with
direct mounts too, but they must always have a real mount at the
base because they are direct mounts. So processes should not be
able to walk into them while they are being mounted (constructed).

But I'm pretty sure it's rare (maybe not done at all) that this
map construct is used with direct mounts.

> 
> 4) Symlinks look like they should qualify for parent being non-empty;
> at least autofs_d_manage() seems to think so (simple_empty() use).
> So shouldn't we remove the trap from its parent on symlink/restore on
> unlink if parent gets empty?  For version 4 or earlier, that is.  Or
> is
> it simply that daemon only creates symlinks in root directory?

Yes, they have to be empty.

If a symlink is to be used (based on autofs config or map option)
and the "browse" option is used for the indirect mount (browse
only makes sense for indirect autofs managed mounts) then the
mount point directory has to be removed and a symlink created
so it must be empty to for this to make sense.

If it's a "nobrowse" autofs mount then nothing should already
exist, it just gets created.

The catch is that a map entry for which a symlink is to be used
instead of a mount can't be a multi-mount. I'm pretty sure I don't
have sufficient error checking for that in the daemon but I also
haven't had reports of problems with it either.

For a very long time the use of symlinks was not common but when
the amd format map parser was added it made sense to use symlinks
in some cases for those. That was partly to reduce the number of
mounts needed and because I deliberately don't support amd map
entries that provide the multi-mount construct. The way amd did
this looked ugly to me, very much a hack to add a Sun format
mount feature.

As far as keeping the trap flags up to date, I don't.

It seemed so much simpler to just leave the flags in place but,
at that time, symlinks were not used (although it was possible to
do so), now that's changed fiddling with the flags might now make
sense.

As I said on IRC:
"DCACHE_NEED_AUTOMOUNT is set on symlink dentries because, when
->lookup() is called the dentry may trigger a callback to the
daemon that will either create a directory (since, in this case,
one does not already exist) and attempt to mount on it or create
a symlink if the autofs config/map requires it.

I didn't think there would be potential simplification by setting
and clearing the DCACHE_NEED_AUTOMOUNT flag based on it being a
directory (mountpoint) or a symlink so the flag is always left set.
Although, as you point out, symlinks won't actually trigger mounts
so the flag being left set when the dentry is a symlink is due to
lazyness, since there's nothing to gain. If you can see potential
simplification in the VFS code by managing this flag better then
that would be worth while."

> 
> 
> Anyway, intermediate state of the series is in #work.namei right now,
> and some _very_ interesting possibilities open up.  It definitely
> needs more massage around __follow_mount_rcu() (as it is, the
> fastpath in there is still too twisted).  Said that
> 	* call graph is less convoluted
> 	* follow_managed() calls are folded into
> step_into().  Interface:
> int step_into(nd, flags, dentry, inode, seq), with inode/seq used
> only
> if we are in RCU mode.
> 	* ".." still doesn't use that; it probably ought to.
> 	* lookup_fast() doesn't take path - nd, &inode, &seq and
> returns dentry
> 	* lookup_open() and fs/namei.c:atomic_open() get similar
> treatment
> - don't take path, return dentry.
> 	* calls of follow_managed()/step_into() combination returning 1
> are always followed by get_link(), and very shortly, at that.  So
> much
> that we can realistically merge pick_link() (in the end of
> step_into()) with get_link().  That merge is NOT done in this branch
> yet.
> 
> The last one promises to get rid of a rather unpleasant group of
> calling
> conventions.  Right now we have several functions (step_into()/
> walk_component()/lookup_last()/do_last()) with the following calling
> conventions:
> 	-E...	=> error
> 	0	=> non-symlink or symlink not followed; nd->path points to it
> 	1	=> picked a symlink to follow; its mount/dentry/seq has been
> pushed on nd->stack[]; its inode is stashed into nd->link_inode for
> subsequent get_link() to pick.  nd->path is left unchanged.
> 
> That way all of those become
> 	ERR_PTR(-E...)	=> error
> 	NULL		=> non-symlink, symlink not followed or a
> pure
> jump (bare "/" or procfs ones); nd->path points to where we end up
>         string		=> symlink being followed; the sucker's
> pushed
> to stack, initial jump (if any) has been handled and the string
> returned
> is what we need to traverse.
> 
> IMO it's less arbitrary that way.  More importantly, the separation
> between
> step_into() committing to symlink traversal and (inevitably
> following)
> get_link() is gone - it's one operation after that change.  No nd-
> >link_inode
> either - it's only needed to carry the information from pick_link()
> to the
> next get_link().
> 
> Loops turn into
> 	while (!(err = link_path_walk(nd, s)) &&
> 	       (s = lookup_last(nd)) != NULL)
> 		;
> and
> 	while (!(err = link_path_walk(nd, s)) &&
> 	       (s = do_last(nd, file, op)) != NULL)
> 		;
> 
> trailing_symlink() goes away (folded into pick_link()/get_link()
> combo,
> conditional upon nd->depth at the entry).  And in link_path_walk()
> we'll
> have
>                 if (unlikely(!*name)) {
>                         /* pathname body, done */
>                         if (!nd->depth)
>                                 return 0;
>                         name = nd->stack[nd->depth - 1].name;
>                         /* trailing symlink, done */
>                         if (!name)
>                                 return 0;
>                         /* last component of nested symlink */
>                         s = walk_component(nd, WALK_FOLLOW);
>                 } else {
>                         /* not the last component */
>                         s = walk_component(nd, WALK_FOLLOW |
> WALK_MORE);
>                 }
>                 if (s) {
>                         if (IS_ERR(s))
>                                 return PTR_ERR(s);
> 			/* a symlink to follow */
> 			nd->stack[nd->depth - 1].name = name;
>                         name = s;
>                         continue;
>                 }
> 
> Anyway, before I try that one I'm going to fold path_openat2() into
> that series - that step is definitely going to require some massage
> there; it's too close to get_link() changes done in Aleksa's series.
> 
> If we do that, we get a single primitive for "here's the result of
> lookup; traverse mounts and either move into the result or, if
> it's a symlink that needs to be traversed, start the symlink
> traversal - jump into the base position for it (if needed) and
> return the pathname that needs to be handled".  As it is, mainline
> has that logics spread over about a dozen locations...
> 
> Diffstat at the moment:
>  fs/autofs/dev-ioctl.c |   6 +-
>  fs/internal.h         |   1 -
>  fs/namei.c            | 460 ++++++++++++++------------------------
> ------------
>  fs/namespace.c        |  97 +++++++----
>  fs/nfs/nfstrace.h     |   2 -
>  fs/open.c             |   4 +-
>  include/linux/namei.h |   3 +-
>  7 files changed, 197 insertions(+), 376 deletions(-)
> 
> In the current form the sucker appears to work (so far - about 30%
> into the usual xfstests run) without visible slowdowns...

Ok, I'll have a look at that branch, ;)

Ian
Ian Kent Jan. 14, 2020, 12:25 a.m.
On Mon, 2020-01-13 at 10:59 +0800, Ian Kent wrote:
> 
> > 3) is _anything_ besides root directory ever created in direct
> > autofs
> > superblocks by anyone?  If not, why does autofs_lookup() even
> > bother
> > to
> > do anything there?  IOW, why not have it return ERR_PTR(-ENOENT)
> > immediately
> > for direct ones?  Or am I missing something and it is, in fact,
> > possible
> > to have the daemon create something in those?
> 
> Short answer is no, longer answer is directories "shouldn't" ever
> be created inside direct mount points.
> 
> The thing is that the multi-mount map construct can be used with
> direct mounts too, but they must always have a real mount at the
> base because they are direct mounts. So processes should not be
> able to walk into them while they are being mounted (constructed).
> 
> But I'm pretty sure it's rare (maybe not done at all) that this
> map construct is used with direct mounts.

This isn't right.

There's actually nothing stopping a user from using a direct map
entry that's a multi-mount without an actual mount at its root.
So there could be directories created under these, it's just not
usually done.

I'm pretty sure I don't check and disallow this.

Ian
Al Viro Jan. 14, 2020, 4:39 a.m.
On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote:

> This isn't right.
> 
> There's actually nothing stopping a user from using a direct map
> entry that's a multi-mount without an actual mount at its root.
> So there could be directories created under these, it's just not
> usually done.
> 
> I'm pretty sure I don't check and disallow this.

IDGI...  How the hell will that work in v5?  Who will set _any_
traps outside the one in root in that scenario?  autofs_lookup()
won't (there it's conditional upon indirect mount).  Neither
will autofs_dir_mkdir() (conditional upon version being less
than 5).  Who will, then?

Confused...
Al Viro Jan. 14, 2020, 4:57 a.m.
On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote:

> If I'm understanding this proposal correctly, this would be a problem
> for the libpathrs use-case -- if this is done then there's no way to
> avoid a TOCTOU with someone mounting and the userspace program checking
> whether something is a mountpoint (unless you have Linux >5.6 and
> RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
> 
>   1. Open the candidate directory.
>   2. umount2(MNT_EXPIRE) the fd.
>     * -EINVAL means it wasn't a mountpoint when we got the fd, and the
> 	  fd is a stable handle to the underlying directory.
> 	* -EAGAIN or -EBUSY means that it was a mountpoint or became a
> 	  mountpoint after the fd was opened (we don't care about that, but
> 	  fail-safe is better here).
>   3. Use the fd from (1) for all operations.

... except that foo/../bar *WILL* cross into the covering mount, on any
kernel that supports ...at(2) at all, so I would be very cautious about
any kind "hardening" claims in that case.

I'm not sure about Linus' proposal - it looks rather convoluted and we
get a hard to describe twist of semantics in an area (procfs symlinks
vs. mount traversal) on top of everything else in there...

Anyway, a couple of questions:

1) do you see any problems on your testcases with the current #fixes?
That's commit 7a955b7363b8 as branch tip.

2) do you have any updates you would like to fold into stuff in
#work.openat2?  Right now I have a local variant of #work.namei (with
fairly cosmetical change compared to vfs.git one) that merges clean
with #work.openat2; I would like to do any updates/fold-ins/etc.
of #work.openat2 *before* doing a merge and continuing to work on
top of the merge results...
Ian Kent Jan. 14, 2020, 5:01 a.m.
On Tue, 2020-01-14 at 04:39 +0000, Al Viro wrote:
> On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote:
> 
> > This isn't right.
> > 
> > There's actually nothing stopping a user from using a direct map
> > entry that's a multi-mount without an actual mount at its root.
> > So there could be directories created under these, it's just not
> > usually done.
> > 
> > I'm pretty sure I don't check and disallow this.
> 
> IDGI...  How the hell will that work in v5?  Who will set _any_
> traps outside the one in root in that scenario?  autofs_lookup()
> won't (there it's conditional upon indirect mount).  Neither
> will autofs_dir_mkdir() (conditional upon version being less
> than 5).  Who will, then?
> 
> Confused...

It's easy to miss.

For autofs type direct and offset mounts the flags are set at fill
super time.

They have to be set then because they are direct mounts and offset
mounts behave the same as direct mounts so they need to be set then
too. So, like direct mounts, offset mounts are each distinct autofs
(trigger) mounts.

I could check for this construct and refuse it if that's really
needed. I'm pretty sure this map construct isn't much used by
people using direct mounts.

Ian
Al Viro Jan. 14, 2020, 5:12 a.m.
On Tue, Jan 14, 2020 at 04:57:33AM +0000, Al Viro wrote:
> On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote:
> 
> > If I'm understanding this proposal correctly, this would be a problem
> > for the libpathrs use-case -- if this is done then there's no way to
> > avoid a TOCTOU with someone mounting and the userspace program checking
> > whether something is a mountpoint (unless you have Linux >5.6 and
> > RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
> > 
> >   1. Open the candidate directory.
> >   2. umount2(MNT_EXPIRE) the fd.
> >     * -EINVAL means it wasn't a mountpoint when we got the fd, and the
> > 	  fd is a stable handle to the underlying directory.
> > 	* -EAGAIN or -EBUSY means that it was a mountpoint or became a
> > 	  mountpoint after the fd was opened (we don't care about that, but
> > 	  fail-safe is better here).
> >   3. Use the fd from (1) for all operations.
> 
> ... except that foo/../bar *WILL* cross into the covering mount, on any
> kernel that supports ...at(2) at all, so I would be very cautious about
> any kind "hardening" claims in that case.
> 
> I'm not sure about Linus' proposal - it looks rather convoluted and we
> get a hard to describe twist of semantics in an area (procfs symlinks
> vs. mount traversal) on top of everything else in there...

PS: one thing that might be interesting is exposing LOOKUP_DOWN via
AT_... flag - it would allow to request mount traversals at the starting
point explicitly.  Pretty much all code needed for that is already there;
all it would take is checking the flag in path_openat() and path_parentat()
and having handle_lookup_down() called there, same as in path_lookupat().

A tricky question is whether such flag should affect absolute symlinks -
i.e.

chdir /foo
ln -s /bar barf
overmount /
do lookup with that flag for /bar/splat
do lookup with that flag for barf/splat

Do we want the same results in both calls?  The first one would
traverse mounts on / and walk into /bar/splat in overmounting;
the second - see no mounts whatsoever on current directory (/foo
in old root), see the symlink to "/bar", jump to process' root
and proceed from there, first for "bar", then "splat" in it...
Ian Kent Jan. 14, 2020, 5:59 a.m.
On Tue, 2020-01-14 at 13:01 +0800, Ian Kent wrote:
> On Tue, 2020-01-14 at 04:39 +0000, Al Viro wrote:
> > On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote:
> > 
> > > This isn't right.
> > > 
> > > There's actually nothing stopping a user from using a direct map
> > > entry that's a multi-mount without an actual mount at its root.
> > > So there could be directories created under these, it's just not
> > > usually done.
> > > 
> > > I'm pretty sure I don't check and disallow this.
> > 
> > IDGI...  How the hell will that work in v5?  Who will set _any_
> > traps outside the one in root in that scenario?  autofs_lookup()
> > won't (there it's conditional upon indirect mount).  Neither
> > will autofs_dir_mkdir() (conditional upon version being less
> > than 5).  Who will, then?
> > 
> > Confused...
> 
> It's easy to miss.
> 
> For autofs type direct and offset mounts the flags are set at fill
> super time.
> 
> They have to be set then because they are direct mounts and offset
> mounts behave the same as direct mounts so they need to be set then
> too. So, like direct mounts, offset mounts are each distinct autofs
> (trigger) mounts.
> 
> I could check for this construct and refuse it if that's really
> needed. I'm pretty sure this map construct isn't much used by
> people using direct mounts.

Ok, once again I'm not exactly accurate is some of what I said.

It turns out that the autofs connectathon tests, one of the tests
that I use, does test direct mounts with offsets both with and
without a real mount at the base of the mount.

Based on that, I have to say this map construct is meant to be
supported with Sun format maps of autofs (even though I think it's
probably not used much).

So not allowing it is probably the wrong thing to do.

OTOH initial testing with the #work.namei branch shows these are
functioning as required.

Ian
Aleksa Sarai Jan. 14, 2020, 8:01 p.m.
On 2020-01-14, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote:
> 
> > If I'm understanding this proposal correctly, this would be a problem
> > for the libpathrs use-case -- if this is done then there's no way to
> > avoid a TOCTOU with someone mounting and the userspace program checking
> > whether something is a mountpoint (unless you have Linux >5.6 and
> > RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
> > 
> >   1. Open the candidate directory.
> >   2. umount2(MNT_EXPIRE) the fd.
> >     * -EINVAL means it wasn't a mountpoint when we got the fd, and the
> > 	  fd is a stable handle to the underlying directory.
> > 	* -EAGAIN or -EBUSY means that it was a mountpoint or became a
> > 	  mountpoint after the fd was opened (we don't care about that, but
> > 	  fail-safe is better here).
> >   3. Use the fd from (1) for all operations.
> 
> ... except that foo/../bar *WILL* cross into the covering mount, on any
> kernel that supports ...at(2) at all, so I would be very cautious about
> any kind "hardening" claims in that case.

In the use-case I have, we would have full control over what the path
being opened is (and thus you wouldn't open "foo/../bar"). But I agree
that generally the MNT_EXPIRE solution is really non-ideal anyway.

Not to mention that we're still screwed when it comes to using
magic-links (because if someone bind-mounts a magic-link over a
magic-link there's absolutely no race-free way to be sure that we're
traversing the right magic-link -- for that we'll need to have a
different solution).

> I'm not sure about Linus' proposal - it looks rather convoluted and we
> get a hard to describe twist of semantics in an area (procfs symlinks
> vs. mount traversal) on top of everything else in there...

Yeah, I agree.

> 1) do you see any problems on your testcases with the current #fixes?
> That's commit 7a955b7363b8 as branch tip.

I will take a quick look later today, but I'm currently at a conference.

> 2) do you have any updates you would like to fold into stuff in
> #work.openat2?  Right now I have a local variant of #work.namei (with
> fairly cosmetical change compared to vfs.git one) that merges clean
> with #work.openat2; I would like to do any updates/fold-ins/etc.
> of #work.openat2 *before* doing a merge and continuing to work on
> top of the merge results...

Yes, there were two patches I sent a while ago[1]. I can re-send them if
you like. The second patch switches open_how->mode to a u64, but I'm
still on the fence about whether that makes sense to do...

[1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/
Aleksa Sarai Jan. 15, 2020, 1:57 p.m.
On 2020-01-14, Al Viro <viro@zeniv.linux.org.uk> wrote:
> 1) do you see any problems on your testcases with the current #fixes?
> That's commit 7a955b7363b8 as branch tip.

I just finished testing the few cases I reported earlier and they both
appear to be fixed with the current #work.namei branch. And I don't have
any troubles booting whatsoever.
Al Viro Jan. 15, 2020, 2:25 p.m.
On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:

> Yes, there were two patches I sent a while ago[1]. I can re-send them if
> you like. The second patch switches open_how->mode to a u64, but I'm
> still on the fence about whether that makes sense to do...

IMO plain __u64 is better than games with __aligned_u64 - all sizes are
fixed, so...

> [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/

Do you want that series folded into "open: introduce openat2(2) syscall"
and "selftests: add openat2(2) selftests" or would you rather have them
appended at the end of the series.  Personally I'd go for "fold them in"
if it had been about my code, but it's really up to you.
Aleksa Sarai Jan. 15, 2020, 2:29 p.m.
On 2020-01-15, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
> 
> > Yes, there were two patches I sent a while ago[1]. I can re-send them if
> > you like. The second patch switches open_how->mode to a u64, but I'm
> > still on the fence about whether that makes sense to do...
> 
> IMO plain __u64 is better than games with __aligned_u64 - all sizes are
> fixed, so...
> 
> > [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/
> 
> Do you want that series folded into "open: introduce openat2(2) syscall"
> and "selftests: add openat2(2) selftests" or would you rather have them
> appended at the end of the series.  Personally I'd go for "fold them in"
> if it had been about my code, but it's really up to you.

"fold them in" would probably be better to avoid making the mainline
history confusing afterwards. Thanks.
Aleksa Sarai Jan. 15, 2020, 2:34 p.m.
On 2020-01-16, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2020-01-15, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
> > 
> > > Yes, there were two patches I sent a while ago[1]. I can re-send them if
> > > you like. The second patch switches open_how->mode to a u64, but I'm
> > > still on the fence about whether that makes sense to do...
> > 
> > IMO plain __u64 is better than games with __aligned_u64 - all sizes are
> > fixed, so...
> > 
> > > [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/
> > 
> > Do you want that series folded into "open: introduce openat2(2) syscall"
> > and "selftests: add openat2(2) selftests" or would you rather have them
> > appended at the end of the series.  Personally I'd go for "fold them in"
> > if it had been about my code, but it's really up to you.
> 
> "fold them in" would probably be better to avoid making the mainline
> history confusing afterwards. Thanks.

Also (if you prefer) I can send a v3 which uses u64s rather than
aligned_u64s.
Al Viro Jan. 15, 2020, 2:48 p.m.
On Thu, Jan 16, 2020 at 01:34:59AM +1100, Aleksa Sarai wrote:
> On 2020-01-16, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > On 2020-01-15, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
> > > 
> > > > Yes, there were two patches I sent a while ago[1]. I can re-send them if
> > > > you like. The second patch switches open_how->mode to a u64, but I'm
> > > > still on the fence about whether that makes sense to do...
> > > 
> > > IMO plain __u64 is better than games with __aligned_u64 - all sizes are
> > > fixed, so...
> > > 
> > > > [1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/
> > > 
> > > Do you want that series folded into "open: introduce openat2(2) syscall"
> > > and "selftests: add openat2(2) selftests" or would you rather have them
> > > appended at the end of the series.  Personally I'd go for "fold them in"
> > > if it had been about my code, but it's really up to you.
> > 
> > "fold them in" would probably be better to avoid making the mainline
> > history confusing afterwards. Thanks.
> 
> Also (if you prefer) I can send a v3 which uses u64s rather than
> aligned_u64s.

<mode "lazy bastard">
Could you fold and resend the results of folding (i.e. replacements
for two commits in question)?
</mode>

The hard part is, of course, in updating commit messages ;-)
Aleksa Sarai Jan. 18, 2020, 12:07 p.m.
Patch changelog:
  v3:
   * Merge changes into the original patches to make Al's life easier.
     [Al Viro]
  v2:
   * Add include <linux/types.h> to openat2.h. [Florian Weimer]
   * Move OPEN_HOW_SIZE_* constants out of UAPI. [Florian Weimer]
   * Switch from __aligned_u64 to __u64 since it isn't necessary.
     [David Laight]
  v1: <https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/>

While openat2(2) is still not yet in Linus's tree, we can take this
opportunity to iron out some small warts that weren't noticed earlier:

  * A fix was suggested by Florian Weimer, to separate the openat2
    definitions so glibc can use the header directly. I've put the
    maintainership under VFS but let me know if you'd prefer it belong
    ot the fcntl folks.

  * Having heterogenous field sizes in an extensible struct results in
    "padding hole" problems when adding new fields (in addition the
    correct error to use for non-zero padding isn't entirely clear ).
    The simplest solution is to just copy clone(3)'s model -- always use
    u64s. It will waste a little more space in the struct, but it
    removes a possible future headache.

This patch is intended to replace the corresponding patches in Al's
#work.openat2 tree (and *will not* apply on Linus' tree).

@Al: I will send some additional patches later, but they will require
     proper design review since they're ABI-related features (namely,
	 adding a way to check what features a syscall supports as I
	 outlined in my talk here[1]).

[1]: https://youtu.be/ggD-eb3yPVs

Aleksa Sarai (2):
  open: introduce openat2(2) syscall
  selftests: add openat2(2) selftests

 CREDITS                                       |   4 +-
 MAINTAINERS                                   |   1 +
 arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd.h               |   2 +-
 arch/arm64/include/asm/unistd32.h             |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
 arch/s390/kernel/syscalls/syscall.tbl         |   1 +
 arch/sh/kernel/syscalls/syscall.tbl           |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
 fs/open.c                                     | 147 +++--
 include/linux/fcntl.h                         |  16 +-
 include/linux/syscalls.h                      |   3 +
 include/uapi/asm-generic/unistd.h             |   5 +-
 include/uapi/linux/fcntl.h                    |   2 +-
 include/uapi/linux/openat2.h                  |  39 ++
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/openat2/.gitignore    |   1 +
 tools/testing/selftests/openat2/Makefile      |   8 +
 tools/testing/selftests/openat2/helpers.c     | 109 ++++
 tools/testing/selftests/openat2/helpers.h     | 106 ++++
 .../testing/selftests/openat2/openat2_test.c  | 312 +++++++++++
 .../selftests/openat2/rename_attack_test.c    | 160 ++++++
 .../testing/selftests/openat2/resolve_test.c  | 523 ++++++++++++++++++
 34 files changed, 1418 insertions(+), 39 deletions(-)
 create mode 100644 include/uapi/linux/openat2.h
 create mode 100644 tools/testing/selftests/openat2/.gitignore
 create mode 100644 tools/testing/selftests/openat2/Makefile
 create mode 100644 tools/testing/selftests/openat2/helpers.c
 create mode 100644 tools/testing/selftests/openat2/helpers.h
 create mode 100644 tools/testing/selftests/openat2/openat2_test.c
 create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
 create mode 100644 tools/testing/selftests/openat2/resolve_test.c
Al Viro Jan. 18, 2020, 3:28 p.m.
On Sat, Jan 18, 2020 at 11:07:58PM +1100, Aleksa Sarai wrote:
> Patch changelog:
>   v3:
>    * Merge changes into the original patches to make Al's life easier.
>      [Al Viro]
>   v2:
>    * Add include <linux/types.h> to openat2.h. [Florian Weimer]
>    * Move OPEN_HOW_SIZE_* constants out of UAPI. [Florian Weimer]
>    * Switch from __aligned_u64 to __u64 since it isn't necessary.
>      [David Laight]
>   v1: <https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/>
> 
> While openat2(2) is still not yet in Linus's tree, we can take this
> opportunity to iron out some small warts that weren't noticed earlier:
> 
>   * A fix was suggested by Florian Weimer, to separate the openat2
>     definitions so glibc can use the header directly. I've put the
>     maintainership under VFS but let me know if you'd prefer it belong
>     ot the fcntl folks.
> 
>   * Having heterogenous field sizes in an extensible struct results in
>     "padding hole" problems when adding new fields (in addition the
>     correct error to use for non-zero padding isn't entirely clear ).
>     The simplest solution is to just copy clone(3)'s model -- always use
>     u64s. It will waste a little more space in the struct, but it
>     removes a possible future headache.
> 
> This patch is intended to replace the corresponding patches in Al's
> #work.openat2 tree (and *will not* apply on Linus' tree).
> 
> @Al: I will send some additional patches later, but they will require
>      proper design review since they're ABI-related features (namely,
> 	 adding a way to check what features a syscall supports as I
> 	 outlined in my talk here[1]).

#work.openat2 updated, #for-next rebuilt and force-pushed.  There's
a massive update of #work.namei as well, also pushed out; not in
#for-next yet, will post the patch series for review later today.
Al Viro Jan. 18, 2020, 6:09 p.m.
On Sat, Jan 18, 2020 at 03:28:33PM +0000, Al Viro wrote:

> #work.openat2 updated, #for-next rebuilt and force-pushed.  There's
> a massive update of #work.namei as well, also pushed out; not in
> #for-next yet, will post the patch series for review later today.

BTW, looking through that code again, how could this
static bool legitimize_root(struct nameidata *nd)
{
        /*
         * For scoped-lookups (where nd->root has been zeroed), we need to
         * restart the whole lookup from scratch -- because set_root() is wrong
         * for these lookups (nd->dfd is the root, not the filesystem root).
         */
        if (!nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED))
                return false;

possibly trigger?  The only things that ever clean ->root.mnt are

	1) failing legitimize_path(nd, &nd->root, nd->root_seq) in
legitimize_root() itself.  If *ANY* legitimize_path() has failed,
we are through - RCU pathwalk is given up.  In particular, if you
look at the call chains leading to legitimize_root(), you'll see
that it's called by unlazy_walk() or unlazy_child() and failure
has either of those buggger off immediately.  The same goes for
their callers; fail any of those and we are done; the very next
thing that will be done with that nameidata is going to be
terminate_walk().  We don't look at its fields, etc. - just return
to the top level ASAP and call terminate_walk() on it.  Which is where
we run into
                if (nd->flags & LOOKUP_ROOT_GRABBED) {
                        path_put(&nd->root);
                        nd->flags &= ~LOOKUP_ROOT_GRABBED;
                }
paired with setting LOOKUP_ROOT_GRABBED just before the attempt
to legitimize in legitimize_root().  The next thing *after*
terminate_walk() is either path_init() or the end of life for
that struct nameidata instance.
	This is really, really fundamental for understanding the whole
thing - a failure of unlazy_walk/unlazy_child means that we are through
with that attempt.

	2) complete_walk() doing
                if (!(nd->flags & (LOOKUP_ROOT | LOOKUP_IS_SCOPED)))
                        nd->root.mnt = NULL;   
Can't happen with LOOKUP_IS_SCOPED in flags, obviously.

	3) path_init().  Where it's followed either by leaving through
        if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
		....
        }
(and LOOKUP_IS_SCOPED includes LOOKUP_IN_ROOT) or with a failure exit
(no calls of *anything* but terminate_walk() after that or with
        if (flags & LOOKUP_IS_SCOPED) {
                nd->root = nd->path;
... and that makes damn sure nd->root.mnt is not NULL.

And neither of the LOOKUP_IS_SCOPED bits ever gets changed in nd->flags -
they remain as path_init() has set them.

The same, BTW, goes for the check you've added in the beginning of
set_root() - set_root() is called only with NULL nd->root.mnt (trivial to
prove) and that is incompatible with LOOKUP_IS_SCOPED.  I'm kinda-sorta
OK with having WARN_ON() there for a while, but IMO the check in the
beginning of legitimize_root() should go away - this kind of defensive
programming only makes harder to reason about the behaviour of the
entire thing.  And fs/namei.c is too convoluted as it is...
Aleksa Sarai Jan. 18, 2020, 11:03 p.m.
On 2020-01-18, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Jan 18, 2020 at 03:28:33PM +0000, Al Viro wrote:
> 
> > #work.openat2 updated, #for-next rebuilt and force-pushed.  There's
> > a massive update of #work.namei as well, also pushed out; not in
> > #for-next yet, will post the patch series for review later today.
> 
> BTW, looking through that code again, how could this
> static bool legitimize_root(struct nameidata *nd)
> {
>         /*
>          * For scoped-lookups (where nd->root has been zeroed), we need to
>          * restart the whole lookup from scratch -- because set_root() is wrong
>          * for these lookups (nd->dfd is the root, not the filesystem root).
>          */
>         if (!nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED))
>                 return false;
> 
> possibly trigger?  The only things that ever clean ->root.mnt are

You're quite right -- the codepath I was worried about was pick_link()
failing (which *does* clear nd->path.mnt, and I must've misread it at
the time as nd->root.mnt).

We can drop this check, though now complete_walk()'s main defence
against a NULL nd->root.mnt is that path_is_under() will fail and
trigger -EXDEV (or set_root() will fail at some point in the future).
However, as you pointed out, a NULL nd->root.mnt won't happen with
things as they stand today -- I might be a little too paranoid. :P

> 	This is really, really fundamental for understanding the whole
> thing - a failure of unlazy_walk/unlazy_child means that we are through
> with that attempt.

Yup -- see above, the worry was about pick_link() not about how the
RCU-walk and REF-walk dances operate.

> The same, BTW, goes for the check you've added in the beginning of
> set_root() - set_root() is called only with NULL nd->root.mnt (trivial to
> prove) and that is incompatible with LOOKUP_IS_SCOPED.  I'm kinda-sorta
> OK with having WARN_ON() there for a while, but IMO the check in the
> beginning of legitimize_root() should go away -

You're quite right about dropping the legitimize_root() check, but I'd
like to keep the WARN_ON() in set_root(). The main reason being that it
makes us very damn sure that a future change won't accidentally break
the nd->root contract which all of the LOOKUP_IS_SCOPED changes rely on.
Then again, this might be my paranoia popping up again.

> this kind of defensive programming only makes harder to reason about
> the behaviour of the entire thing.  And fs/namei.c is too convoluted
> as it is...

If you feel that dropping some of these more defensive checks is better
for the codebase as a whole, then I defer to your judgement. I
completely agree that namei is a pretty complicated chunk of code.
Al Viro Jan. 19, 2020, 1:12 a.m.
On Sun, Jan 19, 2020 at 10:03:13AM +1100, Aleksa Sarai wrote:

> > possibly trigger?  The only things that ever clean ->root.mnt are
> 
> You're quite right -- the codepath I was worried about was pick_link()
> failing (which *does* clear nd->path.mnt, and I must've misread it at
> the time as nd->root.mnt).

pick_link() (allocation failure of external stack in RCU case, followed
by failure to legitimize the link) is, unfortunately, subtle and nasty.
We *must* path_put() the link; if we'd managed to legitimize the mount
and failed on dentry, the mount needs to be dropped.  No way around it.
And while everything else there can be left for soon-to-be-reached
terminate_walk(), this cannot.  We have no good way to pass what
we need to drop to the place where that eventual terminate_walk()
drops rcu_read_lock().  So we end up having to do what terminate_walk()
would've done and do it right there, so we could do that path_put(link)
before we bugger off.

I'm not happy about that, but I don't see cleaner solutions, more's the
pity.  However, it doesn't mess with ->root - nor should it, since
we don't have LOOKUP_ROOT_GRABBED (not in RCU mode), so it can and
should be left alone.
 
> We can drop this check, though now complete_walk()'s main defence
> against a NULL nd->root.mnt is that path_is_under() will fail and
> trigger -EXDEV (or set_root() will fail at some point in the future).
> However, as you pointed out, a NULL nd->root.mnt won't happen with
> things as they stand today -- I might be a little too paranoid. :P

The only reason why complete_walk() zeroes nd->root in some cases is
microoptimization - we *know* we won't be using it later, so we don't
care whether it's stale or not and can spare unlazy_walk() a bit of
work.  All there is to that one.

I don't see any reason for adding code that would clear nd->root in later
work; if such thing does get added (again, I don't see what purpose
could that possibly serve), we'll need to watch out for a lot of things.
Starting with LOOKUP_ROOT case...  It's not something likely to slip
in unnoticed.
Al Viro Jan. 19, 2020, 3:14 a.m.
OK, vfs.git #work.namei seems to survive xfstests.  I think
it cleans the things quite a bit, but it obviously needs more
review and testing.

	Review and testing would be _very_ welcome; it does a lot
of massage, so there had been a plenty of opportunities to fuck up
and fail to spot that.  The same goes for profiling - it doesn't
seem to slow the things down, but that needs to be verified.

	It does include #work.openat2.  Topology: 17 commits, followed
by clean merge with #work.openat2, followed by 9 followups.  The
part is #work.openat2 is as posted by Aleksa; I can repost it, but
I don't see much point.  Description of the rest follows; patches
themselves will be in followups.

part 1: follow_automount() cleanups and fixes.

	Quite a bit of that function had been about working around the
wrong calling conventions of finish_automount().  The problem is that
finish_automount() misuses the primitive intended for mount(2) and
friends, where we want to mount on top of the pile, even if something
has managed to add to that while we'd been trying to lock the namespace.
For automount that's not the right thing to do - there we want to discard
whatever it was going to attach and just cross into what got mounted
there in the meanwhile (most likely - the results of the same automount
triggered by somebody else).  Current mainline kinda-sorta manages to do
that, but it's unreliable and very convoluted.  Much simpler approach
is to stop using lock_mount() in finish_automount() and have it bail
out if something turns out to have been mounted on top where we wanted
to attach.  That allows to get rid of a lot of PITA in the caller.
Another simplification comes from not trying to cross into the results
of automount - simply ride through the next iteration of the loop and
let it move into overmount.

	Another thing in the same series is divorcing follow_automount()
from nameidata; that'll play later when we get to unifying follow_down()
with the guts of follow_managed().

	4 commits, the second one fixes a hard-to-hit race.  The first
is a prereq for it.

1/17	do_add_mount(): lift lock_mount/unlock_mount into callers
2/17	fix automount/automount race properly
3/17	follow_automount(): get rid of dead^Wstillborn code
4/17	follow_automount() doesn't need the entire nameidata

part 2: unifying mount traversals in pathwalk.

	Handling of mount traversal (follow_managed()) is currently called
in a bunch of places.  Each of them is shortly followed by a call of
step_into() or an open-coded equivalent thereof.  However, the locations
of those step_into() calls are far from preceding follow_managed();
moreover, that preceding call might happen on different paths that
converge to given step_into() call.  It's harder to analyse that it should
be (especially when it comes to liveness analysis) and it forces rather
ugly calling conventions on lookup_fast()/atomic_open()/lookup_open().
The series below massages the code to the point when the calls of
follow_managed() (and __follow_mount_rcu()) move into the beginning of
step_into().

5/17	make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW
	gets EEXIST handling in do_last() past the step_into() call there.
6/17	handle_mounts(): start building a sane wrapper for follow_managed()
	rather than mangling follow_managed() itself (and creating conflicts
	with openat2 series), add a wrapper that will absorb the required
	interface changes.
7/17	atomic_open(): saner calling conventions (return dentry on success)
	struct path passed to it is pure out parameter; only dentry part
	ever varies, though - mnt is always nd->path.mnt.  Just return
	the dentry on success, and ERR_PTR(-E...) on failure.
8/17	lookup_open(): saner calling conventions (return dentry on success)
	propagate the same change one level up the call chain.
9/17	do_last(): collapse the call of path_to_nameidata()
	struct path filled in lookup_open() call is eventually given to
	handle_mounts(); the only use it has before that is path_to_nameidata()
	call in "->atomic_open() has actually opened it" case, and there
	path_to_nameidata() is an overkill - we are guaranteed to replace
	only nd->path.dentry.  So have the struct path filled only immediately
	prior to handle_mounts().
10/17	handle_mounts(): pass dentry in, turn path into a pure out argument
	now all callers of handle_mount() are directly preceded by filling
	struct path it gets.  path->mnt is nd->path.mnt in all cases, so we can
	pass just the dentry instead and fill path in handle_mount() itself.
	Some boilerplate gone, path is pure out argument of handle_mount()
	now.
11/17	lookup_fast(): consolidate the RCU success case
	massage to gather what will become an RCU case equivalent of
	handle_mounts(); basically, that's what we do if revalidate succeeds
	in RCU case of lookup_fast(), including unlazy and fallback to
	handle_mounts() if __follow_mount_rcu() says "it's too tricky".
12/17	teach handle_mounts() to handle RCU mode
	... and take that into handle_mount() itself.  The other caller of
	__follow_mount_rcu() is fine with the same fallback (it just didn't
	bother since it's in the very beginning of pathwalk), switched to
	handle_mount() as well.
13/17	lookup_fast(): take mount traversal into callers
	Now we are getting somewhere - both RCU and non-RCU success cases of
	lookup_fast() are ended with the same return handle_mounts(...);
	move that to the callers - there it will merge with the identical calls
	that had been on the paths where we had to do slow lookups.
	lookup_fast() returns dentry now.
14/17	new step_into() flag: WALK_NOFOLLOW
	use step_into() instead of open-coding it in handle_lookup_down().
	Add a flag for "don't follow symlinks regardless of LOOKUP_FOLLOW" for
	that (and eventually, I hope, for .. handling).
	Now *all* calls of handle_mounts() and step_into() are right next to
	each other.
15/17	fold handle_mounts() into step_into()
	... and we can move the call of handle_mounts() into step_into(),
	getting a slightly saner calling conventions out of that.
16/17	LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()
	another payoff from 14/17 - we can teach path_lookupat() to do
	what path_mountpointat() used to.  And kill the latter, along with
	its wrappers.
17/17	expand the only remaining call of path_lookup_conditional()
	minor cleanup - RIP path_lookup_conditional().  Only one caller left.

At that point we run out of things that can be done without textual conflicts
with openat2 series.  Changes so far:
	* mount traversal is taken into step_into().
	* lookup_fast(), atomic_open() and lookup_open() calling conventions
are slightly changed.  All of them return dentry now, instead of returning
an int and filling struct path on success.  For lookup_fast() the old
"0 for cache miss, 1 for cache hit" is replaced with "NULL stands for cache
miss, dentry - for hit".
	* step_into() can be called in RCU mode as well.  Takes nameidata,
WALK_... flags, dentry and, in RCU case, corresponding inode and seq value.
Handles mount traversals, decides whether it's a symlink to be followed.
Error => returns -E...; symlink to follow => returns 1, puts symlink on stack;
non-symlink or symlink not to follow => returns 0, moves nd->path to new location.
	* LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and friends
became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in flags.

Next comes the merge with Aleksa's openat2 patchset; everything up to that point
had been non-conflicting with it.  That patchset has been posted earlier;
it's in #work.openat2.  The next series comes on top of the merge.

part 3: untangling the symlink handling.

	Right now when we decide to follow a symlink it happens this way:
	* step_into() decides that it has been given a symlink that needs to
be followed.
	* it calls pick_link(), which pushes the symlink on stack and
returns 1 on success / -E... on error.  Symlink's mount/dentry/seq is
stored on stack and the inode is stashed in nd->link_inode.
	* step_into() passes that 1 to its callers, which proceed to pass it
up the call chain for several layers.  In all cases we get to get_link()
call shortly afterwards.
	* get_link() is called, picks the inode stashed in nd->link_inode
by the pick_link(), does some checks, touches the atime, etc.
	* get_link() either picks the link body out of inode or calls
->get_link().  If it's an absolute symlink, we move to the root and return
the relative portion of the body; if it's a relative one - just return the
body.  If it's a procfs-style one, the call of nd_jump_link() has been
made and we'd moved to whatever location is desired.  And return NULL,
same as we do for symlink to "/".
	* the caller proceeds to deal with the string returned to it.

	The sequence is the same in all cases (nested symlink, trailing
symlink on lookup, trailing symlink on open), but its pieces are not close
to each other and the bit between the call of pick_link() and (inevitable)
call of get_link() afterwards is not easy to follow.  Moreover, a bunch
of functions (walk_component/lookup_last/do_last) ends up with the same
conventions for return values as step_into().  And those conventions
(see above) are not pretty - 0/1/-E... is asking for mistakes, especially
when returned 1 is used only to direct control flow on a rather twisted
way to matching get_link() call.  And that path can be seriously twisted.
E.g. when we are trying to open /dev/stdin, we get the following sequence:
	* path_init() has put us into root and returned "/dev/stdin"
	* link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
	* we call do_last(), which sees that we have LAST_NORM and calls
lookup_fast().  Let's assume that everything is in dcache; we get the
dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
	* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the
damn thing.  Into the stack it goes and we return 1.
	* do_last() sees 1 and returns it.
	* trailing_symlink() is called (in the top-level loop) and it
calls get_link().  OK, we get "/proc/self/fd/0" for body, move to
root again and return "proc/self/fd/0".
	* link_path_walk() is given that string, eventually leading us into
/proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
	* do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
	* _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now).  It's dropped (from step_into()) and we get
to look at what we'd been given.  A symlink to follow, so on the stack
it goes and we return 1.
	* again, do_last() passes 1 to caller
	* trailing_symlink() is called and calls get_link().
	* this time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin.  And returns NULL.  But the
fun doesn't stop yet.
	* trailing_symlink() returns "" to the caller
	* link_path_walk() is called on that and does nothing
whatsoever.
	* do_last() is called and sees LAST_BIND left by the get_link().
It calls handle_dots()
	* handle_dots() drops the symlink from stack and returns
	* do_last() *FINALLY* proceeds to the point after its call of
step_into() (finish_open:) and gets around to opening the damn thing.

	Making sense of the control flow through all of that is not fun,
to put it mildly; debugging anything in that area can be a massive PITA,
and this example has touched only one of 3 cases.  Arguably, the worst
one, but...  Anyway, it turns out that this code can be massaged to
considerably saner shape - both in terms of control flow and wrt calling
conventions.

1/9	merging pick_link() with get_link(), part 1
	prep work: move the "hardening" crap from trailing_symlink() into
get_link() (conditional on the absense of LOOKUP_PARENT in nd->flags).
We'll be moving the calls of get_link() around quite a bit through that
series, and the next step will be to eliminate trailing_symlink().
2/9	merging pick_link() with get_link(), part 2
	fold trailing_symlink() into lookup_last() and do_last().
Now these are returning strings; it's not the final calling conventions,
but it's almost there.  NULL => old 0, we are done.  ERR_PTR(-E...) =>
old -E..., we'd failed.  string => old 1, and the string is the symlink
body to follow.  Just as for trailing_symlink(), "/" and procfs ones
(where get_link() returns NULL) yield "", so the ugly song and dance
with no-op trip through link_path_walk()/handle_dots() still remains.
3/9	merging pick_link() with get_link(), part 3
	elimination of that round-trip.  In *all* cases having
get_link() return NULL on such symlinks means that we'll proceed to
drop the symlink from stack and get back to the point near that
get_link() call - basically, where we would be if it hadn't been
a symlink at all.  The path by which we are getting there depends
upon the call site; the end result is the same in all cases - such
symlinks (procfs ones and symlink to "/") are fully processed by
the time get_link() returns, so we could as well drop them from the
stack right in get_link().  Makes life simpler in terms of control
flow analysis...
	And now the calling conventions for do_last() and lookup_last()
have reached the final shape - ERR_PTR(-E...) for error, NULL for
"we are done", string for "traverse this".
4/9	merging pick_link() with get_link(), part 4
	now all calls of walk_component() are followed by the same
boilerplate - "if it has returned 1, call get_link() and if that
has returned NULL treat that as if walk_component() has returned 0".
Eliminate by folding that into walk_component() itself.  Now
walk_component() return value conventions have joined those of
do_last()/lookup_last().
5/9	merging pick_link() with get_link(), part 5
	same as for the previous, only this time the boilerplate
migrates one level down, into step_into().  Only one caller of
get_link() left, step_into() has joined the same return value
conventions.
6/9	merging pick_link() with get_link(), part 6
	move that thing into pick_link().  Now all traces of
"return 1 if we are following a symlink" are gone.
7/9	finally fold get_link() into pick_link()
	ta-da - expand get_link() into the only caller.  As a side
benefit, we get rid of stashing the inode in nd->link_inode - it
was done only to carry that piece of information from pick_link()
to eventual get_link().  That's not the main benefit, though - the
control flow became considerably easier to reason about.

For what it's worth, the example above (/dev/stdin) becomes
	* path_init() has put us into root and returned "/dev/stdin"
	* link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
	* we call do_last(), which sees that we have LAST_NORM and calls
lookup_fast().  Let's assume that everything is in dcache; we get the
dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
	* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the
damn thing.  On the stack it goes and we get its body.  Which is
"/proc/self/fd/0", so we move to root and return "proc/self/fd/0".
	* do_last() sees non-NULL and returns it - whether it's an error
or a pathname to traverse, we hadn't reached something we'll be opening.
	* link_path_walk() is given that string, eventually leading us into
/proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
	* do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
	* _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now).  It's dropped (from step_into()) and we get
to look at what we'd been given.  A symlink to follow, so on the stack
it goes.   This time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin.  And returns NULL.  So we
drop symlink from stack and return that NULL to caller.
	* that NULL is returned by step_into(), same as if we had just
moved to a non-symlink.
	* do_last() proceeds to open the damn thing.

part 4.  some mount traversal cleanups.

8/9	massage __follow_mount_rcu() a bit
	make it more similar to non-RCU counterpart
9/9	new helper: traverse_mounts()
	the guts of follow_managed() are very similar to
follow_down().  The calling conventions are different (follow_managed()
works with nameidata, follow_down() - with standalone struct path),
but the core loop is pretty much the same in both.  Turned that loop
into a common helper (traverse_mounts()) and since follow_managed()
becomes a very thin wrapper around it, expand follow_managed() at its
only call site (in handle_mounts()),

That's where the series stands right now.  FWIW, at 5.5-rc1 fs/namei.c
had been 4867 lines, at the tip of #work.openat2 - 4998, at the
tip of #work.namei (containing #work.openat2) - 4730...  And IMO
the thing has become considerably easier to follow.

What's more, it might be possible to untangle the control flow in
do_last() now.  Probably a separate series, though - do_last() is
one hell of a tarpit, so I'm not stepping into it for the rest
of this cycle...
Ian Kent Jan. 19, 2020, 2:33 p.m.
On Sun, 2020-01-19 at 03:14 +0000, Al Viro wrote:
> 	OK, vfs.git #work.namei seems to survive xfstests.  I think
> it cleans the things quite a bit, but it obviously needs more
> review and testing.
> 
> 	Review and testing would be _very_ welcome; it does a lot
> of massage, so there had been a plenty of opportunities to fuck up
> and fail to spot that.  The same goes for profiling - it doesn't
> seem to slow the things down, but that needs to be verified.

I have run my usual tests (the second run of my submount-test is still
going) and they have run through fine.

I spend what time I can looking through the series tomorrow but will
probably need to complete that when I return from my trip to Albany
(Western Australia) some time on Friday.

> 
> 	It does include #work.openat2.  Topology: 17 commits, followed
> by clean merge with #work.openat2, followed by 9 followups.  The
> part is #work.openat2 is as posted by Aleksa; I can repost it, but
> I don't see much point.  Description of the rest follows; patches
> themselves will be in followups.
> 
> part 1: follow_automount() cleanups and fixes.
> 
> 	Quite a bit of that function had been about working around the
> wrong calling conventions of finish_automount().  The problem is that
> finish_automount() misuses the primitive intended for mount(2) and
> friends, where we want to mount on top of the pile, even if something
> has managed to add to that while we'd been trying to lock the
> namespace.
> For automount that's not the right thing to do - there we want to
> discard
> whatever it was going to attach and just cross into what got mounted
> there in the meanwhile (most likely - the results of the same
> automount
> triggered by somebody else).  Current mainline kinda-sorta manages to
> do
> that, but it's unreliable and very convoluted.  Much simpler approach
> is to stop using lock_mount() in finish_automount() and have it bail
> out if something turns out to have been mounted on top where we
> wanted
> to attach.  That allows to get rid of a lot of PITA in the caller.
> Another simplification comes from not trying to cross into the
> results
> of automount - simply ride through the next iteration of the loop and
> let it move into overmount.
> 
> 	Another thing in the same series is divorcing
> follow_automount()
> from nameidata; that'll play later when we get to unifying
> follow_down()
> with the guts of follow_managed().
> 
> 	4 commits, the second one fixes a hard-to-hit race.  The first
> is a prereq for it.
> 
> 1/17	do_add_mount(): lift lock_mount/unlock_mount into callers
> 2/17	fix automount/automount race properly
> 3/17	follow_automount(): get rid of dead^Wstillborn code
> 4/17	follow_automount() doesn't need the entire nameidata
> 
> part 2: unifying mount traversals in pathwalk.
> 
> 	Handling of mount traversal (follow_managed()) is currently
> called
> in a bunch of places.  Each of them is shortly followed by a call of
> step_into() or an open-coded equivalent thereof.  However, the
> locations
> of those step_into() calls are far from preceding follow_managed();
> moreover, that preceding call might happen on different paths that
> converge to given step_into() call.  It's harder to analyse that it
> should
> be (especially when it comes to liveness analysis) and it forces
> rather
> ugly calling conventions on
> lookup_fast()/atomic_open()/lookup_open().
> The series below massages the code to the point when the calls of
> follow_managed() (and __follow_mount_rcu()) move into the beginning
> of
> step_into().
> 
> 5/17	make build_open_flags() treat O_CREAT | O_EXCL as implying
> O_NOFOLLOW
> 	gets EEXIST handling in do_last() past the step_into() call
> there.
> 6/17	handle_mounts(): start building a sane wrapper for
> follow_managed()
> 	rather than mangling follow_managed() itself (and creating
> conflicts
> 	with openat2 series), add a wrapper that will absorb the
> required
> 	interface changes.
> 7/17	atomic_open(): saner calling conventions (return dentry on
> success)
> 	struct path passed to it is pure out parameter; only dentry
> part
> 	ever varies, though - mnt is always nd->path.mnt.  Just return
> 	the dentry on success, and ERR_PTR(-E...) on failure.
> 8/17	lookup_open(): saner calling conventions (return dentry on
> success)
> 	propagate the same change one level up the call chain.
> 9/17	do_last(): collapse the call of path_to_nameidata()
> 	struct path filled in lookup_open() call is eventually given to
> 	handle_mounts(); the only use it has before that is
> path_to_nameidata()
> 	call in "->atomic_open() has actually opened it" case, and
> there
> 	path_to_nameidata() is an overkill - we are guaranteed to
> replace
> 	only nd->path.dentry.  So have the struct path filled only
> immediately
> 	prior to handle_mounts().
> 10/17	handle_mounts(): pass dentry in, turn path into a pure out
> argument
> 	now all callers of handle_mount() are directly preceded by
> filling
> 	struct path it gets.  path->mnt is nd->path.mnt in all cases,
> so we can
> 	pass just the dentry instead and fill path in handle_mount()
> itself.
> 	Some boilerplate gone, path is pure out argument of
> handle_mount()
> 	now.
> 11/17	lookup_fast(): consolidate the RCU success case
> 	massage to gather what will become an RCU case equivalent of
> 	handle_mounts(); basically, that's what we do if revalidate
> succeeds
> 	in RCU case of lookup_fast(), including unlazy and fallback to
> 	handle_mounts() if __follow_mount_rcu() says "it's too tricky".
> 12/17	teach handle_mounts() to handle RCU mode
> 	... and take that into handle_mount() itself.  The other caller
> of
> 	__follow_mount_rcu() is fine with the same fallback (it just
> didn't
> 	bother since it's in the very beginning of pathwalk), switched
> to
> 	handle_mount() as well.
> 13/17	lookup_fast(): take mount traversal into callers
> 	Now we are getting somewhere - both RCU and non-RCU success
> cases of
> 	lookup_fast() are ended with the same return
> handle_mounts(...);
> 	move that to the callers - there it will merge with the
> identical calls
> 	that had been on the paths where we had to do slow lookups.
> 	lookup_fast() returns dentry now.
> 14/17	new step_into() flag: WALK_NOFOLLOW
> 	use step_into() instead of open-coding it in
> handle_lookup_down().
> 	Add a flag for "don't follow symlinks regardless of
> LOOKUP_FOLLOW" for
> 	that (and eventually, I hope, for .. handling).
> 	Now *all* calls of handle_mounts() and step_into() are right
> next to
> 	each other.
> 15/17	fold handle_mounts() into step_into()
> 	... and we can move the call of handle_mounts() into
> step_into(),
> 	getting a slightly saner calling conventions out of that.
> 16/17	LOOKUP_MOUNTPOINT: fold path_mountpointat() into
> path_lookupat()
> 	another payoff from 14/17 - we can teach path_lookupat() to do
> 	what path_mountpointat() used to.  And kill the latter, along
> with
> 	its wrappers.
> 17/17	expand the only remaining call of path_lookup_conditional()
> 	minor cleanup - RIP path_lookup_conditional().  Only one caller
> left.
> 
> At that point we run out of things that can be done without textual
> conflicts
> with openat2 series.  Changes so far:
> 	* mount traversal is taken into step_into().
> 	* lookup_fast(), atomic_open() and lookup_open() calling
> conventions
> are slightly changed.  All of them return dentry now, instead of
> returning
> an int and filling struct path on success.  For lookup_fast() the old
> "0 for cache miss, 1 for cache hit" is replaced with "NULL stands for
> cache
> miss, dentry - for hit".
> 	* step_into() can be called in RCU mode as well.  Takes
> nameidata,
> WALK_... flags, dentry and, in RCU case, corresponding inode and seq
> value.
> Handles mount traversals, decides whether it's a symlink to be
> followed.
> Error => returns -E...; symlink to follow => returns 1, puts symlink
> on stack;
> non-symlink or symlink not to follow => returns 0, moves nd->path to
> new location.
> 	* LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and
> friends
> became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in
> flags.
> 
> Next comes the merge with Aleksa's openat2 patchset; everything up to
> that point
> had been non-conflicting with it.  That patchset has been posted
> earlier;
> it's in #work.openat2.  The next series comes on top of the merge.
> 
> part 3: untangling the symlink handling.
> 
> 	Right now when we decide to follow a symlink it happens this
> way:
> 	* step_into() decides that it has been given a symlink that
> needs to
> be followed.
> 	* it calls pick_link(), which pushes the symlink on stack and
> returns 1 on success / -E... on error.  Symlink's mount/dentry/seq is
> stored on stack and the inode is stashed in nd->link_inode.
> 	* step_into() passes that 1 to its callers, which proceed to
> pass it
> up the call chain for several layers.  In all cases we get to
> get_link()
> call shortly afterwards.
> 	* get_link() is called, picks the inode stashed in nd-
> >link_inode
> by the pick_link(), does some checks, touches the atime, etc.
> 	* get_link() either picks the link body out of inode or calls
> ->get_link().  If it's an absolute symlink, we move to the root and
> return
> the relative portion of the body; if it's a relative one - just
> return the
> body.  If it's a procfs-style one, the call of nd_jump_link() has
> been
> made and we'd moved to whatever location is desired.  And return
> NULL,
> same as we do for symlink to "/".
> 	* the caller proceeds to deal with the string returned to it.
> 
> 	The sequence is the same in all cases (nested symlink, trailing
> symlink on lookup, trailing symlink on open), but its pieces are not
> close
> to each other and the bit between the call of pick_link() and
> (inevitable)
> call of get_link() afterwards is not easy to follow.  Moreover, a
> bunch
> of functions (walk_component/lookup_last/do_last) ends up with the
> same
> conventions for return values as step_into().  And those conventions
> (see above) are not pretty - 0/1/-E... is asking for mistakes,
> especially
> when returned 1 is used only to direct control flow on a rather
> twisted
> way to matching get_link() call.  And that path can be seriously
> twisted.
> E.g. when we are trying to open /dev/stdin, we get the following
> sequence:
> 	* path_init() has put us into root and returned "/dev/stdin"
> 	* link_path_walk() has eventually reached /dev and left
> <LAST_NORM, "stdin"> in nd->last_type/nd->last
> 	* we call do_last(), which sees that we have LAST_NORM and
> calls
> lookup_fast().  Let's assume that everything is in dcache; we get the
> dentry of /dev/stdin and proceed to finish_lookup:, where we call
> step_into()
> 	* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick
> the
> damn thing.  Into the stack it goes and we return 1.
> 	* do_last() sees 1 and returns it.
> 	* trailing_symlink() is called (in the top-level loop) and it
> calls get_link().  OK, we get "/proc/self/fd/0" for body, move to
> root again and return "proc/self/fd/0".
> 	* link_path_walk() is given that string, eventually leading us
> into
> /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
> 	* do_last() is called, and similar to the previous case we
> eventually reach the call of step_into() with dentry of
> /proc/self/fd/0.
> 	* _now_ we can discard /dev/stdin from the stack (we'd been
> using its body until now).  It's dropped (from step_into()) and we
> get
> to look at what we'd been given.  A symlink to follow, so on the
> stack
> it goes and we return 1.
> 	* again, do_last() passes 1 to caller
> 	* trailing_symlink() is called and calls get_link().
> 	* this time it's a procfs symlink and its ->get_link() method
> moves us to the mount/dentry of our stdin.  And returns NULL.  But
> the
> fun doesn't stop yet.
> 	* trailing_symlink() returns "" to the caller
> 	* link_path_walk() is called on that and does nothing
> whatsoever.
> 	* do_last() is called and sees LAST_BIND left by the
> get_link().
> It calls handle_dots()
> 	* handle_dots() drops the symlink from stack and returns
> 	* do_last() *FINALLY* proceeds to the point after its call of
> step_into() (finish_open:) and gets around to opening the damn thing.
> 
> 	Making sense of the control flow through all of that is not
> fun,
> to put it mildly; debugging anything in that area can be a massive
> PITA,
> and this example has touched only one of 3 cases.  Arguably, the
> worst
> one, but...  Anyway, it turns out that this code can be massaged to
> considerably saner shape - both in terms of control flow and wrt
> calling
> conventions.
> 
> 1/9	merging pick_link() with get_link(), part 1
> 	prep work: move the "hardening" crap from trailing_symlink()
> into
> get_link() (conditional on the absense of LOOKUP_PARENT in nd-
> >flags).
> We'll be moving the calls of get_link() around quite a bit through
> that
> series, and the next step will be to eliminate trailing_symlink().
> 2/9	merging pick_link() with get_link(), part 2
> 	fold trailing_symlink() into lookup_last() and do_last().
> Now these are returning strings; it's not the final calling
> conventions,
> but it's almost there.  NULL => old 0, we are done.  ERR_PTR(-E...)
> =>
> old -E..., we'd failed.  string => old 1, and the string is the
> symlink
> body to follow.  Just as for trailing_symlink(), "/" and procfs ones
> (where get_link() returns NULL) yield "", so the ugly song and dance
> with no-op trip through link_path_walk()/handle_dots() still remains.
> 3/9	merging pick_link() with get_link(), part 3
> 	elimination of that round-trip.  In *all* cases having
> get_link() return NULL on such symlinks means that we'll proceed to
> drop the symlink from stack and get back to the point near that
> get_link() call - basically, where we would be if it hadn't been
> a symlink at all.  The path by which we are getting there depends
> upon the call site; the end result is the same in all cases - such
> symlinks (procfs ones and symlink to "/") are fully processed by
> the time get_link() returns, so we could as well drop them from the
> stack right in get_link().  Makes life simpler in terms of control
> flow analysis...
> 	And now the calling conventions for do_last() and lookup_last()
> have reached the final shape - ERR_PTR(-E...) for error, NULL for
> "we are done", string for "traverse this".
> 4/9	merging pick_link() with get_link(), part 4
> 	now all calls of walk_component() are followed by the same
> boilerplate - "if it has returned 1, call get_link() and if that
> has returned NULL treat that as if walk_component() has returned 0".
> Eliminate by folding that into walk_component() itself.  Now
> walk_component() return value conventions have joined those of
> do_last()/lookup_last().
> 5/9	merging pick_link() with get_link(), part 5
> 	same as for the previous, only this time the boilerplate
> migrates one level down, into step_into().  Only one caller of
> get_link() left, step_into() has joined the same return value
> conventions.
> 6/9	merging pick_link() with get_link(), part 6
> 	move that thing into pick_link().  Now all traces of
> "return 1 if we are following a symlink" are gone.
> 7/9	finally fold get_link() into pick_link()
> 	ta-da - expand get_link() into the only caller.  As a side
> benefit, we get rid of stashing the inode in nd->link_inode - it
> was done only to carry that piece of information from pick_link()
> to eventual get_link().  That's not the main benefit, though - the
> control flow became considerably easier to reason about.
> 
> For what it's worth, the example above (/dev/stdin) becomes
> 	* path_init() has put us into root and returned "/dev/stdin"
> 	* link_path_walk() has eventually reached /dev and left
> <LAST_NORM, "stdin"> in nd->last_type/nd->last
> 	* we call do_last(), which sees that we have LAST_NORM and
> calls
> lookup_fast().  Let's assume that everything is in dcache; we get the
> dentry of /dev/stdin and proceed to finish_lookup:, where we call
> step_into()
> 	* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick
> the
> damn thing.  On the stack it goes and we get its body.  Which is
> "/proc/self/fd/0", so we move to root and return "proc/self/fd/0".
> 	* do_last() sees non-NULL and returns it - whether it's an
> error
> or a pathname to traverse, we hadn't reached something we'll be
> opening.
> 	* link_path_walk() is given that string, eventually leading us
> into
> /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
> 	* do_last() is called, and similar to the previous case we
> eventually reach the call of step_into() with dentry of
> /proc/self/fd/0.
> 	* _now_ we can discard /dev/stdin from the stack (we'd been
> using its body until now).  It's dropped (from step_into()) and we
> get
> to look at what we'd been given.  A symlink to follow, so on the
> stack
> it goes.   This time it's a procfs symlink and its ->get_link()
> method
> moves us to the mount/dentry of our stdin.  And returns NULL.  So we
> drop symlink from stack and return that NULL to caller.
> 	* that NULL is returned by step_into(), same as if we had just
> moved to a non-symlink.
> 	* do_last() proceeds to open the damn thing.
> 
> part 4.  some mount traversal cleanups.
> 
> 8/9	massage __follow_mount_rcu() a bit
> 	make it more similar to non-RCU counterpart
> 9/9	new helper: traverse_mounts()
> 	the guts of follow_managed() are very similar to
> follow_down().  The calling conventions are different
> (follow_managed()
> works with nameidata, follow_down() - with standalone struct path),
> but the core loop is pretty much the same in both.  Turned that loop
> into a common helper (traverse_mounts()) and since follow_managed()
> becomes a very thin wrapper around it, expand follow_managed() at its
> only call site (in handle_mounts()),
> 
> That's where the series stands right now.  FWIW, at 5.5-rc1
> fs/namei.c
> had been 4867 lines, at the tip of #work.openat2 - 4998, at the
> tip of #work.namei (containing #work.openat2) - 4730...  And IMO
> the thing has become considerably easier to follow.
> 
> What's more, it might be possible to untangle the control flow in
> do_last() now.  Probably a separate series, though - do_last() is
> one hell of a tarpit, so I'm not stepping into it for the rest
> of this cycle...
>