[09/17] file: Implement fnext_task

Submitted by Eric W. Biederman on Aug. 17, 2020, 10:04 p.m.

Details

Message ID 20200817220425.9389-9-ebiederm@xmission.com
State New
Series "Series without cover letter"
Headers show

Commit Message

Eric W. Biederman Aug. 17, 2020, 10:04 p.m.
As a companion to fget_task and fcheck_task implement fnext_task that
will return the struct file for the first file descriptor show number
is equal or greater than the fd argument value, or NULL if there is
no such struct file.

This allows file descriptors of foreign processes to be iterated through
safely, without needed to increment the count on files_struct.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/file.c               | 21 +++++++++++++++++++++
 include/linux/fdtable.h |  1 +
 2 files changed, 22 insertions(+)

Patch hide | download patch | download mbox

diff --git a/fs/file.c b/fs/file.c
index 8d4b385055e9..88f9f78869f8 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -876,6 +876,27 @@  struct file *fcheck_task(struct task_struct *task, unsigned int fd)
 	return file;
 }
 
+struct file *fnext_task(struct task_struct *task, unsigned int *ret_fd)
+{
+	/* Must be called with rcu_read_lock held */
+	struct files_struct *files;
+	unsigned int fd = *ret_fd;
+	struct file *file = NULL;
+
+	task_lock(task);
+	files = task->files;
+	if (files) {
+		for (; fd < files_fdtable(files)->max_fds; fd++) {
+			file = fcheck_files(files, fd);
+			if (file)
+				break;
+		}
+	}
+	task_unlock(task);
+	*ret_fd = fd;
+	return file;
+}
+
 /*
  * Lightweight file lookup - no refcnt increment if fd table isn't shared.
  *
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index def9debd2ce2..a3a054084f49 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -104,6 +104,7 @@  static inline struct file *fcheck_files(struct files_struct *files, unsigned int
  */
 #define fcheck(fd)	fcheck_files(current->files, fd)
 struct file *fcheck_task(struct task_struct *task, unsigned int fd);
+struct file *fnext_task(struct task_struct *task, unsigned int *fd);
 
 struct task_struct;
 

Comments

Linus Torvalds Aug. 17, 2020, 11:54 p.m.
I like the series, but I have to say that the extension of the
horrible "fcheck*()" interfaces raises my hackles..

That interface is very very questionable to begin with, and it's easy
to get wrong.

I don't see you getting it wrong - you do seem to hold the RCU read
lock in the places I checked, but it worries me.

I think my worry could be at least partially mitigated with more
comments and docs:

On Mon, Aug 17, 2020 at 3:10 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> +struct file *fnext_task(struct task_struct *task, unsigned int *ret_fd)

Please please *please* make it clear that because this does *not*
increment any reference counts, you have to be very very careful using
the "struct file" pointer this returns.

I really dislike the naming. The old "fcheck()" naming came from the
fact that at least one original user just mainly checked if the result
was NULL or not. And then others had to be careful to either hold the
file_lock spinlock, or at least the RCU read lock so that the result
didn't go away.

Here, you have "fnext_task()", and it doesn't even have that "check"
warning mark, or any comment about how dangerous it can be to use..

So the rule is that the "struct file" pointer this returns can only be
read under RCU, but the 'fd' it returns has a longer lifetime.

I think your uses are ok, and I think this is an improvement in
general, I just want a bigger "WARNING! Here be dragons!" sign (and
naming could be nicer).

            Linus
Eric W. Biederman Aug. 18, 2020, 1:02 a.m.
Linus Torvalds <torvalds@linux-foundation.org> writes:

> I like the series, but I have to say that the extension of the
> horrible "fcheck*()" interfaces raises my hackles..
>
> That interface is very very questionable to begin with, and it's easy
> to get wrong.
>
> I don't see you getting it wrong - you do seem to hold the RCU read
> lock in the places I checked, but it worries me.
>
> I think my worry could be at least partially mitigated with more
> comments and docs:
>
> On Mon, Aug 17, 2020 at 3:10 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> +struct file *fnext_task(struct task_struct *task, unsigned int *ret_fd)
>
> Please please *please* make it clear that because this does *not*
> increment any reference counts, you have to be very very careful using
> the "struct file" pointer this returns.
>
> I really dislike the naming. The old "fcheck()" naming came from the
> fact that at least one original user just mainly checked if the result
> was NULL or not. And then others had to be careful to either hold the
> file_lock spinlock, or at least the RCU read lock so that the result
> didn't go away.
>
> Here, you have "fnext_task()", and it doesn't even have that "check"
> warning mark, or any comment about how dangerous it can be to use..
>
> So the rule is that the "struct file" pointer this returns can only be
> read under RCU, but the 'fd' it returns has a longer lifetime.
>
> I think your uses are ok, and I think this is an improvement in
> general, I just want a bigger "WARNING! Here be dragons!" sign (and
> naming could be nicer).

Fair.  The naming is my least favorite of these.

I struggle with the fcheck name as I have not seen or at least not
registed on the the user that just checks to see if the result is NULL.
So the name fcheck never made a bit of sense to me.

I will see if I can come up with some good descriptive comments around
these functions.  Along with describing what these things are doing I am
thinking maybe I should put "_rcu" in their names and have a debug check
that verifies "_rcu" is held.

I did the rcu must be held variants because that matches almost exactly
what the existing code is doing, and the smaller the change the fewer
the surprises and bugs.

Eric
Linus Torvalds Aug. 18, 2020, 1:17 a.m.
On Mon, Aug 17, 2020 at 6:06 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I struggle with the fcheck name as I have not seen or at least not
> registed on the the user that just checks to see if the result is NULL.
> So the name fcheck never made a bit of sense to me.

Yeah, that name is not great. I just don't want to make things even worse.

> I will see if I can come up with some good descriptive comments around
> these functions.  Along with describing what these things are doing I am
> thinking maybe I should put "_rcu" in their names and have a debug check
> that verifies "_rcu" is held.

Yeah, something along the lines of "rcu_lookup_fd_task(tsk,fd)" would
be a *lot* more descriptive than fcheck_task().

And I think "fnext_task()" could be "rcu_lookup_next_fd_task(tsk,fd)".

Yes, those are much longer names, but it's not like you end up typing
them all that often, and I think being descriptive would be worth it.

And "fcheck()" and "fcheck_files()" would be good to rename too along
the same lines.

Something like "rcu_lookup_fd()" and "rcu_lookup_fd_files()" respectively?

I'm obviously trying to go for a "rcu_lookup_fd*()" kind of pattern,
and I'm not married to _that_ particular pattern but I think it would
be better than what we have now.

                       Linus
Christian Brauner Aug. 18, 2020, 11:05 a.m.
On Mon, Aug 17, 2020 at 06:17:35PM -0700, Linus Torvalds wrote:
> On Mon, Aug 17, 2020 at 6:06 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > I struggle with the fcheck name as I have not seen or at least not
> > registed on the the user that just checks to see if the result is NULL.
> > So the name fcheck never made a bit of sense to me.
> 
> Yeah, that name is not great. I just don't want to make things even worse.
> 
> > I will see if I can come up with some good descriptive comments around
> > these functions.  Along with describing what these things are doing I am
> > thinking maybe I should put "_rcu" in their names and have a debug check
> > that verifies "_rcu" is held.
> 
> Yeah, something along the lines of "rcu_lookup_fd_task(tsk,fd)" would
> be a *lot* more descriptive than fcheck_task().
> 
> And I think "fnext_task()" could be "rcu_lookup_next_fd_task(tsk,fd)".
> 
> Yes, those are much longer names, but it's not like you end up typing
> them all that often, and I think being descriptive would be worth it.
> 
> And "fcheck()" and "fcheck_files()" would be good to rename too along
> the same lines.
> 
> Something like "rcu_lookup_fd()" and "rcu_lookup_fd_files()" respectively?
> 
> I'm obviously trying to go for a "rcu_lookup_fd*()" kind of pattern,
> and I'm not married to _that_ particular pattern but I think it would
> be better than what we have now.

In fs/inode.c and a few other places we have the *_rcu suffix pattern
already so maybe:

fcheck() -> fd_file_rcu() or lookup_fd_rcu()
fcheck_files() -> fd_files_rcu() or lookup_fd_files_rcu()
fnext_task() -> fd_file_from_task_rcu() or lookup_next_fd_from_task_rcu()

rather than as prefix or sm.

Christian
Eric W. Biederman Aug. 19, 2020, 1:22 p.m.
Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Mon, Aug 17, 2020 at 06:17:35PM -0700, Linus Torvalds wrote:
>> On Mon, Aug 17, 2020 at 6:06 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> >
>> > I struggle with the fcheck name as I have not seen or at least not
>> > registed on the the user that just checks to see if the result is NULL.
>> > So the name fcheck never made a bit of sense to me.
>> 
>> Yeah, that name is not great. I just don't want to make things even worse.
>> 
>> > I will see if I can come up with some good descriptive comments around
>> > these functions.  Along with describing what these things are doing I am
>> > thinking maybe I should put "_rcu" in their names and have a debug check
>> > that verifies "_rcu" is held.
>> 
>> Yeah, something along the lines of "rcu_lookup_fd_task(tsk,fd)" would
>> be a *lot* more descriptive than fcheck_task().
>> 
>> And I think "fnext_task()" could be "rcu_lookup_next_fd_task(tsk,fd)".
>> 
>> Yes, those are much longer names, but it's not like you end up typing
>> them all that often, and I think being descriptive would be worth it.
>> 
>> And "fcheck()" and "fcheck_files()" would be good to rename too along
>> the same lines.
>> 
>> Something like "rcu_lookup_fd()" and "rcu_lookup_fd_files()" respectively?
>> 
>> I'm obviously trying to go for a "rcu_lookup_fd*()" kind of pattern,
>> and I'm not married to _that_ particular pattern but I think it would
>> be better than what we have now.
>
> In fs/inode.c and a few other places we have the *_rcu suffix pattern
> already so maybe:
>
> fcheck() -> fd_file_rcu() or lookup_fd_rcu()
> fcheck_files() -> fd_files_rcu() or lookup_fd_files_rcu()
> fnext_task() -> fd_file_from_task_rcu() or lookup_next_fd_from_task_rcu()
>

So I sat down and played with it and here is what I wound up with is:

__fcheck_files -> files_lookup_fd_raw
fcheck_files   -> files_lookup_fd_locked
fcheck_files   -> files_lookup_fd_rcu
fcheck         -> lookup_fd_rcu
...
fcheck_task    -> task_lookup_fd_fcu
fnext_task     -> task_lookup_next_fd_rcu


That seems to match existing patterns of names, makes it relatively
clear what is going on, and has enough expressivity for for the entire
group of functions.  Plus the core name is short enough to remember
easily.

The tricky bit was realizing I needed to split fcheck_files.  As it
currently can be used in both locked and rcu scenarios.

Doing this I found one bug in my code and one bug in the existing
bpf code so.  So being extra aware of the rcu vs locked nature
of the locking makes a difference at least for me.

The bug in the existing code is that bpf_iter does get_file instead
of get_file_rcu.  Does anyone have any sense of how to add debugging
to get_file to notice when it is being called in the wrong context?

Eric
Alexei Starovoitov Aug. 19, 2020, 3:54 p.m.
On Wed, Aug 19, 2020 at 6:25 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> The bug in the existing code is that bpf_iter does get_file instead
> of get_file_rcu.  Does anyone have any sense of how to add debugging
> to get_file to notice when it is being called in the wrong context?

That bug is already fixed in bpf tree.
See commit cf28f3bbfca0 ("bpf: Use get_file_rcu() instead of
get_file() for task_file iterator")
Linus Torvalds Aug. 19, 2020, 6:32 p.m.
On Wed, Aug 19, 2020 at 6:25 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> So I sat down and played with it and here is what I wound up with is:
>
> __fcheck_files -> files_lookup_fd_raw
> fcheck_files   -> files_lookup_fd_locked
> fcheck_files   -> files_lookup_fd_rcu
> fcheck         -> lookup_fd_rcu
> ...
> fcheck_task    -> task_lookup_fd_fcu
> fnext_task     -> task_lookup_next_fd_rcu

This certainly looks fine to me. No confusion about what it does. So Ack.

                   Linus
Cyrill Gorcunov Aug. 20, 2020, 9:50 p.m.
On Mon, Aug 17, 2020 at 05:04:17PM -0500, Eric W. Biederman wrote:
> As a companion to fget_task and fcheck_task implement fnext_task that
> will return the struct file for the first file descriptor show number
> is equal or greater than the fd argument value, or NULL if there is
> no such struct file.
> 
> This allows file descriptors of foreign processes to be iterated through
> safely, without needed to increment the count on files_struct.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/file.c               | 21 +++++++++++++++++++++
>  include/linux/fdtable.h |  1 +
>  2 files changed, 22 insertions(+)
> 
> diff --git a/fs/file.c b/fs/file.c
> index 8d4b385055e9..88f9f78869f8 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -876,6 +876,27 @@ struct file *fcheck_task(struct task_struct *task, unsigned int fd)
>  	return file;
>  }
>  
> +struct file *fnext_task(struct task_struct *task, unsigned int *ret_fd)
> +{
> +	/* Must be called with rcu_read_lock held */
> +	struct files_struct *files;
> +	unsigned int fd = *ret_fd;
> +	struct file *file = NULL;
> +
> +	task_lock(task);
> +	files = task->files;
> +	if (files) {
> +		for (; fd < files_fdtable(files)->max_fds; fd++) {
> +			file = fcheck_files(files, fd);
> +			if (file)
> +				break;
> +		}
> +	}
> +	task_unlock(task);
> +	*ret_fd = fd;
> +	return file;
> +}

Eric, if only I'm not missing something obvious you could
escape @fd/@ret_fd operations in case if task->files = NULL,
iow

	if (files) {
		unsigned int fd = *ret_fd;
		for (; fd < files_fdtable(files)->max_fds; fd++) {
			file = fcheck_files(files, fd);
			if (file)
				break;
		}
		*ret_fd = fd;
	}
Eric W. Biederman Aug. 21, 2020, 3:22 p.m.
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Wed, Aug 19, 2020 at 6:25 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> The bug in the existing code is that bpf_iter does get_file instead
>> of get_file_rcu.  Does anyone have any sense of how to add debugging
>> to get_file to notice when it is being called in the wrong context?
>
> That bug is already fixed in bpf tree.
> See commit cf28f3bbfca0 ("bpf: Use get_file_rcu() instead of
> get_file() for task_file iterator")

I wished you had based that change on -rc1 instead of some random
looking place in David's Millers net tree.

I am glad to see that our existing debug checks can catch that
kind of problem when the code is exercised enough.

I am going to pull this change into my tree on top of -rc1 so we won't
have unnecessary conflicts.  Hopefully this will show up in -rc2 so the
final version of this patchset can use an easily describable base.

Eric
Eric W. Biederman Aug. 21, 2020, 4:14 p.m.
Cyrill Gorcunov <gorcunov@gmail.com> writes:

> On Mon, Aug 17, 2020 at 05:04:17PM -0500, Eric W. Biederman wrote:
>> As a companion to fget_task and fcheck_task implement fnext_task that
>> will return the struct file for the first file descriptor show number
>> is equal or greater than the fd argument value, or NULL if there is
>> no such struct file.
>> 
>> This allows file descriptors of foreign processes to be iterated through
>> safely, without needed to increment the count on files_struct.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/file.c               | 21 +++++++++++++++++++++
>>  include/linux/fdtable.h |  1 +
>>  2 files changed, 22 insertions(+)
>> 
>> diff --git a/fs/file.c b/fs/file.c
>> index 8d4b385055e9..88f9f78869f8 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -876,6 +876,27 @@ struct file *fcheck_task(struct task_struct *task, unsigned int fd)
>>  	return file;
>>  }
>>  
>> +struct file *fnext_task(struct task_struct *task, unsigned int *ret_fd)
>> +{
>> +	/* Must be called with rcu_read_lock held */
>> +	struct files_struct *files;
>> +	unsigned int fd = *ret_fd;
>> +	struct file *file = NULL;
>> +
>> +	task_lock(task);
>> +	files = task->files;
>> +	if (files) {
>> +		for (; fd < files_fdtable(files)->max_fds; fd++) {
>> +			file = fcheck_files(files, fd);
>> +			if (file)
>> +				break;
>> +		}
>> +	}
>> +	task_unlock(task);
>> +	*ret_fd = fd;
>> +	return file;
>> +}
>
> Eric, if only I'm not missing something obvious you could
> escape @fd/@ret_fd operations in case if task->files = NULL,
> iow
>
> 	if (files) {
> 		unsigned int fd = *ret_fd;
> 		for (; fd < files_fdtable(files)->max_fds; fd++) {
> 			file = fcheck_files(files, fd);
> 			if (file)
> 				break;
> 		}
> 		*ret_fd = fd;
> 	}

You aren't missing anything.  I just don't see what would be gained
by skipping those steps in an uncommon case.

As it stands it is easy to verify that *ret_fd is always read
and always set, and that task_lock is not needed to read or
write to ret_fd.

Eric
Alexei Starovoitov Aug. 21, 2020, 4:17 p.m.
On Fri, Aug 21, 2020 at 8:26 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Wed, Aug 19, 2020 at 6:25 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>
> >> The bug in the existing code is that bpf_iter does get_file instead
> >> of get_file_rcu.  Does anyone have any sense of how to add debugging
> >> to get_file to notice when it is being called in the wrong context?
> >
> > That bug is already fixed in bpf tree.
> > See commit cf28f3bbfca0 ("bpf: Use get_file_rcu() instead of
> > get_file() for task_file iterator")
>
> I wished you had based that change on -rc1 instead of some random
> looking place in David's Millers net tree.

random?
It's a well documented process. Please see:
Documentation/bpf/bpf_devel_QA.rst

> I am glad to see that our existing debug checks can catch that
> kind of problem when the code is exercised enough.

They did not. Please see the commit log of the fix.
It was a NULL pointer dereference.

> I am going to pull this change into my tree on top of -rc1 so we won't
> have unnecessary conflicts.  Hopefully this will show up in -rc2 so the
> final version of this patchset can use an easily describable base.

Please do not cherry pick fixes from other trees. You need to wait
until the bpf tree gets merged into net tree and net into Linus's tree.
It's only a couple days away. Hopefully it's there by -rc2,
but I cannot speak for Dave's schedule.
We'll send bpf tree pull-req to Dave today.