[PATCHv3] locks: Filter /proc/locks output on proc pid ns

Submitted by Nikolay Borisov on Aug. 3, 2016, 2:54 p.m.

Details

Message ID 1470236078-2389-1-git-send-email-kernel@kyup.com
State New
Series "locks: Show only file_locks created in the same pidns as current process"
Headers show

Commit Message

Nikolay Borisov Aug. 3, 2016, 2:54 p.m.
On busy container servers reading /proc/locks shows all the locks
created by all clients. This can cause large latency spikes. In my
case I observed lsof taking up to 5-10 seconds while processing around
50k locks. Fix this by limiting the locks shown only to those created
in the same pidns as the one the proc fs was mounted in. When reading
/proc/locks from the init_pid_ns proc instance then perform no
filtering

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
---
 fs/locks.c | 5 +++++
 1 file changed, 5 insertions(+)

Patch hide | download patch | download mbox

diff --git a/fs/locks.c b/fs/locks.c
index ee1b15f6fc13..65e75810a836 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2648,9 +2648,14 @@  static int locks_show(struct seq_file *f, void *v)
 {
 	struct locks_iterator *iter = f->private;
 	struct file_lock *fl, *bfl;
+	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
 
 	fl = hlist_entry(v, struct file_lock, fl_link);
 
+	if ((proc_pidns != &init_pid_ns) && fl->fl_nspid
+	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
+		return 0;
+
 	lock_get_status(f, fl, iter->li_pos, "");
 
 	list_for_each_entry(bfl, &fl->fl_block, fl_block)

Comments

Jeff Layton Aug. 3, 2016, 3:24 p.m.
On Wed, 2016-08-03 at 17:54 +0300, Nikolay Borisov wrote:
> On busy container servers reading /proc/locks shows all the locks
> created by all clients. This can cause large latency spikes. In my
> case I observed lsof taking up to 5-10 seconds while processing around
> 50k locks. Fix this by limiting the locks shown only to those created
> in the same pidns as the one the proc fs was mounted in. When reading
> /proc/locks from the init_pid_ns proc instance then perform no
> filtering
> 
> > Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> ---
>  fs/locks.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index ee1b15f6fc13..65e75810a836 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2648,9 +2648,14 @@ static int locks_show(struct seq_file *f, void *v)
>  {
> >  	struct locks_iterator *iter = f->private;
> >  	struct file_lock *fl, *bfl;
> > +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>  
> >  	fl = hlist_entry(v, struct file_lock, fl_link);
>  
> > +	if ((proc_pidns != &init_pid_ns) && fl->fl_nspid
> > +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
> > +		return 0;
> +
> >  	lock_get_status(f, fl, iter->li_pos, "");
>  
> >  	list_for_each_entry(bfl, &fl->fl_block, fl_block)


Yeah, that makes much more sense to me. I'll plan to merge this for
v4.9 unless there are objections between now and the next merge window.

Thanks,
Eric W. Biederman Aug. 3, 2016, 4:23 p.m.
Nikolay Borisov <kernel@kyup.com> writes:

> On busy container servers reading /proc/locks shows all the locks
> created by all clients. This can cause large latency spikes. In my
> case I observed lsof taking up to 5-10 seconds while processing around
> 50k locks. Fix this by limiting the locks shown only to those created
> in the same pidns as the one the proc fs was mounted in. When reading
> /proc/locks from the init_pid_ns proc instance then perform no
> filtering

If we are going to do this, this should be a recrusive belonging test
(because pid namespaces are recursive).

Right now the test looks like it will filter out child pid namespaces.

Special casing the init_pid_ns should be an optimization not something
that is necessary for correctness. (as it appears here).

Eric


> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> ---
>  fs/locks.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index ee1b15f6fc13..65e75810a836 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2648,9 +2648,14 @@ static int locks_show(struct seq_file *f, void *v)
>  {
>  	struct locks_iterator *iter = f->private;
>  	struct file_lock *fl, *bfl;
> +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>  
>  	fl = hlist_entry(v, struct file_lock, fl_link);
>  
> +	if ((proc_pidns != &init_pid_ns) && fl->fl_nspid
> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
> +		return 0;
> +
>  	lock_get_status(f, fl, iter->li_pos, "");
>  
>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
Jeff Layton Aug. 3, 2016, 4:50 p.m.
On Wed, 2016-08-03 at 11:23 -0500, Eric W. Biederman wrote:
> Nikolay Borisov <kernel@kyup.com> writes:
> 
> > 
> > On busy container servers reading /proc/locks shows all the locks
> > created by all clients. This can cause large latency spikes. In my
> > case I observed lsof taking up to 5-10 seconds while processing
> > around
> > 50k locks. Fix this by limiting the locks shown only to those
> > created
> > in the same pidns as the one the proc fs was mounted in. When
> > reading
> > /proc/locks from the init_pid_ns proc instance then perform no
> > filtering
> 
> If we are going to do this, this should be a recrusive belonging test
> (because pid namespaces are recursive).
> 
> Right now the test looks like it will filter out child pid
> namespaces.
> 
> Special casing the init_pid_ns should be an optimization not
> something
> that is necessary for correctness. (as it appears here).
> 
> Eric
> 
> 

Ok, thanks. I'm still not that namespace savvy -- so there's a
hierarchy of pid_namespaces?

If so, then yeah does sound better. Is there an interface that allows
you to tell whether a pid is a descendant of a particular
pid_namespace?

> > 
> > Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> > ---
> >  fs/locks.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/fs/locks.c b/fs/locks.c
> > index ee1b15f6fc13..65e75810a836 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -2648,9 +2648,14 @@ static int locks_show(struct seq_file *f,
> > void *v)
> >  {
> >  	struct locks_iterator *iter = f->private;
> >  	struct file_lock *fl, *bfl;
> > +	struct pid_namespace *proc_pidns = file_inode(f->file)-
> > >i_sb->s_fs_info;
> >  
> >  	fl = hlist_entry(v, struct file_lock, fl_link);
> >  
> > +	if ((proc_pidns != &init_pid_ns) && fl->fl_nspid
> > +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
> > +		return 0;
> > +
> >  	lock_get_status(f, fl, iter->li_pos, "");
> >  
> >  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
Eric W. Biederman Aug. 3, 2016, 5:40 p.m.
Nikolay Borisov <kernel@kyup.com> writes:

> On busy container servers reading /proc/locks shows all the locks
> created by all clients. This can cause large latency spikes. In my
> case I observed lsof taking up to 5-10 seconds while processing around
> 50k locks. Fix this by limiting the locks shown only to those created
> in the same pidns as the one the proc fs was mounted in. When reading
> /proc/locks from the init_pid_ns proc instance then perform no
> filtering
>
> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> ---
>  fs/locks.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index ee1b15f6fc13..65e75810a836 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2648,9 +2648,14 @@ static int locks_show(struct seq_file *f, void *v)
>  {
>  	struct locks_iterator *iter = f->private;
>  	struct file_lock *fl, *bfl;
> +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>  
>  	fl = hlist_entry(v, struct file_lock, fl_link);
>  
> +	if ((proc_pidns != &init_pid_ns) && fl->fl_nspid
> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
> +		return 0;
> +

With no loss of generality you can simplify this check to:

	if ((fl->fl_ns_pid) && (pid_nr_ns(fl->fl_nspid, prod_pidns)))
        	return 0;

>  	lock_get_status(f, fl, iter->li_pos, "");
>  
>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)


Of course now I am staring at the crazy use of pid_vnr in
lock_get_status.  That should probably be pid_nr_ns(proc_pidns) as well.


Eric
Eric W. Biederman Aug. 3, 2016, 9:09 p.m.
Jeff Layton <jlayton@poochiereds.net> writes:

> On Wed, 2016-08-03 at 11:23 -0500, Eric W. Biederman wrote:
>> Nikolay Borisov <kernel@kyup.com> writes:
>> 
>> > 
>> > On busy container servers reading /proc/locks shows all the locks
>> > created by all clients. This can cause large latency spikes. In my
>> > case I observed lsof taking up to 5-10 seconds while processing
>> > around
>> > 50k locks. Fix this by limiting the locks shown only to those
>> > created
>> > in the same pidns as the one the proc fs was mounted in. When
>> > reading
>> > /proc/locks from the init_pid_ns proc instance then perform no
>> > filtering
>> 
>> If we are going to do this, this should be a recrusive belonging test
>> (because pid namespaces are recursive).
>> 
>> Right now the test looks like it will filter out child pid
>> namespaces.
>> 
>> Special casing the init_pid_ns should be an optimization not
>> something
>> that is necessary for correctness. (as it appears here).
>> 
>> Eric
>> 
>> 
>
> Ok, thanks. I'm still not that namespace savvy -- so there's a
> hierarchy of pid_namespaces?

There is.

> If so, then yeah does sound better. Is there an interface that allows
> you to tell whether a pid is a descendant of a particular
> pid_namespace?

Yes.  And each pid has an array of the pid namespaces it is in so it is
a O(1) operation to see if that struct pid is in a pid namespace.

Dumb question does anyone know the difference between fl_nspid and
fl_pid off the top of your heads?  I am looking at the code and I am
confused why we have to both.  I am afraid that there was some
sloppiness when the pid namespace was implemented and this was the
result.  I remember that file locks were a rough spot during the
conversion but I don't recall the details off the top of my head.

Eric
Nikolay Borisov Aug. 3, 2016, 9:26 p.m.
On  4.08.2016 00:09, Eric W. Biederman wrote:
> Dumb question does anyone know the difference between fl_nspid and
> fl_pid off the top of your heads?  I am looking at the code and I am
> confused why we have to both.  I am afraid that there was some
> sloppiness when the pid namespace was implemented and this was the
> result.  I remember that file locks were a rough spot during the
> conversion but I don't recall the details off the top of my head.

I think the fl_nspid is the actual pid namespace where the lock was
created, where as pid can be just a unique value required by NFS. I
gathered that information while reading the changelogs and the mailing
list discussion here:
https://lists.linuxfoundation.org/pipermail/containers/2007-December/009044.html
Eric W. Biederman Aug. 4, 2016, 4:18 a.m.
Nikolay Borisov <n.borisov.lkml@gmail.com> writes:

> On  4.08.2016 00:09, Eric W. Biederman wrote:
>> Dumb question does anyone know the difference between fl_nspid and
>> fl_pid off the top of your heads?  I am looking at the code and I am
>> confused why we have to both.  I am afraid that there was some
>> sloppiness when the pid namespace was implemented and this was the
>> result.  I remember that file locks were a rough spot during the
>> conversion but I don't recall the details off the top of my head.
>
> I think the fl_nspid is the actual pid namespace where the lock was
> created, where as pid can be just a unique value required by NFS. I
> gathered that information while reading the changelogs and the mailing
> list discussion here:
> https://lists.linuxfoundation.org/pipermail/containers/2007-December/009044.html

Thanks for the old thread.

Researching myself I see that for posix locks we have struct flock
that has a field l_pid that must be the pid of the process holding
the lock.

It looks like the explanation given in the old thread was inadequate,
and the code in the kernel is definitely incorrect with respect to pid
namespaces.  If the process holding the lock is in a different pid
namespace than the process waiting for the lock l_pid will be set
incorrectly.

Looking at the code it appears from the perspective of struct file_lock
the only difference between a remote lock and a local lock is that
fl_owner is set to point at a different structure.

Looking at the nfs case if I am reading my sources correctly the struct
nlm field named svid is the process id on the remote system, and other
nlm fields distinguish the host.

Is the desired behavior for nfs that for a remote lock we set l_pid
to the remote process id, and don't report a hint about which machine
the process id is on?

It does seem to make sense to have both fl_pid as a value usable
remotely and otherwise and fl_nspid (arguably it should be fl_local_pid)
as a value usable on the local machine.

I think the code that sets fl_nspid today is anything but obviously
correct and probably needs to be rewritten.

Eric
Eric W. Biederman Aug. 4, 2016, 5:07 a.m.
ebiederm@xmission.com (Eric W. Biederman) writes:

> Nikolay Borisov <n.borisov.lkml@gmail.com> writes:
>
>> On  4.08.2016 00:09, Eric W. Biederman wrote:
>>> Dumb question does anyone know the difference between fl_nspid and
>>> fl_pid off the top of your heads?  I am looking at the code and I am
>>> confused why we have to both.  I am afraid that there was some
>>> sloppiness when the pid namespace was implemented and this was the
>>> result.  I remember that file locks were a rough spot during the
>>> conversion but I don't recall the details off the top of my head.
>>
>> I think the fl_nspid is the actual pid namespace where the lock was
>> created, where as pid can be just a unique value required by NFS. I
>> gathered that information while reading the changelogs and the mailing
>> list discussion here:
>> https://lists.linuxfoundation.org/pipermail/containers/2007-December/009044.html
>
> Thanks for the old thread.
>
> Researching myself I see that for posix locks we have struct flock
> that has a field l_pid that must be the pid of the process holding
> the lock.
>
> It looks like the explanation given in the old thread was inadequate,
> and the code in the kernel is definitely incorrect with respect to pid
> namespaces.  If the process holding the lock is in a different pid
> namespace than the process waiting for the lock l_pid will be set
> incorrectly.
>
> Looking at the code it appears from the perspective of struct file_lock
> the only difference between a remote lock and a local lock is that
> fl_owner is set to point at a different structure.
>
> Looking at the nfs case if I am reading my sources correctly the struct
> nlm field named svid is the process id on the remote system, and other
> nlm fields distinguish the host.
>
> Is the desired behavior for nfs that for a remote lock we set l_pid
> to the remote process id, and don't report a hint about which machine
> the process id is on?
>
> It does seem to make sense to have both fl_pid as a value usable
> remotely and otherwise and fl_nspid (arguably it should be fl_local_pid)
> as a value usable on the local machine.
>
> I think the code that sets fl_nspid today is anything but obviously
> correct and probably needs to be rewritten.

Bah.  I take it back F_GETFL isis not broken when crossing pid
namespaces, the code is just dangerously clever.

Eric