[RFC] locks: Show only file_locks created in the same pidns as current process

Submitted by Nikolay Borisov on Aug. 2, 2016, 2:42 p.m.

Details

Message ID 1470148943-21835-1-git-send-email-kernel@kyup.com
State New
Series "locks: Show only file_locks created in the same pidns as current process"
Headers show

Commit Message

Nikolay Borisov Aug. 2, 2016, 2:42 p.m.
Currently when /proc/locks is read it will show all the file locks
which are currently created on the machine. On containers, hosted
on busy servers this means that doing lsof can be very slow. I
observed up to 5 seconds stalls reading 50k locks, while the container
itself had only a small number of relevant entries. Fix it by
filtering the locks listed by the pidns of the current process
and the process which created the lock.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
---
 fs/locks.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Patch hide | download patch | download mbox

diff --git a/fs/locks.c b/fs/locks.c
index 6333263b7bc8..53e96df4c583 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2615,9 +2615,17 @@  static int locks_show(struct seq_file *f, void *v)
 {
 	struct locks_iterator *iter = f->private;
 	struct file_lock *fl, *bfl;
+	struct pid_namespace *pid_ns = task_active_pid_ns(current);
+
 
 	fl = hlist_entry(v, struct file_lock, fl_link);
 
+	pr_info ("Current pid_ns: %p init_pid_ns: %p, fl->fl_nspid: %p nspidof:%p\n", pid_ns, &init_pid_ns,
+		 fl->fl_nspid, ns_of_pid(fl->fl_nspid));
+	if ((pid_ns != &init_pid_ns) && fl->fl_nspid &&
+		(pid_ns != ns_of_pid(fl->fl_nspid)))
+		    return 0;
+
 	lock_get_status(f, fl, iter->li_pos, "");
 
 	list_for_each_entry(bfl, &fl->fl_block, fl_block)

Comments

Nikolay Borisov Aug. 2, 2016, 2:45 p.m.
On 08/02/2016 05:42 PM, Nikolay Borisov wrote:
> Currently when /proc/locks is read it will show all the file locks
> which are currently created on the machine. On containers, hosted
> on busy servers this means that doing lsof can be very slow. I
> observed up to 5 seconds stalls reading 50k locks, while the container
> itself had only a small number of relevant entries. Fix it by
> filtering the locks listed by the pidns of the current process
> and the process which created the lock.
> 
> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> ---
>  fs/locks.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 6333263b7bc8..53e96df4c583 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2615,9 +2615,17 @@ static int locks_show(struct seq_file *f, void *v)
>  {
>  	struct locks_iterator *iter = f->private;
>  	struct file_lock *fl, *bfl;
> +	struct pid_namespace *pid_ns = task_active_pid_ns(current);
> +
>  
>  	fl = hlist_entry(v, struct file_lock, fl_link);
>  
> +	pr_info ("Current pid_ns: %p init_pid_ns: %p, fl->fl_nspid: %p nspidof:%p\n", pid_ns, &init_pid_ns,
> +		 fl->fl_nspid, ns_of_pid(fl->fl_nspid));

Obviously I don't intend on including that in the final submission.

> +	if ((pid_ns != &init_pid_ns) && fl->fl_nspid &&
> +		(pid_ns != ns_of_pid(fl->fl_nspid)))
> +		    return 0;
> +
>  	lock_get_status(f, fl, iter->li_pos, "");
>  
>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
>
J. Bruce Fields Aug. 2, 2016, 3:05 p.m.
On Tue, Aug 02, 2016 at 05:42:23PM +0300, Nikolay Borisov wrote:
> Currently when /proc/locks is read it will show all the file locks
> which are currently created on the machine. On containers, hosted
> on busy servers this means that doing lsof can be very slow. I
> observed up to 5 seconds stalls reading 50k locks,

Do you mean just that the reading process itself was blocked, or that
others were getting stuck on blocked_lock_lock?

(And what process was actually reading /proc/locks, out of curiosity?)

> while the container
> itself had only a small number of relevant entries. Fix it by
> filtering the locks listed by the pidns of the current process
> and the process which created the lock.

Thanks, that's interesting.  So you show a lock if it was created by
someone in the current pid namespace.  With a special exception for the
init namespace so that 

If a filesystem is shared between containers that means you won't
necessarily be able to figure out from within a container which lock is
conflicting with your lock.  (I don't know if that's really a problem.
I'm unfortunately short on evidence aobut what people actually use
/proc/locks for....)

--b.

> 
> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> ---
>  fs/locks.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 6333263b7bc8..53e96df4c583 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2615,9 +2615,17 @@ static int locks_show(struct seq_file *f, void *v)
>  {
>  	struct locks_iterator *iter = f->private;
>  	struct file_lock *fl, *bfl;
> +	struct pid_namespace *pid_ns = task_active_pid_ns(current);
> +
>  
>  	fl = hlist_entry(v, struct file_lock, fl_link);
>  
> +	pr_info ("Current pid_ns: %p init_pid_ns: %p, fl->fl_nspid: %p nspidof:%p\n", pid_ns, &init_pid_ns,
> +		 fl->fl_nspid, ns_of_pid(fl->fl_nspid));
> +	if ((pid_ns != &init_pid_ns) && fl->fl_nspid &&
> +		(pid_ns != ns_of_pid(fl->fl_nspid)))
> +		    return 0;
> +
>  	lock_get_status(f, fl, iter->li_pos, "");
>  
>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
> -- 
> 2.5.0
Nikolay Borisov Aug. 2, 2016, 3:20 p.m.
On 08/02/2016 06:05 PM, J. Bruce Fields wrote:
> On Tue, Aug 02, 2016 at 05:42:23PM +0300, Nikolay Borisov wrote:
>> Currently when /proc/locks is read it will show all the file locks
>> which are currently created on the machine. On containers, hosted
>> on busy servers this means that doing lsof can be very slow. I
>> observed up to 5 seconds stalls reading 50k locks,
> 
> Do you mean just that the reading process itself was blocked, or that
> others were getting stuck on blocked_lock_lock?

I mean the listing process. Here is a simplified example from cat: 

cat-15084 [010] 3394000.190341: funcgraph_entry:      # 6156.641 us |  vfs_read();
cat-15084 [010] 3394000.196568: funcgraph_entry:      # 6096.618 us |  vfs_read();
cat-15084 [010] 3394000.202743: funcgraph_entry:      # 6060.097 us |  vfs_read();
cat-15084 [010] 3394000.208937: funcgraph_entry:      # 6111.374 us |  vfs_read();


> 
> (And what process was actually reading /proc/locks, out of curiosity?)

lsof in my case

> 
>> while the container
>> itself had only a small number of relevant entries. Fix it by
>> filtering the locks listed by the pidns of the current process
>> and the process which created the lock.
> 
> Thanks, that's interesting.  So you show a lock if it was created by
> someone in the current pid namespace.  With a special exception for the
> init namespace so that 

I admit this is a rather naive approach. Something else I was pondering was 
checking whether the user_ns of the lock's creator pidns is the same as the 
reader's user_ns. That should potentially solve your concerns re. 
shared filesystems, no? Or whether the reader's userns is an ancestor 
of the user'ns of the creator's pidns? Maybe Eric can elaborate whether 
this would make sense?

> 
> If a filesystem is shared between containers that means you won't
> necessarily be able to figure out from within a container which lock is
> conflicting with your lock.  (I don't know if that's really a problem.
> I'm unfortunately short on evidence aobut what people actually use
> /proc/locks for....)
> 
> --b.
> 
>>
>> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
>> ---
>>  fs/locks.c | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/fs/locks.c b/fs/locks.c
>> index 6333263b7bc8..53e96df4c583 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -2615,9 +2615,17 @@ static int locks_show(struct seq_file *f, void *v)
>>  {
>>  	struct locks_iterator *iter = f->private;
>>  	struct file_lock *fl, *bfl;
>> +	struct pid_namespace *pid_ns = task_active_pid_ns(current);
>> +
>>  
>>  	fl = hlist_entry(v, struct file_lock, fl_link);
>>  
>> +	pr_info ("Current pid_ns: %p init_pid_ns: %p, fl->fl_nspid: %p nspidof:%p\n", pid_ns, &init_pid_ns,
>> +		 fl->fl_nspid, ns_of_pid(fl->fl_nspid));
>> +	if ((pid_ns != &init_pid_ns) && fl->fl_nspid &&
>> +		(pid_ns != ns_of_pid(fl->fl_nspid)))
>> +		    return 0;
>> +
>>  	lock_get_status(f, fl, iter->li_pos, "");
>>  
>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
>> -- 
>> 2.5.0
J. Bruce Fields Aug. 2, 2016, 3:43 p.m.
On Tue, Aug 02, 2016 at 06:20:32PM +0300, Nikolay Borisov wrote:
> On 08/02/2016 06:05 PM, J. Bruce Fields wrote:
> > (And what process was actually reading /proc/locks, out of curiosity?)
> 
> lsof in my case

Oh, thanks, and you said that at the start, and I overlooked
it--apologies.

> >> while the container
> >> itself had only a small number of relevant entries. Fix it by
> >> filtering the locks listed by the pidns of the current process
> >> and the process which created the lock.
> > 
> > Thanks, that's interesting.  So you show a lock if it was created by
> > someone in the current pid namespace.  With a special exception for the
> > init namespace so that 
> 
> I admit this is a rather naive approach. Something else I was pondering was 
> checking whether the user_ns of the lock's creator pidns is the same as the 
> reader's user_ns. That should potentially solve your concerns re. 
> shared filesystems, no? Or whether the reader's userns is an ancestor 
> of the user'ns of the creator's pidns? Maybe Eric can elaborate whether 
> this would make sense?

If I could just imagine myself king of the world for a moment--I wish I
could have an interface that took a path or a filehandle and gave back a
list of locks on the associated filesystem.  Then if lsof wanted a
global list, it would go through /proc/mounts and request the list of
locks for each filesystem.

For /proc/locks it might be nice if we could restrict to locks on
filesystem that are somehow visible to the current process, but I don't
know if there's a simple way to do that.

--b.

> 
> > 
> > If a filesystem is shared between containers that means you won't
> > necessarily be able to figure out from within a container which lock is
> > conflicting with your lock.  (I don't know if that's really a problem.
> > I'm unfortunately short on evidence aobut what people actually use
> > /proc/locks for....)
> > 
> > --b.
> > 
> >>
> >> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> >> ---
> >>  fs/locks.c | 8 ++++++++
> >>  1 file changed, 8 insertions(+)
> >>
> >> diff --git a/fs/locks.c b/fs/locks.c
> >> index 6333263b7bc8..53e96df4c583 100644
> >> --- a/fs/locks.c
> >> +++ b/fs/locks.c
> >> @@ -2615,9 +2615,17 @@ static int locks_show(struct seq_file *f, void *v)
> >>  {
> >>  	struct locks_iterator *iter = f->private;
> >>  	struct file_lock *fl, *bfl;
> >> +	struct pid_namespace *pid_ns = task_active_pid_ns(current);
> >> +
> >>  
> >>  	fl = hlist_entry(v, struct file_lock, fl_link);
> >>  
> >> +	pr_info ("Current pid_ns: %p init_pid_ns: %p, fl->fl_nspid: %p nspidof:%p\n", pid_ns, &init_pid_ns,
> >> +		 fl->fl_nspid, ns_of_pid(fl->fl_nspid));
> >> +	if ((pid_ns != &init_pid_ns) && fl->fl_nspid &&
> >> +		(pid_ns != ns_of_pid(fl->fl_nspid)))
> >> +		    return 0;
> >> +
> >>  	lock_get_status(f, fl, iter->li_pos, "");
> >>  
> >>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
> >> -- 
> >> 2.5.0
Eric W. Biederman Aug. 2, 2016, 4 p.m.
Nikolay Borisov <kernel@kyup.com> writes:

> Currently when /proc/locks is read it will show all the file locks
> which are currently created on the machine. On containers, hosted
> on busy servers this means that doing lsof can be very slow. I
> observed up to 5 seconds stalls reading 50k locks, while the container
> itself had only a small number of relevant entries. Fix it by
> filtering the locks listed by the pidns of the current process
> and the process which created the lock.

The locks always confuse me so I am not 100% connecting locks
to a pid namespace is appropriate.

That said if you are going to filter by pid namespace please use the pid
namespace of proc, not the pid namespace of the process reading the
file.

Different contents of files depending on who opens them is generally to
be discouraged.

Eric

> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> ---
>  fs/locks.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index 6333263b7bc8..53e96df4c583 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2615,9 +2615,17 @@ static int locks_show(struct seq_file *f, void *v)
>  {
>  	struct locks_iterator *iter = f->private;
>  	struct file_lock *fl, *bfl;
> +	struct pid_namespace *pid_ns = task_active_pid_ns(current);
> +
>  
>  	fl = hlist_entry(v, struct file_lock, fl_link);
>  
> +	pr_info ("Current pid_ns: %p init_pid_ns: %p, fl->fl_nspid: %p nspidof:%p\n", pid_ns, &init_pid_ns,
> +		 fl->fl_nspid, ns_of_pid(fl->fl_nspid));
> +	if ((pid_ns != &init_pid_ns) && fl->fl_nspid &&
> +		(pid_ns != ns_of_pid(fl->fl_nspid)))
> +		    return 0;
> +
>  	lock_get_status(f, fl, iter->li_pos, "");
>  
>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
J. Bruce Fields Aug. 2, 2016, 5:40 p.m.
On Tue, Aug 02, 2016 at 11:00:39AM -0500, Eric W. Biederman wrote:
> Nikolay Borisov <kernel@kyup.com> writes:
> 
> > Currently when /proc/locks is read it will show all the file locks
> > which are currently created on the machine. On containers, hosted
> > on busy servers this means that doing lsof can be very slow. I
> > observed up to 5 seconds stalls reading 50k locks, while the container
> > itself had only a small number of relevant entries. Fix it by
> > filtering the locks listed by the pidns of the current process
> > and the process which created the lock.
> 
> The locks always confuse me so I am not 100% connecting locks
> to a pid namespace is appropriate.
> 
> That said if you are going to filter by pid namespace please use the pid
> namespace of proc, not the pid namespace of the process reading the
> file.

Oh, that makes sense, thanks.

What does /proc/mounts use, out of curiosity?  The mount namespace that
/proc was originally mounted in?

--b.

> 
> Different contents of files depending on who opens them is generally to
> be discouraged.
> 
> Eric
> 
> > Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> > ---
> >  fs/locks.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/fs/locks.c b/fs/locks.c
> > index 6333263b7bc8..53e96df4c583 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -2615,9 +2615,17 @@ static int locks_show(struct seq_file *f, void *v)
> >  {
> >  	struct locks_iterator *iter = f->private;
> >  	struct file_lock *fl, *bfl;
> > +	struct pid_namespace *pid_ns = task_active_pid_ns(current);
> > +
> >  
> >  	fl = hlist_entry(v, struct file_lock, fl_link);
> >  
> > +	pr_info ("Current pid_ns: %p init_pid_ns: %p, fl->fl_nspid: %p nspidof:%p\n", pid_ns, &init_pid_ns,
> > +		 fl->fl_nspid, ns_of_pid(fl->fl_nspid));
> > +	if ((pid_ns != &init_pid_ns) && fl->fl_nspid &&
> > +		(pid_ns != ns_of_pid(fl->fl_nspid)))
> > +		    return 0;
> > +
> >  	lock_get_status(f, fl, iter->li_pos, "");
> >  
> >  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
Eric W. Biederman Aug. 2, 2016, 7:09 p.m.
"J. Bruce Fields" <bfields@fieldses.org> writes:

> On Tue, Aug 02, 2016 at 11:00:39AM -0500, Eric W. Biederman wrote:
>> Nikolay Borisov <kernel@kyup.com> writes:
>> 
>> > Currently when /proc/locks is read it will show all the file locks
>> > which are currently created on the machine. On containers, hosted
>> > on busy servers this means that doing lsof can be very slow. I
>> > observed up to 5 seconds stalls reading 50k locks, while the container
>> > itself had only a small number of relevant entries. Fix it by
>> > filtering the locks listed by the pidns of the current process
>> > and the process which created the lock.
>> 
>> The locks always confuse me so I am not 100% connecting locks
>> to a pid namespace is appropriate.
>> 
>> That said if you are going to filter by pid namespace please use the pid
>> namespace of proc, not the pid namespace of the process reading the
>> file.
>
> Oh, that makes sense, thanks.
>
> What does /proc/mounts use, out of curiosity?  The mount namespace that
> /proc was originally mounted in?

/proc/mounts -> /proc/self/mounts

/proc/[pid]/mounts lists mounts from the mount namespace of the
appropriate process.

That is another way to go but it is a tread carefully thing as changing
things that way it is easy to surprise apparmor or selinux rules and be
surprised you broke someones userspace in a way that prevents booting.
Although I suspect /proc/locks isn't too bad.

Eric
J. Bruce Fields Aug. 2, 2016, 7:44 p.m.
On Tue, Aug 02, 2016 at 02:09:22PM -0500, Eric W. Biederman wrote:
> "J. Bruce Fields" <bfields@fieldses.org> writes:
> 
> > On Tue, Aug 02, 2016 at 11:00:39AM -0500, Eric W. Biederman wrote:
> >> Nikolay Borisov <kernel@kyup.com> writes:
> >> 
> >> > Currently when /proc/locks is read it will show all the file locks
> >> > which are currently created on the machine. On containers, hosted
> >> > on busy servers this means that doing lsof can be very slow. I
> >> > observed up to 5 seconds stalls reading 50k locks, while the container
> >> > itself had only a small number of relevant entries. Fix it by
> >> > filtering the locks listed by the pidns of the current process
> >> > and the process which created the lock.
> >> 
> >> The locks always confuse me so I am not 100% connecting locks
> >> to a pid namespace is appropriate.
> >> 
> >> That said if you are going to filter by pid namespace please use the pid
> >> namespace of proc, not the pid namespace of the process reading the
> >> file.
> >
> > Oh, that makes sense, thanks.
> >
> > What does /proc/mounts use, out of curiosity?  The mount namespace that
> > /proc was originally mounted in?
> 
> /proc/mounts -> /proc/self/mounts

D'oh, I knew that.

> /proc/[pid]/mounts lists mounts from the mount namespace of the
> appropriate process.
> 
> That is another way to go but it is a tread carefully thing as changing
> things that way it is easy to surprise apparmor or selinux rules and be
> surprised you broke someones userspace in a way that prevents booting.
> Although I suspect /proc/locks isn't too bad.

OK, thanks.

/proc/[pid]/locks might be confusing.  I'd expect it to be "all the
locks owned by this task", rather than "all the locks owned by pid's in
the same pid namespace", or whatever criterion we choose.

Uh, I'm still trying to think of the Obviously Right solution here, and
it's not coming.

--b.
Jeff Layton Aug. 2, 2016, 8:01 p.m.
On Tue, 2016-08-02 at 15:44 -0400, J. Bruce Fields wrote:
> On Tue, Aug 02, 2016 at 02:09:22PM -0500, Eric W. Biederman wrote:
> > 
> > > > "J. Bruce Fields" <bfields@fieldses.org> writes:
> > 
> > > 
> > > On Tue, Aug 02, 2016 at 11:00:39AM -0500, Eric W. Biederman wrote:
> > > > 
> > > > > > > > Nikolay Borisov <kernel@kyup.com> writes:
> > > > 
> > > > > 
> > > > > Currently when /proc/locks is read it will show all the file locks
> > > > > which are currently created on the machine. On containers, hosted
> > > > > on busy servers this means that doing lsof can be very slow. I
> > > > > observed up to 5 seconds stalls reading 50k locks, while the container
> > > > > itself had only a small number of relevant entries. Fix it by
> > > > > filtering the locks listed by the pidns of the current process
> > > > > and the process which created the lock.
> > > > 
> > > > The locks always confuse me so I am not 100% connecting locks
> > > > to a pid namespace is appropriate.
> > > > 
> > > > That said if you are going to filter by pid namespace please use the pid
> > > > namespace of proc, not the pid namespace of the process reading the
> > > > file.
> > > 
> > > Oh, that makes sense, thanks.
> > > 
> > > What does /proc/mounts use, out of curiosity?  The mount namespace that
> > > /proc was originally mounted in?
> > 
> > /proc/mounts -> /proc/self/mounts
> 
> D'oh, I knew that.
> 
> > 
> > /proc/[pid]/mounts lists mounts from the mount namespace of the
> > appropriate process.
> > 
> > That is another way to go but it is a tread carefully thing as changing
> > things that way it is easy to surprise apparmor or selinux rules and be
> > surprised you broke someones userspace in a way that prevents booting.
> > Although I suspect /proc/locks isn't too bad.
> 
> OK, thanks.
> 
> /proc/[pid]/locks might be confusing.  I'd expect it to be "all the
> locks owned by this task", rather than "all the locks owned by pid's in
> the same pid namespace", or whatever criterion we choose.
> 
> Uh, I'm still trying to think of the Obviously Right solution here, and
> it's not coming.
> 
> --b.


I'm a little leery of changing how this works. It has always been
maintained as a legacy interface, so do we run the risk of breaking
something if we turn it into a per-namespace thing? This also doesn't
solve the problem of slow traversal in the init_pid_ns -- only in a
container.

I also can't help but feel that /proc/locks is just showing its age. It
was fine in the late 90's, but its limitations are just becoming more
apparent as things get more complex. It was never designed for
performance as you end up thrashing several spinlocks when reading it.

Maybe it's time to think about presenting this info in another way? A
global view of all locks on the system is interesting but maybe it
would be better to present it more granularly somehow?

I guess I should go look at what lsof actually does with this info...
Nikolay Borisov Aug. 2, 2016, 8:11 p.m.
On Tue, Aug 2, 2016 at 11:01 PM, Jeff Layton <jlayton@poochiereds.net> wrote:
> On Tue, 2016-08-02 at 15:44 -0400, J. Bruce Fields wrote:
>> On Tue, Aug 02, 2016 at 02:09:22PM -0500, Eric W. Biederman wrote:
>> >
>> > > > "J. Bruce Fields" <bfields@fieldses.org> writes:
>> >
>> > >
>> > > On Tue, Aug 02, 2016 at 11:00:39AM -0500, Eric W. Biederman wrote:
>> > > >
>> > > > > > > > Nikolay Borisov <kernel@kyup.com> writes:
>> > > >
>> > > > >
>> > > > > Currently when /proc/locks is read it will show all the file locks
>> > > > > which are currently created on the machine. On containers, hosted
>> > > > > on busy servers this means that doing lsof can be very slow. I
>> > > > > observed up to 5 seconds stalls reading 50k locks, while the container
>> > > > > itself had only a small number of relevant entries. Fix it by
>> > > > > filtering the locks listed by the pidns of the current process
>> > > > > and the process which created the lock.
>> > > >
>> > > > The locks always confuse me so I am not 100% connecting locks
>> > > > to a pid namespace is appropriate.
>> > > >
>> > > > That said if you are going to filter by pid namespace please use the pid
>> > > > namespace of proc, not the pid namespace of the process reading the
>> > > > file.
>> > >
>> > > Oh, that makes sense, thanks.
>> > >
>> > > What does /proc/mounts use, out of curiosity?  The mount namespace that
>> > > /proc was originally mounted in?
>> >
>> > /proc/mounts -> /proc/self/mounts
>>
>> D'oh, I knew that.
>>
>> >
>> > /proc/[pid]/mounts lists mounts from the mount namespace of the
>> > appropriate process.
>> >
>> > That is another way to go but it is a tread carefully thing as changing
>> > things that way it is easy to surprise apparmor or selinux rules and be
>> > surprised you broke someones userspace in a way that prevents booting.
>> > Although I suspect /proc/locks isn't too bad.
>>
>> OK, thanks.
>>
>> /proc/[pid]/locks might be confusing.  I'd expect it to be "all the
>> locks owned by this task", rather than "all the locks owned by pid's in
>> the same pid namespace", or whatever criterion we choose.
>>
>> Uh, I'm still trying to think of the Obviously Right solution here, and
>> it's not coming.
>>
>> --b.
>
>
> I'm a little leery of changing how this works. It has always been
> maintained as a legacy interface, so do we run the risk of breaking
> something if we turn it into a per-namespace thing? This also doesn't
> solve the problem of slow traversal in the init_pid_ns -- only in a
> container.
>
> I also can't help but feel that /proc/locks is just showing its age. It
> was fine in the late 90's, but its limitations are just becoming more
> apparent as things get more complex. It was never designed for
> performance as you end up thrashing several spinlocks when reading it.

I believe it's also used by CRIU, though in this case you'd be doing
that from the init ns so I guess it's not that big of a problem there.

>
> Maybe it's time to think about presenting this info in another way? A
> global view of all locks on the system is interesting but maybe it
> would be better to present it more granularly somehow?
>
> I guess I should go look at what lsof actually does with this info...
>
> --
> Jeff Layton <jlayton@poochiereds.net>
J. Bruce Fields Aug. 2, 2016, 8:34 p.m.
On Tue, Aug 02, 2016 at 04:01:22PM -0400, Jeff Layton wrote:
> On Tue, 2016-08-02 at 15:44 -0400, J. Bruce Fields wrote:
> > On Tue, Aug 02, 2016 at 02:09:22PM -0500, Eric W. Biederman wrote:
> > > 
> > > > > "J. Bruce Fields" <bfields@fieldses.org> writes:
> > > 
> > > > 
> > > > On Tue, Aug 02, 2016 at 11:00:39AM -0500, Eric W. Biederman wrote:
> > > > > 
> > > > > > > > > Nikolay Borisov <kernel@kyup.com> writes:
> > > > > 
> > > > > > 
> > > > > > Currently when /proc/locks is read it will show all the file locks
> > > > > > which are currently created on the machine. On containers, hosted
> > > > > > on busy servers this means that doing lsof can be very slow. I
> > > > > > observed up to 5 seconds stalls reading 50k locks, while the container
> > > > > > itself had only a small number of relevant entries. Fix it by
> > > > > > filtering the locks listed by the pidns of the current process
> > > > > > and the process which created the lock.
> > > > > 
> > > > > The locks always confuse me so I am not 100% connecting locks
> > > > > to a pid namespace is appropriate.
> > > > > 
> > > > > That said if you are going to filter by pid namespace please use the pid
> > > > > namespace of proc, not the pid namespace of the process reading the
> > > > > file.
> > > > 
> > > > Oh, that makes sense, thanks.
> > > > 
> > > > What does /proc/mounts use, out of curiosity?  The mount namespace that
> > > > /proc was originally mounted in?
> > > 
> > > /proc/mounts -> /proc/self/mounts
> > 
> > D'oh, I knew that.
> > 
> > > 
> > > /proc/[pid]/mounts lists mounts from the mount namespace of the
> > > appropriate process.
> > > 
> > > That is another way to go but it is a tread carefully thing as changing
> > > things that way it is easy to surprise apparmor or selinux rules and be
> > > surprised you broke someones userspace in a way that prevents booting.
> > > Although I suspect /proc/locks isn't too bad.
> > 
> > OK, thanks.
> > 
> > /proc/[pid]/locks might be confusing.  I'd expect it to be "all the
> > locks owned by this task", rather than "all the locks owned by pid's in
> > the same pid namespace", or whatever criterion we choose.
> > 
> > Uh, I'm still trying to think of the Obviously Right solution here, and
> > it's not coming.
> > 
> > --b.
> 
> 
> I'm a little leery of changing how this works. It has always been
> maintained as a legacy interface, so do we run the risk of breaking
> something if we turn it into a per-namespace thing?

The namespace work is all about making interfaces per-namespace.  I
guess it works as long as it contributes to the illusion that each
container is its own machine.

Thinking about it, I might be sold on the per-pidns approach (with
Eric's modification to use the pidns of /proc not the reader).

My complaint about not being able to see conflicting locks would apply
just as well to conflicts from nfs locks held by other clients.  A disk
filesystem shared across multiple containers is a little like an nfs
filesystem shared between nfs clients.

That'd solve this immediate problem without requiring an lsof upgrade as
well.

> This also doesn't
> solve the problem of slow traversal in the init_pid_ns -- only in a
> container.
> 
> I also can't help but feel that /proc/locks is just showing its age. It
> was fine in the late 90's, but its limitations are just becoming more
> apparent as things get more complex. It was never designed for
> performance as you end up thrashing several spinlocks when reading it.
> 
> Maybe it's time to think about presenting this info in another way? A
> global view of all locks on the system is interesting but maybe it
> would be better to present it more granularly somehow?

But, yes, that might be a good idea.

--b.

> 
> I guess I should go look at what lsof actually does with this info...
> 
> -- 
> Jeff Layton <jlayton@poochiereds.net>