[v2] locks: Filter /proc/locks output on proc pid ns

Submitted by Nikolay Borisov on Aug. 3, 2016, 7:35 a.m.

Details

Message ID 1470209710-30022-1-git-send-email-kernel@kyup.com
State New
Series "locks: Show only file_locks created in the same pidns as current process"
Headers show

Commit Message

Nikolay Borisov Aug. 3, 2016, 7:35 a.m.
On busy container servers reading /proc/locks shows all the locks
created by all clients. This can cause large latency spikes. In my
case I observed lsof taking up to 5-10 seconds while processing around
50k locks. Fix this by limiting the locks shown only to those created
in the same pidns as the one the proc was mounted in. When reading
/proc/locks from the init_pid_ns show everything.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
---
 fs/locks.c | 6 ++++++
 1 file changed, 6 insertions(+)

Patch hide | download patch | download mbox

diff --git a/fs/locks.c b/fs/locks.c
index ee1b15f6fc13..751673d7f7fc 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2648,9 +2648,15 @@  static int locks_show(struct seq_file *f, void *v)
 {
 	struct locks_iterator *iter = f->private;
 	struct file_lock *fl, *bfl;
+	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+	struct pid_namespace *current_pidns = task_active_pid_ns(current);
 
 	fl = hlist_entry(v, struct file_lock, fl_link);
 
+	if ((current_pidns != &init_pid_ns) && fl->fl_nspid
+	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
+		return 0;
+
 	lock_get_status(f, fl, iter->li_pos, "");
 
 	list_for_each_entry(bfl, &fl->fl_block, fl_block)

Comments

Jeff Layton Aug. 3, 2016, 1:46 p.m.
On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote:
> On busy container servers reading /proc/locks shows all the locks
> created by all clients. This can cause large latency spikes. In my
> case I observed lsof taking up to 5-10 seconds while processing around
> 50k locks. Fix this by limiting the locks shown only to those created
> in the same pidns as the one the proc was mounted in. When reading
> /proc/locks from the init_pid_ns show everything.
> 
> > Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> ---
>  fs/locks.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index ee1b15f6fc13..751673d7f7fc 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v)
>  {
> >  	struct locks_iterator *iter = f->private;
> >  	struct file_lock *fl, *bfl;
> > +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
> > +	struct pid_namespace *current_pidns = task_active_pid_ns(current);
>  
> >  	fl = hlist_entry(v, struct file_lock, fl_link);
>  
> > > +	if ((current_pidns != &init_pid_ns) && fl->fl_nspid

Ok, so when you read from a process that's in the init_pid_ns
namespace, then you'll get the whole pile of locks, even when reading
this from a filesystem that was mounted in a different pid_ns?

That seems odd to me if so. Any reason not to just uniformly use the
proc_pidns here?

> > > +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
> > +		return 0;
> +
> >  	lock_get_status(f, fl, iter->li_pos, "");
>  
> >  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
Nikolay Borisov Aug. 3, 2016, 2:17 p.m.
On 08/03/2016 04:46 PM, Jeff Layton wrote:
> On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote:
>> On busy container servers reading /proc/locks shows all the locks
>> created by all clients. This can cause large latency spikes. In my
>> case I observed lsof taking up to 5-10 seconds while processing around
>> 50k locks. Fix this by limiting the locks shown only to those created
>> in the same pidns as the one the proc was mounted in. When reading
>> /proc/locks from the init_pid_ns show everything.
>>
>>> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
>> ---
>>  fs/locks.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/fs/locks.c b/fs/locks.c
>> index ee1b15f6fc13..751673d7f7fc 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v)
>>  {
>>>  	struct locks_iterator *iter = f->private;
>>>  	struct file_lock *fl, *bfl;
>>> +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>>> +	struct pid_namespace *current_pidns = task_active_pid_ns(current);
>>  
>>>  	fl = hlist_entry(v, struct file_lock, fl_link);
>>  
>>>> +	if ((current_pidns != &init_pid_ns) && fl->fl_nspid
> 
> Ok, so when you read from a process that's in the init_pid_ns
> namespace, then you'll get the whole pile of locks, even when reading
> this from a filesystem that was mounted in a different pid_ns?
> 
> That seems odd to me if so. Any reason not to just uniformly use the
> proc_pidns here?

[CCing some people from openvz/CRIU]

My train of thought was "we should have means which would be the one
universal truth about everything and this would be a process in the
init_pid_ns". I don't have strong preference as long as I'm not breaking
userspace. As I said before - I think the CRIU guys might be using that
interface.

> 
>>>> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
>>> +		return 0;
>> +
>>>  	lock_get_status(f, fl, iter->li_pos, "");
>>  
>>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
>
J. Bruce Fields Aug. 3, 2016, 2:28 p.m.
On Wed, Aug 03, 2016 at 05:17:09PM +0300, Nikolay Borisov wrote:
> 
> 
> On 08/03/2016 04:46 PM, Jeff Layton wrote:
> > On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote:
> >> On busy container servers reading /proc/locks shows all the locks
> >> created by all clients. This can cause large latency spikes. In my
> >> case I observed lsof taking up to 5-10 seconds while processing around
> >> 50k locks. Fix this by limiting the locks shown only to those created
> >> in the same pidns as the one the proc was mounted in. When reading
> >> /proc/locks from the init_pid_ns show everything.
> >>
> >>> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> >> ---
> >>  fs/locks.c | 6 ++++++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/fs/locks.c b/fs/locks.c
> >> index ee1b15f6fc13..751673d7f7fc 100644
> >> --- a/fs/locks.c
> >> +++ b/fs/locks.c
> >> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v)
> >>  {
> >>>  	struct locks_iterator *iter = f->private;
> >>>  	struct file_lock *fl, *bfl;
> >>> +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
> >>> +	struct pid_namespace *current_pidns = task_active_pid_ns(current);
> >>  
> >>>  	fl = hlist_entry(v, struct file_lock, fl_link);
> >>  
> >>>> +	if ((current_pidns != &init_pid_ns) && fl->fl_nspid
> > 
> > Ok, so when you read from a process that's in the init_pid_ns
> > namespace, then you'll get the whole pile of locks, even when reading
> > this from a filesystem that was mounted in a different pid_ns?
> > 
> > That seems odd to me if so. Any reason not to just uniformly use the
> > proc_pidns here?
> 
> [CCing some people from openvz/CRIU]
> 
> My train of thought was "we should have means which would be the one
> universal truth about everything and this would be a process in the
> init_pid_ns".

OK, but why not make that means be "mount proc from the init_pid_ns and
read /proc/locks there".  So just replace current_pidns with proc_pidns
in the above.  I think that's all Jeff was suggesting.

--b.

> I don't have strong preference as long as I'm not breaking
> userspace. As I said before - I think the CRIU guys might be using that
> interface.
> 
> > 
> >>>> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
> >>> +		return 0;
> >> +
> >>>  	lock_get_status(f, fl, iter->li_pos, "");
> >>  
> >>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
> >
Nikolay Borisov Aug. 3, 2016, 2:33 p.m.
On 08/03/2016 05:28 PM, J. Bruce Fields wrote:
> On Wed, Aug 03, 2016 at 05:17:09PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 08/03/2016 04:46 PM, Jeff Layton wrote:
>>> On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote:
>>>> On busy container servers reading /proc/locks shows all the locks
>>>> created by all clients. This can cause large latency spikes. In my
>>>> case I observed lsof taking up to 5-10 seconds while processing around
>>>> 50k locks. Fix this by limiting the locks shown only to those created
>>>> in the same pidns as the one the proc was mounted in. When reading
>>>> /proc/locks from the init_pid_ns show everything.
>>>>
>>>>> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
>>>> ---
>>>>  fs/locks.c | 6 ++++++
>>>>  1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/fs/locks.c b/fs/locks.c
>>>> index ee1b15f6fc13..751673d7f7fc 100644
>>>> --- a/fs/locks.c
>>>> +++ b/fs/locks.c
>>>> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v)
>>>>  {
>>>>>  	struct locks_iterator *iter = f->private;
>>>>>  	struct file_lock *fl, *bfl;
>>>>> +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>>>>> +	struct pid_namespace *current_pidns = task_active_pid_ns(current);
>>>>  
>>>>>  	fl = hlist_entry(v, struct file_lock, fl_link);
>>>>  
>>>>>> +	if ((current_pidns != &init_pid_ns) && fl->fl_nspid
>>>
>>> Ok, so when you read from a process that's in the init_pid_ns
>>> namespace, then you'll get the whole pile of locks, even when reading
>>> this from a filesystem that was mounted in a different pid_ns?
>>>
>>> That seems odd to me if so. Any reason not to just uniformly use the
>>> proc_pidns here?
>>
>> [CCing some people from openvz/CRIU]
>>
>> My train of thought was "we should have means which would be the one
>> universal truth about everything and this would be a process in the
>> init_pid_ns".
> 
> OK, but why not make that means be "mount proc from the init_pid_ns and
> read /proc/locks there".  So just replace current_pidns with proc_pidns
> in the above.  I think that's all Jeff was suggesting.

Oh, you are right. Silly me, yes, I'm happy with this and I will send a
patch.


> 
> --b.
> 
>> I don't have strong preference as long as I'm not breaking
>> userspace. As I said before - I think the CRIU guys might be using that
>> interface.
>>
>>>
>>>>>> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
>>>>> +		return 0;
>>>> +
>>>>>  	lock_get_status(f, fl, iter->li_pos, "");
>>>>  
>>>>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
>>>
Pavel Emelianov Aug. 3, 2016, 2:54 p.m.
On 08/03/2016 05:17 PM, Nikolay Borisov wrote:
> 
> 
> On 08/03/2016 04:46 PM, Jeff Layton wrote:
>> On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote:
>>> On busy container servers reading /proc/locks shows all the locks
>>> created by all clients. This can cause large latency spikes. In my
>>> case I observed lsof taking up to 5-10 seconds while processing around
>>> 50k locks. Fix this by limiting the locks shown only to those created
>>> in the same pidns as the one the proc was mounted in. When reading
>>> /proc/locks from the init_pid_ns show everything.
>>>
>>>> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
>>> ---
>>>  fs/locks.c | 6 ++++++
>>>  1 file changed, 6 insertions(+)
>>>
>>> diff --git a/fs/locks.c b/fs/locks.c
>>> index ee1b15f6fc13..751673d7f7fc 100644
>>> --- a/fs/locks.c
>>> +++ b/fs/locks.c
>>> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v)
>>>  {
>>>>  	struct locks_iterator *iter = f->private;
>>>>  	struct file_lock *fl, *bfl;
>>>> +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>>>> +	struct pid_namespace *current_pidns = task_active_pid_ns(current);
>>>  
>>>>  	fl = hlist_entry(v, struct file_lock, fl_link);
>>>  
>>>>> +	if ((current_pidns != &init_pid_ns) && fl->fl_nspid
>>
>> Ok, so when you read from a process that's in the init_pid_ns
>> namespace, then you'll get the whole pile of locks, even when reading
>> this from a filesystem that was mounted in a different pid_ns?
>>
>> That seems odd to me if so. Any reason not to just uniformly use the
>> proc_pidns here?
> 
> [CCing some people from openvz/CRIU]

Thanks :)

> My train of thought was "we should have means which would be the one
> universal truth about everything and this would be a process in the
> init_pid_ns". I don't have strong preference as long as I'm not breaking
> userspace. As I said before - I think the CRIU guys might be using that
> interface.

This particular change won't break us mostly because we've switched to
reading the /proc/pid/fdinfo/n files for locks.

-- Pavel

>>
>>>>> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
>>>> +		return 0;
>>> +
>>>>  	lock_get_status(f, fl, iter->li_pos, "");
>>>  
>>>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
>>
> .
>
Nikolay Borisov Aug. 3, 2016, 3 p.m.
On 08/03/2016 05:54 PM, Pavel Emelyanov wrote:
> On 08/03/2016 05:17 PM, Nikolay Borisov wrote:
>>
>>
[SNIP]
>>
>> [CCing some people from openvz/CRIU]
> 
> Thanks :)
> 
>> My train of thought was "we should have means which would be the one
>> universal truth about everything and this would be a process in the
>> init_pid_ns". I don't have strong preference as long as I'm not breaking
>> userspace. As I said before - I think the CRIU guys might be using that
>> interface.
> 
> This particular change won't break us mostly because we've switched to
> reading the /proc/pid/fdinfo/n files for locks.

[thinking out loud here]

I've never actually looked into those files but now that I have it seems
to make sense to also switch 'lsof' to actually reading the locks from
the available pids directories rather than relying on the global
/proc/locks interface. Oh well :)

[/thinking out loud here]

> 
> -- Pavel
> 
>>>
>>>>>> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
>>>>> +		return 0;
>>>> +
>>>>>  	lock_get_status(f, fl, iter->li_pos, "");
>>>>  
>>>>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
>>>
>> .
>>
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
>
J. Bruce Fields Aug. 3, 2016, 3:06 p.m.
On Wed, Aug 03, 2016 at 06:00:18PM +0300, Nikolay Borisov wrote:
> 
> 
> On 08/03/2016 05:54 PM, Pavel Emelyanov wrote:
> > On 08/03/2016 05:17 PM, Nikolay Borisov wrote:
> >>
> >>
> [SNIP]
> >>
> >> [CCing some people from openvz/CRIU]
> > 
> > Thanks :)
> > 
> >> My train of thought was "we should have means which would be the one
> >> universal truth about everything and this would be a process in the
> >> init_pid_ns". I don't have strong preference as long as I'm not breaking
> >> userspace. As I said before - I think the CRIU guys might be using that
> >> interface.
> > 
> > This particular change won't break us mostly because we've switched to
> > reading the /proc/pid/fdinfo/n files for locks.
> 
> [thinking out loud here]
> 
> I've never actually looked into those files but now that I have it seems
> to make sense to also switch 'lsof' to actually reading the locks from
> the available pids directories rather than relying on the global
> /proc/locks interface. Oh well :)

Digging around...  Oh, I see, there's an optional 'lock:..' line in
/proc/[pid]/fdinfo/[pid] file, is that what you're looking at?  I'd
forgotten.  Yeah, maybe that would make more sense long term.

--b.

> 
> [/thinking out loud here]
> 
> > 
> > -- Pavel
> > 
> >>>
> >>>>>> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
> >>>>> +		return 0;
> >>>> +
> >>>>>  	lock_get_status(f, fl, iter->li_pos, "");
> >>>>  
> >>>>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
> >>>
> >> .
> >>
> > 
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
> >
Nikolay Borisov Aug. 3, 2016, 3:10 p.m.
On 08/03/2016 06:06 PM, J. Bruce Fields wrote:
> Digging around...  Oh, I see, there's an optional 'lock:..' line in
> /proc/[pid]/fdinfo/[pid] file, is that what you're looking at?  I'd
> forgotten.  Yeah, maybe that would make more sense long term.

Yep, that's the one but this requires the userspace to be updated to use
that interface. In the meantime we could do away with some maintenance
of the existing /proc/locks :)
Eric W. Biederman Aug. 3, 2016, 5:35 p.m.
Nikolay Borisov <kernel@kyup.com> writes:

> On 08/03/2016 06:06 PM, J. Bruce Fields wrote:
>> Digging around...  Oh, I see, there's an optional 'lock:..' line in
>> /proc/[pid]/fdinfo/[pid] file, is that what you're looking at?  I'd
>> forgotten.  Yeah, maybe that would make more sense long term.
>
> Yep, that's the one but this requires the userspace to be updated to use
> that interface. In the meantime we could do away with some maintenance
> of the existing /proc/locks :)

I am tempted to say let's not change /proc/locks at all, but if locks
really are in a pid namespace than I do think it makes sense to filter
them in /proc just so there is not excessive visiblity outside of the
pid namespace.

Excessive visibility is a problem on it's own.

Eric