[0/5,RFC] Add an interface to discover relationships between namespaces

Submitted by Andrey Vagin on July 21, 2016, 9:06 p.m.

Details

Message ID 20160721210650.GA10989@outlook.office365.com
State Rejected
Series "Series without cover letter"
Headers show

Commit Message

Andrey Vagin July 21, 2016, 9:06 p.m.
On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Andrey,
> 
> On 07/14/2016 08:20 PM, Andrey Vagin wrote:

<snip>

> 
> Could you add here an of the API in detail: what do these FDs refer to,
> and how do you use them to solve the use case? And could you you add
> that info to the commit messages please.

Hi Michael,

A patch for man-pages is attached. It adds the following text to
namespaces(7).

Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
pace file descriptors.  The correct syntax is:

      fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
      Returns a file descriptor that refers to an owning  user  names‐
      pace.

NS_GET_PARENT
      Returns  a  file  descriptor  that refers to a parent namespace.
      This ioctl(2) can be used for pid and user namespaces. For  user
      namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
      ing.

In addition to generic ioctl(2) errors, the following specific ones can
occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is  outside  of the current namespace
      scope.

ENOENT ns_fd refers to the init namespace.

Thanks,
Andrew

> 
> Thanks,
> 
> Michael
> 
> 
> > [1] https://lkml.org/lkml/2016/7/6/158
> > [2] https://lkml.org/lkml/2016/7/9/101
> > 
> > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
> > Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
> > Cc: "W. Trevor King" <wking@tremily.us>
> > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > Cc: Serge Hallyn <serge.hallyn@canonical.com>
> > 
> > --
> > 2.5.5
> > 
> > 
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
From 4b9194026f901c2247150bb3038c41658700f6dd Mon Sep 17 00:00:00 2001
From: Andrey Vagin <avagin@openvz.org>
Date: Thu, 21 Jul 2016 13:58:06 -0700
Subject: [PATCH] namespace.7: descirbe NS_GET_USERNS and NS_GET-PARENT ioctl-s

Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 man7/namespaces.7 | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

Patch hide | download patch | download mbox

diff --git a/man7/namespaces.7 b/man7/namespaces.7
index 98ed3e5..207e4a5 100644
--- a/man7/namespaces.7
+++ b/man7/namespaces.7
@@ -149,6 +149,49 @@  even if all processes in the namespace terminate.
 The file descriptor can be passed to
 .BR setns (2).
 
+Since Linux 4.X, the following
+.BR ioctl (2)
+calls are supported for namespace file descriptors.
+The correct syntax is:
+.PP
+.RS
+.nf
+.IB fd " = ioctl(" ns_fd ", " ioctl_type ");"
+.fi
+.RE
+.PP
+where
+.I ioctl_type
+is one of the following:
+.TP
+.B NS_GET_USERNS
+Returns a file descriptor that refers to an owning user namespace.
+.TP
+.B NS_GET_PARENT
+Returns a file descriptor that refers to a parent namespace. This
+.BR ioctl (2)
+can be used for pid and user namespaces. For user namespaces,
+.B NS_GET_PARENT
+and
+.B NS_GET_USERNS
+have the same meaning.
+.PP
+In addition to generic
+.BR ioctl (2)
+errors, the following specific ones can occur:
+.PP
+.TP
+.B EINVAL
+.B NS_GET_PARENT
+was called for a nonhierarchical namespace.
+.TP
+.B EPERM
+The requested namespace is outside of the current namespace scope.
+.TP
+.B ENOENT
+.IB ns_fd
+refers to the init namespace.
+.PP
 In Linux 3.7 and earlier, these files were visible as hard links.
 Since Linux 3.8, they appear as symbolic links.
 If two processes are in the same namespace, then the inode numbers of their

Comments

Michael Kerrisk (man-pages) July 22, 2016, 6:48 a.m.
Hi Andrey,

On 07/21/2016 11:06 PM, Andrew Vagin wrote:
> On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Andrey,
>>
>> On 07/14/2016 08:20 PM, Andrey Vagin wrote:
>
> <snip>
>
>>
>> Could you add here an of the API in detail: what do these FDs refer to,
>> and how do you use them to solve the use case? And could you you add
>> that info to the commit messages please.
>
> Hi Michael,
>
> A patch for man-pages is attached. It adds the following text to
> namespaces(7).
>
> Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
> pace file descriptors.  The correct syntax is:
>
>       fd = ioctl(ns_fd, ioctl_type);
>
> where ioctl_type is one of the following:
>
> NS_GET_USERNS
>       Returns a file descriptor that refers to an owning  user  names‐
>       pace.
>
> NS_GET_PARENT
>       Returns  a  file  descriptor  that refers to a parent namespace.
>       This ioctl(2) can be used for pid and user namespaces. For  user
>       namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
>       ing.
>
> In addition to generic ioctl(2) errors, the following specific ones can
> occur:
>
> EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.
>
> EPERM  The  requested  namespace  is  outside  of the current namespace
>       scope.
>
> ENOENT ns_fd refers to the init namespace.

Thanks for this. But still part of the question remains unanswered.
How do we (in user-space) use the file descriptors to answer any of
the questions that this patch series was designed to solve? (This
info should be in the commit message and the man-pages patch.)

Thanks,

Michael


>>> [1] https://lkml.org/lkml/2016/7/6/158
>>> [2] https://lkml.org/lkml/2016/7/9/101
>>>
>>> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
>>> Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
>>> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
>>> Cc: "W. Trevor King" <wking@tremily.us>
>>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>>> Cc: Serge Hallyn <serge.hallyn@canonical.com>
>>>
>>> --
>>> 2.5.5
>>>
>>>
>>
>>
>> --
>> Michael Kerrisk
>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>> Linux/UNIX System Programming Training: http://man7.org/training/
Andrei Vagin July 22, 2016, 6:25 p.m.
On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> Hi Andrey,
>
>
> On 07/21/2016 11:06 PM, Andrew Vagin wrote:
>>
>> On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
>> wrote:
>>>
>>> Hi Andrey,
>>>
>>> On 07/14/2016 08:20 PM, Andrey Vagin wrote:
>>
>>
>> <snip>
>>
>>>
>>> Could you add here an of the API in detail: what do these FDs refer to,
>>> and how do you use them to solve the use case? And could you you add
>>> that info to the commit messages please.
>>
>>
>> Hi Michael,
>>
>> A patch for man-pages is attached. It adds the following text to
>> namespaces(7).
>>
>> Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
>> pace file descriptors.  The correct syntax is:
>>
>>       fd = ioctl(ns_fd, ioctl_type);
>>
>> where ioctl_type is one of the following:
>>
>> NS_GET_USERNS
>>       Returns a file descriptor that refers to an owning  user  names‐
>>       pace.
>>
>> NS_GET_PARENT
>>       Returns  a  file  descriptor  that refers to a parent namespace.
>>       This ioctl(2) can be used for pid and user namespaces. For  user
>>       namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
>>       ing.
>>
>> In addition to generic ioctl(2) errors, the following specific ones can
>> occur:
>>
>> EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.
>>
>> EPERM  The  requested  namespace  is  outside  of the current namespace
>>       scope.
>>
>> ENOENT ns_fd refers to the init namespace.
>
>
> Thanks for this. But still part of the question remains unanswered.
> How do we (in user-space) use the file descriptors to answer any of
> the questions that this patch series was designed to solve? (This
> info should be in the commit message and the man-pages patch.)

I'm sorry, but I am not sure that I understand what you ask.

Here are the origin questions:
Someone else then asked me a question that led me to wonder about
generally introspecting on the parental relationships between user
namespaces and the association of other namespaces types with user
namespaces. One use would be visualization, in order to understand the
running system. Another would be to answer the question I already
mentioned: what capability does process X have to perform operations
on a resource governed by namespace Y?

Here is an example which shows how we can get the owning namespace
inode number by using these ioctl-s.

$ ls -l /proc/13929/ns/pid
lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]'

$ ./nsowner /proc/13929/ns/pid
user:[4026532227]

The owning user namespace for pid:[4026532228] is user:[4026532227].

The nsowner  tool is cimpiled from this code:

int main(int argc, char *argv[])
{
        char buf[128], path[] = "/proc/self/fd/0123456789";
        int ns, uns, ret;

        ns = open(argv[1], O_RDONLY);
        if (ns < 0)
                return 1;

        uns = ioctl(ns, NS_GET_USERNS);
        if (uns < 0)
                return 1;

        snprintf(path, sizeof(path), "/proc/self/fd/%d", uns);
        ret = readlink(path, buf, sizeof(buf) - 1);
        if (ret < 0)
                return 1;
        buf[ret] = 0;

        printf("%s\n", buf);

        return 0;
}

Does this example answer to the origin question? If it isn't, could
you eloborate what you expect to see here.

And I wrote one more example which show all relationships between
namespaces. It enumirates all processes in a system, collects all
namespaces and determins parent and owning namespaces for each of
them, then it constructs a namespace tree and shows it.

Here is a code: https://gist.github.com/avagin/db805f95e15ffb0af7e559dbb8de4418

Here is an example of output for my test system:
[root@fc24 nsfs]# ./nstree
user:[4026531837]
 \__  mnt:[4026532203]
 \__  ipc:[4026531839]
 \__  user:[4026532224]
     \__  user:[4026532226]
         \__  user:[4026532227]
             \__  pid:[4026532228]
     \__  pid:[4026532225]
         \__  pid:[4026532228]
 \__  user:[4026532221]
     \__  pid:[4026532222]
     \__  user:[4026532223]
 \__  mnt:[4026532211]
 \__  uts:[4026531838]
 \__  cgroup:[4026531835]
 \__  pid:[4026531836]
     \__  pid:[4026532225]
         \__  pid:[4026532228]
     \__  pid:[4026532222]
 \__  mnt:[4026531857]
 \__  mnt:[4026531840]
 \__  net:[4026531957]

Thanks,
Andrew

>
> Thanks,
>
> Michael
>
>
>>>> [1] https://lkml.org/lkml/2016/7/6/158
>>>> [2] https://lkml.org/lkml/2016/7/9/101
>>>>
>>>> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
>>>> Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
>>>> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
>>>> Cc: "W. Trevor King" <wking@tremily.us>
>>>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>>>> Cc: Serge Hallyn <serge.hallyn@canonical.com>
>>>>
>>>> --
>>>> 2.5.5
>>>>
>>>>
>>>
>>>
>>> --
>>> Michael Kerrisk
>>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>>> Linux/UNIX System Programming Training: http://man7.org/training/
>
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Michael Kerrisk (man-pages) July 25, 2016, 11:47 a.m.
Hi Andrey,

On 07/22/2016 08:25 PM, Andrey Vagin wrote:
> On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> Hi Andrey,
>>
>>
>> On 07/21/2016 11:06 PM, Andrew Vagin wrote:
>>>
>>> On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
>>> wrote:
>>>>
>>>> Hi Andrey,
>>>>
>>>> On 07/14/2016 08:20 PM, Andrey Vagin wrote:
>>>
>>>
>>> <snip>
>>>
>>>>
>>>> Could you add here an of the API in detail: what do these FDs refer to,
>>>> and how do you use them to solve the use case? And could you you add
>>>> that info to the commit messages please.
>>>
>>>
>>> Hi Michael,
>>>
>>> A patch for man-pages is attached. It adds the following text to
>>> namespaces(7).
>>>
>>> Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
>>> pace file descriptors.  The correct syntax is:
>>>
>>>       fd = ioctl(ns_fd, ioctl_type);
>>>
>>> where ioctl_type is one of the following:
>>>
>>> NS_GET_USERNS
>>>       Returns a file descriptor that refers to an owning  user  names‐
>>>       pace.
>>>
>>> NS_GET_PARENT
>>>       Returns  a  file  descriptor  that refers to a parent namespace.
>>>       This ioctl(2) can be used for pid and user namespaces. For  user
>>>       namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
>>>       ing.

For each of the above, I think it is worth mentioning that the
close-on-exec flag is set for the returned file descriptor.

>>>
>>> In addition to generic ioctl(2) errors, the following specific ones can
>>> occur:
>>>
>>> EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.
>>>
>>> EPERM  The  requested  namespace  is  outside  of the current namespace
>>>       scope.

Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
user namespace"?

>>>
>>> ENOENT ns_fd refers to the init namespace.
>>
>>
>> Thanks for this. But still part of the question remains unanswered.
>> How do we (in user-space) use the file descriptors to answer any of
>> the questions that this patch series was designed to solve? (This
>> info should be in the commit message and the man-pages patch.)
>
> I'm sorry, but I am not sure that I understand what you ask.
>
> Here are the origin questions:
> Someone else then asked me a question that led me to wonder about
> generally introspecting on the parental relationships between user
> namespaces and the association of other namespaces types with user
> namespaces. One use would be visualization, in order to understand the
> running system. Another would be to answer the question I already
> mentioned: what capability does process X have to perform operations
> on a resource governed by namespace Y?
>
> Here is an example which shows how we can get the owning namespace
> inode number by using these ioctl-s.
>
> $ ls -l /proc/13929/ns/pid
> lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]'
>
> $ ./nsowner /proc/13929/ns/pid
> user:[4026532227]
>
> The owning user namespace for pid:[4026532228] is user:[4026532227].
>
> The nsowner  tool is cimpiled from this code:
>
> int main(int argc, char *argv[])
> {
>         char buf[128], path[] = "/proc/self/fd/0123456789";
>         int ns, uns, ret;
>
>         ns = open(argv[1], O_RDONLY);
>         if (ns < 0)
>                 return 1;
>
>         uns = ioctl(ns, NS_GET_USERNS);
>         if (uns < 0)
>                 return 1;
>
>         snprintf(path, sizeof(path), "/proc/self/fd/%d", uns);
>         ret = readlink(path, buf, sizeof(buf) - 1);
>         if (ret < 0)
>                 return 1;
>         buf[ret] = 0;
>
>         printf("%s\n", buf);
>
>         return 0;
> }

So, from my point of view, the important piece that was missing from
your commit message was the note to use readlink("/proc/self/fd/%d")
on the returned FDs. I think that detail needs to be part of the
commit message (and also the man page text). I think it even be
helpful to include the above program as part of the commit message:
it helps people more quickly grasp the API.

> Does this example answer to the origin question?

Yes.

>If it isn't, could
> you eloborate what you expect to see here.
>
> And I wrote one more example which show all relationships between
> namespaces. It enumirates all processes in a system, collects all
> namespaces and determins parent and owning namespaces for each of
> them, then it constructs a namespace tree and shows it.
>
> Here is a code: https://gist.github.com/avagin/db805f95e15ffb0af7e559dbb8de4418

That's great! Thanks!
  
> Here is an example of output for my test system:
> [root@fc24 nsfs]# ./nstree
> user:[4026531837]
>  \__  mnt:[4026532203]
>  \__  ipc:[4026531839]
>  \__  user:[4026532224]
>      \__  user:[4026532226]
>          \__  user:[4026532227]
>              \__  pid:[4026532228]
>      \__  pid:[4026532225]
>          \__  pid:[4026532228]
>  \__  user:[4026532221]
>      \__  pid:[4026532222]
>      \__  user:[4026532223]
>  \__  mnt:[4026532211]
>  \__  uts:[4026531838]
>  \__  cgroup:[4026531835]
>  \__  pid:[4026531836]
>      \__  pid:[4026532225]
>          \__  pid:[4026532228]
>      \__  pid:[4026532222]
>  \__  mnt:[4026531857]
>  \__  mnt:[4026531840]
>  \__  net:[4026531957]

Cheers,

Michael

>>>>> [1] https://lkml.org/lkml/2016/7/6/158
>>>>> [2] https://lkml.org/lkml/2016/7/9/101
Eric W. Biederman July 25, 2016, 1:18 p.m.
"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> Hi Andrey,
>
> On 07/22/2016 08:25 PM, Andrey Vagin wrote:
>> On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
>> <mtk.manpages@gmail.com> wrote:
>>> Hi Andrey,
>>>
>>>
>>> On 07/21/2016 11:06 PM, Andrew Vagin wrote:
>>>>
>>>> On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
>>>> wrote:
>>>>>
>>>>> Hi Andrey,
>>>>>
>>>>> On 07/14/2016 08:20 PM, Andrey Vagin wrote:
>>>>
>>>>
>>>> <snip>
>>>>
>>>>>
>>>>> Could you add here an of the API in detail: what do these FDs refer to,
>>>>> and how do you use them to solve the use case? And could you you add
>>>>> that info to the commit messages please.
>>>>
>>>>
>>>> Hi Michael,
>>>>
>>>> A patch for man-pages is attached. It adds the following text to
>>>> namespaces(7).
>>>>
>>>> Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
>>>> pace file descriptors.  The correct syntax is:
>>>>
>>>>       fd = ioctl(ns_fd, ioctl_type);
>>>>
>>>> where ioctl_type is one of the following:
>>>>
>>>> NS_GET_USERNS
>>>>       Returns a file descriptor that refers to an owning  user  names‐
>>>>       pace.
>>>>
>>>> NS_GET_PARENT
>>>>       Returns  a  file  descriptor  that refers to a parent namespace.
>>>>       This ioctl(2) can be used for pid and user namespaces. For  user
>>>>       namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
>>>>       ing.
>
> For each of the above, I think it is worth mentioning that the
> close-on-exec flag is set for the returned file descriptor.

Hmm.  That is an odd default.

>>>>
>>>> In addition to generic ioctl(2) errors, the following specific ones can
>>>> occur:
>>>>
>>>> EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.
>>>>
>>>> EPERM  The  requested  namespace  is  outside  of the current namespace
>>>>       scope.
>
> Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
> user namespace"?

Having looked at that bit of code I don't think capabilities really
have a role to play.

>>>> ENOENT ns_fd refers to the init namespace.
>>>
>>>
>>> Thanks for this. But still part of the question remains unanswered.
>>> How do we (in user-space) use the file descriptors to answer any of
>>> the questions that this patch series was designed to solve? (This
>>> info should be in the commit message and the man-pages patch.)
>>
>> I'm sorry, but I am not sure that I understand what you ask.
>>
>> Here are the origin questions:
>> Someone else then asked me a question that led me to wonder about
>> generally introspecting on the parental relationships between user
>> namespaces and the association of other namespaces types with user
>> namespaces. One use would be visualization, in order to understand the
>> running system. Another would be to answer the question I already
>> mentioned: what capability does process X have to perform operations
>> on a resource governed by namespace Y?
>>
>> Here is an example which shows how we can get the owning namespace
>> inode number by using these ioctl-s.
>>
>> $ ls -l /proc/13929/ns/pid
>> lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]'
>>
>> $ ./nsowner /proc/13929/ns/pid
>> user:[4026532227]
>>
>> The owning user namespace for pid:[4026532228] is user:[4026532227].
>>
>> The nsowner  tool is cimpiled from this code:
>>
>> int main(int argc, char *argv[])
>> {
>>         char buf[128], path[] = "/proc/self/fd/0123456789";
>>         int ns, uns, ret;
>>
>>         ns = open(argv[1], O_RDONLY);
>>         if (ns < 0)
>>                 return 1;
>>
>>         uns = ioctl(ns, NS_GET_USERNS);
>>         if (uns < 0)
>>                 return 1;
>>
>>         snprintf(path, sizeof(path), "/proc/self/fd/%d", uns);
>>         ret = readlink(path, buf, sizeof(buf) - 1);
>>         if (ret < 0)
>>                 return 1;
>>         buf[ret] = 0;
>>
>>         printf("%s\n", buf);
>>
>>         return 0;
>> }
>
> So, from my point of view, the important piece that was missing from
> your commit message was the note to use readlink("/proc/self/fd/%d")
> on the returned FDs. I think that detail needs to be part of the
> commit message (and also the man page text). I think it even be
> helpful to include the above program as part of the commit message:
> it helps people more quickly grasp the API.

Please, please make the standard way to compare these things fstat.
That is much less magic than a symlink, and a little more future proof.
Possibly even kcmp.

At some point we will care about migrating a migrating sub-container and we
may have to have some minor changes.

Eric
Michael Kerrisk (man-pages) July 25, 2016, 2:46 p.m.
Hi Eric,

On 07/25/2016 03:18 PM, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>
>> Hi Andrey,
>>
>> On 07/22/2016 08:25 PM, Andrey Vagin wrote:
>>> On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
>>> <mtk.manpages@gmail.com> wrote:
>>>> Hi Andrey,
>>>>
>>>>
>>>> On 07/21/2016 11:06 PM, Andrew Vagin wrote:
>>>>>
>>>>> On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
>>>>> wrote:
>>>>>>
>>>>>> Hi Andrey,
>>>>>>
>>>>>> On 07/14/2016 08:20 PM, Andrey Vagin wrote:
>>>>>
>>>>>
>>>>> <snip>
>>>>>
>>>>>>
>>>>>> Could you add here an of the API in detail: what do these FDs refer to,
>>>>>> and how do you use them to solve the use case? And could you you add
>>>>>> that info to the commit messages please.
>>>>>
>>>>>
>>>>> Hi Michael,
>>>>>
>>>>> A patch for man-pages is attached. It adds the following text to
>>>>> namespaces(7).
>>>>>
>>>>> Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
>>>>> pace file descriptors.  The correct syntax is:
>>>>>
>>>>>       fd = ioctl(ns_fd, ioctl_type);
>>>>>
>>>>> where ioctl_type is one of the following:
>>>>>
>>>>> NS_GET_USERNS
>>>>>       Returns a file descriptor that refers to an owning  user  names‐
>>>>>       pace.
>>>>>
>>>>> NS_GET_PARENT
>>>>>       Returns  a  file  descriptor  that refers to a parent namespace.
>>>>>       This ioctl(2) can be used for pid and user namespaces. For  user
>>>>>       namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
>>>>>       ing.
>>
>> For each of the above, I think it is worth mentioning that the
>> close-on-exec flag is set for the returned file descriptor.
>
> Hmm.  That is an odd default.

Why do you say that? It's pretty common as the default for various
APIs that create new FDs these days. (There's of course a strong argument
that the original UNIX default was a design blunder...)

>>>>>
>>>>> In addition to generic ioctl(2) errors, the following specific ones can
>>>>> occur:
>>>>>
>>>>> EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.
>>>>>
>>>>> EPERM  The  requested  namespace  is  outside  of the current namespace
>>>>>       scope.
>>
>> Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
>> user namespace"?
>
> Having looked at that bit of code I don't think capabilities really
> have a role to play.

Yes, I caught up with that now. I await to see how this plays out
in the next patch version.

>>>>> ENOENT ns_fd refers to the init namespace.
>>>>
>>>>
>>>> Thanks for this. But still part of the question remains unanswered.
>>>> How do we (in user-space) use the file descriptors to answer any of
>>>> the questions that this patch series was designed to solve? (This
>>>> info should be in the commit message and the man-pages patch.)
>>>
>>> I'm sorry, but I am not sure that I understand what you ask.
>>>
>>> Here are the origin questions:
>>> Someone else then asked me a question that led me to wonder about
>>> generally introspecting on the parental relationships between user
>>> namespaces and the association of other namespaces types with user
>>> namespaces. One use would be visualization, in order to understand the
>>> running system. Another would be to answer the question I already
>>> mentioned: what capability does process X have to perform operations
>>> on a resource governed by namespace Y?
>>>
>>> Here is an example which shows how we can get the owning namespace
>>> inode number by using these ioctl-s.
>>>
>>> $ ls -l /proc/13929/ns/pid
>>> lrwxrwxrwx 1 root root 0 Jul 22 21:03 /proc/13929/ns/pid -> 'pid:[4026532228]'
>>>
>>> $ ./nsowner /proc/13929/ns/pid
>>> user:[4026532227]
>>>
>>> The owning user namespace for pid:[4026532228] is user:[4026532227].
>>>
>>> The nsowner  tool is cimpiled from this code:
>>>
>>> int main(int argc, char *argv[])
>>> {
>>>         char buf[128], path[] = "/proc/self/fd/0123456789";
>>>         int ns, uns, ret;
>>>
>>>         ns = open(argv[1], O_RDONLY);
>>>         if (ns < 0)
>>>                 return 1;
>>>
>>>         uns = ioctl(ns, NS_GET_USERNS);
>>>         if (uns < 0)
>>>                 return 1;
>>>
>>>         snprintf(path, sizeof(path), "/proc/self/fd/%d", uns);
>>>         ret = readlink(path, buf, sizeof(buf) - 1);
>>>         if (ret < 0)
>>>                 return 1;
>>>         buf[ret] = 0;
>>>
>>>         printf("%s\n", buf);
>>>
>>>         return 0;
>>> }
>>
>> So, from my point of view, the important piece that was missing from
>> your commit message was the note to use readlink("/proc/self/fd/%d")
>> on the returned FDs. I think that detail needs to be part of the
>> commit message (and also the man page text). I think it even be
>> helpful to include the above program as part of the commit message:
>> it helps people more quickly grasp the API.
>
> Please, please make the standard way to compare these things fstat.
> That is much less magic than a symlink, and a little more future proof.
> Possibly even kcmp.

As in fstat() to get the st_ino field, right?

Cheers,

Michael

> At some point we will care about migrating a migrating sub-container and we
> may have to have some minor changes.
>
> Eric
>
Serge E. Hallyn July 25, 2016, 2:54 p.m.
Quoting Michael Kerrisk (man-pages) (mtk.manpages@gmail.com):
> Hi Eric,
> 
> On 07/25/2016 03:18 PM, Eric W. Biederman wrote:
> >"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> >
> >>Hi Andrey,
> >>
> >>On 07/22/2016 08:25 PM, Andrey Vagin wrote:
> >>>On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
> >>><mtk.manpages@gmail.com> wrote:
> >>>>Hi Andrey,
> >>>>
> >>>>
> >>>>On 07/21/2016 11:06 PM, Andrew Vagin wrote:
> >>>>>
> >>>>>On Thu, Jul 21, 2016 at 04:41:12PM +0200, Michael Kerrisk (man-pages)
> >>>>>wrote:
> >>>>>>
> >>>>>>Hi Andrey,
> >>>>>>
> >>>>>>On 07/14/2016 08:20 PM, Andrey Vagin wrote:
> >>>>>
> >>>>>
> >>>>><snip>
> >>>>>
> >>>>>>
> >>>>>>Could you add here an of the API in detail: what do these FDs refer to,
> >>>>>>and how do you use them to solve the use case? And could you you add
> >>>>>>that info to the commit messages please.
> >>>>>
> >>>>>
> >>>>>Hi Michael,
> >>>>>
> >>>>>A patch for man-pages is attached. It adds the following text to
> >>>>>namespaces(7).
> >>>>>
> >>>>>Since  Linux 4.X, the following ioctl(2) calls are supported for names‐
> >>>>>pace file descriptors.  The correct syntax is:
> >>>>>
> >>>>>      fd = ioctl(ns_fd, ioctl_type);
> >>>>>
> >>>>>where ioctl_type is one of the following:
> >>>>>
> >>>>>NS_GET_USERNS
> >>>>>      Returns a file descriptor that refers to an owning  user  names‐
> >>>>>      pace.
> >>>>>
> >>>>>NS_GET_PARENT
> >>>>>      Returns  a  file  descriptor  that refers to a parent namespace.
> >>>>>      This ioctl(2) can be used for pid and user namespaces. For  user
> >>>>>      namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
> >>>>>      ing.
> >>
> >>For each of the above, I think it is worth mentioning that the
> >>close-on-exec flag is set for the returned file descriptor.
> >
> >Hmm.  That is an odd default.
> 
> Why do you say that? It's pretty common as the default for various
> APIs that create new FDs these days. (There's of course a strong argument
> that the original UNIX default was a design blunder...)
> 
> >>>>>
> >>>>>In addition to generic ioctl(2) errors, the following specific ones can
> >>>>>occur:
> >>>>>
> >>>>>EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.
> >>>>>
> >>>>>EPERM  The  requested  namespace  is  outside  of the current namespace
> >>>>>      scope.
> >>
> >>Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
> >>user namespace"?
> >
> >Having looked at that bit of code I don't think capabilities really
> >have a role to play.
> 
> Yes, I caught up with that now. I await to see how this plays out
> in the next patch version.

Thanks - that had caught my eye but I hadn't had time to look into the
justification for this.  Hiding this kind of thing indeed seems wrong to
me, unless there is a really good justification for it, i.e. a way
to use that info in an exploit.
Eric W. Biederman July 25, 2016, 2:59 p.m.
"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> Hi Eric,
>
> On 07/25/2016 03:18 PM, Eric W. Biederman wrote:
>> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>>
>>> Hi Andrey,
>>>
>>> On 07/22/2016 08:25 PM, Andrey Vagin wrote:
>>>> On Thu, Jul 21, 2016 at 11:48 PM, Michael Kerrisk (man-pages)
>>>> <mtk.manpages@gmail.com> wrote:
>>>>> Hi Andrey,
>>>>>
>>>>>
>>>>> On 07/21/2016 11:06 PM, Andrew Vagin wrote:
>>>>>>
[snip]
>>>>>> where ioctl_type is one of the following:
>>>>>>
>>>>>> NS_GET_USERNS
>>>>>>       Returns a file descriptor that refers to an owning  user  names‐
>>>>>>       pace.
>>>>>>
>>>>>> NS_GET_PARENT
>>>>>>       Returns  a  file  descriptor  that refers to a parent namespace.
>>>>>>       This ioctl(2) can be used for pid and user namespaces. For  user
>>>>>>       namespaces,  NS_GET_PARENT and NS_GET_USERNS have the same mean‐
>>>>>>       ing.
>>>
>>> For each of the above, I think it is worth mentioning that the
>>> close-on-exec flag is set for the returned file descriptor.
>>
>> Hmm.  That is an odd default.
>
> Why do you say that? It's pretty common as the default for various
> APIs that create new FDs these days. (There's of course a strong argument
> that the original UNIX default was a design blunder...)

Interesting.  I haven't kept up on that, but it seems reasonable.

[snip]
>>> So, from my point of view, the important piece that was missing from
>>> your commit message was the note to use readlink("/proc/self/fd/%d")
>>> on the returned FDs. I think that detail needs to be part of the
>>> commit message (and also the man page text). I think it even be
>>> helpful to include the above program as part of the commit message:
>>> it helps people more quickly grasp the API.
>>
>> Please, please make the standard way to compare these things fstat.
>> That is much less magic than a symlink, and a little more future proof.
>> Possibly even kcmp.
>
> As in fstat() to get the st_ino field, right?

Both the st_ino and st_dev fields.

The most likely change to support checkpoint/restart in the future is to
preserve st_ino across migrations and instantiate a different instance
of nsfs to hold the inode numbers from the previous machine.

We would need to handle the preservation carefully or else there is
a chance that two namespace file descriptors (collected from different
sources) with different st_dev and st_ino fields may actuall refer to
the same object.

Which is a long way of saying we have the st_dev field please use it,
it may matter at some point.

Eric
Eric W. Biederman July 25, 2016, 3:17 p.m.
"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Michael Kerrisk (man-pages) (mtk.manpages@gmail.com):
>> Hi Eric,
>> 
>> On 07/25/2016 03:18 PM, Eric W. Biederman wrote:
>> >"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>> >
>> >>Hi Andrey,
>> >>
>> >>On 07/22/2016 08:25 PM, Andrey Vagin wrote:
>> >>Perhaps add "and the caller does not have CAP_SYS_ADMIN" in the initial
>> >>user namespace"?
>> >
>> >Having looked at that bit of code I don't think capabilities really
>> >have a role to play.
>> 
>> Yes, I caught up with that now. I await to see how this plays out
>> in the next patch version.
>
> Thanks - that had caught my eye but I hadn't had time to look into the
> justification for this.  Hiding this kind of thing indeed seems wrong to
> me, unless there is a really good justification for it, i.e. a way
> to use that info in an exploit.

To avoid breaking checkpoint/restart we need to limit information to the
namespaces the caller is a member of for the user and pid namespaces.

This roughly duplicates the parentage checks in ns_capable.

Conceptually this is the same as limiting .. in a chroot environment.

Eric