procfs: change the owner of non-dumpable and writeable files

Submitted by Aleksa Sarai on Jan. 18, 2017, 4:01 a.m.

Details

Message ID 20170118040159.4751-1-asarai@suse.de
State New
Series "procfs: change the owner of non-dumpable and writeable files"
Headers show

Commit Message

Aleksa Sarai Jan. 18, 2017, 4:01 a.m.
In order to protect against ptrace(2) and similar attacks on container
runtimes when they join namespaces, many runtimes set mm->dumpable to
SUID_DUMP_DISABLE. However, doing this means that attempting to set up
an unprivileged user namespace will fail because an unprivileged process
can no longer access /proc/self/{setgroups,{uid,gid}_map} for the
container process (which is the same uid as the runtime process).

Fix this by changing pid_getattr to *also* change the owner of regular
files that have a mode of 0644 (when the process is not dumpable). This
ensures that the important /proc/[pid]/... files mentioned above are
properly accessible by a container runtime in a rootless container
context.

The most blantant issue is that a non-dumpable process in a rootless
container context is unable to open /proc/self/setgroups, because it
doesn't own the file.

int main(void)
{
	prctl(PR_SET_DUMPABLE, 0, 0, 0, 0);
	unshare(CLONE_NEWUSER);

	/* This will fail. */
	int fd = open("/proc/self/setgroups", O_WRONLY);
	if (fd < 0)
		abort();

	return 0;
}

Cc: dev@opencontainers.org
Cc: containers@lists.linux-foundation.org
Signed-off-by: Aleksa Sarai <asarai@suse.de>
---
 fs/proc/base.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ca651ac00660..ebabb12f4536 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1729,6 +1729,7 @@  int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 			return -ENOENT;
 		}
 		if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
+		    (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) ||
 		    task_dumpable(task)) {
 			cred = __task_cred(task);
 			stat->uid = cred->euid;
@@ -1770,6 +1771,7 @@  int pid_revalidate(struct dentry *dentry, unsigned int flags)
 
 	if (task) {
 		if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
+		    (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) ||
 		    task_dumpable(task)) {
 			rcu_read_lock();
 			cred = __task_cred(task);
@@ -2394,7 +2396,7 @@  static int proc_pident_instantiate(struct inode *dir,
 	return -ENOENT;
 }
 
-static struct dentry *proc_pident_lookup(struct inode *dir, 
+static struct dentry *proc_pident_lookup(struct inode *dir,
 					 struct dentry *dentry,
 					 const struct pid_entry *ents,
 					 unsigned int nents)
@@ -2536,7 +2538,7 @@  static const struct pid_entry attr_dir_stuff[] = {
 
 static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx)
 {
-	return proc_pident_readdir(file, ctx, 
+	return proc_pident_readdir(file, ctx,
 				   attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff));
 }
 

Comments

Kees Cook Jan. 18, 2017, 11:22 p.m.
On Tue, Jan 17, 2017 at 8:01 PM, Aleksa Sarai <asarai@suse.de> wrote:
> In order to protect against ptrace(2) and similar attacks on container
> runtimes when they join namespaces, many runtimes set mm->dumpable to
> SUID_DUMP_DISABLE. However, doing this means that attempting to set up
> an unprivileged user namespace will fail because an unprivileged process
> can no longer access /proc/self/{setgroups,{uid,gid}_map} for the
> container process (which is the same uid as the runtime process).
>
> Fix this by changing pid_getattr to *also* change the owner of regular
> files that have a mode of 0644 (when the process is not dumpable). This
> ensures that the important /proc/[pid]/... files mentioned above are
> properly accessible by a container runtime in a rootless container
> context.
>
> The most blantant issue is that a non-dumpable process in a rootless
> container context is unable to open /proc/self/setgroups, because it
> doesn't own the file.

This changes a lot more than just setgroups, doesn't it? This bypasses
the task_dumpable check for all kinds of things. Though, I expect the
has_pid_permissions() check to be the harder one to pass. Why does
has_pid_permissions() succeed in the case you've given?

-Kees

> int main(void)
> {
>         prctl(PR_SET_DUMPABLE, 0, 0, 0, 0);
>         unshare(CLONE_NEWUSER);
>
>         /* This will fail. */
>         int fd = open("/proc/self/setgroups", O_WRONLY);
>         if (fd < 0)
>                 abort();
>
>         return 0;
> }
>
> Cc: dev@opencontainers.org
> Cc: containers@lists.linux-foundation.org
> Signed-off-by: Aleksa Sarai <asarai@suse.de>
> ---
>  fs/proc/base.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ca651ac00660..ebabb12f4536 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1729,6 +1729,7 @@ int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
>                         return -ENOENT;
>                 }
>                 if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
> +                   (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) ||
>                     task_dumpable(task)) {
>                         cred = __task_cred(task);
>                         stat->uid = cred->euid;
> @@ -1770,6 +1771,7 @@ int pid_revalidate(struct dentry *dentry, unsigned int flags)
>
>         if (task) {
>                 if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
> +                   (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) ||
>                     task_dumpable(task)) {
>                         rcu_read_lock();
>                         cred = __task_cred(task);
> @@ -2394,7 +2396,7 @@ static int proc_pident_instantiate(struct inode *dir,
>         return -ENOENT;
>  }
>
> -static struct dentry *proc_pident_lookup(struct inode *dir,
> +static struct dentry *proc_pident_lookup(struct inode *dir,
>                                          struct dentry *dentry,
>                                          const struct pid_entry *ents,
>                                          unsigned int nents)
> @@ -2536,7 +2538,7 @@ static const struct pid_entry attr_dir_stuff[] = {
>
>  static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx)
>  {
> -       return proc_pident_readdir(file, ctx,
> +       return proc_pident_readdir(file, ctx,
>                                    attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff));
>  }
>
> --
> 2.11.0
>
Aleksa Sarai Jan. 18, 2017, 11:34 p.m.
>> In order to protect against ptrace(2) and similar attacks on container
>> runtimes when they join namespaces, many runtimes set mm->dumpable to
>> SUID_DUMP_DISABLE. However, doing this means that attempting to set up
>> an unprivileged user namespace will fail because an unprivileged process
>> can no longer access /proc/self/{setgroups,{uid,gid}_map} for the
>> container process (which is the same uid as the runtime process).
>>
>> Fix this by changing pid_getattr to *also* change the owner of regular
>> files that have a mode of 0644 (when the process is not dumpable). This
>> ensures that the important /proc/[pid]/... files mentioned above are
>> properly accessible by a container runtime in a rootless container
>> context.
>>
>> The most blantant issue is that a non-dumpable process in a rootless
>> container context is unable to open /proc/self/setgroups, because it
>> doesn't own the file.
>
> This changes a lot more than just setgroups, doesn't it? This bypasses
> the task_dumpable check for all kinds of things.

Yeah. I can special case /proc/self/setgroups as well as uid_map, 
gid_map and the other *specific* things that runC needs. But ultimately 
I think we should come up with agreement on what things should always 
appear to be owned by the process's user.

> Though, I expect the
> has_pid_permissions() check to be the harder one to pass. Why does
> has_pid_permissions() succeed in the case you've given?

Because the group id of the container process is the same as the parent 
process, so is_group_p() will succeed. Also hide_pid_min is set such 
that it will work in either case.
Michal Hocko Jan. 19, 2017, 9:29 a.m.
Cc Eric

On Wed 18-01-17 15:01:59, Aleksa Sarai wrote:
> In order to protect against ptrace(2) and similar attacks on container
> runtimes when they join namespaces, many runtimes set mm->dumpable to
> SUID_DUMP_DISABLE. However, doing this means that attempting to set up
> an unprivileged user namespace will fail because an unprivileged process
> can no longer access /proc/self/{setgroups,{uid,gid}_map} for the
> container process (which is the same uid as the runtime process).
> 
> Fix this by changing pid_getattr to *also* change the owner of regular
> files that have a mode of 0644 (when the process is not dumpable). This
> ensures that the important /proc/[pid]/... files mentioned above are
> properly accessible by a container runtime in a rootless container
> context.
> 
> The most blantant issue is that a non-dumpable process in a rootless
> container context is unable to open /proc/self/setgroups, because it
> doesn't own the file.
> 
> int main(void)
> {
> 	prctl(PR_SET_DUMPABLE, 0, 0, 0, 0);
> 	unshare(CLONE_NEWUSER);
> 
> 	/* This will fail. */
> 	int fd = open("/proc/self/setgroups", O_WRONLY);
> 	if (fd < 0)
> 		abort();
> 
> 	return 0;
> }

I do agree that failing to open anything in /proc/self/ is more than
unexepcted! I cannot judge the patch but my gut feeling tells me that
the fix should be somewhere in the open handler.

One nit below

> 
> Cc: dev@opencontainers.org
> Cc: containers@lists.linux-foundation.org
> Signed-off-by: Aleksa Sarai <asarai@suse.de>
> ---
>  fs/proc/base.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ca651ac00660..ebabb12f4536 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1729,6 +1729,7 @@ int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
>  			return -ENOENT;
>  		}
>  		if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
> +		    (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) ||
>  		    task_dumpable(task)) {
>  			cred = __task_cred(task);
>  			stat->uid = cred->euid;
> @@ -1770,6 +1771,7 @@ int pid_revalidate(struct dentry *dentry, unsigned int flags)
>  
>  	if (task) {
>  		if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
> +		    (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) ||
>  		    task_dumpable(task)) {
>  			rcu_read_lock();
>  			cred = __task_cred(task);
> @@ -2394,7 +2396,7 @@ static int proc_pident_instantiate(struct inode *dir,
>  	return -ENOENT;
>  }
>  
> -static struct dentry *proc_pident_lookup(struct inode *dir, 
> +static struct dentry *proc_pident_lookup(struct inode *dir,
>  					 struct dentry *dentry,
>  					 const struct pid_entry *ents,
>  					 unsigned int nents)
> @@ -2536,7 +2538,7 @@ static const struct pid_entry attr_dir_stuff[] = {
>  
>  static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx)
>  {
> -	return proc_pident_readdir(file, ctx, 
> +	return proc_pident_readdir(file, ctx,
>  				   attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff));
>  }

this two are just whitespace noise
Aleksa Sarai Jan. 19, 2017, 1:08 p.m.
>> In order to protect against ptrace(2) and similar attacks on container
>> runtimes when they join namespaces, many runtimes set mm->dumpable to
>> SUID_DUMP_DISABLE. However, doing this means that attempting to set up
>> an unprivileged user namespace will fail because an unprivileged process
>> can no longer access /proc/self/{setgroups,{uid,gid}_map} for the
>> container process (which is the same uid as the runtime process).
>>
>> Fix this by changing pid_getattr to *also* change the owner of regular
>> files that have a mode of 0644 (when the process is not dumpable). This
>> ensures that the important /proc/[pid]/... files mentioned above are
>> properly accessible by a container runtime in a rootless container
>> context.
>>
>> The most blantant issue is that a non-dumpable process in a rootless
>> container context is unable to open /proc/self/setgroups, because it
>> doesn't own the file.
>>
>> int main(void)
>> {
>> 	prctl(PR_SET_DUMPABLE, 0, 0, 0, 0);
>> 	unshare(CLONE_NEWUSER);
>>
>> 	/* This will fail. */
>> 	int fd = open("/proc/self/setgroups", O_WRONLY);
>> 	if (fd < 0)
>> 		abort();
>>
>> 	return 0;
>> }
>
> I do agree that failing to open anything in /proc/self/ is more than
> unexepcted! I cannot judge the patch but my gut feeling tells me that
> the fix should be somewhere in the open handler.

Maybe that would suffice as a more specific fix (for the special case of 
/proc/self), but the fact that none of the users and groups are 
correctly set in /proc/[pid] will cause issues for runC and other 
container runtimes (because they don't go through /proc/self -- it's 
accessing /proc/[pid] from another process).

Though I get the feeling that the *correct* fix would be to remove the 
conditional and *always* change the owner -- maybe I'm missing something 
but I can't think of the security issue that this code currently fixes 
(since all of the important permission checks are *in addition* to the 
generic_permission used for /proc/self/..., which use ptrace_may_access).