[v6,5/6] binfmt_*: scope path resolution of interpreters

Submitted by Aleksa Sarai on May 6, 2019, 4:54 p.m.

Details

Message ID 20190506165439.9155-6-cyphar@cyphar.com
State New
Series "namei: resolveat(2) path resolution restriction API"
Headers show

Commit Message

Aleksa Sarai May 6, 2019, 4:54 p.m.
The need to be able to scope path resolution of interpreters became
clear with one of the possible vectors used in CVE-2019-5736 (which
most major container runtimes were vulnerable to).

Naively, it might seem that openat(2) -- which supports path scoping --
can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
binary being executed. Unfortunately, a "bad binary" (usually a symlink)
could be written as a #!-style script with the symlink target as the
interpreter -- which would be completely missed by just scoping the
openat(2). An example of this being exploitable is CVE-2019-5736.

In order to get around this, we need to pass down to each binfmt_*
implementation the scoping flags requested in execveat(2). In order to
maintain backwards-compatibility we only pass the scoping AT_* flags.

To avoid breaking userspace (in the exceptionally rare cases where you
have #!-scripts with a relative path being execveat(2)-ed with dfd !=
AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
set in execveat(2).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/binfmt_elf.c            |  2 +-
 fs/binfmt_elf_fdpic.c      |  2 +-
 fs/binfmt_em86.c           |  4 ++--
 fs/binfmt_misc.c           |  2 +-
 fs/binfmt_script.c         |  2 +-
 fs/exec.c                  | 26 ++++++++++++++++++++++----
 include/linux/binfmts.h    |  1 +
 include/linux/fs.h         |  9 +++++++--
 include/uapi/linux/fcntl.h |  6 ++++++
 9 files changed, 42 insertions(+), 12 deletions(-)

Patch hide | download patch | download mbox

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7d09d125f148..473420eed477 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -772,7 +772,7 @@  static int load_elf_binary(struct linux_binprm *bprm)
 			if (elf_interpreter[elf_ppnt->p_filesz - 1] != '\0')
 				goto out_free_interp;
 
-			interpreter = open_exec(elf_interpreter);
+			interpreter = openat_exec(bprm->dfd, elf_interpreter, bprm->flags);
 			retval = PTR_ERR(interpreter);
 			if (IS_ERR(interpreter))
 				goto out_free_interp;
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index b53bb3729ac1..c463c6428f77 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -263,7 +263,7 @@  static int load_elf_fdpic_binary(struct linux_binprm *bprm)
 			kdebug("Using ELF interpreter %s", interpreter_name);
 
 			/* replace the program with the interpreter */
-			interpreter = open_exec(interpreter_name);
+			interpreter = openat_exec(bprm->dfd, interpreter_name, bprm->flags);
 			retval = PTR_ERR(interpreter);
 			if (IS_ERR(interpreter)) {
 				interpreter = NULL;
diff --git a/fs/binfmt_em86.c b/fs/binfmt_em86.c
index dd2d3f0cd55d..3ee46b0dc0d4 100644
--- a/fs/binfmt_em86.c
+++ b/fs/binfmt_em86.c
@@ -81,10 +81,10 @@  static int load_em86(struct linux_binprm *bprm)
 
 	/*
 	 * OK, now restart the process with the interpreter's inode.
-	 * Note that we use open_exec() as the name is now in kernel
+	 * Note that we use openat_exec() as the name is now in kernel
 	 * space, and we don't need to copy it.
 	 */
-	file = open_exec(interp);
+	file = openat_exec(binprm->dfd, interp, binprm->flags);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index aa4a7a23ff99..573ef06ff5a1 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -209,7 +209,7 @@  static int load_misc_binary(struct linux_binprm *bprm)
 		if (!IS_ERR(interp_file))
 			deny_write_access(interp_file);
 	} else {
-		interp_file = open_exec(fmt->interpreter);
+		interp_file = openat_exec(bprm->dfd, fmt->interpreter, bprm->flags);
 	}
 	retval = PTR_ERR(interp_file);
 	if (IS_ERR(interp_file))
diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c
index e996174cbfc0..aaff12d12bfd 100644
--- a/fs/binfmt_script.c
+++ b/fs/binfmt_script.c
@@ -137,7 +137,7 @@  static int load_script(struct linux_binprm *bprm)
 	/*
 	 * OK, now restart the process with the interpreter's dentry.
 	 */
-	file = open_exec(i_name);
+	file = openat_exec(bprm->dfd, i_name, bprm->flags);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
diff --git a/fs/exec.c b/fs/exec.c
index 2e0033348d8e..d0f244d9a541 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -846,12 +846,24 @@  static struct file *do_open_execat(int fd, struct filename *name, int flags)
 		.lookup_flags = LOOKUP_FOLLOW,
 	};
 
-	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
+	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH | AT_BENEATH |
+		       AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
+		       AT_THIS_ROOT)) != 0)
 		return ERR_PTR(-EINVAL);
 	if (flags & AT_SYMLINK_NOFOLLOW)
 		open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
 	if (flags & AT_EMPTY_PATH)
 		open_exec_flags.lookup_flags |= LOOKUP_EMPTY;
+	if (flags & AT_BENEATH)
+		open_exec_flags.lookup_flags |= LOOKUP_BENEATH;
+	if (flags & AT_XDEV)
+		open_exec_flags.lookup_flags |= LOOKUP_XDEV;
+	if (flags & AT_NO_MAGICLINKS)
+		open_exec_flags.lookup_flags |= LOOKUP_NO_MAGICLINKS;
+	if (flags & AT_NO_SYMLINKS)
+		open_exec_flags.lookup_flags |= LOOKUP_NO_SYMLINKS;
+	if (flags & AT_THIS_ROOT)
+		open_exec_flags.lookup_flags |= LOOKUP_IN_ROOT;
 
 	file = do_filp_open(fd, name, &open_exec_flags);
 	if (IS_ERR(file))
@@ -879,18 +891,18 @@  static struct file *do_open_execat(int fd, struct filename *name, int flags)
 	return ERR_PTR(err);
 }
 
-struct file *open_exec(const char *name)
+struct file *openat_exec(int dfd, const char *name, int flags)
 {
 	struct filename *filename = getname_kernel(name);
 	struct file *f = ERR_CAST(filename);
 
 	if (!IS_ERR(filename)) {
-		f = do_open_execat(AT_FDCWD, filename, 0);
+		f = do_open_execat(dfd, filename, flags);
 		putname(filename);
 	}
 	return f;
 }
-EXPORT_SYMBOL(open_exec);
+EXPORT_SYMBOL(openat_exec);
 
 int kernel_read_file(struct file *file, void **buf, loff_t *size,
 		     loff_t max_size, enum kernel_read_file_id id)
@@ -1762,6 +1774,12 @@  static int __do_execve_file(int fd, struct filename *filename,
 
 	sched_exec();
 
+	bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
+			       AT_THIS_ROOT);
+	bprm->dfd = AT_FDCWD;
+	if (bprm->flags)
+		bprm->dfd = fd;
+
 	bprm->file = file;
 	if (!filename) {
 		bprm->filename = "none";
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 688ab0de7810..e4da2d36e97f 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -50,6 +50,7 @@  struct linux_binprm {
 	unsigned int taso:1;
 #endif
 	unsigned int recursion_depth; /* only for search_binary_handler() */
+	int dfd, flags;		/* passed down to execat_open() */
 	struct file * file;
 	struct cred *cred;	/* new credentials */
 	int unsafe;		/* how unsafe this exec is (mask of LSM_UNSAFE_*) */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fe94c48481a6..cdcd2e021101 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2945,8 +2945,13 @@  extern int kernel_read_file_from_fd(int, void **, loff_t *, loff_t,
 extern ssize_t kernel_read(struct file *, void *, size_t, loff_t *);
 extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *);
 extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *);
-extern struct file * open_exec(const char *);
- 
+
+extern struct file *openat_exec(int, const char *, int);
+static inline struct file *open_exec(const char *name)
+{
+	return openat_exec(AT_FDCWD, name, 0);
+}
+
 /* fs/dcache.c -- generic fs support functions */
 extern bool is_subdir(struct dentry *, struct dentry *);
 extern bool path_is_under(const struct path *, const struct path *);
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 1d338357df8a..bfca7f87cd2a 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -93,5 +93,11 @@ 
 
 #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
 
+#define AT_RESOLUTION_TYPE	0x1F0000 /* Type of path-resolution scoping we are applying. */
+#define AT_BENEATH		0x010000 /* - Block "lexical" trickery like "..", symlinks, absolute paths, etc. */
+#define AT_XDEV			0x020000 /* - Block mount-point crossings (includes bind-mounts). */
+#define AT_NO_MAGICLINKS	0x040000 /* - Block procfs-style "magic" symlinks. */
+#define AT_NO_SYMLINKS		0x080000 /* - Block all symlinks (implies AT_NO_MAGICLINKS). */
+#define AT_THIS_ROOT		0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
 
 #endif /* _UAPI_LINUX_FCNTL_H */

Comments

Dmitry Safonov via Containers May 6, 2019, 6:37 p.m.
On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> The need to be able to scope path resolution of interpreters became
> clear with one of the possible vectors used in CVE-2019-5736 (which
> most major container runtimes were vulnerable to).
>
> Naively, it might seem that openat(2) -- which supports path scoping --
> can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
> binary being executed. Unfortunately, a "bad binary" (usually a symlink)
> could be written as a #!-style script with the symlink target as the
> interpreter -- which would be completely missed by just scoping the
> openat(2). An example of this being exploitable is CVE-2019-5736.
>
> In order to get around this, we need to pass down to each binfmt_*
> implementation the scoping flags requested in execveat(2). In order to
> maintain backwards-compatibility we only pass the scoping AT_* flags.
>
> To avoid breaking userspace (in the exceptionally rare cases where you
> have #!-scripts with a relative path being execveat(2)-ed with dfd !=
> AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
> set in execveat(2).

This seems extremely dangerous. I like the overall series, but not this patch.

> @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
>
>         sched_exec();
>
> +       bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
> +                              AT_THIS_ROOT);
[...]
> +#define AT_THIS_ROOT           0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */

So now what happens if there is a setuid root ELF binary with program
interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an
unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that
going to let the unprivileged user decide which interpreter the
setuid-root process should use? From a high-level perspective, opening
the interpreter should be controlled by the program that is being
loaded, not by the program that invoked it.


In my opinion, CVE-2019-5736 points out two different problems:

The big problem: The __ptrace_may_access() logic has a special-case
short-circuit for "introspection" that you can't opt out of; this
makes it possible to open things in procfs that are related to the
current process even if the credentials of the process wouldn't permit
accessing another process like it. I think the proper fix to deal with
this would be to add a prctl() flag for "set whether introspection is
allowed for this process", and if userspace has manually un-set that
flag, any introspection special-case logic would be skipped.

An additional problem: /proc/*/exe can be used to open a file for
writing; I think it may have been Andy Lutomirski who pointed out some
time ago that it would be nice if you couldn't use /proc/*/fd/* to
re-open files with more privileges, which is sort of the same thing.
Aleksa Sarai May 6, 2019, 7:17 p.m.
On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > The need to be able to scope path resolution of interpreters became
> > clear with one of the possible vectors used in CVE-2019-5736 (which
> > most major container runtimes were vulnerable to).
> >
> > Naively, it might seem that openat(2) -- which supports path scoping --
> > can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
> > binary being executed. Unfortunately, a "bad binary" (usually a symlink)
> > could be written as a #!-style script with the symlink target as the
> > interpreter -- which would be completely missed by just scoping the
> > openat(2). An example of this being exploitable is CVE-2019-5736.
> >
> > In order to get around this, we need to pass down to each binfmt_*
> > implementation the scoping flags requested in execveat(2). In order to
> > maintain backwards-compatibility we only pass the scoping AT_* flags.
> >
> > To avoid breaking userspace (in the exceptionally rare cases where you
> > have #!-scripts with a relative path being execveat(2)-ed with dfd !=
> > AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
> > set in execveat(2).
> 
> This seems extremely dangerous. I like the overall series, but not this patch.
> 
> > @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
> >
> >         sched_exec();
> >
> > +       bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
> > +                              AT_THIS_ROOT);
> [...]
> > +#define AT_THIS_ROOT           0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
> 
> So now what happens if there is a setuid root ELF binary with program
> interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an
> unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that
> going to let the unprivileged user decide which interpreter the
> setuid-root process should use? From a high-level perspective, opening
> the interpreter should be controlled by the program that is being
> loaded, not by the program that invoked it.

I went a bit nuts with openat_exec(), and I did end up adding it to the
ELF interpreter lookup (and you're completely right that this is a bad
idea -- I will drop it from this patch if it's included in the next
series).

The proposed solutions you give below are much nicer than this patch so
I can drop it and work on fixing those issues separately.

> In my opinion, CVE-2019-5736 points out two different problems:
>
> The big problem: The __ptrace_may_access() logic has a special-case
> short-circuit for "introspection" that you can't opt out of; this
> makes it possible to open things in procfs that are related to the
> current process even if the credentials of the process wouldn't permit
> accessing another process like it. I think the proper fix to deal with
> this would be to add a prctl() flag for "set whether introspection is
> allowed for this process", and if userspace has manually un-set that
> flag, any introspection special-case logic would be skipped.

We could do PR_SET_DUMPABLE=3 for this, I guess?

> An additional problem: /proc/*/exe can be used to open a file for
> writing; I think it may have been Andy Lutomirski who pointed out some
> time ago that it would be nice if you couldn't use /proc/*/fd/* to
> re-open files with more privileges, which is sort of the same thing.

This is something I'm currently working on a series for, which would
boil down to some restrictions on how re-opening of file descriptors
works through procfs.

However, execveat() of a procfs magiclink is a bit hard to block --
there is no way for userspace to to represent a file being "open for
execute" so they are all "open for execute" by default and blocking it
outright seems a bit extreme (though I actually hope to eventually add
the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence
why I've reserved some bits).

(Thinking more about it, there is an argument that I should include the
above patch into this series so that we can block re-opening of fds
opened through resolveat(2) without explicit flags from the outset.)
Andy Lutomirski May 6, 2019, 11:41 p.m.
> On May 6, 2019, at 12:17 PM, Aleksa Sarai <cyphar@cyphar.com> wrote:
> 
>> On 2019-05-06, Jann Horn <jannh@google.com> wrote:
>>> On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>>> The need to be able to scope path resolution of interpreters became
>>> clear with one of the possible vectors used in CVE-2019-5736 (which
>>> most major container runtimes were vulnerable to).
>>> 
>>> Naively, it might seem that openat(2) -- which supports path scoping --
>>> can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
>>> binary being executed. Unfortunately, a "bad binary" (usually a symlink)
>>> could be written as a #!-style script with the symlink target as the
>>> interpreter -- which would be completely missed by just scoping the
>>> openat(2). An example of this being exploitable is CVE-2019-5736.
>>> 
>>> In order to get around this, we need to pass down to each binfmt_*
>>> implementation the scoping flags requested in execveat(2). In order to
>>> maintain backwards-compatibility we only pass the scoping AT_* flags.
>>> 
>>> To avoid breaking userspace (in the exceptionally rare cases where you
>>> have #!-scripts with a relative path being execveat(2)-ed with dfd !=
>>> AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
>>> set in execveat(2).
>> 
>> This seems extremely dangerous. I like the overall series, but not this patch.
>> 
>>> @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
>>> 
>>>        sched_exec();
>>> 
>>> +       bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
>>> +                              AT_THIS_ROOT);
>> [...]
>>> +#define AT_THIS_ROOT           0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
>> 
>> So now what happens if there is a setuid root ELF binary with program
>> interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an
>> unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that
>> going to let the unprivileged user decide which interpreter the
>> setuid-root process should use? From a high-level perspective, opening
>> the interpreter should be controlled by the program that is being
>> loaded, not by the program that invoked it.
> 
> I went a bit nuts with openat_exec(), and I did end up adding it to the
> ELF interpreter lookup (and you're completely right that this is a bad
> idea -- I will drop it from this patch if it's included in the next
> series).
> 
> The proposed solutions you give below are much nicer than this patch so
> I can drop it and work on fixing those issues separately.
> 
>> In my opinion, CVE-2019-5736 points out two different problems:
>> 
>> The big problem: The __ptrace_may_access() logic has a special-case
>> short-circuit for "introspection" that you can't opt out of; this
>> makes it possible to open things in procfs that are related to the
>> current process even if the credentials of the process wouldn't permit
>> accessing another process like it. I think the proper fix to deal with
>> this would be to add a prctl() flag for "set whether introspection is
>> allowed for this process", and if userspace has manually un-set that
>> flag, any introspection special-case logic would be skipped.
> 
> We could do PR_SET_DUMPABLE=3 for this, I guess?
> 
>> An additional problem: /proc/*/exe can be used to open a file for
>> writing; I think it may have been Andy Lutomirski who pointed out some
>> time ago that it would be nice if you couldn't use /proc/*/fd/* to
>> re-open files with more privileges, which is sort of the same thing.
> 
> This is something I'm currently working on a series for, which would
> boil down to some restrictions on how re-opening of file descriptors
> works through procfs.
> 
> However, execveat() of a procfs magiclink is a bit hard to block --
> there is no way for userspace to to represent a file being "open for
> execute" so they are all "open for execute" by default and blocking it
> outright seems a bit extreme (though I actually hope to eventually add
> the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence
> why I've reserved some bits).

There’s an O_MAYEXEC series floating around.

> 
> (Thinking more about it, there is an argument that I should include the
> above patch into this series so that we can block re-opening of fds
> opened through resolveat(2) without explicit flags from the outset.)
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>
Eric W. Biederman May 8, 2019, 12:38 a.m.
Jann Horn <jannh@google.com> writes:
>
> In my opinion, CVE-2019-5736 points out two different problems:
>
> The big problem: The __ptrace_may_access() logic has a special-case
> short-circuit for "introspection" that you can't opt out of;

Once upon a time in a galaxy far far away I fixed a bug where we missing
ptrace_may_access checks on various proc files and systems using selinux
stopped working.  At the time selinux did not allow ptrace like access
to yourself.  The "introspection" special case was the quick and simple
work-around.

There is nothing fundamental in having the "introspection" special case
except that various lsms have probably grown to depend upon it being
there.  I expect without difficulty we could move the check down
into the various lsms.  Which would get that check out of the core
kernel code.

Then the special case would the lsms challenge to keep or remove.

Eric
Aleksa Sarai May 8, 2019, 12:54 a.m.
On 2019-05-07, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> > On Mon, May 6, 2019 at 6:56 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > The need to be able to scope path resolution of interpreters became
> > > clear with one of the possible vectors used in CVE-2019-5736 (which
> > > most major container runtimes were vulnerable to).
> > >
> > > Naively, it might seem that openat(2) -- which supports path scoping --
> > > can be combined with execveat(AT_EMPTY_PATH) to trivially scope the
> > > binary being executed. Unfortunately, a "bad binary" (usually a symlink)
> > > could be written as a #!-style script with the symlink target as the
> > > interpreter -- which would be completely missed by just scoping the
> > > openat(2). An example of this being exploitable is CVE-2019-5736.
> > >
> > > In order to get around this, we need to pass down to each binfmt_*
> > > implementation the scoping flags requested in execveat(2). In order to
> > > maintain backwards-compatibility we only pass the scoping AT_* flags.
> > >
> > > To avoid breaking userspace (in the exceptionally rare cases where you
> > > have #!-scripts with a relative path being execveat(2)-ed with dfd !=
> > > AT_FDCWD), we only pass dfd down to binfmt_* if any of our new flags are
> > > set in execveat(2).
> > 
> > This seems extremely dangerous. I like the overall series, but not this patch.
> > 
> > > @@ -1762,6 +1774,12 @@ static int __do_execve_file(int fd, struct filename *filename,
> > >
> > >         sched_exec();
> > >
> > > +       bprm->flags = flags & (AT_XDEV | AT_NO_MAGICLINKS | AT_NO_SYMLINKS |
> > > +                              AT_THIS_ROOT);
> > [...]
> > > +#define AT_THIS_ROOT           0x100000 /* - Scope ".." resolution to dirfd (like chroot(2)). */
> > 
> > So now what happens if there is a setuid root ELF binary with program
> > interpreter "/lib64/ld-linux-x86-64.so.2" (like /bin/su), and an
> > unprivileged user runs it with execveat(..., AT_THIS_ROOT)? Is that
> > going to let the unprivileged user decide which interpreter the
> > setuid-root process should use? From a high-level perspective, opening
> > the interpreter should be controlled by the program that is being
> > loaded, not by the program that invoked it.
> 
> I went a bit nuts with openat_exec(), and I did end up adding it to the
> ELF interpreter lookup (and you're completely right that this is a bad
> idea -- I will drop it from this patch if it's included in the next
> series).
> 
> The proposed solutions you give below are much nicer than this patch so
> I can drop it and work on fixing those issues separately.

Another possible solution would be to only allow (for instance)
AT_NO_MAGICLINKS for execveat(2). That way you cannot scope the
resolution but you can block the most concerning cases -- those
involving /proc-related access.

I've posted a v7 with this patch dropped (because we can always add AT_*
flags later in time), but I think having at least NO_MAGICLINKS would be
useful.
Dmitry Safonov via Containers May 10, 2019, 8:10 p.m.
On Tue, May 07, 2019 at 07:38:58PM -0500, Eric W. Biederman wrote:
> Jann Horn <jannh@google.com> writes:
> > In my opinion, CVE-2019-5736 points out two different problems:
> >
> > The big problem: The __ptrace_may_access() logic has a special-case
> > short-circuit for "introspection" that you can't opt out of;
> 
> Once upon a time in a galaxy far far away I fixed a bug where we missing
> ptrace_may_access checks on various proc files and systems using selinux
> stopped working.  At the time selinux did not allow ptrace like access
> to yourself.  The "introspection" special case was the quick and simple
> work-around.
> 
> There is nothing fundamental in having the "introspection" special case
> except that various lsms have probably grown to depend upon it being
> there.  I expect without difficulty we could move the check down
> into the various lsms.  Which would get that check out of the core
> kernel code.

Oh, if that's an option, that would be great, I think.


But this means, for example, that a non-root, non-dumpable process can't
open /proc/self/maps anymore, or open /proc/self/fd/*, and things like
that, without making itself dumpable. I would be surprised if there is
no code out there that relies on that.

From what I can tell, without the introspection special case,
introspection would fail in the following cases (assuming that the
process is not capable and isn't using sys_setfs[ug]id()):

 - ruid/euid/suid are not all the same
 - rgid/egid/sgid are not all the same
 - process is not dumpable

I think that there probably should be some way for a non-dumpable
process to look at its own procfs entries? If we could start from a
clean slate, I'd propose an opt-in flag to openat() for that, but
since we don't have a clean slate, I'd be afraid of breaking things
with that. But maybe I'm just being overly careful here?
Dmitry Safonov via Containers May 10, 2019, 8:41 p.m.
On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> > In my opinion, CVE-2019-5736 points out two different problems:
> >
> > The big problem: The __ptrace_may_access() logic has a special-case
> > short-circuit for "introspection" that you can't opt out of; this
> > makes it possible to open things in procfs that are related to the
> > current process even if the credentials of the process wouldn't permit
> > accessing another process like it. I think the proper fix to deal with
> > this would be to add a prctl() flag for "set whether introspection is
> > allowed for this process", and if userspace has manually un-set that
> > flag, any introspection special-case logic would be skipped.
> 
> We could do PR_SET_DUMPABLE=3 for this, I guess?

Hmm... I'd make it a new prctl() command, since introspection is
somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
think the introspection flag should be per-thread.

> > An additional problem: /proc/*/exe can be used to open a file for
> > writing; I think it may have been Andy Lutomirski who pointed out some
> > time ago that it would be nice if you couldn't use /proc/*/fd/* to
> > re-open files with more privileges, which is sort of the same thing.
> 
> This is something I'm currently working on a series for, which would
> boil down to some restrictions on how re-opening of file descriptors
> works through procfs.

Ah, nice!

> However, execveat() of a procfs magiclink is a bit hard to block --
> there is no way for userspace to to represent a file being "open for
> execute" so they are all "open for execute" by default and blocking it
> outright seems a bit extreme (though I actually hope to eventually add
> the ability to mark an O_PATH as "open for X" to resolveat(2) -- hence
> why I've reserved some bits).

(For what it's worth, I'm mostly concerned about read vs write, not
really about execute, since execute really is just another form of
reading in my opinion.)

> (Thinking more about it, there is an argument that I should include the
> above patch into this series so that we can block re-opening of fds
> opened through resolveat(2) without explicit flags from the outset.)
Andy Lutomirski May 10, 2019, 9:20 p.m.
On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote:
>
> On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> > On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> > > In my opinion, CVE-2019-5736 points out two different problems:
> > >
> > > The big problem: The __ptrace_may_access() logic has a special-case
> > > short-circuit for "introspection" that you can't opt out of; this
> > > makes it possible to open things in procfs that are related to the
> > > current process even if the credentials of the process wouldn't permit
> > > accessing another process like it. I think the proper fix to deal with
> > > this would be to add a prctl() flag for "set whether introspection is
> > > allowed for this process", and if userspace has manually un-set that
> > > flag, any introspection special-case logic would be skipped.
> >
> > We could do PR_SET_DUMPABLE=3 for this, I guess?
>
> Hmm... I'd make it a new prctl() command, since introspection is
> somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
> think the introspection flag should be per-thread.

I've lost track of the context here, but it seems to me that
mitigating attacks involving accidental following of /proc links
shouldn't depend on dumpability.  What's the actual problem this is
trying to solve again?
Dmitry Safonov via Containers May 10, 2019, 10:55 p.m.
On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote:
> On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote:
> >
> > On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> > > On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> > > > In my opinion, CVE-2019-5736 points out two different problems:
> > > >
> > > > The big problem: The __ptrace_may_access() logic has a special-case
> > > > short-circuit for "introspection" that you can't opt out of; this
> > > > makes it possible to open things in procfs that are related to the
> > > > current process even if the credentials of the process wouldn't permit
> > > > accessing another process like it. I think the proper fix to deal with
> > > > this would be to add a prctl() flag for "set whether introspection is
> > > > allowed for this process", and if userspace has manually un-set that
> > > > flag, any introspection special-case logic would be skipped.
> > >
> > > We could do PR_SET_DUMPABLE=3 for this, I guess?
> >
> > Hmm... I'd make it a new prctl() command, since introspection is
> > somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
> > think the introspection flag should be per-thread.
> 
> I've lost track of the context here, but it seems to me that
> mitigating attacks involving accidental following of /proc links
> shouldn't depend on dumpability.  What's the actual problem this is
> trying to solve again?

The one actual security problem that I've seen related to this is
CVE-2019-5736. There is a write-up of it at
<https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
under "Successful approach", but it goes more or less as follows:

A container is running that doesn't use user namespaces (because for
some reason I don't understand, apparently some people still do that).
An evil process is running inside the container with UID 0 (as in,
GLOBAL_ROOT_UID); so if the evil process inside the container was able
to reach root-owned files on the host filesystem, it could write into
them.

The container engine wants to spawn a new process inside the container.
It forks off a child that joins the container's namespaces (including
PID and mount namespaces), and then the child calls execve() on some
path in the container.
The attacker replaces the executable in the container with a symlink
to /proc/self/exe and replaces a library inside the container with a
malicious one.
When the container engine calls execve(), intending to run an executable
inside the container, it instead goes through ptrace_may_access() using
the introspection short-circuit and re-executes its own executable
through the jumped symlink /proc/self/exe (which is normally unreachable
for the container). After the execve(), the process loads an evil
library from inside the container and is under the control of the
container.
Now the container controls a process whose /proc/self/exe is a jumped
symlink to a host executable, and the container can write into it.

Some container engines are now using an extremely ugly hack to work
around this - whenever they want to enter a container, they copy the
host binary into a new memfd and execute that to avoid exposing the
original host binary to containers:
<https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b>


In my opinion, the problems here are:

 - Apparently some people run untrusted containers without user
   namespaces. It would be really nice if people could not do that.
   (Probably the biggest problem here.)
 - ptrace_may_access() has a short-circuit that permits a process to
   unintentionally look at itself even if it has dropped privileges -
   here, it permits the execve("/proc/self/exe", ...) that would
   normally be blocked by the check for CAP_SYS_PTRACE if the process
   is nondumpable.
 - You can use /proc/*/exe to get a writable fd.
Christian Brauner May 10, 2019, 11:36 p.m.
On Sat, May 11, 2019 at 12:55 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote:
> > On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
> > > > On 2019-05-06, Jann Horn <jannh@google.com> wrote:
> > > > > In my opinion, CVE-2019-5736 points out two different problems:
> > > > >
> > > > > The big problem: The __ptrace_may_access() logic has a special-case
> > > > > short-circuit for "introspection" that you can't opt out of; this
> > > > > makes it possible to open things in procfs that are related to the
> > > > > current process even if the credentials of the process wouldn't permit
> > > > > accessing another process like it. I think the proper fix to deal with
> > > > > this would be to add a prctl() flag for "set whether introspection is
> > > > > allowed for this process", and if userspace has manually un-set that
> > > > > flag, any introspection special-case logic would be skipped.
> > > >
> > > > We could do PR_SET_DUMPABLE=3 for this, I guess?
> > >
> > > Hmm... I'd make it a new prctl() command, since introspection is
> > > somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
> > > think the introspection flag should be per-thread.
> >
> > I've lost track of the context here, but it seems to me that
> > mitigating attacks involving accidental following of /proc links
> > shouldn't depend on dumpability.  What's the actual problem this is
> > trying to solve again?
>
> The one actual security problem that I've seen related to this is
> CVE-2019-5736. There is a write-up of it at
> <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
> under "Successful approach", but it goes more or less as follows:
>
> A container is running that doesn't use user namespaces (because for
> some reason I don't understand, apparently some people still do that).
> An evil process is running inside the container with UID 0 (as in,
> GLOBAL_ROOT_UID); so if the evil process inside the container was able
> to reach root-owned files on the host filesystem, it could write into
> them.
>
> The container engine wants to spawn a new process inside the container.
> It forks off a child that joins the container's namespaces (including
> PID and mount namespaces), and then the child calls execve() on some
> path in the container.
> The attacker replaces the executable in the container with a symlink
> to /proc/self/exe and replaces a library inside the container with a
> malicious one.
> When the container engine calls execve(), intending to run an executable
> inside the container, it instead goes through ptrace_may_access() using
> the introspection short-circuit and re-executes its own executable
> through the jumped symlink /proc/self/exe (which is normally unreachable
> for the container). After the execve(), the process loads an evil
> library from inside the container and is under the control of the
> container.
> Now the container controls a process whose /proc/self/exe is a jumped
> symlink to a host executable, and the container can write into it.
>
> Some container engines are now using an extremely ugly hack to work
> around this - whenever they want to enter a container, they copy the
> host binary into a new memfd and execute that to avoid exposing the
> original host binary to containers:
> <https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b>
>
>
> In my opinion, the problems here are:
>
>  - Apparently some people run untrusted containers without user
>    namespaces. It would be really nice if people could not do that.
>    (Probably the biggest problem here.)

I know I sound like a broken record since I've been going on about this
forever together with a lot of other people but honestly,
the fact that people are running untrusted workloads in privileged containers
is the real issue here.

Aleksa is a good friend of mine and we have discussed this a lot so I hope
he doesn't hate me for saying this again: it is crazy that there are container
runtimes out there that promise (or at least do not state the opposite)
containers without user namespaces or containers with user namespaces
that allow to map the host root id to anything can be safe. They cannot.

Even if this /proc/*/exe thing is somehow blocked there
are other ways of escaping from a privileged container.
We (i.e. LXC) literally do not accept CVEs for privileged containers
because we do not consider them safe by design.
It seems to me to be heading in the wrong direction to keep up the
illusion that with enough effort we can make this all nice and safe.
Yes, the userspace memfd hack we came up with is as ugly as a security
patch can be but if you make promises you can't keep you better be
prepared to pay the price when things start to fall apart.
So if this part of the patch is just needed to handle this do we really
want to do all that tricky work or is there more to gain from this that
makes it worth it.

Christian
Aleksa Sarai May 11, 2019, 3:49 p.m.
On 2019-05-11, Christian Brauner <christian@brauner.io> wrote:
> > In my opinion, the problems here are:
> >
> >  - Apparently some people run untrusted containers without user
> >    namespaces. It would be really nice if people could not do that.
> >    (Probably the biggest problem here.)
>
> I know I sound like a broken record since I've been going on about this
> forever together with a lot of other people but honestly,
> the fact that people are running untrusted workloads in privileged containers
> is the real issue here.

I completely agree. It's a shit-show, and it's caused by bad defaults in
Docker and (now) podman. To be fair, they both now support rootless
containers but the default is still privileged containers.

They do support user namespaces (though it should be noted that LXD's
support is much nicer from a security standpoint) but unless it's the
default the support is almost pointless. In the case of Docker it can
lead to some usability issues when you enable it (which I believe is the
main justification for it not being the default).

> Aleksa is a good friend of mine and we have discussed this a lot so I hope
> he doesn't hate me for saying this again: it is crazy that there are container
> runtimes out there that promise (or at least do not state the opposite)
> containers without user namespaces or containers with user namespaces
> that allow to map the host root id to anything can be safe. They cannot.

Yeah, the fact that we (runc) don't scream from the rooftops that this
setup is insecure is definitely a problem. I have mentioned this
whenever I've had a chance, but the fact that the most popular runtimes
(which use runc) don't use user namespaces compounds the issue. I'm
willing to bet that >90% of users of runc-based runtimes don't use user
namespaces at all, and this is all down to bad defaults.

There are also some other misfeatures we have in runc that we're
basically forced to support because some users use them, and we can't
really break entire projects (even though it's the projects' fault they
have an insecure setup).

> It seems to me to be heading in the wrong direction to keep up the
> illusion that with enough effort we can make this all nice and safe.
> Yes, the userspace memfd hack we came up with is as ugly as a security
> patch can be but if you make promises you can't keep you better be
> prepared to pay the price when things start to fall apart.

> So if this part of the patch is just needed to handle this do we really
> want to do all that tricky work or is there more to gain from this that
> makes it worth it.

I dropped this patch in v7, I don't think it's required for the
overarching feature. Looking back on it, it doesn't make much sense
given the context that privileged containers are unsafe in the first
place.

I do think that being able to block introspection might be a useful
hardening feature though. During attachment it would be nice to be sure
that nothing will be able to touch the attaching process's /proc/$pid --
even itself.
Andy Lutomirski May 11, 2019, 5 p.m.
> On May 10, 2019, at 3:55 PM, Jann Horn <jannh@google.com> wrote:
> 
>> On Fri, May 10, 2019 at 02:20:23PM -0700, Andy Lutomirski wrote:
>>> On Fri, May 10, 2019 at 1:41 PM Jann Horn <jannh@google.com> wrote:
>>> 
>>>> On Tue, May 07, 2019 at 05:17:35AM +1000, Aleksa Sarai wrote:
>>>>> On 2019-05-06, Jann Horn <jannh@google.com> wrote:
>>>>> In my opinion, CVE-2019-5736 points out two different problems:
>>>>> 
>>>>> The big problem: The __ptrace_may_access() logic has a special-case
>>>>> short-circuit for "introspection" that you can't opt out of; this
>>>>> makes it possible to open things in procfs that are related to the
>>>>> current process even if the credentials of the process wouldn't permit
>>>>> accessing another process like it. I think the proper fix to deal with
>>>>> this would be to add a prctl() flag for "set whether introspection is
>>>>> allowed for this process", and if userspace has manually un-set that
>>>>> flag, any introspection special-case logic would be skipped.
>>>> 
>>>> We could do PR_SET_DUMPABLE=3 for this, I guess?
>>> 
>>> Hmm... I'd make it a new prctl() command, since introspection is
>>> somewhat orthogonal to dumpability. Also, dumpability is per-mm, and I
>>> think the introspection flag should be per-thread.
>> 
>> I've lost track of the context here, but it seems to me that
>> mitigating attacks involving accidental following of /proc links
>> shouldn't depend on dumpability.  What's the actual problem this is
>> trying to solve again?
> 
> The one actual security problem that I've seen related to this is
> CVE-2019-5736. There is a write-up of it at
> <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
> under "Successful approach", but it goes more or less as follows:
> 
> A container is running that doesn't use user namespaces (because for
> some reason I don't understand, apparently some people still do that).
> An evil process is running inside the container with UID 0 (as in,
> GLOBAL_ROOT_UID); so if the evil process inside the container was able
> to reach root-owned files on the host filesystem, it could write into
> them.
> 
> The container engine wants to spawn a new process inside the container.
> It forks off a child that joins the container's namespaces (including
> PID and mount namespaces), and then the child calls execve() on some
> path in the container.

I think that, at this point, the task should be considered owned by the container.  Maybe we should have a better API than execve() to execute a program in a safer way, but fiddling with dumpability seems like a band-aid.  In fact, the process is arguably pwned even *before* execve.

A better “spawn” API should fix this.  In the mean time, I think it should be assumed that, if you join a container’s namespaces, you are at its mercy.

> The attacker replaces the executable in the container with a symlink
> to /proc/self/exe and replaces a library inside the container with a
> malicious one.

Cute.

> When the container engine calls execve(), intending to run an executable
> inside the container, it instead goes through ptrace_may_access() using
> the introspection short-circuit and re-executes its own executable
> through the jumped symlink /proc/self/exe (which is normally unreachable
> for the container). After the execve(), the process loads an evil
> library from inside the container and is under the control of the
> container.
> Now the container controls a process whose /proc/self/exe is a jumped
> symlink to a host executable, and the container can write into it.
> 
> Some container engines are now using an extremely ugly hack to work
> around this - whenever they want to enter a container, they copy the
> host binary into a new memfd and execute that to avoid exposing the
> original host binary to containers:
> <https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b>
> 
> 
> In my opinion, the problems here are:
> 
> - Apparently some people run untrusted containers without user
>   namespaces. It would be really nice if people could not do that.
>   (Probably the biggest problem here.)

> - ptrace_may_access() has a short-circuit that permits a process to
>   unintentionally look at itself even if it has dropped privileges -
>   here, it permits the execve("/proc/self/exe", ...) that would
>   normally be blocked by the check for CAP_SYS_PTRACE if the process
>   is nondumpable.

I don’t see this as a problem.  Dumpable is about protecting a task from others, not about protecting a task against itself.

> - You can use /proc/*/exe to get a writable fd.

This is IMO the real bug.
Linus Torvalds May 11, 2019, 5:21 p.m.
On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> A better “spawn” API should fix this.

Andy, stop with the "spawn would be better".

Spawn is garbage. It's garbage because it's fundamentally too
inflexible, and it's garbage because it is quite complex to try to
work around the inflexibility by having those complex "action pointer
arrays" to make up for its failings.

And spawn() would fundamentally have all the same permission issues
that you now point to execve() as having, so it doesn't even *solve*
anything.

You've said this whole "spawn would fix things" thing before, and it's
wrong. Spawn isn't better. Really. If fixes absolutely zero things,
and the only reason for spawn existing is because VMS and NT had that
broken and inflexible model.

There's at least one paper from some MS people about how "spawn()" is
wonderful, and maybe you bought into the garbage from that. But that
paper is about how they hate fork(), not because of execve(). And if
you hate fork, use "vfork()" instead (preferably with an immediate
call to a non-returning function in the child to avoid the stack
re-use issue that makes it so simple to screw up vfork() in hard to
debug ways).

execve() is a _fine_ model. That's not the problem in this whole issue
at all - never was, and never will be.

The problem in this discussion is (a) having privileges you shouldn't
have and (b) having other interfaces that make it easyish to change
the filesystem layout to confuse those entities with privileges.

So the reason the open flags can be problematic is exactly because
they effectively change filesystem layout. And no, it's not just
AT_THIS_ROOT, although that's the obvious one. Things like "you can't
follow symlinks" can also effectively change the layout: imagine if
you have a PATH-like lookup model, and you end up having symlinks as
part of the standard filesystem layout.. Now a "don't follow symlinks"
can turn the *standard* executable into something that isn't found,
and then you might end up executing something else instead (think root
having '.' as the last entry in path, which some people used to
suggest as the fix for the completely bad "first entry" case)..

Notice? None of the real problems are about execve or would be solved
by any spawn API. You just think that because you've apparently been
talking to too many MS people that think fork (and thus indirectly
execve()) is bad process management.

                Linus
Aleksa Sarai May 11, 2019, 5:26 p.m.
On 2019-05-11, Andy Lutomirski <luto@amacapital.net> wrote:
> >> I've lost track of the context here, but it seems to me that
> >> mitigating attacks involving accidental following of /proc links
> >> shouldn't depend on dumpability.  What's the actual problem this is
> >> trying to solve again?
> > 
> > The one actual security problem that I've seen related to this is
> > CVE-2019-5736. There is a write-up of it at
> > <https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html>
> > under "Successful approach", but it goes more or less as follows:
> > 
> > A container is running that doesn't use user namespaces (because for
> > some reason I don't understand, apparently some people still do that).
> > An evil process is running inside the container with UID 0 (as in,
> > GLOBAL_ROOT_UID); so if the evil process inside the container was able
> > to reach root-owned files on the host filesystem, it could write into
> > them.
> > 
> > The container engine wants to spawn a new process inside the container.
> > It forks off a child that joins the container's namespaces (including
> > PID and mount namespaces), and then the child calls execve() on some
> > path in the container.
> 
> I think that, at this point, the task should be considered owned by
> the container.  Maybe we should have a better API than execve() to
> execute a program in a safer way, but fiddling with dumpability seems
> like a band-aid.  In fact, the process is arguably pwned even *before*
> execve.

Yeah, execve is just the vector (though in this case it's done in order
to clear mm->dumpable). An earlier CVE (CVE-2016-9962) was very similar
but was attacking a dirfd that runc had open into the container (LXC had
a very similar bug too) -- setting !mm->dumpable was one of the
workarounds we had for this.

> A better “spawn” API should fix this.  In the mean time, I think it
> should be assumed that, if you join a container’s namespaces, you are
> at its mercy.

This is generally how we treat containers as runtime authors, but it's
not a trivial thing to get right. In many cases the kernel APIs are
working against you -- Christian and myself have written a fair few
patches to fix holes in the kernel APIs so we can avoid these kinds of
assumptions.

But yes, one of the most risky parts of a container runtime is when
you're attaching to a running container because all of the helpful
introspection APIs in /proc/ suddenly become a security nightmare. A
better "spawn a process in these namespaces" API might help improve the
situation (or at least, I hope it would).

> > - You can use /proc/*/exe to get a writable fd.
> 
> This is IMO the real bug.

I will try to send an RFC of the patchset I have for this next week or
so. Funnily enough, currently /proc/*/exe has the write bit set in its
"mode" (my series fixes this).
Linus Torvalds May 11, 2019, 5:26 p.m.
On Sat, May 11, 2019 at 1:21 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Notice? None of the real problems are about execve or would be solved
> by any spawn API. You just think that because you've apparently been
> talking to too many MS people that think fork (and thus indirectly
> execve()) is bad process management.

Side note: a good policy has been (and remains) to make suid binaries
not be dynamically linked. And in the absence of that, the dynamic
linker at least resets the library path when it notices itself being
dynamic, and it certainly doesn't inherit any open flags from the
non-trusted environment.

And by the same logic, a suid interpreter must *definitely* should not
inherit any execve() flags from the non-trusted environment. So I
think Aleksa's patch to use the passed-in open flags is *exactly* the
wrong thing to do for security reasons. It doesn't close holes, it
opens them.

                Linus
Aleksa Sarai May 11, 2019, 5:31 p.m.
On 2019-05-11, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, May 11, 2019 at 1:21 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Notice? None of the real problems are about execve or would be solved
> > by any spawn API. You just think that because you've apparently been
> > talking to too many MS people that think fork (and thus indirectly
> > execve()) is bad process management.
> 
> Side note: a good policy has been (and remains) to make suid binaries
> not be dynamically linked. And in the absence of that, the dynamic
> linker at least resets the library path when it notices itself being
> dynamic, and it certainly doesn't inherit any open flags from the
> non-trusted environment.
> 
> And by the same logic, a suid interpreter must *definitely* should not
> inherit any execve() flags from the non-trusted environment. So I
> think Aleksa's patch to use the passed-in open flags is *exactly* the
> wrong thing to do for security reasons. It doesn't close holes, it
> opens them.

Yup, I've dropped the patch for the next version. (To be honest, I'm not
sure why I included any of the other flags -- the only one that would've
been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
Linus Torvalds May 11, 2019, 5:43 p.m.
On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> Yup, I've dropped the patch for the next version. (To be honest, I'm not
> sure why I included any of the other flags -- the only one that would've
> been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)

I do wonder if we could try to just set AT_NO_MAGICLINKS
unconditionally for execve() (and certainly for the suid case).

I'd rather try to do these things across the board, than have "suid
binaries are treated specially" if at all possible.

The main use case for having /proc/<pid>/exe thing is for finding open
file descriptors, and for 'ps' kind of use, or to find the startup
directory when people don't populate the execve() environment fully
(ie "readlink(/proc/self/exe)" is afaik pretty common.

Sadly, googling for

    execve /proc/self/exe

does actually find hits, including one that implies that chrome does
exactly that.  So it might not be possible.

Somewhat odd, but it does just confirm the whole "users will at some
point do everything in their power to use every odd special case,
intended or not".

                  Linus
Christian Brauner May 11, 2019, 5:48 p.m.
On May 11, 2019 7:43:44 PM GMT+02:00, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>>
>> Yup, I've dropped the patch for the next version. (To be honest, I'm
>not
>> sure why I included any of the other flags -- the only one that
>would've
>> been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
>
>I do wonder if we could try to just set AT_NO_MAGICLINKS
>unconditionally for execve() (and certainly for the suid case).
>
>I'd rather try to do these things across the board, than have "suid
>binaries are treated specially" if at all possible.
>
>The main use case for having /proc/<pid>/exe thing is for finding open
>file descriptors, and for 'ps' kind of use, or to find the startup
>directory when people don't populate the execve() environment fully
>(ie "readlink(/proc/self/exe)" is afaik pretty common.
>
>Sadly, googling for
>
>    execve /proc/self/exe
>
>does actually find hits, including one that implies that chrome does
>exactly that.  So it might not be possible.
>
>Somewhat odd, but it does just confirm the whole "users will at some
>point do everything in their power to use every odd special case,
>intended or not".
>
>                  Linus

Sadly I have to admit that we are using this.
Also, execveat on glibc is implemented via
/proc/self/fd/<nr> on kernels that do not
have a proper execveat.
See fexecve...

Christian
Aleksa Sarai May 11, 2019, 6 p.m.
On 2019-05-11, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, May 11, 2019 at 1:31 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > Yup, I've dropped the patch for the next version. (To be honest, I'm not
> > sure why I included any of the other flags -- the only one that would've
> > been necessary to deal with CVE-2019-5736 was AT_NO_MAGICLINKS.)
> 
> I do wonder if we could try to just set AT_NO_MAGICLINKS
> unconditionally for execve() (and certainly for the suid case).
> 
> I'd rather try to do these things across the board, than have "suid
> binaries are treated specially" if at all possible.
> 
> The main use case for having /proc/<pid>/exe thing is for finding open
> file descriptors, and for 'ps' kind of use, or to find the startup
> directory when people don't populate the execve() environment fully
> (ie "readlink(/proc/self/exe)" is afaik pretty common.
> 
> Sadly, googling for
> 
>     execve /proc/self/exe
> 
> does actually find hits, including one that implies that chrome does
> exactly that.  So it might not be possible.
> 
> Somewhat odd, but it does just confirm the whole "users will at some
> point do everything in their power to use every odd special case,
> intended or not".

*sheepishly* Actually we use this in runc very liberally.

It's done because we need to run namespace-related code but runc is
written in Go so (long story short) we re-exec ourselves in order to
run some __attribute__((constructor)) code which sets up the namespaces
and then lets the Go runtime boot.

I suspect just writing everything in C would've been orders of magnitude
simpler, but I wasn't around when that decision was made. :P

Also as Christian mentioned, fexecve(3) in glibc is implemented using
/proc/self/fd on old kernels (then again, if we change the behaviour on
new kernels it won't matter because glibc uses execveat(AT_EMPTY_PATH)
if it's available).
Andy Lutomirski May 11, 2019, 10:39 p.m.
> On May 11, 2019, at 10:21 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> On Sat, May 11, 2019 at 1:00 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> A better “spawn” API should fix this.
>
> Andy, stop with the "spawn would be better".

It doesn’t have to be spawn per se.  But the current situation sucks.

>
> Notice? None of the real problems are about execve or would be solved
> by any spawn API. You just think that because you've apparently been
> talking to too many MS people that think fork (and thus indirectly
> execve()) is bad process management.
>
>

I’ve literally never spoken to an MS person about it.

What container managers and init systems *want* is a way to drop
privileges, change namespaces, etc and then run something in a
controlled way so that the intermediate states aren’t dangerous. An
API for this could be spawn-like or exec-like — that particular
distinction is beside the point.  Having personally written code that
mucks with namepsaces, I've wanted two particular abilities that are
both quite awkward:

a) Change all my UIDs and GIDs to match a container, enter that
container's namespaces, and run some binary in the container's
filesystem, all atomically enough that I don't need to worry about
accidentally leaking privileges into the container.  A
super-duper-non-dumpable mode would kind of allow this, but I'd worry
that there's some other hole besides ptrace() and /proc/self.

b) Change all my UIDs and GIDs to match a container, enter that
container's namespaces, and run some binary that is *not* in the
container's filesystem.  This happens, for example, if the container's
mount namespace has no exec mounts at all.  We don't have a fantastic
way to do this at all right now due to /proc/self/exe.

Regardless, the actual CVE at hand would have been nicely avoided if
writing to /proc/self/exe didn’t work, and I see no reason we can’t
make that happen.

I suppose we could also consider a change to disable /proc/self/exe if
it's not reachable from /proc/self/root.  By "disable", I mean that
readlink() should maybe still work, but actually trying to open it
could probably fail safely.
Linus Torvalds May 11, 2019, 11:08 p.m.
[ on mobile again, power is off and the wifi doesn't work, so I'm reading
email on my phone and apologizing once more for horrible html email.. ]

On Sat, May 11, 2019, 18:40 Andy Lutomirski <luto@kernel.org> wrote:

>
> a) Change all my UIDs and GIDs to match a container, enter that
> container's namespaces, and run some binary in the container's
> filesystem, all atomically enough that I don't need to worry about
> accidentally leaking privileges into the container.  A
> super-duper-non-dumpable mode would kind of allow this, but I'd worry
> that there's some other hole besides ptrace() and /proc/self.
>

So I would expect that you'd want to do this even *without* doing an execve
at all, which is why I still don't think this is actually about
spawn/execve at all.

But I agree that what you that what you want sounds reasonable. But I think
the "dumpable" flag has always been a complete hack, and very unreliable
with random meanings (and random ways to then get around it).

We have actually had lots of people wanting to run "lists of system calls"
kinds of things. Sometimes for performance reasons, sometimes for other
random reasons  Maybe that "atomicity" angle would be another one, although
we would have to make the setuid etc things be something that happens at
the end.

So rather than "spawn", is much rather see people think about ways to just
batch operations in general, rather than think it is about batching things
just before a process change.

b) Change all my UIDs and GIDs to match a container, enter that
> container's namespaces, and run some binary that is *not* in the
> container's filesystem.
>

Hey, you could probably do something very close to that by opening the
executable you want to run first, making it O_CLOEXEC, then doing all the
other things (which now makes the executable inaccessible), and finally
doing execveat() on the file descriptor..

I say "something very close" because even though it's O_CLOEXEC, only the
file would be closed, and /proc/self/exe would still exist.

But we really could make that execveat of a O_CLOEXEC file thing also
disable access to /proc/*/exe, and it sounds like a sane and simple
extension in general to do..

       Linus

>
Andy Lutomirski May 11, 2019, 11:37 p.m.
> On May 11, 2019, at 4:08 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> [ on mobile again, power is off and the wifi doesn't work, so I'm reading email on my phone and apologizing once more for horrible html email.. ]
> 
>> On Sat, May 11, 2019, 18:40 Andy Lutomirski <luto@kernel.org> wrote:
>> 
>> a) Change all my UIDs and GIDs to match a container, enter that
>> container's namespaces, and run some binary in the container's
>> filesystem, all atomically enough that I don't need to worry about
>> accidentally leaking privileges into the container.  A
>> super-duper-non-dumpable mode would kind of allow this, but I'd worry
>> that there's some other hole besides ptrace() and /proc/self.
> 
> 
> So I would expect that you'd want to do this even *without* doing an execve at all, which is why I still don't think this is actually about spawn/execve at all.
> 
> But I agree that what you that what you want sounds reasonable. But I think the "dumpable" flag has always been a complete hack, and very unreliable with random meanings (and random ways to then get around it).
> 
> We have actually had lots of people wanting to run "lists of system calls" kinds of things. Sometimes for performance reasons, sometimes for other random reasons  Maybe that "atomicity" angle would be another one, although we would have to make the setuid etc things be something that happens at the end.
> 
> So rather than "spawn", is much rather see people think about ways to just batch operations in general, rather than think it is about batching things just before a process change.
> 
>> b) Change all my UIDs and GIDs to match a container, enter that
>> container's namespaces, and run some binary that is *not* in the
>> container's filesystem.  
> 
> 
> Hey, you could probably do something very close to that by opening the executable you want to run first, making it O_CLOEXEC, then doing all the other things (which now makes the executable inaccessible), and finally doing execveat() on the file descriptor..
> 
> I say "something very close" because even though it's O_CLOEXEC, only the file would be closed, and /proc/self/exe would still exist.
> 
> But we really could make that execveat of a O_CLOEXEC file thing also disable access to /proc/*/exe, and it sounds like a sane and simple extension in general to do..  
> 
>  

I bet this will break something that already exists. An execveat() flag to turn off /proc/self/exe would do the trick, though.
Christian Brauner May 12, 2019, 10:19 a.m.
On Sat, May 11, 2019 at 07:08:49PM -0400, Linus Torvalds wrote:
> [ on mobile again, power is off and the wifi doesn't work, so I'm reading
> email on my phone and apologizing once more for horrible html email.. ]
> 
> On Sat, May 11, 2019, 18:40 Andy Lutomirski <luto@kernel.org> wrote:
> 
> >
> > a) Change all my UIDs and GIDs to match a container, enter that
> > container's namespaces, and run some binary in the container's

For the namespace part, an idea I had and presented at LPC a while ago
was to make setns() interpret the nstype argument as a flag argument and
to introduce an fd that can refer to multiple namespaces at the same
time. This way you could do:

setns(namespaces_fd, CLONE_NEWNS | CLONE_NEWUSER | CLONE_NEWPID)

that could still be done. But I since have come to think that there's a
better way now that we have CLONE_PIDFD. We should just make setns()
take a pidfd as argument and then be able to call:

setns(pidfd, 0);

which would cause the calling thread to join all of the namespaces of
the process referred to by pidfd at the same time. That really shouldn't
be hard to do. I could probably get this going rather quickly and it
would really help out container managers a lot.

> > filesystem, all atomically enough that I don't need to worry about
> > accidentally leaking privileges into the container.  A
> > super-duper-non-dumpable mode would kind of allow this, but I'd worry
> > that there's some other hole besides ptrace() and /proc/self.
> >
> 
> So I would expect that you'd want to do this even *without* doing an execve
> at all, which is why I still don't think this is actually about
> spawn/execve at all.
> 
> But I agree that what you that what you want sounds reasonable. But I think

I have pitched an api like that a while ago (see [1]) - when I first
started this fd for processes thing - that would allow you to do things
like that. The gist was:

1. int pidfd_create aka CLONE_PIDFD now
   will return an fd that creates a process context. The fd returned by
   does not refer to any existing process and has not actually been
   started yet. So non-configuration operations on it or trying to
   interact with it would fail with e.g. ESRCH/EINVAL.
   
   We essentially have this now with CLONE_PIDFD. The bit that is missing
   is an "unscheduled" process.

2. int pidfd_config
   takes a CLONE_PIDFD and can be used to configure a process context
   before it is alive.
   Some things that I would like to be able to do with this syscall are:
   - configure signals
   - set clone flags
   - write idmappings if the process runs in a new user namespace
   - configure what happens when all procfds referring to the process are gone
   - ...
3. int pidfd_info
4. int pidfd_manage
   Parts of that are captured in pidfd_send_signal().

Just to get a very rough feel for this without detailing parameters right now:

/* process should have own mountns */
pidfd_config(fd, PROC_SET_NAMESPACE,  CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWUSR, <potentially additional arguments>)

/* process should get SIGKILL when all procfds are closed */
pidfd_config(fd, PROC_SET_CLOSE, SIGKILL,     <potentially additional arguments>)

After the caller is done configuring the process there would be a final step:

pidfd_config(fd, PROC_CREATE, 0, <potentially additional arguments>)

which would create the process and (either as return value or through a
parameter) return the pid of the newly created process.

I had more thoughts on this and had started prototyping some of it but
then there wasn't much interest it seemed. Maybe because it's crazy.

[1]: https://lore.kernel.org/lkml/20181118174148.nvkc4ox2uorfatbm@brauner.io/

> the "dumpable" flag has always been a complete hack, and very unreliable
> with random meanings (and random ways to then get around it).
> 
> We have actually had lots of people wanting to run "lists of system calls"
> kinds of things. Sometimes for performance reasons, sometimes for other
> random reasons  Maybe that "atomicity" angle would be another one, although
> we would have to make the setuid etc things be something that happens at
> the end.
> 
> So rather than "spawn", is much rather see people think about ways to just
> batch operations in general, rather than think it is about batching things
> just before a process change.
> 
> b) Change all my UIDs and GIDs to match a container, enter that
> > container's namespaces, and run some binary that is *not* in the
> > container's filesystem.
> >
> 
> Hey, you could probably do something very close to that by opening the
> executable you want to run first, making it O_CLOEXEC, then doing all the
> other things (which now makes the executable inaccessible), and finally
> doing execveat() on the file descriptor..

I think that's somewhat similar to what I've suggested above.

> 
> I say "something very close" because even though it's O_CLOEXEC, only the
> file would be closed, and /proc/self/exe would still exist.
> 
> But we really could make that execveat of a O_CLOEXEC file thing also
> disable access to /proc/*/exe, and it sounds like a sane and simple
> extension in general to do..
> 
>        Linus
> 
> >
Linus Torvalds May 12, 2019, 10:44 a.m.
On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> I bet this will break something that already exists. An execveat() flag to turn off /proc/self/exe would do the trick, though.

Thinking more about it, I suspect it is (once again) wrong to let the
thing that does the execve() control that bit.

Generally, the less we allow people to affect the lifetime and
environment of a suid executable, the better off we are.

But maybe we could limit /proc/*/exe to at least not honor suid'ness
of the target? Or does chrome/runc depend on that too?

             Linus
Aleksa Sarai May 12, 2019, 1:35 p.m.
On 2019-05-12, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > I bet this will break something that already exists. An execveat()
> > flag to turn off /proc/self/exe would do the trick, though.
> 
> Thinking more about it, I suspect it is (once again) wrong to let the
> thing that does the execve() control that bit.
> 
> Generally, the less we allow people to affect the lifetime and
> environment of a suid executable, the better off we are.
> 
> But maybe we could limit /proc/*/exe to at least not honor suid'ness
> of the target? Or does chrome/runc depend on that too?

Speaking on the runc side, we don't depend on this. It's possible
someone depends on this for fexecve(3) -- but as mentioned before in
newer kernels glibc uses execve(AT_EMPTY_PATH).

I would like to point out though that I'm a little bit cautious about
/proc/self/exe-specific restrictions -- because a trivial way to get
around them would be to just open it with O_PATH (and you end up with a
/proc/self/fd/ which is equivalent). Unfortunately blocking setuid exec
on all O_PATH descriptors would break even execve(AT_EMPTY_PATH) of
setuid descriptors.

The patches I mentioned (which Andy and I discussed off-list) would
effectively make the magiclink modes in /proc/ affect how you can
operate on the path (no write bit in the mode, cannot re-open it write).
One aspect of this is how to handle O_PATH and in particular how do we
handle an O_PATH re-open of an already-restricted magiclink.

Maybe we could make it so that setuid is disallowed if you are dealing
with an O_PATH fd which was a magiclink. Effectively, on O_PATH open you
get an fmode_t saying FMODE_SETUID_EXEC_ALLOWED *but* if the path is a
magiclink this fmode gets dropped and when the fd is given to
execveat(AT_EMPTY_PATH) the fmode is checked and setuid-exec is not
allowed.

[I assume in this discussion "setuid" means "setuid + setcap", right?]
Aleksa Sarai May 12, 2019, 1:38 p.m.
On 2019-05-12, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-05-12, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > > I bet this will break something that already exists. An execveat()
> > > flag to turn off /proc/self/exe would do the trick, though.
> > 
> > Thinking more about it, I suspect it is (once again) wrong to let the
> > thing that does the execve() control that bit.
> > 
> > Generally, the less we allow people to affect the lifetime and
> > environment of a suid executable, the better off we are.
> > 
> > But maybe we could limit /proc/*/exe to at least not honor suid'ness
> > of the target? Or does chrome/runc depend on that too?
> 
> Speaking on the runc side, we don't depend on this. It's possible
> someone depends on this for fexecve(3) -- but as mentioned before in
> newer kernels glibc uses execve(AT_EMPTY_PATH).
> 
> I would like to point out though that I'm a little bit cautious about
> /proc/self/exe-specific restrictions -- because a trivial way to get
> around them would be to just open it with O_PATH (and you end up with a
> /proc/self/fd/ which is equivalent). Unfortunately blocking setuid exec
> on all O_PATH descriptors would break even execve(AT_EMPTY_PATH) of
> setuid descriptors.
> 
> The patches I mentioned (which Andy and I discussed off-list) would
> effectively make the magiclink modes in /proc/ affect how you can
> operate on the path (no write bit in the mode, cannot re-open it write).
> One aspect of this is how to handle O_PATH and in particular how do we
> handle an O_PATH re-open of an already-restricted magiclink.
> 
> Maybe we could make it so that setuid is disallowed if you are dealing
> with an O_PATH fd which was a magiclink. Effectively, on O_PATH open you
> get an fmode_t saying FMODE_SETUID_EXEC_ALLOWED *but* if the path is a
> magiclink this fmode gets dropped and when the fd is given to
> execveat(AT_EMPTY_PATH) the fmode is checked and setuid-exec is not
> allowed.

... and obviously /proc/self/exe would have an fmode
~FMODE_SETUID_EXEC_ALLOWED from the outset. The reason for this slightly
odd semantic would be to continue to allow O_PATH setuid-exec as long as
the O_PATH was opened from an actual path rather than a magiclink.
Andy Lutomirski May 12, 2019, 2:34 p.m.
> On May 12, 2019, at 6:35 AM, Aleksa Sarai <cyphar@cyphar.com> wrote:
> 
>> On 2019-05-12, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>> On Sat, May 11, 2019 at 7:37 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>> I bet this will break something that already exists. An execveat()
>>> flag to turn off /proc/self/exe would do the trick, though.
>> 
>> Thinking more about it, I suspect it is (once again) wrong to let the
>> thing that does the execve() control that bit.
>> 
>> Generally, the less we allow people to affect the lifetime and
>> environment of a suid executable, the better off we are.
>> 
>> But maybe we could limit /proc/*/exe to at least not honor suid'ness
>> of the target? Or does chrome/runc depend on that too?
> 
> Speaking on the runc side, we don't depend on this. It's possible
> someone depends on this for fexecve(3) -- but as mentioned before in
> newer kernels glibc uses execve(AT_EMPTY_PATH).

Why are we concerned about suid?  Don’t we already block suid if the path being execed doesn’t come from the current mountns?  That should mostly cover the things we care about, no?

I suppose we could also block suid for deleted files, so that deleting a known-buggy suid binary becomes more reliable. But every sensible package tool should already be chmoding the suid away before unlinking.