[v2,01/24] exec: Move unshare_files to fix posix file locking during exec

Submitted by Eric W. Biederman on Nov. 20, 2020, 11:14 p.m.

Details

Message ID 20201120231441.29911-1-ebiederm@xmission.com
State New
Series "exec: Move unshare_files and guarantee files_struct.count is correct"
Headers show

Commit Message

Eric W. Biederman Nov. 20, 2020, 11:14 p.m.
Many moons ago the binfmts were doing some very questionable things
with file descriptors and an unsharing of the file descriptor table
was added to make things better[1][2].  The helper steal_lockss was
added to avoid breaking the userspace programs[3][4][6].

Unfortunately it turned out that steal_locks did not work for network
file systems[5], so it was removed to see if anyone would
complain[7][8].  It was thought at the time that NPTL would not be
affected as the unshare_files happened after the other threads were
killed[8].  Unfortunately because there was an unshare_files in
binfmt_elf.c before the threads were killed this analysis was
incorrect.

This unshare_files in binfmt_elf.c resulted in the unshares_files
happening whenever threads were present.  Which led to unshare_files
being moved to the start of do_execve[9].

Later the problems were rediscovered and the suggested approach was to
readd steal_locks under a different name[10].  I happened to be
reviewing patches and I noticed that this approach was a step
backwards[11].

I proposed simply moving unshare_files[12] and it was pointed
out that moving unshare_files without auditing the code was
also unsafe[13].

There were then several attempts to solve this[14][15][16] and I even
posted this set of changes[17].  Unfortunately because auditing all of
execve is time consuming this change did not make it in at the time.

Well now that I am cleaning up exec I have made the time to read
through all of the binfmts and the only playing with file descriptors
is either the security modules closing them in
security_bprm_committing_creds or is in the generic code in fs/exec.c.
None of it happens before begin_new_exec is called.

So move unshare_files into begin_new_exec, after the point of no
return.  If memory is very very very low and the application calling
exec is sharing file descriptor tables between processes we might fail
past the point of no return.  Which is unfortunate but no different
than any of the other places where we allocate memory after the point
of no return.

This movement allows another process that shares the file table, or
another thread of the same process and that closes files or changes
their close on exec behavior and races with execve to cause some
unexpected things to happen.  There is only one time of check to time
of use race and it is just there so that execve fails instead of
an interpreter failing when it tries to open the file it is supposed
to be interpreting.   Failing later if userspace is being silly is
not a problem.

With this change it the following discription from the removal
of steal_locks[8] finally becomes true.

    Apps using NPTL are not affected, since all other threads are killed before
    execve.

    Apps using LinuxThreads are only affected if they

      - have multiple threads during exec (LinuxThreads doesn't kill other
        threads, the app may do it with pthread_kill_other_threads_np())
      - rely on POSIX locks being inherited across exec

    Both conditions are documented, but not their interaction.

    Apps using clone() natively are affected if they

      - use clone(CLONE_FILES)
      - rely on POSIX locks being inherited across exec

I have investigated some paths to make it possible to solve this
without moving unshare_files but they all look more complicated[18].

Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Reported-by: Jeff Layton <jlayton@redhat.com>
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
[1] 02cda956de0b ("[PATCH] unshare_files"
[2] 04e9bcb4d106 ("[PATCH] use new unshare_files helper")
[3] 088f5d7244de ("[PATCH] add steal_locks helper")
[4] 02c541ec8ffa ("[PATCH] use new steal_locks helper")
[5] https://lkml.kernel.org/r/E1FLIlF-0007zR-00@dorka.pomaz.szeredi.hu
[6] https://lkml.kernel.org/r/0060321191605.GB15997@sorel.sous-sol.org
[7] https://lkml.kernel.org/r/E1FLwjC-0000kJ-00@dorka.pomaz.szeredi.hu
[8] c89681ed7d0e ("[PATCH] remove steal_locks()")
[9] fd8328be874f ("[PATCH] sanitize handling of shared descriptor tables in failing execve()")
[10] https://lkml.kernel.org/r/20180317142520.30520-1-jlayton@kernel.org
[11] https://lkml.kernel.org/r/87r2nwqk73.fsf@xmission.com
[12] https://lkml.kernel.org/r/87bmfgvg8w.fsf@xmission.com
[13] https://lkml.kernel.org/r/20180322111424.GE30522@ZenIV.linux.org.uk
[14] https://lkml.kernel.org/r/20180827174722.3723-1-jlayton@kernel.org
[15] https://lkml.kernel.org/r/20180830172423.21964-1-jlayton@kernel.org
[16] https://lkml.kernel.org/r/20180914105310.6454-1-jlayton@kernel.org
[17] https://lkml.kernel.org/r/87a7ohs5ow.fsf@xmission.com
[18] https://lkml.kernel.org/r/87pn8c1uj6.fsf_-_@x220.int.ebiederm.org
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-1-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

Patch hide | download patch | download mbox

diff --git a/fs/exec.c b/fs/exec.c
index 547a2390baf5..fa788988efe9 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1238,6 +1238,7 @@  void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
 int begin_new_exec(struct linux_binprm * bprm)
 {
 	struct task_struct *me = current;
+	struct files_struct *displaced;
 	int retval;
 
 	/* Once we are committed compute the creds */
@@ -1257,6 +1258,13 @@  int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	/* Ensure the files table is not shared. */
+	retval = unshare_files(&displaced);
+	if (retval)
+		goto out;
+	if (displaced)
+		put_files_struct(displaced);
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1776,7 +1784,6 @@  static int bprm_execve(struct linux_binprm *bprm,
 		       int fd, struct filename *filename, int flags)
 {
 	struct file *file;
-	struct files_struct *displaced;
 	int retval;
 
 	/*
@@ -1784,13 +1791,9 @@  static int bprm_execve(struct linux_binprm *bprm,
 	 */
 	io_uring_task_cancel();
 
-	retval = unshare_files(&displaced);
-	if (retval)
-		return retval;
-
 	retval = prepare_bprm_creds(bprm);
 	if (retval)
-		goto out_files;
+		return retval;
 
 	check_unsafe_exec(bprm);
 	current->in_execve = 1;
@@ -1805,8 +1808,12 @@  static int bprm_execve(struct linux_binprm *bprm,
 	bprm->file = file;
 	/*
 	 * Record that a name derived from an O_CLOEXEC fd will be
-	 * inaccessible after exec. Relies on having exclusive access to
-	 * current->files (due to unshare_files above).
+	 * inaccessible after exec.  This allows the code in exec to
+	 * choose to fail when the executable is not mmaped into the
+	 * interpreter and an open file descriptor is not passed to
+	 * the interpreter.  This makes for a better user experience
+	 * than having the interpreter start and then immediately fail
+	 * when it finds the executable is inaccessible.
 	 */
 	if (bprm->fdpath &&
 	    close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
@@ -1827,8 +1834,6 @@  static int bprm_execve(struct linux_binprm *bprm,
 	rseq_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current, false);
-	if (displaced)
-		put_files_struct(displaced);
 	return retval;
 
 out:
@@ -1845,10 +1850,6 @@  static int bprm_execve(struct linux_binprm *bprm,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 
-out_files:
-	if (displaced)
-		reset_files_struct(displaced);
-
 	return retval;
 }
 

Comments

Linus Torvalds Nov. 20, 2020, 11:58 p.m.
On Fri, Nov 20, 2020 at 3:16 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> @@ -1257,6 +1258,13 @@ int begin_new_exec(struct linux_binprm * bprm)
>         if (retval)
>                 goto out;
>
> +       /* Ensure the files table is not shared. */
> +       retval = unshare_files(&displaced);
> +       if (retval)
> +               goto out;
> +       if (displaced)
> +               put_files_struct(displaced);

It's not obvious from the patch (not enough context), but the new
placement seems to make much more sense - and it's where we do the
de-thread and switch the vm and signals too.

So this does seem to be the much more logical place.

Ack.

      Linus
Al Viro Dec. 7, 2020, 10:22 p.m.
On Fri, Nov 20, 2020 at 05:14:18PM -0600, Eric W. Biederman wrote:
> @@ -1805,8 +1808,12 @@ static int bprm_execve(struct linux_binprm *bprm,
>  	bprm->file = file;
>  	/*
>  	 * Record that a name derived from an O_CLOEXEC fd will be
> -	 * inaccessible after exec. Relies on having exclusive access to
> -	 * current->files (due to unshare_files above).
> +	 * inaccessible after exec.  This allows the code in exec to
> +	 * choose to fail when the executable is not mmaped into the
> +	 * interpreter and an open file descriptor is not passed to
> +	 * the interpreter.  This makes for a better user experience
> +	 * than having the interpreter start and then immediately fail
> +	 * when it finds the executable is inaccessible.
>  	 */
>  	if (bprm->fdpath &&
>  	    close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))

We do not have rcu_read_lock() here.  What would happen if
	* we get fdt
	* another thread does e.g. dup2() with target descriptor greater than
the current size
	* old fdt gets copied and (RCU-delayed) freed
	* nobody is holding rcu_read_lock(), so it gets really freed
	* we read a bitmap from the already freed sucker

It's a narrow window, but on SMP KVM it's not impossible to hit; if you
have preemption enabled, the race window is not so narrow even when
running on bare metal.  In the mainline it is safe, but only because
the damn thing is guaranteed to be _not_ shared with any other thread
(which is what the comment had been about).  Why not simply say
	if (bprm->fdpath && get_close_on_exec(fd))
anyway?
Eric W. Biederman Dec. 7, 2020, 11:07 p.m.
Al Viro <viro@zeniv.linux.org.uk> writes:

> On Fri, Nov 20, 2020 at 05:14:18PM -0600, Eric W. Biederman wrote:
>> @@ -1805,8 +1808,12 @@ static int bprm_execve(struct linux_binprm *bprm,
>>  	bprm->file = file;
>>  	/*
>>  	 * Record that a name derived from an O_CLOEXEC fd will be
>> -	 * inaccessible after exec. Relies on having exclusive access to
>> -	 * current->files (due to unshare_files above).
>> +	 * inaccessible after exec.  This allows the code in exec to
>> +	 * choose to fail when the executable is not mmaped into the
>> +	 * interpreter and an open file descriptor is not passed to
>> +	 * the interpreter.  This makes for a better user experience
>> +	 * than having the interpreter start and then immediately fail
>> +	 * when it finds the executable is inaccessible.
>>  	 */
>>  	if (bprm->fdpath &&
>>  	    close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
>
> We do not have rcu_read_lock() here.  What would happen if
> 	* we get fdt
> 	* another thread does e.g. dup2() with target descriptor greater than
> the current size
> 	* old fdt gets copied and (RCU-delayed) freed
> 	* nobody is holding rcu_read_lock(), so it gets really freed
> 	* we read a bitmap from the already freed sucker
>
> It's a narrow window, but on SMP KVM it's not impossible to hit; if you
> have preemption enabled, the race window is not so narrow even when
> running on bare metal.  In the mainline it is safe, but only because
> the damn thing is guaranteed to be _not_ shared with any other thread
> (which is what the comment had been about).  Why not simply say
> 	if (bprm->fdpath && get_close_on_exec(fd))
> anyway?

My mistake.  I missed that the actual code was highly optimized and only
safe in the presence of an unshared files struct.

What I saw and what I thought the comment was talking about was that the
result of the test isn't guaranteed to be stable without having an
exclusive access to the files.  I figured and if something changes later
and it becomes safe to pass the file name to a script later we don't
really care.

In any event thank you for the review.  I will queue up a follow on
patch that makes this use get_close_on_exec.

Eric
Al Viro Dec. 7, 2020, 11:35 p.m.
On Mon, Dec 07, 2020 at 05:07:55PM -0600, Eric W. Biederman wrote:

> My mistake.  I missed that the actual code was highly optimized and only
> safe in the presence of an unshared files struct.

That's a polite way to spell "overoptimized for no good reason" ;-)

> What I saw and what I thought the comment was talking about was that the
> result of the test isn't guaranteed to be stable without having an
> exclusive access to the files.  I figured and if something changes later
> and it becomes safe to pass the file name to a script later we don't
> really care.
> 
> In any event thank you for the review.  I will queue up a follow on
> patch that makes this use get_close_on_exec.

I would rather put that switch to get_close_on_exec() first in the queue
(or fold it into this patch)...
Eric W. Biederman Dec. 8, 2020, 11:16 p.m.
Al Viro <viro@zeniv.linux.org.uk> writes:

> On Mon, Dec 07, 2020 at 05:07:55PM -0600, Eric W. Biederman wrote:
>
>> My mistake.  I missed that the actual code was highly optimized and only
>> safe in the presence of an unshared files struct.
>
> That's a polite way to spell "overoptimized for no good reason" ;-)
>
>> What I saw and what I thought the comment was talking about was that the
>> result of the test isn't guaranteed to be stable without having an
>> exclusive access to the files.  I figured and if something changes later
>> and it becomes safe to pass the file name to a script later we don't
>> really care.
>> 
>> In any event thank you for the review.  I will queue up a follow on
>> patch that makes this use get_close_on_exec.
>
> I would rather put that switch to get_close_on_exec() first in the queue
> (or fold it into this patch)...

All else being equal I would like that placement too.  With the rest of
the patches in linux-next I don't want to mess unless it is bad enough
to breaks someone's bisection of the kernel.

Eric