ucount: use-after-free read in inc_ucount & dec_ucount

Submitted by Nikolay Borisov on March 4, 2017, 12:10 p.m.

Details

Message ID c69f8f03-4324-f934-eed2-643c91d703c0@gmail.com
State New
Series "ucount: use-after-free read in inc_ucount & dec_ucount"
Headers show

Commit Message

Nikolay Borisov March 4, 2017, 12:10 p.m.
On  4.03.2017 14:01, Dmitry Vyukov wrote:
> On Sat, Mar 4, 2017 at 12:50 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>> <n.borisov.lkml@gmail.com> wrote:
>>>> [Addressing Dmitry Vyukov to ask for syzkaller clarification]
>>>>
>>>> On  3.03.2017 18:30, Eric W. Biederman wrote:
>>>>> Nikolay Borisov <n.borisov.lkml@gmail.com> writes:
>>>>>
>>>>>> [Added containers ml, Eric Biederman and Jan Kara]. Please,
>>>>>> next time don't add random people but take the time to see who touched
>>>>>> the code.
>>>>>
>>>>> Comments below.
>>>>>
>>>>>> On  3.03.2017 14:16, JongHwan Kim wrote:
>>>>>>> I've got the following report with syzkaller fuzzer
>>>>>>>
>>>>>>>
>>>>>>> Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on commit .
>>>>>>>
>>>>>>> ==================================================================
>>>>>>> BUG: KASAN: use-after-free in __read_once_size
>>>>>>> include/linux/compiler.h:254 [inline] at addr ffff88006d399bc4
>>>>>>> BUG: KASAN: use-after-free in atomic_read
>>>>>>> arch/x86/include/asm/atomic.h:26 [inline] at addr ffff88006d399bc4
>>>>>>> BUG: KASAN: use-after-free in atomic_dec_if_positive
>>>>>>> include/linux/atomic.h:616 [inline] at addr ffff88006d399bc4
>>>>>>> BUG: KASAN: use-after-free in dec_ucount+0x1e5/0x210 kernel/ucount.c:217
>>>>>>> at addr ffff88006d399bc4
>>>>>>> Read of size 4 by task syz-executor3/19713
>>>>>>> CPU: 1 PID: 19713 Comm: syz-executor3 Not tainted 4.10.0+ #4
>>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>>>>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>>>>>>> Call Trace:
>>>>>>>  __dump_stack lib/dump_stack.c:15 [inline]
>>>>>>>  dump_stack+0x115/0x1cf lib/dump_stack.c:51
>>>>>>>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
>>>>>>>  print_address_description mm/kasan/report.c:200 [inline]
>>>>>>>  kasan_report_error mm/kasan/report.c:289 [inline]
>>>>>>>  kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
>>>>>>>  kasan_report mm/kasan/report.c:331 [inline]
>>>>>>>  __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:331
>>>>>>>  __read_once_size include/linux/compiler.h:254 [inline]
>>>>>>>  atomic_read arch/x86/include/asm/atomic.h:26 [inline]
>>>>>>>  atomic_dec_if_positive include/linux/atomic.h:616 [inline]
>>>>>>>  dec_ucount+0x1e5/0x210 kernel/ucount.c:217
>>>>>>>  dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
>>>>>>>  inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
>>>>>>>  fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
>>>>>>>  fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
>>>>>>>  fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
>>>>>>>  inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
>>>>>>>  __fput+0x327/0x7e0 fs/file_table.c:208
>>>>>>>  ____fput+0x15/0x20 fs/file_table.c:244
>>>>>>>  task_work_run+0x18a/0x260 kernel/task_work.c:116
>>>>>>>  exit_task_work include/linux/task_work.h:21 [inline]
>>>>>>>  do_exit+0xa45/0x1b20 kernel/exit.c:873
>>>>>>>  do_group_exit+0x149/0x400 kernel/exit.c:977
>>>>>>>  get_signal+0x7d5/0x1810 kernel/signal.c:2313
>>>>>>>  do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
>>>>>>>  exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
>>>>>>>  prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
>>>>>>>  syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
>>>>>>>  entry_SYSCALL_64_fastpath+0xc0/0xc2
>>>>>>
>>>>>> So PID 19713 is exitting and as part of it it's freeing its file
>>>>>> descriptors, one of which is apparently an inotify fd. And this has
>>>>>> already been freed.
>>>>>>
>>>>>>
>>>>>>> RIP: 0033:0x44fb79
>>>>>>> RSP: 002b:00007ffd0f00f6d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000ca
>>>>>>> RAX: fffffffffffffdfc RBX: 0000000000708024 RCX: 000000000044fb79
>>>>>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000708024
>>>>>>> RBP: 00000000000ae8e6 R08: 0000000000708000 R09: 000000160000000d
>>>>>>> R10: 00007ffd0f00f710 R11: 0000000000000206 R12: 0000000000708000
>>>>>>> R13: 0000000000708024 R14: 00000000000ae8a1 R15: 0000000000000016
>>>>>>> Object at ffff88006d399b88, in cache kmalloc-96 size: 96
>>>>>>> Allocated:
>>>>>>> PID = 19691
>>>>>>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>>>>>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>>>>>>>  set_track mm/kasan/kasan.c:514 [inline]
>>>>>>>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
>>>>>>>  kmem_cache_alloc_trace+0xfb/0x280 mm/slub.c:2745
>>>>>>>  kmalloc include/linux/slab.h:490 [inline]
>>>>>>>  kzalloc include/linux/slab.h:663 [inline]
>>>>>>>  get_ucounts kernel/ucount.c:140 [inline]
>>>>>>>  inc_ucount+0x538/0xa70 kernel/ucount.c:195
>>>>>>>  inotify_new_group+0x309/0x410 fs/notify/inotify/inotify_user.c:655
>>>>>>>  SYSC_inotify_init1 fs/notify/inotify/inotify_user.c:682 [inline]
>>>>>>>  SyS_inotify_init1 fs/notify/inotify/inotify_user.c:669 [inline]
>>>>>>>  sys_inotify_init+0x17/0x80 fs/notify/inotify/inotify_user.c:696
>>>>>>>  entry_SYSCALL_64_fastpath+0x1f/0xc2
>>>>>>
>>>>>> However, it has been actually allocated by a different process with pid
>>>>>> 19691.
>>>>>>
>>>>>>
>>>>>>> Freed:
>>>>>>> PID = 19708
>>>>>>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>>>>>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>>>>>>>  set_track mm/kasan/kasan.c:514 [inline]
>>>>>>>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
>>>>>>>  slab_free_hook mm/slub.c:1357 [inline]
>>>>>>>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>>>>>>>  slab_free mm/slub.c:2961 [inline]
>>>>>>>  kfree+0xe8/0x2c0 mm/slub.c:3882
>>>>>>>  put_ucounts+0x1dd/0x270 kernel/ucount.c:172
>>>>>>>  dec_ucount+0x172/0x210 kernel/ucount.c:220
>>>>>>>  dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
>>>>>>>  inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
>>>>>>>  fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
>>>>>>>  fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
>>>>>>>  fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
>>>>>>>  inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
>>>>>>>  __fput+0x327/0x7e0 fs/file_table.c:208
>>>>>>>  ____fput+0x15/0x20 fs/file_table.c:244
>>>>>>>  task_work_run+0x18a/0x260 kernel/task_work.c:116
>>>>>>>  exit_task_work include/linux/task_work.h:21 [inline]
>>>>>>>  do_exit+0xa45/0x1b20 kernel/exit.c:873
>>>>>>>  do_group_exit+0x149/0x400 kernel/exit.c:977
>>>>>>>  get_signal+0x7d5/0x1810 kernel/signal.c:2313
>>>>>>>  do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
>>>>>>>  exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
>>>>>>>  prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
>>>>>>>  syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
>>>>>>>  entry_SYSCALL_64_fastpath+0xc0/0xc2
>>>>>>
>>>>>> And yet we have a third process which freed it, PID 19708. So there is
>>>>>> some dance happening with this fd, being allocated by one process,
>>>>>> handed over to 2 more, which are freeing it. Is this a valid usage
>>>>>> scenario of inotify descriptors?
>>>>>
>>>>> They are file descriptors so passing them around is valid.  That is
>>>>> something unix domain sockets have allowed since the dawn of linux.
>>>>>
>>>>> The dance would need to be the fd being passed to the addtional
>>>>> processes and then closed in the original before being closed
>>>>> in the processes the fd was passed to.
>>>>>
>>>>> If those additional processes last longer than the original process this
>>>>> is easy to achieve.
>>>>>
>>>>> My guess is that someone just taught syskallzer to pass file descriptors
>>>>> around.  So this may be an old bug.  Either that or syskallzer hasn't
>>>>> been looking at linux-next with KASAN enabled in the kernel.
>>>>
>>>> Dmitry, can you tell if syzkaller tests sending file descriptors across
>>>> sockets? Since the calltraces here show multiple processes being
>>>> involved in different operations on the exact same file descriptor.
>>>>
>>>> Also JongHwan, can you provide the full, compilable reproducer to try
>>>> and track this issue down?
>>>
>>>
>>> syzkaller can pass descriptors across sockets, but currently only
>>> within a single multi-threaded process.
>>>
>>> Are you sure it's the same descriptor? It seems to me that it's struct
>>> ucounts, which is shared via the global ucounts_hashtable, so no
>>> sharing is required in user processes.
>>>
>>> Unless I am missing something, we want:
>>>
>>> @@ -154,7 +155,7 @@ static struct ucounts *get_ucounts(struct
>>> user_namespace *ns, kuid_t uid)
>>>                         ucounts = new;
>>>                 }
>>>         }
>>> -       if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
>>> +       if (!atomic_add_unless(&ucounts->count, 1, 0))
>>>                 ucounts = NULL;
>>>         spin_unlock_irq(&ucounts_lock);
>>>         return ucounts;
>>>
>>> no?
>>>
>>> put_ucounts drops the last reference, then get_ucounts finds the
>>> ucounts and successfully increments refcount as it's not INT_MAX (it's
>>> 0) and starts using it, meanwhile put_ucounts proceeds to
>>> unconditionally deleting the ucounts.
>>
>>
>> It also seems that a concurrent put_ucounts can make get_ucounts fail
>> _spuriously_, which does not look good.
>> Don't we want something along the following lines?
>>
>> diff --git a/kernel/ucount.c b/kernel/ucount.c
>> index 8a11fc0cb459..233c8e46acd5 100644
>> --- a/kernel/ucount.c
>> +++ b/kernel/ucount.c
>> @@ -143,19 +143,18 @@ static struct ucounts *get_ucounts(struct
>> user_namespace *ns, kuid_t uid)
>>
>>                 new->ns = ns;
>>                 new->uid = uid;
>> -               atomic_set(&new->count, 0);
>> +               atomic_set(&new->count, 1);
>>
>>                 spin_lock_irq(&ucounts_lock);
>>                 ucounts = find_ucounts(ns, uid, hashent);
>>                 if (ucounts) {
>> +                       atomic_inc(&ucounts->count);
>>                         kfree(new);
>>                 } else {
>>                         hlist_add_head(&new->node, hashent);
>>                         ucounts = new;
>>                 }
>>         }
>> -       if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
>> -               ucounts = NULL;
>>         spin_unlock_irq(&ucounts_lock);
>>         return ucounts;
>>  }
>> @@ -166,7 +165,10 @@ static void put_ucounts(struct ucounts *ucounts)
>>
>>         if (atomic_dec_and_test(&ucounts->count)) {
>>                 spin_lock_irqsave(&ucounts_lock, flags);
>> -               hlist_del_init(&ucounts->node);
>> +               if (atomic_read(&ucounts->count) == 0)
>> +                       hlist_del_init(&ucounts->node);
>> +               else
>> +                       ucounts = NULL;
>>                 spin_unlock_irqrestore(&ucounts_lock, flags);
>>
>>                 kfree(ucounts);
> 
> 
> /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
> 
> This is broken per se. Need something more elaborate.
> 


How about this :


This makes the atomic_dec_and_test and hashtable removal atomic in essence.

Patch hide | download patch | download mbox

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 8a11fc0cb459..b817ac0e587c 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -166,11 +166,15 @@  static void put_ucounts(struct ucounts *ucounts)

        if (atomic_dec_and_test(&ucounts->count)) {
                spin_lock_irqsave(&ucounts_lock, flags);
-               hlist_del_init(&ucounts->node);
-               spin_unlock_irqrestore(&ucounts_lock, flags);
-
-               kfree(ucounts);
+               if (!atomic_read(&ucounts->count)) {
+                       hlist_del_init(&ucounts->node);
+                       spin_unlock_irqrestore(&ucounts_lock, flags);
+                       kfree(ucounts);
+                       return;
+               }
        }
+
+       spin_unlock_irqrestore(&ucounts_lock, flags);
 }



Comments

Jann Horn via Containers March 4, 2017, 12:15 p.m.
On Sat, Mar 4, 2017 at 1:10 PM, Nikolay Borisov
<n.borisov.lkml@gmail.com> wrote:
>
>
> On  4.03.2017 14:01, Dmitry Vyukov wrote:
>> On Sat, Mar 4, 2017 at 12:50 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>>> <n.borisov.lkml@gmail.com> wrote:
>>>>> [Addressing Dmitry Vyukov to ask for syzkaller clarification]
>>>>>
>>>>> On  3.03.2017 18:30, Eric W. Biederman wrote:
>>>>>> Nikolay Borisov <n.borisov.lkml@gmail.com> writes:
>>>>>>
>>>>>>> [Added containers ml, Eric Biederman and Jan Kara]. Please,
>>>>>>> next time don't add random people but take the time to see who touched
>>>>>>> the code.
>>>>>>
>>>>>> Comments below.
>>>>>>
>>>>>>> On  3.03.2017 14:16, JongHwan Kim wrote:
>>>>>>>> I've got the following report with syzkaller fuzzer
>>>>>>>>
>>>>>>>>
>>>>>>>> Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on commit .
>>>>>>>>
>>>>>>>> ==================================================================
>>>>>>>> BUG: KASAN: use-after-free in __read_once_size
>>>>>>>> include/linux/compiler.h:254 [inline] at addr ffff88006d399bc4
>>>>>>>> BUG: KASAN: use-after-free in atomic_read
>>>>>>>> arch/x86/include/asm/atomic.h:26 [inline] at addr ffff88006d399bc4
>>>>>>>> BUG: KASAN: use-after-free in atomic_dec_if_positive
>>>>>>>> include/linux/atomic.h:616 [inline] at addr ffff88006d399bc4
>>>>>>>> BUG: KASAN: use-after-free in dec_ucount+0x1e5/0x210 kernel/ucount.c:217
>>>>>>>> at addr ffff88006d399bc4
>>>>>>>> Read of size 4 by task syz-executor3/19713
>>>>>>>> CPU: 1 PID: 19713 Comm: syz-executor3 Not tainted 4.10.0+ #4
>>>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>>>>>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>>>>>>>> Call Trace:
>>>>>>>>  __dump_stack lib/dump_stack.c:15 [inline]
>>>>>>>>  dump_stack+0x115/0x1cf lib/dump_stack.c:51
>>>>>>>>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
>>>>>>>>  print_address_description mm/kasan/report.c:200 [inline]
>>>>>>>>  kasan_report_error mm/kasan/report.c:289 [inline]
>>>>>>>>  kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
>>>>>>>>  kasan_report mm/kasan/report.c:331 [inline]
>>>>>>>>  __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:331
>>>>>>>>  __read_once_size include/linux/compiler.h:254 [inline]
>>>>>>>>  atomic_read arch/x86/include/asm/atomic.h:26 [inline]
>>>>>>>>  atomic_dec_if_positive include/linux/atomic.h:616 [inline]
>>>>>>>>  dec_ucount+0x1e5/0x210 kernel/ucount.c:217
>>>>>>>>  dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
>>>>>>>>  inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
>>>>>>>>  fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
>>>>>>>>  fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
>>>>>>>>  fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
>>>>>>>>  inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
>>>>>>>>  __fput+0x327/0x7e0 fs/file_table.c:208
>>>>>>>>  ____fput+0x15/0x20 fs/file_table.c:244
>>>>>>>>  task_work_run+0x18a/0x260 kernel/task_work.c:116
>>>>>>>>  exit_task_work include/linux/task_work.h:21 [inline]
>>>>>>>>  do_exit+0xa45/0x1b20 kernel/exit.c:873
>>>>>>>>  do_group_exit+0x149/0x400 kernel/exit.c:977
>>>>>>>>  get_signal+0x7d5/0x1810 kernel/signal.c:2313
>>>>>>>>  do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
>>>>>>>>  exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
>>>>>>>>  prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
>>>>>>>>  syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
>>>>>>>>  entry_SYSCALL_64_fastpath+0xc0/0xc2
>>>>>>>
>>>>>>> So PID 19713 is exitting and as part of it it's freeing its file
>>>>>>> descriptors, one of which is apparently an inotify fd. And this has
>>>>>>> already been freed.
>>>>>>>
>>>>>>>
>>>>>>>> RIP: 0033:0x44fb79
>>>>>>>> RSP: 002b:00007ffd0f00f6d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000ca
>>>>>>>> RAX: fffffffffffffdfc RBX: 0000000000708024 RCX: 000000000044fb79
>>>>>>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000708024
>>>>>>>> RBP: 00000000000ae8e6 R08: 0000000000708000 R09: 000000160000000d
>>>>>>>> R10: 00007ffd0f00f710 R11: 0000000000000206 R12: 0000000000708000
>>>>>>>> R13: 0000000000708024 R14: 00000000000ae8a1 R15: 0000000000000016
>>>>>>>> Object at ffff88006d399b88, in cache kmalloc-96 size: 96
>>>>>>>> Allocated:
>>>>>>>> PID = 19691
>>>>>>>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>>>>>>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>>>>>>>>  set_track mm/kasan/kasan.c:514 [inline]
>>>>>>>>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
>>>>>>>>  kmem_cache_alloc_trace+0xfb/0x280 mm/slub.c:2745
>>>>>>>>  kmalloc include/linux/slab.h:490 [inline]
>>>>>>>>  kzalloc include/linux/slab.h:663 [inline]
>>>>>>>>  get_ucounts kernel/ucount.c:140 [inline]
>>>>>>>>  inc_ucount+0x538/0xa70 kernel/ucount.c:195
>>>>>>>>  inotify_new_group+0x309/0x410 fs/notify/inotify/inotify_user.c:655
>>>>>>>>  SYSC_inotify_init1 fs/notify/inotify/inotify_user.c:682 [inline]
>>>>>>>>  SyS_inotify_init1 fs/notify/inotify/inotify_user.c:669 [inline]
>>>>>>>>  sys_inotify_init+0x17/0x80 fs/notify/inotify/inotify_user.c:696
>>>>>>>>  entry_SYSCALL_64_fastpath+0x1f/0xc2
>>>>>>>
>>>>>>> However, it has been actually allocated by a different process with pid
>>>>>>> 19691.
>>>>>>>
>>>>>>>
>>>>>>>> Freed:
>>>>>>>> PID = 19708
>>>>>>>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>>>>>>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>>>>>>>>  set_track mm/kasan/kasan.c:514 [inline]
>>>>>>>>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
>>>>>>>>  slab_free_hook mm/slub.c:1357 [inline]
>>>>>>>>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>>>>>>>>  slab_free mm/slub.c:2961 [inline]
>>>>>>>>  kfree+0xe8/0x2c0 mm/slub.c:3882
>>>>>>>>  put_ucounts+0x1dd/0x270 kernel/ucount.c:172
>>>>>>>>  dec_ucount+0x172/0x210 kernel/ucount.c:220
>>>>>>>>  dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
>>>>>>>>  inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
>>>>>>>>  fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
>>>>>>>>  fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
>>>>>>>>  fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
>>>>>>>>  inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
>>>>>>>>  __fput+0x327/0x7e0 fs/file_table.c:208
>>>>>>>>  ____fput+0x15/0x20 fs/file_table.c:244
>>>>>>>>  task_work_run+0x18a/0x260 kernel/task_work.c:116
>>>>>>>>  exit_task_work include/linux/task_work.h:21 [inline]
>>>>>>>>  do_exit+0xa45/0x1b20 kernel/exit.c:873
>>>>>>>>  do_group_exit+0x149/0x400 kernel/exit.c:977
>>>>>>>>  get_signal+0x7d5/0x1810 kernel/signal.c:2313
>>>>>>>>  do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
>>>>>>>>  exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
>>>>>>>>  prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
>>>>>>>>  syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
>>>>>>>>  entry_SYSCALL_64_fastpath+0xc0/0xc2
>>>>>>>
>>>>>>> And yet we have a third process which freed it, PID 19708. So there is
>>>>>>> some dance happening with this fd, being allocated by one process,
>>>>>>> handed over to 2 more, which are freeing it. Is this a valid usage
>>>>>>> scenario of inotify descriptors?
>>>>>>
>>>>>> They are file descriptors so passing them around is valid.  That is
>>>>>> something unix domain sockets have allowed since the dawn of linux.
>>>>>>
>>>>>> The dance would need to be the fd being passed to the addtional
>>>>>> processes and then closed in the original before being closed
>>>>>> in the processes the fd was passed to.
>>>>>>
>>>>>> If those additional processes last longer than the original process this
>>>>>> is easy to achieve.
>>>>>>
>>>>>> My guess is that someone just taught syskallzer to pass file descriptors
>>>>>> around.  So this may be an old bug.  Either that or syskallzer hasn't
>>>>>> been looking at linux-next with KASAN enabled in the kernel.
>>>>>
>>>>> Dmitry, can you tell if syzkaller tests sending file descriptors across
>>>>> sockets? Since the calltraces here show multiple processes being
>>>>> involved in different operations on the exact same file descriptor.
>>>>>
>>>>> Also JongHwan, can you provide the full, compilable reproducer to try
>>>>> and track this issue down?
>>>>
>>>>
>>>> syzkaller can pass descriptors across sockets, but currently only
>>>> within a single multi-threaded process.
>>>>
>>>> Are you sure it's the same descriptor? It seems to me that it's struct
>>>> ucounts, which is shared via the global ucounts_hashtable, so no
>>>> sharing is required in user processes.
>>>>
>>>> Unless I am missing something, we want:
>>>>
>>>> @@ -154,7 +155,7 @@ static struct ucounts *get_ucounts(struct
>>>> user_namespace *ns, kuid_t uid)
>>>>                         ucounts = new;
>>>>                 }
>>>>         }
>>>> -       if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
>>>> +       if (!atomic_add_unless(&ucounts->count, 1, 0))
>>>>                 ucounts = NULL;
>>>>         spin_unlock_irq(&ucounts_lock);
>>>>         return ucounts;
>>>>
>>>> no?
>>>>
>>>> put_ucounts drops the last reference, then get_ucounts finds the
>>>> ucounts and successfully increments refcount as it's not INT_MAX (it's
>>>> 0) and starts using it, meanwhile put_ucounts proceeds to
>>>> unconditionally deleting the ucounts.
>>>
>>>
>>> It also seems that a concurrent put_ucounts can make get_ucounts fail
>>> _spuriously_, which does not look good.
>>> Don't we want something along the following lines?
>>>
>>> diff --git a/kernel/ucount.c b/kernel/ucount.c
>>> index 8a11fc0cb459..233c8e46acd5 100644
>>> --- a/kernel/ucount.c
>>> +++ b/kernel/ucount.c
>>> @@ -143,19 +143,18 @@ static struct ucounts *get_ucounts(struct
>>> user_namespace *ns, kuid_t uid)
>>>
>>>                 new->ns = ns;
>>>                 new->uid = uid;
>>> -               atomic_set(&new->count, 0);
>>> +               atomic_set(&new->count, 1);
>>>
>>>                 spin_lock_irq(&ucounts_lock);
>>>                 ucounts = find_ucounts(ns, uid, hashent);
>>>                 if (ucounts) {
>>> +                       atomic_inc(&ucounts->count);
>>>                         kfree(new);
>>>                 } else {
>>>                         hlist_add_head(&new->node, hashent);
>>>                         ucounts = new;
>>>                 }
>>>         }
>>> -       if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
>>> -               ucounts = NULL;
>>>         spin_unlock_irq(&ucounts_lock);
>>>         return ucounts;
>>>  }
>>> @@ -166,7 +165,10 @@ static void put_ucounts(struct ucounts *ucounts)
>>>
>>>         if (atomic_dec_and_test(&ucounts->count)) {
>>>                 spin_lock_irqsave(&ucounts_lock, flags);
>>> -               hlist_del_init(&ucounts->node);
>>> +               if (atomic_read(&ucounts->count) == 0)
>>> +                       hlist_del_init(&ucounts->node);
>>> +               else
>>> +                       ucounts = NULL;
>>>                 spin_unlock_irqrestore(&ucounts_lock, flags);
>>>
>>>                 kfree(ucounts);
>>
>>
>> /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
>>
>> This is broken per se. Need something more elaborate.
>>
>
>
> How about this :
>
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 8a11fc0cb459..b817ac0e587c 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -166,11 +166,15 @@ static void put_ucounts(struct ucounts *ucounts)
>
>         if (atomic_dec_and_test(&ucounts->count)) {
>                 spin_lock_irqsave(&ucounts_lock, flags);
> -               hlist_del_init(&ucounts->node);
> -               spin_unlock_irqrestore(&ucounts_lock, flags);
> -
> -               kfree(ucounts);
> +               if (!atomic_read(&ucounts->count)) {
> +                       hlist_del_init(&ucounts->node);
> +                       spin_unlock_irqrestore(&ucounts_lock, flags);
> +                       kfree(ucounts);
> +                       return;
> +               }
>         }
> +
> +       spin_unlock_irqrestore(&ucounts_lock, flags);
>  }
>
>
>
> This makes the atomic_dec_and_test and hashtable removal atomic in essence.


This won't work.
Consider the following scenario:
Thread 0 decrements count to 0 here:
  if (atomic_dec_and_test(&ucounts->count)) {
Then thread 1 calls get_ucounts, increments count to 1, then calls
put_ucounts, decrements count to 0 and unhashes and frees ucounts.
Not thread 0 does:
  if (!atomic_read(&ucounts->count)) {
but ucounts is already freed!
쪼르 March 4, 2017, 12:35 p.m.
Hi, This is my new one report about dec_ucount:
ps.Sorry for my uncomfortable report. This is my first usage of lkml.

Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on commit .

==================================================================
BUG: KASAN: use-after-free in __read_once_size
include/linux/compiler.h:254 [inline] at addr ffff8800372f21fc
BUG: KASAN: use-after-free in atomic_read
arch/x86/include/asm/atomic.h:26 [inline] at addr ffff8800372f21fc
BUG: KASAN: use-after-free in atomic_dec_if_positive
include/linux/atomic.h:616 [inline] at addr ffff8800372f21fc
BUG: KASAN: use-after-free in dec_ucount+0x1e5/0x210
kernel/ucount.c:217 at addr ffff8800372f21fc
Read of size 4 by task syz-executor0/19190
CPU: 1 PID: 19190 Comm: syz-executor0 Not tainted 4.10.0+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:15 [inline]
 dump_stack+0x115/0x1cf lib/dump_stack.c:51
 kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
 print_address_description mm/kasan/report.c:200 [inline]
 kasan_report_error mm/kasan/report.c:289 [inline]
 kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
 kasan_report mm/kasan/report.c:331 [inline]
 __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:331
 __read_once_size include/linux/compiler.h:254 [inline]
 atomic_read arch/x86/include/asm/atomic.h:26 [inline]
 atomic_dec_if_positive include/linux/atomic.h:616 [inline]
 dec_ucount+0x1e5/0x210 kernel/ucount.c:217
 dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
 inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
 fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
 fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
 fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
 inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
 __fput+0x327/0x7e0 fs/file_table.c:208
 ____fput+0x15/0x20 fs/file_table.c:244
 task_work_run+0x18a/0x260 kernel/task_work.c:116
 exit_task_work include/linux/task_work.h:21 [inline]
 do_exit+0xa45/0x1b20 kernel/exit.c:873
 do_group_exit+0x149/0x400 kernel/exit.c:977
 get_signal+0x7d5/0x1810 kernel/signal.c:2313
 do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
 exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
 prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
 syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
 entry_SYSCALL_64_fastpath+0xc0/0xc2
RIP: 0033:0x44fb79
RSP: 002b:00007f048a1cacf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 0000000000708218 RCX: 000000000044fb79
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000708218
RBP: 00000000007081f8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fff84d701af R14: 00007f048a1cb9c0 R15: 000000000000000e
Object at ffff8800372f21c0, in cache kmalloc-96 size: 96
Allocated:
PID = 19163
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 save_stack+0x43/0xd0 mm/kasan/kasan.c:502
 set_track mm/kasan/kasan.c:514 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
 kmem_cache_alloc_trace+0xfb/0x280 mm/slub.c:2745
 kmalloc include/linux/slab.h:490 [inline]
 kzalloc include/linux/slab.h:663 [inline]
 get_ucounts kernel/ucount.c:140 [inline]
 inc_ucount+0x538/0xa70 kernel/ucount.c:195
 inotify_new_group+0x309/0x410 fs/notify/inotify/inotify_user.c:655
 SYSC_inotify_init1 fs/notify/inotify/inotify_user.c:682 [inline]
 SyS_inotify_init1 fs/notify/inotify/inotify_user.c:669 [inline]
 sys_inotify_init+0x17/0x80 fs/notify/inotify/inotify_user.c:696
 entry_SYSCALL_64_fastpath+0x1f/0xc2
Freed:
PID = 19163
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
 save_stack+0x43/0xd0 mm/kasan/kasan.c:502
 set_track mm/kasan/kasan.c:514 [inline]
 kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
 slab_free_hook mm/slub.c:1357 [inline]
 slab_free_freelist_hook mm/slub.c:1379 [inline]
 slab_free mm/slub.c:2961 [inline]
 kfree+0xe8/0x2c0 mm/slub.c:3882
 put_ucounts+0x1dd/0x270 kernel/ucount.c:172
 dec_ucount+0x172/0x210 kernel/ucount.c:220
 dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
 inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
 fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
 fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
 fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
 inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
 __fput+0x327/0x7e0 fs/file_table.c:208
 ____fput+0x15/0x20 fs/file_table.c:244
 task_work_run+0x18a/0x260 kernel/task_work.c:116
 exit_task_work include/linux/task_work.h:21 [inline]
 do_exit+0xa45/0x1b20 kernel/exit.c:873
 do_group_exit+0x149/0x400 kernel/exit.c:977
 get_signal+0x7d5/0x1810 kernel/signal.c:2313
 do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
 exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
 prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
 syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
 entry_SYSCALL_64_fastpath+0xc0/0xc2
Memory state around the buggy address:
 ffff8800372f2080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8800372f2100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8800372f2180: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                                ^
 ffff8800372f2200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8800372f2280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 19190 Comm: syz-executor0 Tainted: G    B           4.10.0+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:15 [inline]
 dump_stack+0x115/0x1cf lib/dump_stack.c:51
 panic+0x1b4/0x392 kernel/panic.c:179
 kasan_end_report+0x5b/0x60 mm/kasan/report.c:141
 kasan_report_error mm/kasan/report.c:293 [inline]
 kasan_report.part.1+0x40a/0x4e0 mm/kasan/report.c:311
 kasan_report mm/kasan/report.c:331 [inline]
 __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:331
 __read_once_size include/linux/compiler.h:254 [inline]
 atomic_read arch/x86/include/asm/atomic.h:26 [inline]
 atomic_dec_if_positive include/linux/atomic.h:616 [inline]
 dec_ucount+0x1e5/0x210 kernel/ucount.c:217
 dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
 inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
 fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
 fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
 fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
 inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
 __fput+0x327/0x7e0 fs/file_table.c:208
 ____fput+0x15/0x20 fs/file_table.c:244
 task_work_run+0x18a/0x260 kernel/task_work.c:116
 exit_task_work include/linux/task_work.h:21 [inline]
 do_exit+0xa45/0x1b20 kernel/exit.c:873
 do_group_exit+0x149/0x400 kernel/exit.c:977
 get_signal+0x7d5/0x1810 kernel/signal.c:2313
 do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
 exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
 prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
 syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
 entry_SYSCALL_64_fastpath+0xc0/0xc2
RIP: 0033:0x44fb79
RSP: 002b:00007f048a1cacf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 0000000000708218 RCX: 000000000044fb79
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000708218
RBP: 00000000007081f8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fff84d701af R14: 00007f048a1cb9c0 R15: 000000000000000e
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..


Syzkaller reproducer:
# {Threaded:false Collide:false Repeat:true Procs:1 Sandbox:setuid Repro:false}
semget$private(0x0, 0x400001003, 0x181)


C reproducer:
// autogenerated by syzkaller (http://github.com/google/syzkaller)

#ifndef __NR_semget
#define __NR_semget 64
#endif

#define __STDC_VERSION__ 201112L

#define _GNU_SOURCE

#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/mount.h>
#include <sys/prctl.h>
#include <sys/resource.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/wait.h>

#include <linux/capability.h>
#include <linux/if.h>
#include <linux/if_tun.h>
#include <linux/kvm.h>
#include <linux/sched.h>
#include <net/if_arp.h>

#include <assert.h>
#include <dirent.h>
#include <errno.h>
#include <fcntl.h>
#include <grp.h>
#include <pthread.h>
#include <setjmp.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

const int kFailStatus = 67;
const int kErrorStatus = 68;
const int kRetryStatus = 69;

__attribute__((noreturn)) void doexit(int status)
{
  volatile unsigned i;
  syscall(__NR_exit_group, status);
  for (i = 0;; i++) {
  }
}

__attribute__((noreturn)) void fail(const char* msg, ...)
{
  int e = errno;
  fflush(stdout);
  va_list args;
  va_start(args, msg);
  vfprintf(stderr, msg, args);
  va_end(args);
  fprintf(stderr, " (errno %d)\n", e);
  doexit(e == ENOMEM ? kRetryStatus : kFailStatus);
}

__attribute__((noreturn)) void exitf(const char* msg, ...)
{
  int e = errno;
  fflush(stdout);
  va_list args;
  va_start(args, msg);
  vfprintf(stderr, msg, args);
  va_end(args);
  fprintf(stderr, " (errno %d)\n", e);
  doexit(kRetryStatus);
}

static int flag_debug;

void debug(const char* msg, ...)
{
  if (!flag_debug)
    return;
  va_list args;
  va_start(args, msg);
  vfprintf(stdout, msg, args);
  va_end(args);
  fflush(stdout);
}

__thread int skip_segv;
__thread jmp_buf segv_env;

static void segv_handler(int sig, siginfo_t* info, void* uctx)
{
  uintptr_t addr = (uintptr_t)info->si_addr;
  const uintptr_t prog_start = 1 << 20;
  const uintptr_t prog_end = 100 << 20;
  if (__atomic_load_n(&skip_segv, __ATOMIC_RELAXED) &&
      (addr < prog_start || addr > prog_end)) {
    debug("SIGSEGV on %p, skipping\n", addr);
    _longjmp(segv_env, 1);
  }
  debug("SIGSEGV on %p, exiting\n", addr);
  doexit(sig);
  for (;;) {
  }
}

static void install_segv_handler()
{
  struct sigaction sa;
  memset(&sa, 0, sizeof(sa));
  sa.sa_sigaction = segv_handler;
  sa.sa_flags = SA_NODEFER | SA_SIGINFO;
  sigaction(SIGSEGV, &sa, NULL);
  sigaction(SIGBUS, &sa, NULL);
}

#define NONFAILING(...)                                                \
  {                                                                    \
    __atomic_fetch_add(&skip_segv, 1, __ATOMIC_SEQ_CST);               \
    if (_setjmp(segv_env) == 0) {                                      \
      __VA_ARGS__;                                                     \
    }                                                                  \
    __atomic_fetch_sub(&skip_segv, 1, __ATOMIC_SEQ_CST);               \
  }

#define BITMASK_LEN(type, bf_len) (type)((1ull << (bf_len)) - 1)

#define BITMASK_LEN_OFF(type, bf_off, bf_len)                          \
  (type)(BITMASK_LEN(type, (bf_len)) << (bf_off))

#define STORE_BY_BITMASK(type, addr, val, bf_off, bf_len)              \
  if ((bf_off) == 0 && (bf_len) == 0) {                                \
    *(type*)(addr) = (type)(val);                                      \
  } else {                                                             \
    type new_val = *(type*)(addr);                                     \
    new_val &= ~BITMASK_LEN_OFF(type, (bf_off), (bf_len));             \
    new_val |= ((type)(val)&BITMASK_LEN(type, (bf_len))) << (bf_off);  \
    *(type*)(addr) = new_val;                                          \
  }

static uintptr_t execute_syscall(int nr, uintptr_t a0, uintptr_t a1,
                                 uintptr_t a2, uintptr_t a3,
                                 uintptr_t a4, uintptr_t a5,
                                 uintptr_t a6, uintptr_t a7,
                                 uintptr_t a8)
{
  switch (nr) {
  default:
    return syscall(nr, a0, a1, a2, a3, a4, a5);
  }
}

static void setup_main_process()
{
  struct sigaction sa;
  memset(&sa, 0, sizeof(sa));
  sa.sa_handler = SIG_IGN;
  syscall(SYS_rt_sigaction, 0x20, &sa, NULL, 8);
  syscall(SYS_rt_sigaction, 0x21, &sa, NULL, 8);
  install_segv_handler();

  char tmpdir_template[] = "./syzkaller.XXXXXX";
  char* tmpdir = mkdtemp(tmpdir_template);
  if (!tmpdir)
    fail("failed to mkdtemp");
  if (chmod(tmpdir, 0777))
    fail("failed to chmod");
  if (chdir(tmpdir))
    fail("failed to chdir");
}

static void loop();

static void sandbox_common()
{
  prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
  setpgrp();
  setsid();

  struct rlimit rlim;
  rlim.rlim_cur = rlim.rlim_max = 128 << 20;
  setrlimit(RLIMIT_AS, &rlim);
  rlim.rlim_cur = rlim.rlim_max = 1 << 20;
  setrlimit(RLIMIT_FSIZE, &rlim);
  rlim.rlim_cur = rlim.rlim_max = 1 << 20;
  setrlimit(RLIMIT_STACK, &rlim);
  rlim.rlim_cur = rlim.rlim_max = 0;
  setrlimit(RLIMIT_CORE, &rlim);

  unshare(CLONE_NEWNS);
  unshare(CLONE_NEWIPC);
  unshare(CLONE_IO);
}

static int do_sandbox_setuid(int executor_pid, bool enable_tun)
{
  int pid = fork();
  if (pid)
    return pid;

  sandbox_common();

  const int nobody = 65534;
  if (setgroups(0, NULL))
    fail("failed to setgroups");
  if (syscall(SYS_setresgid, nobody, nobody, nobody))
    fail("failed to setresgid");
  if (syscall(SYS_setresuid, nobody, nobody, nobody))
    fail("failed to setresuid");

  loop();
  doexit(1);
}

static void remove_dir(const char* dir)
{
  DIR* dp;
  struct dirent* ep;
  int iter = 0;
retry:
  dp = opendir(dir);
  if (dp == NULL) {
    if (errno == EMFILE) {
      exitf("opendir(%s) failed due to NOFILE, exiting");
    }
    exitf("opendir(%s) failed", dir);
  }
  while ((ep = readdir(dp))) {
    if (strcmp(ep->d_name, ".") == 0 || strcmp(ep->d_name, "..") == 0)
      continue;
    char filename[FILENAME_MAX];
    snprintf(filename, sizeof(filename), "%s/%s", dir, ep->d_name);
    struct stat st;
    if (lstat(filename, &st))
      exitf("lstat(%s) failed", filename);
    if (S_ISDIR(st.st_mode)) {
      remove_dir(filename);
      continue;
    }
    int i;
    for (i = 0;; i++) {
      debug("unlink(%s)\n", filename);
      if (unlink(filename) == 0)
        break;
      if (errno == EROFS) {
        debug("ignoring EROFS\n");
        break;
      }
      if (errno != EBUSY || i > 100)
        exitf("unlink(%s) failed", filename);
      debug("umount(%s)\n", filename);
      if (umount2(filename, MNT_DETACH))
        exitf("umount(%s) failed", filename);
    }
  }
  closedir(dp);
  int i;
  for (i = 0;; i++) {
    debug("rmdir(%s)\n", dir);
    if (rmdir(dir) == 0)
      break;
    if (i < 100) {
      if (errno == EROFS) {
        debug("ignoring EROFS\n");
        break;
      }
      if (errno == EBUSY) {
        debug("umount(%s)\n", dir);
        if (umount2(dir, MNT_DETACH))
          exitf("umount(%s) failed", dir);
        continue;
      }
      if (errno == ENOTEMPTY) {
        if (iter < 100) {
          iter++;
          goto retry;
        }
      }
    }
    exitf("rmdir(%s) failed", dir);
  }
}

static uint64_t current_time_ms()
{
  struct timespec ts;

  if (clock_gettime(CLOCK_MONOTONIC, &ts))
    fail("clock_gettime failed");
  return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
}

static void test();

void loop()
{
  int iter;
  for (iter = 0;; iter++) {
    char cwdbuf[256];
    sprintf(cwdbuf, "./%d", iter);
    if (mkdir(cwdbuf, 0777))
      fail("failed to mkdir");
    int pid = fork();
    if (pid < 0)
      fail("clone failed");
    if (pid == 0) {
      prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
      setpgrp();
      if (chdir(cwdbuf))
        fail("failed to chdir");
      test();
      doexit(0);
    }
    int status = 0;
    uint64_t start = current_time_ms();
    for (;;) {
      int res = waitpid(-1, &status, __WALL | WNOHANG);
      if (res == pid)
        break;
      usleep(1000);
      if (current_time_ms() - start > 5 * 1000) {
        kill(-pid, SIGKILL);
        kill(pid, SIGKILL);
        while (waitpid(-1, &status, __WALL) != pid) {
        }
        break;
      }
    }
    remove_dir(cwdbuf);
  }
}

long r[1];
void test()
{
  memset(r, -1, sizeof(r));
  r[0] = execute_syscall(__NR_semget, 0x0ul, 0x400001003ul, 0x181ul, 0,
                         0, 0, 0, 0, 0);
}
int main()
{
  setup_main_process();
  int pid = do_sandbox_setuid(0, false);
  int status = 0;
  while (waitpid(pid, &status, __WALL) != pid) {
  }
  return 0;
}



On Sat, Mar 4, 2017 at 9:15 PM, Dmitry Vyukov <dvyukov@google.com> wrote:

> On Sat, Mar 4, 2017 at 1:10 PM, Nikolay Borisov
> <n.borisov.lkml@gmail.com> wrote:
> >
> >
> > On  4.03.2017 14:01, Dmitry Vyukov wrote:
> >> On Sat, Mar 4, 2017 at 12:50 PM, Dmitry Vyukov <dvyukov@google.com>
> wrote:
> >>>> <n.borisov.lkml@gmail.com> wrote:
> >>>>> [Addressing Dmitry Vyukov to ask for syzkaller clarification]
> >>>>>
> >>>>> On  3.03.2017 18:30, Eric W. Biederman wrote:
> >>>>>> Nikolay Borisov <n.borisov.lkml@gmail.com> writes:
> >>>>>>
> >>>>>>> [Added containers ml, Eric Biederman and Jan Kara]. Please,
> >>>>>>> next time don't add random people but take the time to see who
> touched
> >>>>>>> the code.
> >>>>>>
> >>>>>> Comments below.
> >>>>>>
> >>>>>>> On  3.03.2017 14:16, JongHwan Kim wrote:
> >>>>>>>> I've got the following report with syzkaller fuzzer
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on
> commit .
> >>>>>>>>
> >>>>>>>> ============================================================
> ======
> >>>>>>>> BUG: KASAN: use-after-free in __read_once_size
> >>>>>>>> include/linux/compiler.h:254 [inline] at addr ffff88006d399bc4
> >>>>>>>> BUG: KASAN: use-after-free in atomic_read
> >>>>>>>> arch/x86/include/asm/atomic.h:26 [inline] at addr
> ffff88006d399bc4
> >>>>>>>> BUG: KASAN: use-after-free in atomic_dec_if_positive
> >>>>>>>> include/linux/atomic.h:616 [inline] at addr ffff88006d399bc4
> >>>>>>>> BUG: KASAN: use-after-free in dec_ucount+0x1e5/0x210
> kernel/ucount.c:217
> >>>>>>>> at addr ffff88006d399bc4
> >>>>>>>> Read of size 4 by task syz-executor3/19713
> >>>>>>>> CPU: 1 PID: 19713 Comm: syz-executor3 Not tainted 4.10.0+ #4
> >>>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> >>>>>>>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> >>>>>>>> Call Trace:
> >>>>>>>>  __dump_stack lib/dump_stack.c:15 [inline]
> >>>>>>>>  dump_stack+0x115/0x1cf lib/dump_stack.c:51
> >>>>>>>>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
> >>>>>>>>  print_address_description mm/kasan/report.c:200 [inline]
> >>>>>>>>  kasan_report_error mm/kasan/report.c:289 [inline]
> >>>>>>>>  kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
> >>>>>>>>  kasan_report mm/kasan/report.c:331 [inline]
> >>>>>>>>  __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:331
> >>>>>>>>  __read_once_size include/linux/compiler.h:254 [inline]
> >>>>>>>>  atomic_read arch/x86/include/asm/atomic.h:26 [inline]
> >>>>>>>>  atomic_dec_if_positive include/linux/atomic.h:616 [inline]
> >>>>>>>>  dec_ucount+0x1e5/0x210 kernel/ucount.c:217
> >>>>>>>>  dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
> >>>>>>>>  inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_
> fsnotify.c:169
> >>>>>>>>  fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
> >>>>>>>>  fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
> >>>>>>>>  fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
> >>>>>>>>  inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
> >>>>>>>>  __fput+0x327/0x7e0 fs/file_table.c:208
> >>>>>>>>  ____fput+0x15/0x20 fs/file_table.c:244
> >>>>>>>>  task_work_run+0x18a/0x260 kernel/task_work.c:116
> >>>>>>>>  exit_task_work include/linux/task_work.h:21 [inline]
> >>>>>>>>  do_exit+0xa45/0x1b20 kernel/exit.c:873
> >>>>>>>>  do_group_exit+0x149/0x400 kernel/exit.c:977
> >>>>>>>>  get_signal+0x7d5/0x1810 kernel/signal.c:2313
> >>>>>>>>  do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
> >>>>>>>>  exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
> >>>>>>>>  prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
> >>>>>>>>  syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
> >>>>>>>>  entry_SYSCALL_64_fastpath+0xc0/0xc2
> >>>>>>>
> >>>>>>> So PID 19713 is exitting and as part of it it's freeing its file
> >>>>>>> descriptors, one of which is apparently an inotify fd. And this has
> >>>>>>> already been freed.
> >>>>>>>
> >>>>>>>
> >>>>>>>> RIP: 0033:0x44fb79
> >>>>>>>> RSP: 002b:00007ffd0f00f6d8 EFLAGS: 00000206 ORIG_RAX:
> 00000000000000ca
> >>>>>>>> RAX: fffffffffffffdfc RBX: 0000000000708024 RCX: 000000000044fb79
> >>>>>>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000708024
> >>>>>>>> RBP: 00000000000ae8e6 R08: 0000000000708000 R09: 000000160000000d
> >>>>>>>> R10: 00007ffd0f00f710 R11: 0000000000000206 R12: 0000000000708000
> >>>>>>>> R13: 0000000000708024 R14: 00000000000ae8a1 R15: 0000000000000016
> >>>>>>>> Object at ffff88006d399b88, in cache kmalloc-96 size: 96
> >>>>>>>> Allocated:
> >>>>>>>> PID = 19691
> >>>>>>>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
> >>>>>>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
> >>>>>>>>  set_track mm/kasan/kasan.c:514 [inline]
> >>>>>>>>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
> >>>>>>>>  kmem_cache_alloc_trace+0xfb/0x280 mm/slub.c:2745
> >>>>>>>>  kmalloc include/linux/slab.h:490 [inline]
> >>>>>>>>  kzalloc include/linux/slab.h:663 [inline]
> >>>>>>>>  get_ucounts kernel/ucount.c:140 [inline]
> >>>>>>>>  inc_ucount+0x538/0xa70 kernel/ucount.c:195
> >>>>>>>>  inotify_new_group+0x309/0x410 fs/notify/inotify/inotify_
> user.c:655
> >>>>>>>>  SYSC_inotify_init1 fs/notify/inotify/inotify_user.c:682 [inline]
> >>>>>>>>  SyS_inotify_init1 fs/notify/inotify/inotify_user.c:669 [inline]
> >>>>>>>>  sys_inotify_init+0x17/0x80 fs/notify/inotify/inotify_user.c:696
> >>>>>>>>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> >>>>>>>
> >>>>>>> However, it has been actually allocated by a different process
> with pid
> >>>>>>> 19691.
> >>>>>>>
> >>>>>>>
> >>>>>>>> Freed:
> >>>>>>>> PID = 19708
> >>>>>>>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
> >>>>>>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
> >>>>>>>>  set_track mm/kasan/kasan.c:514 [inline]
> >>>>>>>>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
> >>>>>>>>  slab_free_hook mm/slub.c:1357 [inline]
> >>>>>>>>  slab_free_freelist_hook mm/slub.c:1379 [inline]
> >>>>>>>>  slab_free mm/slub.c:2961 [inline]
> >>>>>>>>  kfree+0xe8/0x2c0 mm/slub.c:3882
> >>>>>>>>  put_ucounts+0x1dd/0x270 kernel/ucount.c:172
> >>>>>>>>  dec_ucount+0x172/0x210 kernel/ucount.c:220
> >>>>>>>>  dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
> >>>>>>>>  inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_
> fsnotify.c:169
> >>>>>>>>  fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
> >>>>>>>>  fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
> >>>>>>>>  fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
> >>>>>>>>  inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
> >>>>>>>>  __fput+0x327/0x7e0 fs/file_table.c:208
> >>>>>>>>  ____fput+0x15/0x20 fs/file_table.c:244
> >>>>>>>>  task_work_run+0x18a/0x260 kernel/task_work.c:116
> >>>>>>>>  exit_task_work include/linux/task_work.h:21 [inline]
> >>>>>>>>  do_exit+0xa45/0x1b20 kernel/exit.c:873
> >>>>>>>>  do_group_exit+0x149/0x400 kernel/exit.c:977
> >>>>>>>>  get_signal+0x7d5/0x1810 kernel/signal.c:2313
> >>>>>>>>  do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
> >>>>>>>>  exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
> >>>>>>>>  prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
> >>>>>>>>  syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
> >>>>>>>>  entry_SYSCALL_64_fastpath+0xc0/0xc2
> >>>>>>>
> >>>>>>> And yet we have a third process which freed it, PID 19708. So
> there is
> >>>>>>> some dance happening with this fd, being allocated by one process,
> >>>>>>> handed over to 2 more, which are freeing it. Is this a valid usage
> >>>>>>> scenario of inotify descriptors?
> >>>>>>
> >>>>>> They are file descriptors so passing them around is valid.  That is
> >>>>>> something unix domain sockets have allowed since the dawn of linux.
> >>>>>>
> >>>>>> The dance would need to be the fd being passed to the addtional
> >>>>>> processes and then closed in the original before being closed
> >>>>>> in the processes the fd was passed to.
> >>>>>>
> >>>>>> If those additional processes last longer than the original process
> this
> >>>>>> is easy to achieve.
> >>>>>>
> >>>>>> My guess is that someone just taught syskallzer to pass file
> descriptors
> >>>>>> around.  So this may be an old bug.  Either that or syskallzer
> hasn't
> >>>>>> been looking at linux-next with KASAN enabled in the kernel.
> >>>>>
> >>>>> Dmitry, can you tell if syzkaller tests sending file descriptors
> across
> >>>>> sockets? Since the calltraces here show multiple processes being
> >>>>> involved in different operations on the exact same file descriptor.
> >>>>>
> >>>>> Also JongHwan, can you provide the full, compilable reproducer to try
> >>>>> and track this issue down?
> >>>>
> >>>>
> >>>> syzkaller can pass descriptors across sockets, but currently only
> >>>> within a single multi-threaded process.
> >>>>
> >>>> Are you sure it's the same descriptor? It seems to me that it's struct
> >>>> ucounts, which is shared via the global ucounts_hashtable, so no
> >>>> sharing is required in user processes.
> >>>>
> >>>> Unless I am missing something, we want:
> >>>>
> >>>> @@ -154,7 +155,7 @@ static struct ucounts *get_ucounts(struct
> >>>> user_namespace *ns, kuid_t uid)
> >>>>                         ucounts = new;
> >>>>                 }
> >>>>         }
> >>>> -       if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
> >>>> +       if (!atomic_add_unless(&ucounts->count, 1, 0))
> >>>>                 ucounts = NULL;
> >>>>         spin_unlock_irq(&ucounts_lock);
> >>>>         return ucounts;
> >>>>
> >>>> no?
> >>>>
> >>>> put_ucounts drops the last reference, then get_ucounts finds the
> >>>> ucounts and successfully increments refcount as it's not INT_MAX (it's
> >>>> 0) and starts using it, meanwhile put_ucounts proceeds to
> >>>> unconditionally deleting the ucounts.
> >>>
> >>>
> >>> It also seems that a concurrent put_ucounts can make get_ucounts fail
> >>> _spuriously_, which does not look good.
> >>> Don't we want something along the following lines?
> >>>
> >>> diff --git a/kernel/ucount.c b/kernel/ucount.c
> >>> index 8a11fc0cb459..233c8e46acd5 100644
> >>> --- a/kernel/ucount.c
> >>> +++ b/kernel/ucount.c
> >>> @@ -143,19 +143,18 @@ static struct ucounts *get_ucounts(struct
> >>> user_namespace *ns, kuid_t uid)
> >>>
> >>>                 new->ns = ns;
> >>>                 new->uid = uid;
> >>> -               atomic_set(&new->count, 0);
> >>> +               atomic_set(&new->count, 1);
> >>>
> >>>                 spin_lock_irq(&ucounts_lock);
> >>>                 ucounts = find_ucounts(ns, uid, hashent);
> >>>                 if (ucounts) {
> >>> +                       atomic_inc(&ucounts->count);
> >>>                         kfree(new);
> >>>                 } else {
> >>>                         hlist_add_head(&new->node, hashent);
> >>>                         ucounts = new;
> >>>                 }
> >>>         }
> >>> -       if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
> >>> -               ucounts = NULL;
> >>>         spin_unlock_irq(&ucounts_lock);
> >>>         return ucounts;
> >>>  }
> >>> @@ -166,7 +165,10 @@ static void put_ucounts(struct ucounts *ucounts)
> >>>
> >>>         if (atomic_dec_and_test(&ucounts->count)) {
> >>>                 spin_lock_irqsave(&ucounts_lock, flags);
> >>> -               hlist_del_init(&ucounts->node);
> >>> +               if (atomic_read(&ucounts->count) == 0)
> >>> +                       hlist_del_init(&ucounts->node);
> >>> +               else
> >>> +                       ucounts = NULL;
> >>>                 spin_unlock_irqrestore(&ucounts_lock, flags);
> >>>
> >>>                 kfree(ucounts);
> >>
> >>
> >> /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
> >>
> >> This is broken per se. Need something more elaborate.
> >>
> >
> >
> > How about this :
> >
> > diff --git a/kernel/ucount.c b/kernel/ucount.c
> > index 8a11fc0cb459..b817ac0e587c 100644
> > --- a/kernel/ucount.c
> > +++ b/kernel/ucount.c
> > @@ -166,11 +166,15 @@ static void put_ucounts(struct ucounts *ucounts)
> >
> >         if (atomic_dec_and_test(&ucounts->count)) {
> >                 spin_lock_irqsave(&ucounts_lock, flags);
> > -               hlist_del_init(&ucounts->node);
> > -               spin_unlock_irqrestore(&ucounts_lock, flags);
> > -
> > -               kfree(ucounts);
> > +               if (!atomic_read(&ucounts->count)) {
> > +                       hlist_del_init(&ucounts->node);
> > +                       spin_unlock_irqrestore(&ucounts_lock, flags);
> > +                       kfree(ucounts);
> > +                       return;
> > +               }
> >         }
> > +
> > +       spin_unlock_irqrestore(&ucounts_lock, flags);
> >  }
> >
> >
> >
> > This makes the atomic_dec_and_test and hashtable removal atomic in
> essence.
>
>
> This won't work.
> Consider the following scenario:
> Thread 0 decrements count to 0 here:
>   if (atomic_dec_and_test(&ucounts->count)) {
> Then thread 1 calls get_ucounts, increments count to 1, then calls
> put_ucounts, decrements count to 0 and unhashes and frees ucounts.
> Not thread 0 does:
>   if (!atomic_read(&ucounts->count)) {
> but ucounts is already freed!
>
Jann Horn via Containers March 4, 2017, 12:38 p.m.
On Sat, Mar 4, 2017 at 1:15 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Sat, Mar 4, 2017 at 1:10 PM, Nikolay Borisov
> <n.borisov.lkml@gmail.com> wrote:
>>
>>
>> On  4.03.2017 14:01, Dmitry Vyukov wrote:
>>> On Sat, Mar 4, 2017 at 12:50 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>>>> <n.borisov.lkml@gmail.com> wrote:
>>>>>> [Addressing Dmitry Vyukov to ask for syzkaller clarification]
>>>>>>
>>>>>> On  3.03.2017 18:30, Eric W. Biederman wrote:
>>>>>>> Nikolay Borisov <n.borisov.lkml@gmail.com> writes:
>>>>>>>
>>>>>>>> [Added containers ml, Eric Biederman and Jan Kara]. Please,
>>>>>>>> next time don't add random people but take the time to see who touched
>>>>>>>> the code.
>>>>>>>
>>>>>>> Comments below.
>>>>>>>
>>>>>>>> On  3.03.2017 14:16, JongHwan Kim wrote:
>>>>>>>>> I've got the following report with syzkaller fuzzer
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on commit .
>>>>>>>>>
>>>>>>>>> ==================================================================
>>>>>>>>> BUG: KASAN: use-after-free in __read_once_size
>>>>>>>>> include/linux/compiler.h:254 [inline] at addr ffff88006d399bc4
>>>>>>>>> BUG: KASAN: use-after-free in atomic_read
>>>>>>>>> arch/x86/include/asm/atomic.h:26 [inline] at addr ffff88006d399bc4
>>>>>>>>> BUG: KASAN: use-after-free in atomic_dec_if_positive
>>>>>>>>> include/linux/atomic.h:616 [inline] at addr ffff88006d399bc4
>>>>>>>>> BUG: KASAN: use-after-free in dec_ucount+0x1e5/0x210 kernel/ucount.c:217
>>>>>>>>> at addr ffff88006d399bc4
>>>>>>>>> Read of size 4 by task syz-executor3/19713
>>>>>>>>> CPU: 1 PID: 19713 Comm: syz-executor3 Not tainted 4.10.0+ #4
>>>>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>>>>>>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>>>>>>>>> Call Trace:
>>>>>>>>>  __dump_stack lib/dump_stack.c:15 [inline]
>>>>>>>>>  dump_stack+0x115/0x1cf lib/dump_stack.c:51
>>>>>>>>>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
>>>>>>>>>  print_address_description mm/kasan/report.c:200 [inline]
>>>>>>>>>  kasan_report_error mm/kasan/report.c:289 [inline]
>>>>>>>>>  kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
>>>>>>>>>  kasan_report mm/kasan/report.c:331 [inline]
>>>>>>>>>  __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:331
>>>>>>>>>  __read_once_size include/linux/compiler.h:254 [inline]
>>>>>>>>>  atomic_read arch/x86/include/asm/atomic.h:26 [inline]
>>>>>>>>>  atomic_dec_if_positive include/linux/atomic.h:616 [inline]
>>>>>>>>>  dec_ucount+0x1e5/0x210 kernel/ucount.c:217
>>>>>>>>>  dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
>>>>>>>>>  inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
>>>>>>>>>  fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
>>>>>>>>>  fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
>>>>>>>>>  fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
>>>>>>>>>  inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
>>>>>>>>>  __fput+0x327/0x7e0 fs/file_table.c:208
>>>>>>>>>  ____fput+0x15/0x20 fs/file_table.c:244
>>>>>>>>>  task_work_run+0x18a/0x260 kernel/task_work.c:116
>>>>>>>>>  exit_task_work include/linux/task_work.h:21 [inline]
>>>>>>>>>  do_exit+0xa45/0x1b20 kernel/exit.c:873
>>>>>>>>>  do_group_exit+0x149/0x400 kernel/exit.c:977
>>>>>>>>>  get_signal+0x7d5/0x1810 kernel/signal.c:2313
>>>>>>>>>  do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
>>>>>>>>>  exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
>>>>>>>>>  prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
>>>>>>>>>  syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
>>>>>>>>>  entry_SYSCALL_64_fastpath+0xc0/0xc2
>>>>>>>>
>>>>>>>> So PID 19713 is exitting and as part of it it's freeing its file
>>>>>>>> descriptors, one of which is apparently an inotify fd. And this has
>>>>>>>> already been freed.
>>>>>>>>
>>>>>>>>
>>>>>>>>> RIP: 0033:0x44fb79
>>>>>>>>> RSP: 002b:00007ffd0f00f6d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000ca
>>>>>>>>> RAX: fffffffffffffdfc RBX: 0000000000708024 RCX: 000000000044fb79
>>>>>>>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000708024
>>>>>>>>> RBP: 00000000000ae8e6 R08: 0000000000708000 R09: 000000160000000d
>>>>>>>>> R10: 00007ffd0f00f710 R11: 0000000000000206 R12: 0000000000708000
>>>>>>>>> R13: 0000000000708024 R14: 00000000000ae8a1 R15: 0000000000000016
>>>>>>>>> Object at ffff88006d399b88, in cache kmalloc-96 size: 96
>>>>>>>>> Allocated:
>>>>>>>>> PID = 19691
>>>>>>>>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>>>>>>>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>>>>>>>>>  set_track mm/kasan/kasan.c:514 [inline]
>>>>>>>>>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
>>>>>>>>>  kmem_cache_alloc_trace+0xfb/0x280 mm/slub.c:2745
>>>>>>>>>  kmalloc include/linux/slab.h:490 [inline]
>>>>>>>>>  kzalloc include/linux/slab.h:663 [inline]
>>>>>>>>>  get_ucounts kernel/ucount.c:140 [inline]
>>>>>>>>>  inc_ucount+0x538/0xa70 kernel/ucount.c:195
>>>>>>>>>  inotify_new_group+0x309/0x410 fs/notify/inotify/inotify_user.c:655
>>>>>>>>>  SYSC_inotify_init1 fs/notify/inotify/inotify_user.c:682 [inline]
>>>>>>>>>  SyS_inotify_init1 fs/notify/inotify/inotify_user.c:669 [inline]
>>>>>>>>>  sys_inotify_init+0x17/0x80 fs/notify/inotify/inotify_user.c:696
>>>>>>>>>  entry_SYSCALL_64_fastpath+0x1f/0xc2
>>>>>>>>
>>>>>>>> However, it has been actually allocated by a different process with pid
>>>>>>>> 19691.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Freed:
>>>>>>>>> PID = 19708
>>>>>>>>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>>>>>>>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>>>>>>>>>  set_track mm/kasan/kasan.c:514 [inline]
>>>>>>>>>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
>>>>>>>>>  slab_free_hook mm/slub.c:1357 [inline]
>>>>>>>>>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>>>>>>>>>  slab_free mm/slub.c:2961 [inline]
>>>>>>>>>  kfree+0xe8/0x2c0 mm/slub.c:3882
>>>>>>>>>  put_ucounts+0x1dd/0x270 kernel/ucount.c:172
>>>>>>>>>  dec_ucount+0x172/0x210 kernel/ucount.c:220
>>>>>>>>>  dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
>>>>>>>>>  inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
>>>>>>>>>  fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
>>>>>>>>>  fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
>>>>>>>>>  fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
>>>>>>>>>  inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
>>>>>>>>>  __fput+0x327/0x7e0 fs/file_table.c:208
>>>>>>>>>  ____fput+0x15/0x20 fs/file_table.c:244
>>>>>>>>>  task_work_run+0x18a/0x260 kernel/task_work.c:116
>>>>>>>>>  exit_task_work include/linux/task_work.h:21 [inline]
>>>>>>>>>  do_exit+0xa45/0x1b20 kernel/exit.c:873
>>>>>>>>>  do_group_exit+0x149/0x400 kernel/exit.c:977
>>>>>>>>>  get_signal+0x7d5/0x1810 kernel/signal.c:2313
>>>>>>>>>  do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
>>>>>>>>>  exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
>>>>>>>>>  prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
>>>>>>>>>  syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
>>>>>>>>>  entry_SYSCALL_64_fastpath+0xc0/0xc2
>>>>>>>>
>>>>>>>> And yet we have a third process which freed it, PID 19708. So there is
>>>>>>>> some dance happening with this fd, being allocated by one process,
>>>>>>>> handed over to 2 more, which are freeing it. Is this a valid usage
>>>>>>>> scenario of inotify descriptors?
>>>>>>>
>>>>>>> They are file descriptors so passing them around is valid.  That is
>>>>>>> something unix domain sockets have allowed since the dawn of linux.
>>>>>>>
>>>>>>> The dance would need to be the fd being passed to the addtional
>>>>>>> processes and then closed in the original before being closed
>>>>>>> in the processes the fd was passed to.
>>>>>>>
>>>>>>> If those additional processes last longer than the original process this
>>>>>>> is easy to achieve.
>>>>>>>
>>>>>>> My guess is that someone just taught syskallzer to pass file descriptors
>>>>>>> around.  So this may be an old bug.  Either that or syskallzer hasn't
>>>>>>> been looking at linux-next with KASAN enabled in the kernel.
>>>>>>
>>>>>> Dmitry, can you tell if syzkaller tests sending file descriptors across
>>>>>> sockets? Since the calltraces here show multiple processes being
>>>>>> involved in different operations on the exact same file descriptor.
>>>>>>
>>>>>> Also JongHwan, can you provide the full, compilable reproducer to try
>>>>>> and track this issue down?
>>>>>
>>>>>
>>>>> syzkaller can pass descriptors across sockets, but currently only
>>>>> within a single multi-threaded process.
>>>>>
>>>>> Are you sure it's the same descriptor? It seems to me that it's struct
>>>>> ucounts, which is shared via the global ucounts_hashtable, so no
>>>>> sharing is required in user processes.
>>>>>
>>>>> Unless I am missing something, we want:
>>>>>
>>>>> @@ -154,7 +155,7 @@ static struct ucounts *get_ucounts(struct
>>>>> user_namespace *ns, kuid_t uid)
>>>>>                         ucounts = new;
>>>>>                 }
>>>>>         }
>>>>> -       if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
>>>>> +       if (!atomic_add_unless(&ucounts->count, 1, 0))
>>>>>                 ucounts = NULL;
>>>>>         spin_unlock_irq(&ucounts_lock);
>>>>>         return ucounts;
>>>>>
>>>>> no?
>>>>>
>>>>> put_ucounts drops the last reference, then get_ucounts finds the
>>>>> ucounts and successfully increments refcount as it's not INT_MAX (it's
>>>>> 0) and starts using it, meanwhile put_ucounts proceeds to
>>>>> unconditionally deleting the ucounts.
>>>>
>>>>
>>>> It also seems that a concurrent put_ucounts can make get_ucounts fail
>>>> _spuriously_, which does not look good.
>>>> Don't we want something along the following lines?
>>>>
>>>> diff --git a/kernel/ucount.c b/kernel/ucount.c
>>>> index 8a11fc0cb459..233c8e46acd5 100644
>>>> --- a/kernel/ucount.c
>>>> +++ b/kernel/ucount.c
>>>> @@ -143,19 +143,18 @@ static struct ucounts *get_ucounts(struct
>>>> user_namespace *ns, kuid_t uid)
>>>>
>>>>                 new->ns = ns;
>>>>                 new->uid = uid;
>>>> -               atomic_set(&new->count, 0);
>>>> +               atomic_set(&new->count, 1);
>>>>
>>>>                 spin_lock_irq(&ucounts_lock);
>>>>                 ucounts = find_ucounts(ns, uid, hashent);
>>>>                 if (ucounts) {
>>>> +                       atomic_inc(&ucounts->count);
>>>>                         kfree(new);
>>>>                 } else {
>>>>                         hlist_add_head(&new->node, hashent);
>>>>                         ucounts = new;
>>>>                 }
>>>>         }
>>>> -       if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
>>>> -               ucounts = NULL;
>>>>         spin_unlock_irq(&ucounts_lock);
>>>>         return ucounts;
>>>>  }
>>>> @@ -166,7 +165,10 @@ static void put_ucounts(struct ucounts *ucounts)
>>>>
>>>>         if (atomic_dec_and_test(&ucounts->count)) {
>>>>                 spin_lock_irqsave(&ucounts_lock, flags);
>>>> -               hlist_del_init(&ucounts->node);
>>>> +               if (atomic_read(&ucounts->count) == 0)
>>>> +                       hlist_del_init(&ucounts->node);
>>>> +               else
>>>> +                       ucounts = NULL;
>>>>                 spin_unlock_irqrestore(&ucounts_lock, flags);
>>>>
>>>>                 kfree(ucounts);
>>>
>>>
>>> /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
>>>
>>> This is broken per se. Need something more elaborate.
>>>
>>
>>
>> How about this :
>>
>> diff --git a/kernel/ucount.c b/kernel/ucount.c
>> index 8a11fc0cb459..b817ac0e587c 100644
>> --- a/kernel/ucount.c
>> +++ b/kernel/ucount.c
>> @@ -166,11 +166,15 @@ static void put_ucounts(struct ucounts *ucounts)
>>
>>         if (atomic_dec_and_test(&ucounts->count)) {
>>                 spin_lock_irqsave(&ucounts_lock, flags);
>> -               hlist_del_init(&ucounts->node);
>> -               spin_unlock_irqrestore(&ucounts_lock, flags);
>> -
>> -               kfree(ucounts);
>> +               if (!atomic_read(&ucounts->count)) {
>> +                       hlist_del_init(&ucounts->node);
>> +                       spin_unlock_irqrestore(&ucounts_lock, flags);
>> +                       kfree(ucounts);
>> +                       return;
>> +               }
>>         }
>> +
>> +       spin_unlock_irqrestore(&ucounts_lock, flags);
>>  }
>>
>>
>>
>> This makes the atomic_dec_and_test and hashtable removal atomic in essence.
>
>
> This won't work.
> Consider the following scenario:
> Thread 0 decrements count to 0 here:
>   if (atomic_dec_and_test(&ucounts->count)) {
> Then thread 1 calls get_ucounts, increments count to 1, then calls
> put_ucounts, decrements count to 0 and unhashes and frees ucounts.
> Not thread 0 does:
>   if (!atomic_read(&ucounts->count)) {
> but ucounts is already freed!


What may work is if put_ucounts re-lookups the ucounts. If it can
still find it and count==0, then it is the right time to delete it. If
it can't find the ucounts, then somebody else has beaten it.
Eric W. Biederman March 4, 2017, 11:58 p.m.
Dmitry Vyukov <dvyukov@google.com> writes:

> On Sat, Mar 4, 2017 at 1:15 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> On Sat, Mar 4, 2017 at 1:10 PM, Nikolay Borisov
>> <n.borisov.lkml@gmail.com> wrote:
>>>
>>>
>>> On  4.03.2017 14:01, Dmitry Vyukov wrote:
>>>> On Sat, Mar 4, 2017 at 12:50 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>>>>> <n.borisov.lkml@gmail.com> wrote:
>>>>>>> [Addressing Dmitry Vyukov to ask for syzkaller clarification]
>>>>>>>
>>>>>>> On  3.03.2017 18:30, Eric W. Biederman wrote:
>>>>>>>> Nikolay Borisov <n.borisov.lkml@gmail.com> writes:
>>>>>>>>
>>>>>>>>> [Added containers ml, Eric Biederman and Jan Kara]. Please,
>>>>>>>>> next time don't add random people but take the time to see who touched
>>>>>>>>> the code.
>>>>>>>>
>>>>>>>> Comments below.
>>>>>>>>
>>>>>>>>> On  3.03.2017 14:16, JongHwan Kim wrote:
>>>>>>>>>> I've got the following report with syzkaller fuzzer
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on commit .
>>>>>>>>>>
>>>>>>>>>> ==================================================================
>>>>>>>>>> BUG: KASAN: use-after-free in __read_once_size
>>>>>>>>>> include/linux/compiler.h:254 [inline] at addr ffff88006d399bc4
>>>>>>>>>> BUG: KASAN: use-after-free in atomic_read
>>>>>>>>>> arch/x86/include/asm/atomic.h:26 [inline] at addr ffff88006d399bc4
>>>>>>>>>> BUG: KASAN: use-after-free in atomic_dec_if_positive
>>>>>>>>>> include/linux/atomic.h:616 [inline] at addr ffff88006d399bc4
>>>>>>>>>> BUG: KASAN: use-after-free in dec_ucount+0x1e5/0x210 kernel/ucount.c:217
>>>>>>>>>> at addr ffff88006d399bc4
>>>>>>>>>> Read of size 4 by task syz-executor3/19713
>>>>>>>>>> CPU: 1 PID: 19713 Comm: syz-executor3 Not tainted 4.10.0+ #4
>>>>>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>>>>>>>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>>>>>>>>>> Call Trace:
>>>>>>>>>>  __dump_stack lib/dump_stack.c:15 [inline]
>>>>>>>>>>  dump_stack+0x115/0x1cf lib/dump_stack.c:51
>>>>>>>>>>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
>>>>>>>>>>  print_address_description mm/kasan/report.c:200 [inline]
>>>>>>>>>>  kasan_report_error mm/kasan/report.c:289 [inline]
>>>>>>>>>>  kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
>>>>>>>>>>  kasan_report mm/kasan/report.c:331 [inline]
>>>>>>>>>>  __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:331
>>>>>>>>>>  __read_once_size include/linux/compiler.h:254 [inline]
>>>>>>>>>>  atomic_read arch/x86/include/asm/atomic.h:26 [inline]
>>>>>>>>>>  atomic_dec_if_positive include/linux/atomic.h:616 [inline]
>>>>>>>>>>  dec_ucount+0x1e5/0x210 kernel/ucount.c:217
>>>>>>>>>>  dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
>>>>>>>>>>  inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
>>>>>>>>>>  fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
>>>>>>>>>>  fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
>>>>>>>>>>  fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
>>>>>>>>>>  inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
>>>>>>>>>>  __fput+0x327/0x7e0 fs/file_table.c:208
>>>>>>>>>>  ____fput+0x15/0x20 fs/file_table.c:244
>>>>>>>>>>  task_work_run+0x18a/0x260 kernel/task_work.c:116
>>>>>>>>>>  exit_task_work include/linux/task_work.h:21 [inline]
>>>>>>>>>>  do_exit+0xa45/0x1b20 kernel/exit.c:873
>>>>>>>>>>  do_group_exit+0x149/0x400 kernel/exit.c:977
>>>>>>>>>>  get_signal+0x7d5/0x1810 kernel/signal.c:2313
>>>>>>>>>>  do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
>>>>>>>>>>  exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
>>>>>>>>>>  prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
>>>>>>>>>>  syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
>>>>>>>>>>  entry_SYSCALL_64_fastpath+0xc0/0xc2
>>>>>>>>>
>>>>>>>>> So PID 19713 is exitting and as part of it it's freeing its file
>>>>>>>>> descriptors, one of which is apparently an inotify fd. And this has
>>>>>>>>> already been freed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> RIP: 0033:0x44fb79
>>>>>>>>>> RSP: 002b:00007ffd0f00f6d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000ca
>>>>>>>>>> RAX: fffffffffffffdfc RBX: 0000000000708024 RCX: 000000000044fb79
>>>>>>>>>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000708024
>>>>>>>>>> RBP: 00000000000ae8e6 R08: 0000000000708000 R09: 000000160000000d
>>>>>>>>>> R10: 00007ffd0f00f710 R11: 0000000000000206 R12: 0000000000708000
>>>>>>>>>> R13: 0000000000708024 R14: 00000000000ae8a1 R15: 0000000000000016
>>>>>>>>>> Object at ffff88006d399b88, in cache kmalloc-96 size: 96
>>>>>>>>>> Allocated:
>>>>>>>>>> PID = 19691
>>>>>>>>>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>>>>>>>>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>>>>>>>>>>  set_track mm/kasan/kasan.c:514 [inline]
>>>>>>>>>>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
>>>>>>>>>>  kmem_cache_alloc_trace+0xfb/0x280 mm/slub.c:2745
>>>>>>>>>>  kmalloc include/linux/slab.h:490 [inline]
>>>>>>>>>>  kzalloc include/linux/slab.h:663 [inline]
>>>>>>>>>>  get_ucounts kernel/ucount.c:140 [inline]
>>>>>>>>>>  inc_ucount+0x538/0xa70 kernel/ucount.c:195
>>>>>>>>>>  inotify_new_group+0x309/0x410 fs/notify/inotify/inotify_user.c:655
>>>>>>>>>>  SYSC_inotify_init1 fs/notify/inotify/inotify_user.c:682 [inline]
>>>>>>>>>>  SyS_inotify_init1 fs/notify/inotify/inotify_user.c:669 [inline]
>>>>>>>>>>  sys_inotify_init+0x17/0x80 fs/notify/inotify/inotify_user.c:696
>>>>>>>>>>  entry_SYSCALL_64_fastpath+0x1f/0xc2
>>>>>>>>>
>>>>>>>>> However, it has been actually allocated by a different process with pid
>>>>>>>>> 19691.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Freed:
>>>>>>>>>> PID = 19708
>>>>>>>>>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>>>>>>>>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>>>>>>>>>>  set_track mm/kasan/kasan.c:514 [inline]
>>>>>>>>>>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
>>>>>>>>>>  slab_free_hook mm/slub.c:1357 [inline]
>>>>>>>>>>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>>>>>>>>>>  slab_free mm/slub.c:2961 [inline]
>>>>>>>>>>  kfree+0xe8/0x2c0 mm/slub.c:3882
>>>>>>>>>>  put_ucounts+0x1dd/0x270 kernel/ucount.c:172
>>>>>>>>>>  dec_ucount+0x172/0x210 kernel/ucount.c:220
>>>>>>>>>>  dec_inotify_instances fs/notify/inotify/inotify.h:37 [inline]
>>>>>>>>>>  inotify_free_group_priv+0x6c/0x80 fs/notify/inotify/inotify_fsnotify.c:169
>>>>>>>>>>  fsnotify_final_destroy_group fs/notify/group.c:37 [inline]
>>>>>>>>>>  fsnotify_put_group+0x73/0xa0 fs/notify/group.c:110
>>>>>>>>>>  fsnotify_destroy_group+0xec/0x120 fs/notify/group.c:93
>>>>>>>>>>  inotify_release+0x37/0x50 fs/notify/inotify/inotify_user.c:280
>>>>>>>>>>  __fput+0x327/0x7e0 fs/file_table.c:208
>>>>>>>>>>  ____fput+0x15/0x20 fs/file_table.c:244
>>>>>>>>>>  task_work_run+0x18a/0x260 kernel/task_work.c:116
>>>>>>>>>>  exit_task_work include/linux/task_work.h:21 [inline]
>>>>>>>>>>  do_exit+0xa45/0x1b20 kernel/exit.c:873
>>>>>>>>>>  do_group_exit+0x149/0x400 kernel/exit.c:977
>>>>>>>>>>  get_signal+0x7d5/0x1810 kernel/signal.c:2313
>>>>>>>>>>  do_signal+0x94/0x1f30 arch/x86/kernel/signal.c:807
>>>>>>>>>>  exit_to_usermode_loop+0x162/0x1e0 arch/x86/entry/common.c:156
>>>>>>>>>>  prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
>>>>>>>>>>  syscall_return_slowpath+0x2b6/0x310 arch/x86/entry/common.c:259
>>>>>>>>>>  entry_SYSCALL_64_fastpath+0xc0/0xc2
>>>>>>>>>
>>>>>>>>> And yet we have a third process which freed it, PID 19708. So there is
>>>>>>>>> some dance happening with this fd, being allocated by one process,
>>>>>>>>> handed over to 2 more, which are freeing it. Is this a valid usage
>>>>>>>>> scenario of inotify descriptors?
>>>>>>>>
>>>>>>>> They are file descriptors so passing them around is valid.  That is
>>>>>>>> something unix domain sockets have allowed since the dawn of linux.
>>>>>>>>
>>>>>>>> The dance would need to be the fd being passed to the addtional
>>>>>>>> processes and then closed in the original before being closed
>>>>>>>> in the processes the fd was passed to.
>>>>>>>>
>>>>>>>> If those additional processes last longer than the original process this
>>>>>>>> is easy to achieve.
>>>>>>>>
>>>>>>>> My guess is that someone just taught syskallzer to pass file descriptors
>>>>>>>> around.  So this may be an old bug.  Either that or syskallzer hasn't
>>>>>>>> been looking at linux-next with KASAN enabled in the kernel.
>>>>>>>
>>>>>>> Dmitry, can you tell if syzkaller tests sending file descriptors across
>>>>>>> sockets? Since the calltraces here show multiple processes being
>>>>>>> involved in different operations on the exact same file descriptor.
>>>>>>>
>>>>>>> Also JongHwan, can you provide the full, compilable reproducer to try
>>>>>>> and track this issue down?
>>>>>>
>>>>>>
>>>>>> syzkaller can pass descriptors across sockets, but currently only
>>>>>> within a single multi-threaded process.
>>>>>>
>>>>>> Are you sure it's the same descriptor? It seems to me that it's struct
>>>>>> ucounts, which is shared via the global ucounts_hashtable, so no
>>>>>> sharing is required in user processes.
>>>>>>
>>>>>> Unless I am missing something, we want:
>>>>>>
>>>>>> @@ -154,7 +155,7 @@ static struct ucounts *get_ucounts(struct
>>>>>> user_namespace *ns, kuid_t uid)
>>>>>>                         ucounts = new;
>>>>>>                 }
>>>>>>         }
>>>>>> -       if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
>>>>>> +       if (!atomic_add_unless(&ucounts->count, 1, 0))
>>>>>>                 ucounts = NULL;
>>>>>>         spin_unlock_irq(&ucounts_lock);
>>>>>>         return ucounts;
>>>>>>
>>>>>> no?
>>>>>>
>>>>>> put_ucounts drops the last reference, then get_ucounts finds the
>>>>>> ucounts and successfully increments refcount as it's not INT_MAX (it's
>>>>>> 0) and starts using it, meanwhile put_ucounts proceeds to
>>>>>> unconditionally deleting the ucounts.
>>>>>
>>>>>
>>>>> It also seems that a concurrent put_ucounts can make get_ucounts fail
>>>>> _spuriously_, which does not look good.
>>>>> Don't we want something along the following lines?
>>>>>
>>>>> diff --git a/kernel/ucount.c b/kernel/ucount.c
>>>>> index 8a11fc0cb459..233c8e46acd5 100644
>>>>> --- a/kernel/ucount.c
>>>>> +++ b/kernel/ucount.c
>>>>> @@ -143,19 +143,18 @@ static struct ucounts *get_ucounts(struct
>>>>> user_namespace *ns, kuid_t uid)
>>>>>
>>>>>                 new->ns = ns;
>>>>>                 new->uid = uid;
>>>>> -               atomic_set(&new->count, 0);
>>>>> +               atomic_set(&new->count, 1);
>>>>>
>>>>>                 spin_lock_irq(&ucounts_lock);
>>>>>                 ucounts = find_ucounts(ns, uid, hashent);
>>>>>                 if (ucounts) {
>>>>> +                       atomic_inc(&ucounts->count);
>>>>>                         kfree(new);
>>>>>                 } else {
>>>>>                         hlist_add_head(&new->node, hashent);
>>>>>                         ucounts = new;
>>>>>                 }
>>>>>         }
>>>>> -       if (!atomic_add_unless(&ucounts->count, 1, INT_MAX))
>>>>> -               ucounts = NULL;
>>>>>         spin_unlock_irq(&ucounts_lock);
>>>>>         return ucounts;
>>>>>  }
>>>>> @@ -166,7 +165,10 @@ static void put_ucounts(struct ucounts *ucounts)
>>>>>
>>>>>         if (atomic_dec_and_test(&ucounts->count)) {
>>>>>                 spin_lock_irqsave(&ucounts_lock, flags);
>>>>> -               hlist_del_init(&ucounts->node);
>>>>> +               if (atomic_read(&ucounts->count) == 0)
>>>>> +                       hlist_del_init(&ucounts->node);
>>>>> +               else
>>>>> +                       ucounts = NULL;
>>>>>                 spin_unlock_irqrestore(&ucounts_lock, flags);
>>>>>
>>>>>                 kfree(ucounts);
>>>>
>>>>
>>>> /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
>>>>
>>>> This is broken per se. Need something more elaborate.
>>>>
>>>
>>>
>>> How about this :
>>>
>>> diff --git a/kernel/ucount.c b/kernel/ucount.c
>>> index 8a11fc0cb459..b817ac0e587c 100644
>>> --- a/kernel/ucount.c
>>> +++ b/kernel/ucount.c
>>> @@ -166,11 +166,15 @@ static void put_ucounts(struct ucounts *ucounts)
>>>
>>>         if (atomic_dec_and_test(&ucounts->count)) {
>>>                 spin_lock_irqsave(&ucounts_lock, flags);
>>> -               hlist_del_init(&ucounts->node);
>>> -               spin_unlock_irqrestore(&ucounts_lock, flags);
>>> -
>>> -               kfree(ucounts);
>>> +               if (!atomic_read(&ucounts->count)) {
>>> +                       hlist_del_init(&ucounts->node);
>>> +                       spin_unlock_irqrestore(&ucounts_lock, flags);
>>> +                       kfree(ucounts);
>>> +                       return;
>>> +               }
>>>         }
>>> +
>>> +       spin_unlock_irqrestore(&ucounts_lock, flags);
>>>  }
>>>
>>>
>>>
>>> This makes the atomic_dec_and_test and hashtable removal atomic in essence.
>>
>>
>> This won't work.
>> Consider the following scenario:
>> Thread 0 decrements count to 0 here:
>>   if (atomic_dec_and_test(&ucounts->count)) {
>> Then thread 1 calls get_ucounts, increments count to 1, then calls
>> put_ucounts, decrements count to 0 and unhashes and frees ucounts.
>> Not thread 0 does:
>>   if (!atomic_read(&ucounts->count)) {
>> but ucounts is already freed!
>
>
> What may work is if put_ucounts re-lookups the ucounts. If it can
> still find it and count==0, then it is the right time to delete it. If
> it can't find the ucounts, then somebody else has beaten it.

I believe what we want is atomic_dec_and_lock_irqsave.

As that does not exist we can just do:

@@ -164,13 +164,16 @@ static void put_ucounts(struct ucounts *ucounts)
 {
        unsigned long flags;
 
+       /* Unless the count is 1 decrement the quick way */
+       if (atomic_add_unless(&ucounts->count, -1, 1))
+               return;
+
+       spin_lock_irqsave(&ucounts_lock, flags);
        if (atomic_dec_and_test(&ucounts->count)) {
-               spin_lock_irqsave(&ucounts_lock, flags);
                hlist_del_init(&ucounts->node);
-               spin_unlock_irqrestore(&ucounts_lock, flags);
-
                kfree(ucounts);
        }
+       spin_unlock_irqrestore(&ucounts_lock, flags);
 }
 
 static inline bool atomic_inc_below(atomic_t *v, int u)


AKA take the spin_lock around the dec_and_test.

Arguably always decrementing under the ucounts_lock that means we can
stop reduce ucounts->count to a simple integer and just always take the
lock.  Thus reducing our number of atomic operations and speeding up the
code a little.

But that might be a bit much for a simple bug fix.

If you folks can verify the fix above closes the race and stops the
problems I would appreciate it.

Eric
Eric W. Biederman March 5, 2017, 9 p.m.
쪼르 <zzoru007@gmail.com> writes:

> Hi, This is my new one report about dec_ucount:
> ps.Sorry for my uncomfortable report. This is my first usage of lkml. 
> Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on commit
> .

You are doing well.  Thank you very much for the report.

Thank you for the reproducer.  Unfortunately I am not able to reproduce
the bug with what the code you have posted here.

From the initial mailing the code said:

> Syzkaller reproducer:
> # {Threaded:false Collide:false Repeat:true Procs:4 Sandbox:setuid
> Repro:false}
> inotify_init()

The code you posted says:

> Syzkaller reproducer:
> # {Threaded:false Collide:false Repeat:true Procs:1 Sandbox:setuid Repro:false}
> semget$private(0x0, 0x400001003, 0x181)

So I expect syzkaller did not create the same code when you ran it
again.  Something easy to miss if you haven't run used a tool like that
much.

If someone knows how to get the code that syzkaller would generate that
matches the original reproducer I would very much appreciate it so that
we can confirm the bug we have spotted in the code is the bug syzkaller
found.

Until that point I am going to fix the obvious bug in the code and hope
that fixes the problem.

Eric
Jann Horn via Containers March 6, 2017, 9:13 a.m.
On Sun, Mar 5, 2017 at 10:00 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> 쪼르 <zzoru007@gmail.com> writes:
>
>> Hi, This is my new one report about dec_ucount:
>> ps.Sorry for my uncomfortable report. This is my first usage of lkml.
>> Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on commit
>> .
>
> You are doing well.  Thank you very much for the report.
>
> Thank you for the reproducer.  Unfortunately I am not able to reproduce
> the bug with what the code you have posted here.
>
> From the initial mailing the code said:
>
>> Syzkaller reproducer:
>> # {Threaded:false Collide:false Repeat:true Procs:4 Sandbox:setuid
>> Repro:false}
>> inotify_init()
>
> The code you posted says:
>
>> Syzkaller reproducer:
>> # {Threaded:false Collide:false Repeat:true Procs:1 Sandbox:setuid Repro:false}
>> semget$private(0x0, 0x400001003, 0x181)
>
> So I expect syzkaller did not create the same code when you ran it
> again.  Something easy to miss if you haven't run used a tool like that
> much.
>
> If someone knows how to get the code that syzkaller would generate that
> matches the original reproducer I would very much appreciate it so that
> we can confirm the bug we have spotted in the code is the bug syzkaller
> found.
>
> Until that point I am going to fix the obvious bug in the code and hope
> that fixes the problem.


Reliably reproducing such bugs is not possible (how would you expect
it to look like?). Your best bet is to write a stress test that
provokes the bug, add some sleeps into kernel code and run it for a
while with KASAN. Should be reproducible within minutes.
Eric W. Biederman March 6, 2017, 4:33 p.m.
Dmitry Vyukov <dvyukov@google.com> writes:

> On Sun, Mar 5, 2017 at 10:00 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> 쪼르 <zzoru007@gmail.com> writes:
>>
>>> Hi, This is my new one report about dec_ucount:
>>> ps.Sorry for my uncomfortable report. This is my first usage of lkml.
>>> Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on commit
>>> .
>>
>> You are doing well.  Thank you very much for the report.
>>
>> Thank you for the reproducer.  Unfortunately I am not able to reproduce
>> the bug with what the code you have posted here.
>>
>> From the initial mailing the code said:
>>
>>> Syzkaller reproducer:
>>> # {Threaded:false Collide:false Repeat:true Procs:4 Sandbox:setuid
>>> Repro:false}
>>> inotify_init()
>>
>> The code you posted says:
>>
>>> Syzkaller reproducer:
>>> # {Threaded:false Collide:false Repeat:true Procs:1 Sandbox:setuid Repro:false}
>>> semget$private(0x0, 0x400001003, 0x181)
>>
>> So I expect syzkaller did not create the same code when you ran it
>> again.  Something easy to miss if you haven't run used a tool like that
>> much.
>>
>> If someone knows how to get the code that syzkaller would generate that
>> matches the original reproducer I would very much appreciate it so that
>> we can confirm the bug we have spotted in the code is the bug syzkaller
>> found.
>>
>> Until that point I am going to fix the obvious bug in the code and hope
>> that fixes the problem.
>
>
> Reliably reproducing such bugs is not possible (how would you expect
> it to look like?). Your best bet is to write a stress test that
> provokes the bug, add some sleeps into kernel code and run it for a
> while with KASAN. Should be reproducible within minutes.

I was not asking for a reliable reproducer.  I was asking what code was
run that triggered the error.

I don't have a clue what the randomly generated code that prompted the
original kernel error is and it doesn't appear anyone else does either.

The only hint I have is:
>>> Syzkaller reproducer:
>>> # {Threaded:false Collide:false Repeat:true Procs:4 Sandbox:setuid
>>> Repro:false}
>>> inotify_init()

The code that was posted did not call inotify_init and so I believe that
was a completely different random piece of code, that has nothing to do
with this issue.

I don't know syzkaller and it looks non-trivial to install on my system
and play around with.  So I am going to leave futzing with syzkaller to
people who have been able to figure it out.

Until I have a reasonable understanding of what the code was doing that
triggered the error I can't say with any certainty that the reported bug
was fixed.

I would love to be able to say that it looks like the bug that caused
the error report was fixed.

Eric
Jann Horn via Containers March 6, 2017, 6:43 p.m.
On Mon, Mar 6, 2017 at 5:33 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Dmitry Vyukov <dvyukov@google.com> writes:
>
>> On Sun, Mar 5, 2017 at 10:00 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> 쪼르 <zzoru007@gmail.com> writes:
>>>
>>>> Hi, This is my new one report about dec_ucount:
>>>> ps.Sorry for my uncomfortable report. This is my first usage of lkml.
>>>> Syzkaller hit 'KASAN: use-after-free Read in dec_ucount' bug on commit
>>>> .
>>>
>>> You are doing well.  Thank you very much for the report.
>>>
>>> Thank you for the reproducer.  Unfortunately I am not able to reproduce
>>> the bug with what the code you have posted here.
>>>
>>> From the initial mailing the code said:
>>>
>>>> Syzkaller reproducer:
>>>> # {Threaded:false Collide:false Repeat:true Procs:4 Sandbox:setuid
>>>> Repro:false}
>>>> inotify_init()
>>>
>>> The code you posted says:
>>>
>>>> Syzkaller reproducer:
>>>> # {Threaded:false Collide:false Repeat:true Procs:1 Sandbox:setuid Repro:false}
>>>> semget$private(0x0, 0x400001003, 0x181)
>>>
>>> So I expect syzkaller did not create the same code when you ran it
>>> again.  Something easy to miss if you haven't run used a tool like that
>>> much.
>>>
>>> If someone knows how to get the code that syzkaller would generate that
>>> matches the original reproducer I would very much appreciate it so that
>>> we can confirm the bug we have spotted in the code is the bug syzkaller
>>> found.
>>>
>>> Until that point I am going to fix the obvious bug in the code and hope
>>> that fixes the problem.
>>
>>
>> Reliably reproducing such bugs is not possible (how would you expect
>> it to look like?). Your best bet is to write a stress test that
>> provokes the bug, add some sleeps into kernel code and run it for a
>> while with KASAN. Should be reproducible within minutes.
>
> I was not asking for a reliable reproducer.  I was asking what code was
> run that triggered the error.
>
> I don't have a clue what the randomly generated code that prompted the
> original kernel error is and it doesn't appear anyone else does either.
>
> The only hint I have is:
>>>> Syzkaller reproducer:
>>>> # {Threaded:false Collide:false Repeat:true Procs:4 Sandbox:setuid
>>>> Repro:false}
>>>> inotify_init()
>
> The code that was posted did not call inotify_init and so I believe that
> was a completely different random piece of code, that has nothing to do
> with this issue.
>
> I don't know syzkaller and it looks non-trivial to install on my system
> and play around with.  So I am going to leave futzing with syzkaller to
> people who have been able to figure it out.
>
> Until I have a reasonable understanding of what the code was doing that
> triggered the error I can't say with any certainty that the reported bug
> was fixed.
>
> I would love to be able to say that it looks like the bug that caused
> the error report was fixed.


Here is a C analog of the reproducer:
https://gist.githubusercontent.com/dvyukov/80293e434b1273c6a1cdce5c6b979ab6/raw/a1f5ff3cab17b5e3882b56060efb795b6355bdf2/gistfile1.txt
that JongHwan provided here:
https://groups.google.com/d/msg/syzkaller/crHVWf1QDkU/DrA9uVquDwAJ

It basically just calls inotify_init in parallel processes.