Message ID | 201708311028.v7VASkOD012975@finist_ce7.work |
---|---|
State | New |
Series | "ms/workqueue: fix ghost PENDING flag while doing MQ IO" |
Headers | show |
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ec41322a..ffd9f4e 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -638,6 +638,35 @@ static void set_work_pool_and_clear_pending(struct work_struct *work, */ smp_wmb(); set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0); + /* + * The following mb guarantees that previous clear of a PENDING bit + * will not be reordered with any speculative LOADS or STORES from + * work->current_func, which is executed afterwards. This possible + * reordering can lead to a missed execution on attempt to qeueue + * the same @work. E.g. consider this case: + * + * CPU#0 CPU#1 + * ---------------------------- -------------------------------- + * + * 1 STORE event_indicated + * 2 queue_work_on() { + * 3 test_and_set_bit(PENDING) + * 4 } set_..._and_clear_pending() { + * 5 set_work_data() # clear bit + * 6 smp_mb() + * 7 work->current_func() { + * 8 LOAD event_indicated + * } + * + * Without an explicit full barrier speculative LOAD on line 8 can + * be executed before CPU#0 does STORE on line 1. If that happens, + * CPU#0 observes the PENDING bit is still set and new execution of + * a @work is not queued in a hope, that CPU#1 will eventually + * finish the queued @work. Meanwhile CPU#1 does not see + * event_indicated is set, because speculative LOAD was executed + * before actual STORE. + */ + smp_mb(); } static void clear_work_data(struct work_struct *work)
Please consider to release it as a ReadyKernel patch. https://readykernel.com/ -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 08/31/2017 01:28 PM, Konstantin Khorenko wrote: > The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git > after rh7-3.10.0-514.26.1.vz7.35.5 > ------> > commit f24bbb53d5035c7b13b5ecb61728d5f12240f139 > Author: Roman Pen <roman.penyaev@profitbricks.com> > Date: Thu Aug 31 13:28:46 2017 +0300 > > ms/workqueue: fix ghost PENDING flag while doing MQ IO > > We have the hole node hang, many processes hang on similar stack as here: > > crash> ps -m ffff8802b7f00000 > [0 00:20:36.663] [UN] PID: 22713 TASK: ffff8802b7f00000 CPU: 1 COMMAND: "worker" > > crash> bt ffff8802b7f00000 > PID: 22713 TASK: ffff8802b7f00000 CPU: 1 COMMAND: "worker" > #0 [ffff88031b04f980] __schedule at ffffffff8256cdd1 > #1 [ffff88031b04f9f8] schedule at ffffffff8256e239 > #2 [ffff88031b04fa18] schedule_timeout at ffffffff82561cea > #3 [ffff88031b04fb88] io_schedule_timeout at ffffffff8256c0d9 > #4 [ffff88031b04fbb8] wait_for_completion_io at ffffffff8256f3e0 > #5 [ffff88031b04fc90] blkdev_issue_flush at ffffffff8193a207 > #6 [ffff88031b04fe08] ext4_sync_file at ffffffffa0af6d34 [ext4] > #7 [ffff88031b04fe68] vfs_fsync_range at ffffffff8173212c > #8 [ffff88031b04fec8] do_fsync at ffffffff817330dc > #9 [ffff88031b04ff68] sys_fdatasync at ffffffff8173437e > RIP: 00007f474a581ddd RSP: 00007f46ba3fe8a0 RFLAGS: 00000282 > RAX: 000000000000004b RBX: ffffffff8258f609 RCX: ffffffffffffffff > RDX: 00007f4754ffd458 RSI: 0000000000000000 RDI: 0000000000000011 > RBP: 0000000000000000 R8: 0000000000000000 R9: 00000000000058b9 > R10: 00007f46ba3fe8b0 R11: 0000000000000293 R12: 00007f475be25d80 > R13: ffffffff8173437e R14: ffff88031b04ff78 R15: 00007f4755141452 > ORIG_RAX: 000000000000004b CS: 0033 SS: 002b > > crash> ps -m ffff8802b7f00000 > [0 00:20:36.663] [UN] PID: 22713 TASK: ffff8802b7f00000 CPU: 1 COMMAND: "worker" > > Sleeps for 20 minutes on bio completion: > > blkdev_issue_flush: > submit_bio(WRITE_FLUSH, bio); > here> wait_for_completion_io(&wait); > > As bio->bi_rw = (WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FLUSH), we had: > > submit_bio->generic_make_request->dm_make_request->queue_io->queue_work > > So in wait_for_completion_io we wait for dm_wq_work to complete these > bio. But work is not in the workqueue already, as work->entry is empty > list, so the work seem completed. That could happen only if md->flags > had DMF_BLOCK_IO_FOR_SUSPEND bit set. But it is already unset, when we > clear the bit we queue another dm_wq_work on these wq in dm_queue_flush. > > So what could've happened here is that operation reordering loads > DMF_BLOCK_IO_FOR_SUSPEND bit in dm_wq_work before it was cleared in > dm_queue_flush. Adding smp_mb in set_work_pool_and_clear_pending should > order operations properly. > > https://jira.sw.ru/browse/PSBM-69788 > > original commit message: > > The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list > with the following backtrace: > > [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds. > [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6 > [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000 > [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server] > [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000 > [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0 > [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38 > [ 601.350965] Call Trace: > [ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60 > [ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80 > [ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230 > [ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220 > [ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0 > [ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60 > [ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110 > [ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10 > [ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70 > [ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90 > [ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0 > [ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40 > [ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140 > [ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30 > [ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70 > [ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50 > [ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0 > [ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100 > [ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0 > [ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0 > [ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50 > > The underlying device is a null_blk, with default parameters: > > queue_mode = MQ > submit_queues = 1 > > Verification that nullb0 has something inflight: > > root@pserver8:~# cat /sys/block/nullb0/inflight > 0 1 > root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \; > ... > /sys/block/nullb0/mq/0/cpu2/rq_list > CTX pending: > ffff8838038e2400 > ... > > During debug it became clear that stalled request is always inserted in > the rq_list from the following path: > > save_stack_trace_tsk + 34 > blk_mq_insert_requests + 231 > blk_mq_flush_plug_list + 281 > blk_flush_plug_list + 199 > wait_on_page_bit + 192 > __filemap_fdatawait_range + 228 > filemap_fdatawait_range + 20 > filemap_write_and_wait_range + 63 > blkdev_fsync + 27 > vfs_fsync_range + 73 > blkdev_write_iter + 202 > __vfs_write + 170 > vfs_write + 169 > kernel_write + 56 > > So blk_flush_plug_list() was called with from_schedule == true. > > If from_schedule is true, that means that finally blk_mq_insert_requests() > offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue, > i.e. it calls kblockd_schedule_delayed_work_on(). > > That means, that we race with another CPU, which is about to execute > __blk_mq_run_hw_queue() work. > > Further debugging shows the following traces from different CPUs: > > CPU#0 CPU#1 > ---------------------------------- ------------------------------- > reqeust A inserted > STORE hctx->ctx_map[0] bit marked > kblockd_schedule...() returns 1 > <schedule to kblockd workqueue> > request B inserted > STORE hctx->ctx_map[1] bit marked > kblockd_schedule...() returns 0 > *** WORK PENDING bit is cleared *** > flush_busy_ctxs() is executed, but > bit 1, set by CPU#1, is not observed > > As a result request B pended forever. > > This behaviour can be explained by speculative LOAD of hctx->ctx_map on > CPU#0, which is reordered with clear of PENDING bit and executed _before_ > actual STORE of bit 1 on CPU#1. > > The proper fix is an explicit full barrier <mfence>, which guarantees > that clear of PENDING bit is to be executed before all possible > speculative LOADS or STORES inside actual work function. > > Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> > Cc: Gioh Kim <gi-oh.kim@profitbricks.com> > Cc: Michael Wang <yun.wang@profitbricks.com> > Cc: Tejun Heo <tj@kernel.org> > Cc: Jens Axboe <axboe@kernel.dk> > Cc: linux-block@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: stable@vger.kernel.org > Signed-off-by: Tejun Heo <tj@kernel.org> > > Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> > --- > kernel/workqueue.c | 29 +++++++++++++++++++++++++++++ > 1 file changed, 29 insertions(+) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index ec41322a..ffd9f4e 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -638,6 +638,35 @@ static void set_work_pool_and_clear_pending(struct work_struct *work, > */ > smp_wmb(); > set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0); > + /* > + * The following mb guarantees that previous clear of a PENDING bit > + * will not be reordered with any speculative LOADS or STORES from > + * work->current_func, which is executed afterwards. This possible > + * reordering can lead to a missed execution on attempt to qeueue > + * the same @work. E.g. consider this case: > + * > + * CPU#0 CPU#1 > + * ---------------------------- -------------------------------- > + * > + * 1 STORE event_indicated > + * 2 queue_work_on() { > + * 3 test_and_set_bit(PENDING) > + * 4 } set_..._and_clear_pending() { > + * 5 set_work_data() # clear bit > + * 6 smp_mb() > + * 7 work->current_func() { > + * 8 LOAD event_indicated > + * } > + * > + * Without an explicit full barrier speculative LOAD on line 8 can > + * be executed before CPU#0 does STORE on line 1. If that happens, > + * CPU#0 observes the PENDING bit is still set and new execution of > + * a @work is not queued in a hope, that CPU#1 will eventually > + * finish the queued @work. Meanwhile CPU#1 does not see > + * event_indicated is set, because speculative LOAD was executed > + * before actual STORE. > + */ > + smp_mb(); > } > > static void clear_work_data(struct work_struct *work) > . >