fuse kio: Add comment about why we need to wait pending read requests

Submitted by Kirill Tkhai on Oct. 23, 2018, 10:01 a.m.

Details

Message ID 154028886677.8068.6960631900459085731.stgit@localhost.localdomain
State New
Series "fuse kio: Add comment about why we need to wait pending read requests"
Headers show

Commit Message

Kirill Tkhai Oct. 23, 2018, 10:01 a.m.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/fuse/kio/pcs/pcs_fuse_kdirect.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
index 3940d6c255ba..9b45fcbf8941 100644
--- a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
+++ b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
@@ -1081,7 +1081,12 @@  static void pcs_kio_setattr_handle(struct fuse_inode *fi, struct fuse_req *req)
 	r->end = req->end;
 	if (di->size.op == PCS_SIZE_SHRINK) {
 		BUG_ON(!mutex_is_locked(&req->io_inode->i_mutex));
-		/* wait for aio reads in flight */
+		/*
+		 * Wait for already submitted aio reads. Further reads
+		 * (including already queued to bg_queue) will be stopped
+		 * by wait_shrink(), and they will be processed from
+		 * _pcs_shrink_end().
+		 */
 		inode_dio_wait(req->io_inode);
 		/*
 		 * Writebackcache was flushed already so it is safe to

Comments

Pavel Butsykin Oct. 23, 2018, 10:15 a.m.
On 23.10.2018 13:01, Kirill Tkhai wrote:
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
>   fs/fuse/kio/pcs/pcs_fuse_kdirect.c |    7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
> index 3940d6c255ba..9b45fcbf8941 100644
> --- a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
> +++ b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
> @@ -1081,7 +1081,12 @@ static void pcs_kio_setattr_handle(struct fuse_inode *fi, struct fuse_req *req)
>   	r->end = req->end;
>   	if (di->size.op == PCS_SIZE_SHRINK) {
>   		BUG_ON(!mutex_is_locked(&req->io_inode->i_mutex));
> -		/* wait for aio reads in flight */
> +		/*
> +		 * Wait for already submitted aio reads. Further reads
> +		 * (including already queued to bg_queue) will be stopped
> +		 * by wait_shrink(), and they will be processed from
> +		 * _pcs_shrink_end().
> +		 */

Why do we have to wait already submitted aio reads ?

>   		inode_dio_wait(req->io_inode);
>   		/*
>   		 * Writebackcache was flushed already so it is safe to
>
Kirill Tkhai Oct. 23, 2018, 10:19 a.m.
On 23.10.2018 13:15, Pavel Butsykin wrote:
> 
> 
> On 23.10.2018 13:01, Kirill Tkhai wrote:
>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>> ---
>>   fs/fuse/kio/pcs/pcs_fuse_kdirect.c |    7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>> index 3940d6c255ba..9b45fcbf8941 100644
>> --- a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>> +++ b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>> @@ -1081,7 +1081,12 @@ static void pcs_kio_setattr_handle(struct fuse_inode *fi, struct fuse_req *req)
>>   	r->end = req->end;
>>   	if (di->size.op == PCS_SIZE_SHRINK) {
>>   		BUG_ON(!mutex_is_locked(&req->io_inode->i_mutex));
>> -		/* wait for aio reads in flight */
>> +		/*
>> +		 * Wait for already submitted aio reads. Further reads
>> +		 * (including already queued to bg_queue) will be stopped
>> +		 * by wait_shrink(), and they will be processed from
>> +		 * _pcs_shrink_end().
>> +		 */
> 
> Why do we have to wait already submitted aio reads ?

Because there could be grow requests. This place is broken, I'm fixing that at the moment.
One-dimensional size.op is not enough to drive race between grow and shrink, we need to
convert PCS_SIZE_* into bit fields like it used to be before.

>>   		inode_dio_wait(req->io_inode);
>>   		/*
>>   		 * Writebackcache was flushed already so it is safe to
>>
Pavel Butsykin Oct. 23, 2018, 1:08 p.m.
On 23.10.2018 13:19, Kirill Tkhai wrote:
> On 23.10.2018 13:15, Pavel Butsykin wrote:
>>
>>
>> On 23.10.2018 13:01, Kirill Tkhai wrote:
>>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>>> ---
>>>    fs/fuse/kio/pcs/pcs_fuse_kdirect.c |    7 ++++++-
>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>>> index 3940d6c255ba..9b45fcbf8941 100644
>>> --- a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>>> +++ b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>>> @@ -1081,7 +1081,12 @@ static void pcs_kio_setattr_handle(struct fuse_inode *fi, struct fuse_req *req)
>>>    	r->end = req->end;
>>>    	if (di->size.op == PCS_SIZE_SHRINK) {
>>>    		BUG_ON(!mutex_is_locked(&req->io_inode->i_mutex));
>>> -		/* wait for aio reads in flight */
>>> +		/*
>>> +		 * Wait for already submitted aio reads. Further reads
>>> +		 * (including already queued to bg_queue) will be stopped
>>> +		 * by wait_shrink(), and they will be processed from
>>> +		 * _pcs_shrink_end().
>>> +		 */
>>
>> Why do we have to wait already submitted aio reads ?
> 
> Because there could be grow requests. This place is broken, I'm fixing that at the moment.
> One-dimensional size.op is not enough to drive race between grow and shrink, we need to
> convert PCS_SIZE_* into bit fields like it used to be before.
> 

I don't understand. There are no chances to races between grow and
shrink, because the shrink is protected from simultaneous execution with
write requests at the fuse level. No write requests - no grows.

>>>    		inode_dio_wait(req->io_inode);
>>>    		/*
>>>    		 * Writebackcache was flushed already so it is safe to
>>>
Kirill Tkhai Oct. 23, 2018, 1:38 p.m.
On 23.10.2018 16:08, Pavel Butsykin wrote:
> 
> 
> On 23.10.2018 13:19, Kirill Tkhai wrote:
>> On 23.10.2018 13:15, Pavel Butsykin wrote:
>>>
>>>
>>> On 23.10.2018 13:01, Kirill Tkhai wrote:
>>>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>>>> ---
>>>>    fs/fuse/kio/pcs/pcs_fuse_kdirect.c |    7 ++++++-
>>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>>>> index 3940d6c255ba..9b45fcbf8941 100644
>>>> --- a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>>>> +++ b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>>>> @@ -1081,7 +1081,12 @@ static void pcs_kio_setattr_handle(struct fuse_inode *fi, struct fuse_req *req)
>>>>    	r->end = req->end;
>>>>    	if (di->size.op == PCS_SIZE_SHRINK) {
>>>>    		BUG_ON(!mutex_is_locked(&req->io_inode->i_mutex));
>>>> -		/* wait for aio reads in flight */
>>>> +		/*
>>>> +		 * Wait for already submitted aio reads. Further reads
>>>> +		 * (including already queued to bg_queue) will be stopped
>>>> +		 * by wait_shrink(), and they will be processed from
>>>> +		 * _pcs_shrink_end().
>>>> +		 */
>>>
>>> Why do we have to wait already submitted aio reads ?
>>
>> Because there could be grow requests. This place is broken, I'm fixing that at the moment.
>> One-dimensional size.op is not enough to drive race between grow and shrink, we need to
>> convert PCS_SIZE_* into bit fields like it used to be before.
>>
> 
> I don't understand. There are no chances to races between grow and
> shrink, because the shrink is protected from simultaneous execution with
> write requests at the fuse level. No write requests - no grows.

Not exactly. Currently we have fallocate requests, which may spawn a grow operation.
Even FALLOC_FL_KEEP_SIZE may do that -- then inode mutex is not held. So, the question
is not so easy, what is correct, and which is place should be fixed.

>>>>    		inode_dio_wait(req->io_inode);
>>>>    		/*
>>>>    		 * Writebackcache was flushed already so it is safe to
>>>>
Kirill Tkhai Oct. 23, 2018, 1:55 p.m.
On 23.10.2018 13:15, Pavel Butsykin wrote:
> 
> 
> On 23.10.2018 13:01, Kirill Tkhai wrote:
>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>> ---
>>   fs/fuse/kio/pcs/pcs_fuse_kdirect.c |    7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>> index 3940d6c255ba..9b45fcbf8941 100644
>> --- a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>> +++ b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>> @@ -1081,7 +1081,12 @@ static void pcs_kio_setattr_handle(struct fuse_inode *fi, struct fuse_req *req)
>>   	r->end = req->end;
>>   	if (di->size.op == PCS_SIZE_SHRINK) {
>>   		BUG_ON(!mutex_is_locked(&req->io_inode->i_mutex));
>> -		/* wait for aio reads in flight */
>> +		/*
>> +		 * Wait for already submitted aio reads. Further reads
>> +		 * (including already queued to bg_queue) will be stopped
>> +		 * by wait_shrink(), and they will be processed from
>> +		 * _pcs_shrink_end().
>> +		 */
> 
> Why do we have to wait already submitted aio reads ?

Also, see here:

[cpu0]                                                       [cpu1]
fuse_size_grow_work()
  ...
  spin_lock(&di->lock)
  di->size.op = PCS_SIZE_INACTION;
  spin_unlock(&di->lock)
  ...                                                        pcs_kio_setattr_handle()
                                                               inode_dio_wait() started
                                                               |
                                                               |
  pcs_cc_requeue(di->cluster, &pending_reqs)                   |
  ...                                                          |
    request_end()                                              v
                                                               inode_dio_wait() finished
                                                             proceed FATTR_SIZE shrink
Pavel Butsykin Oct. 23, 2018, 2:37 p.m.
On 23.10.2018 16:38, Kirill Tkhai wrote:
> On 23.10.2018 16:08, Pavel Butsykin wrote:
>>
>>
>> On 23.10.2018 13:19, Kirill Tkhai wrote:
>>> On 23.10.2018 13:15, Pavel Butsykin wrote:
>>>>
>>>>
>>>> On 23.10.2018 13:01, Kirill Tkhai wrote:
>>>>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>>>>> ---
>>>>>     fs/fuse/kio/pcs/pcs_fuse_kdirect.c |    7 ++++++-
>>>>>     1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>>>>> index 3940d6c255ba..9b45fcbf8941 100644
>>>>> --- a/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>>>>> +++ b/fs/fuse/kio/pcs/pcs_fuse_kdirect.c
>>>>> @@ -1081,7 +1081,12 @@ static void pcs_kio_setattr_handle(struct fuse_inode *fi, struct fuse_req *req)
>>>>>     	r->end = req->end;
>>>>>     	if (di->size.op == PCS_SIZE_SHRINK) {
>>>>>     		BUG_ON(!mutex_is_locked(&req->io_inode->i_mutex));
>>>>> -		/* wait for aio reads in flight */
>>>>> +		/*
>>>>> +		 * Wait for already submitted aio reads. Further reads
>>>>> +		 * (including already queued to bg_queue) will be stopped
>>>>> +		 * by wait_shrink(), and they will be processed from
>>>>> +		 * _pcs_shrink_end().
>>>>> +		 */
>>>>
>>>> Why do we have to wait already submitted aio reads ?
>>>
>>> Because there could be grow requests. This place is broken, I'm fixing that at the moment.
>>> One-dimensional size.op is not enough to drive race between grow and shrink, we need to
>>> convert PCS_SIZE_* into bit fields like it used to be before.
>>>
>>
>> I don't understand. There are no chances to races between grow and
>> shrink, because the shrink is protected from simultaneous execution with
>> write requests at the fuse level. No write requests - no grows.
> 
> Not exactly. Currently we have fallocate requests, which may spawn a grow operation.
> Even FALLOC_FL_KEEP_SIZE may do that -- then inode mutex is not held. So, the question
> is not so easy, what is correct, and which is place should be fixed.
> 

So, we have two places where possible fallocate without mutex:

1.
static size_t fuse_send_unmap(struct fuse_req *req, struct fuse_io_priv *io,
		loff_t pos, size_t count, fl_owner_t owner)
{

..
	inarg->mode = FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE;
	req->in.h.opcode = FUSE_FALLOCATE;
..

	if (io->async)
		return fuse_async_req_send(fc, req, count, io);

	fuse_request_send(fc, req);
	return count;
}

It's definitely a mistake, we should protect this fallocate
from simultaneous execution with shrink.

2.
static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
				loff_t length)
{
...
	bool lock_inode = !(mode & FALLOC_FL_KEEP_SIZE) ||
			   (mode & (FALLOC_FL_PUNCH_HOLE|FALLOC_FL_ZERO_RANGE));

	/* Return error if mode is not supported */
	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
		     FALLOC_FL_ZERO_RANGE))
		return -EOPNOTSUPP;

...

So, fallocate without mutex only possible
if (mode & FALLOC_FL_KEEP_SIZE) == FALLOC_FL_KEEP_SIZE.

Hmm, what does it mean ? Looks like it will be the same as
FALLOC_FL_ZERO_RANGE|FALLOC_FL_KEEP_SIZE.

I wonder, Do we need to handle such fallocate for vstorage at all?

It seems that NO, it's just PCS_IREQ_NOOP.

Alexey?

>>>>>     		inode_dio_wait(req->io_inode);
>>>>>     		/*
>>>>>     		 * Writebackcache was flushed already so it is safe to
>>>>>