[Devel,vz7] fuse: relax i_mutex coverage in fuse_fsync

Submitted by Maxim Patlasov on Nov. 30, 2016, 1:02 a.m.

Details

Message ID 148046775173.10135.14560536277093193938.stgit@maxim-thinkpad
State New
Series "fuse: relax i_mutex coverage in fuse_fsync"
Headers show

Commit Message

Maxim Patlasov Nov. 30, 2016, 1:02 a.m.
fuse_fsync_common() does need i_mutex for fuse_sync_writes() and
fuse_flush_mtime(). But when those operations are done, it's actually
doesn't matter whether to hold the lock over fuse_request_send(FUSE_FSYNC)
or not: we ensured that all relevant data were already seen by
userspace fuse daemon, and so it will sync them (by handling FUSE_FSYNC)
anyway; if the user screws up by leaking new data updates in-between, it
is up to the user and doesn't violate fsync(2) semantics.

https://jira.sw.ru/browse/PSBM-55919

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
---
 fs/fuse/file.c |    3 +++
 1 file changed, 3 insertions(+)

Patch hide | download patch | download mbox

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 464b2f5..559dfd9 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -697,6 +697,8 @@  int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 		goto out;
 	}
 
+	mutex_unlock(&inode->i_mutex);
+
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.fh = ff->fh;
 	inarg.fsync_flags = datasync ? 1 : 0;
@@ -715,6 +717,7 @@  int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 			fc->no_fsync = 1;
 		err = 0;
 	}
+	return err;
 out:
 	mutex_unlock(&inode->i_mutex);
 	return err;

Comments

Denis V. Lunev Nov. 30, 2016, 4:24 a.m.
On 11/30/2016 04:02 AM, Maxim Patlasov wrote:
> fuse_fsync_common() does need i_mutex for fuse_sync_writes() and
> fuse_flush_mtime(). But when those operations are done, it's actually
> doesn't matter whether to hold the lock over fuse_request_send(FUSE_FSYNC)
> or not: we ensured that all relevant data were already seen by
> userspace fuse daemon, and so it will sync them (by handling FUSE_FSYNC)
> anyway; if the user screws up by leaking new data updates in-between, it
> is up to the user and doesn't violate fsync(2) semantics.
>
> https://jira.sw.ru/browse/PSBM-55919
>
> Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Can you pls think about merging the same into PCS6-Up12?


> ---
>  fs/fuse/file.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 464b2f5..559dfd9 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -697,6 +697,8 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>  		goto out;
>  	}
>  
> +	mutex_unlock(&inode->i_mutex);
> +
>  	memset(&inarg, 0, sizeof(inarg));
>  	inarg.fh = ff->fh;
>  	inarg.fsync_flags = datasync ? 1 : 0;
> @@ -715,6 +717,7 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>  			fc->no_fsync = 1;
>  		err = 0;
>  	}
> +	return err;
>  out:
>  	mutex_unlock(&inode->i_mutex);
>  	return err;
>
Alexey Kuznetsov Nov. 30, 2016, 12:47 p.m.
Hello!

I do not think you got it right.

i_mutex in fsync is not about some atomicity,
it is about stopping data feed while fsync is executed
to prevent livelock.

I cannot tell anything about mtime update, it is just some voodoo
magic for me.

What's about fsync semantics, I see two different ways:

A.

1. Remove useless write_inode_now. Its work is done
    by filemap_write_and_wait_range(), there is no need to repeat it
under mutex.
2. move mutex_lock _after_  fuse_sync_writes(), which is essentially
    fuse continuation forfilemap_write_and_wait_range().
3. i_mutex is preserved only around fsync call.

B.
1. Remove  write_inode_now as well.
2. Remove i_mutex _completely_. (No idea about mtime voodo though)
2. Replace fuse_sync_writes() with fuse_set_nowrite()
    and add release after call to FSYNC.

Both prevent livelock. B is obviosly optimal.

But A preserves historic fuse protocol semantics.
F.e. I have no idea would user space survive truncate
racing with fsync. pstorage should survice, though this
path was never tested.





On Wed, Nov 30, 2016 at 4:02 AM, Maxim Patlasov <mpatlasov@virtuozzo.com> wrote:
> fuse_fsync_common() does need i_mutex for fuse_sync_writes() and
> fuse_flush_mtime(). But when those operations are done, it's actually
> doesn't matter whether to hold the lock over fuse_request_send(FUSE_FSYNC)
> or not: we ensured that all relevant data were already seen by
> userspace fuse daemon, and so it will sync them (by handling FUSE_FSYNC)
> anyway; if the user screws up by leaking new data updates in-between, it
> is up to the user and doesn't violate fsync(2) semantics.
>
> https://jira.sw.ru/browse/PSBM-55919
>
> Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
> ---
>  fs/fuse/file.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 464b2f5..559dfd9 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -697,6 +697,8 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>                 goto out;
>         }
>
> +       mutex_unlock(&inode->i_mutex);
> +
>         memset(&inarg, 0, sizeof(inarg));
>         inarg.fh = ff->fh;
>         inarg.fsync_flags = datasync ? 1 : 0;
> @@ -715,6 +717,7 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>                         fc->no_fsync = 1;
>                 err = 0;
>         }
> +       return err;
>  out:
>         mutex_unlock(&inode->i_mutex);
>         return err;
>
Alexey Kuznetsov Nov. 30, 2016, 1:09 p.m.
Sorry, missed that pair fuse_set_nowrite/fuse_release_writes
can be done only under i_mutex.

IMHO it is only due to bad implementation.
If fuse_set_nowrite would be done with separate
count instead of adding negative bias, it would
be possible.


On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov <kuznet@virtuozzo.com> wrote:
> Hello!
>
> I do not think you got it right.
>
> i_mutex in fsync is not about some atomicity,
> it is about stopping data feed while fsync is executed
> to prevent livelock.
>
> I cannot tell anything about mtime update, it is just some voodoo
> magic for me.
>
> What's about fsync semantics, I see two different ways:
>
> A.
>
> 1. Remove useless write_inode_now. Its work is done
>     by filemap_write_and_wait_range(), there is no need to repeat it
> under mutex.
> 2. move mutex_lock _after_  fuse_sync_writes(), which is essentially
>     fuse continuation forfilemap_write_and_wait_range().
> 3. i_mutex is preserved only around fsync call.
>
> B.
> 1. Remove  write_inode_now as well.
> 2. Remove i_mutex _completely_. (No idea about mtime voodo though)
> 2. Replace fuse_sync_writes() with fuse_set_nowrite()
>     and add release after call to FSYNC.
>
> Both prevent livelock. B is obviosly optimal.
>
> But A preserves historic fuse protocol semantics.
> F.e. I have no idea would user space survive truncate
> racing with fsync. pstorage should survice, though this
> path was never tested.
>
>
>
>
>
> On Wed, Nov 30, 2016 at 4:02 AM, Maxim Patlasov <mpatlasov@virtuozzo.com> wrote:
>> fuse_fsync_common() does need i_mutex for fuse_sync_writes() and
>> fuse_flush_mtime(). But when those operations are done, it's actually
>> doesn't matter whether to hold the lock over fuse_request_send(FUSE_FSYNC)
>> or not: we ensured that all relevant data were already seen by
>> userspace fuse daemon, and so it will sync them (by handling FUSE_FSYNC)
>> anyway; if the user screws up by leaking new data updates in-between, it
>> is up to the user and doesn't violate fsync(2) semantics.
>>
>> https://jira.sw.ru/browse/PSBM-55919
>>
>> Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
>> ---
>>  fs/fuse/file.c |    3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index 464b2f5..559dfd9 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -697,6 +697,8 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>                 goto out;
>>         }
>>
>> +       mutex_unlock(&inode->i_mutex);
>> +
>>         memset(&inarg, 0, sizeof(inarg));
>>         inarg.fh = ff->fh;
>>         inarg.fsync_flags = datasync ? 1 : 0;
>> @@ -715,6 +717,7 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>                         fc->no_fsync = 1;
>>                 err = 0;
>>         }
>> +       return err;
>>  out:
>>         mutex_unlock(&inode->i_mutex);
>>         return err;
>>
Maxim Patlasov Nov. 30, 2016, 6:01 p.m.
Alexey,


You're right. And while composing the patch I well understood that it's 
possible to rework fuse_sync_writes() using a counter instead of 
negative bias. But the problem with flush_mtime still exists anyway. 
Think about it: we firstly acquire local mtime from local inode, then 
fill and submit mtime-update-request. Since then, we don't know when 
exactly fuse daemon will apply that new mtime to its metadata 
structures. If another mtime-update is generated in-between (e.g. "touch 
-d <date> file", or even simplier -- just a single direct write 
implicitly updating mtime), we wouldn't know which of those two 
mtime-update-requests are processed by fused first. That comes from a 
general FUSE protocol limitation: when kernel fuse queues request A, 
then request B, it cannot be sure if they will be processed by userspace 
as <A, then B> or <B, then A>.


The big advantage of the patch I sent is that it's very simple, 
straightforward and presumably will remove 99% of contention between 
fsync and io_submit (assuming we spend most of time waiting for 
userspace ACK for FUSE_FSYNC request. There are actually three questions 
to answer:


1) Do we really must honor a crazy app who mixes a lot of fsyncs with a 
lot of io_submits? The goal of fsync is to ensure that some state is 
actually went to platters. An app who races io_submit-s with fsync-s 
actually doesn't care which state will come to platters. I'm not sure 
that it's reasonable to work very hard to achieve the best possible 
performance for such a marginal app.


2) Will the patch (in the form I sent it) break something? I think no. 
If you know some usecase that can be broken, let's discuss it in more 
details.


3) Should we expect some noticeable (or significant) improvement in 
performance comparing fuse_fsync with no locking at all vs. the locking 
we have with that patch applied? I tend to think that the answer is "no" 
because handling FUSE_FSYNC is notoriously heavy-weight operation. If 
you disagree, let's firstly measure that difference in performance 
(simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then 
start to think if it's really worthy to fully re-work locking scheme to 
preserve flush_mtime correctness w/o i_mutex.


Thanks,

Maxim


On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote:
> Sorry, missed that pair fuse_set_nowrite/fuse_release_writes
> can be done only under i_mutex.
>
> IMHO it is only due to bad implementation.
> If fuse_set_nowrite would be done with separate
> count instead of adding negative bias, it would
> be possible.
>
>
> On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov <kuznet@virtuozzo.com> wrote:
>> Hello!
>>
>> I do not think you got it right.
>>
>> i_mutex in fsync is not about some atomicity,
>> it is about stopping data feed while fsync is executed
>> to prevent livelock.
>>
>> I cannot tell anything about mtime update, it is just some voodoo
>> magic for me.
>>
>> What's about fsync semantics, I see two different ways:
>>
>> A.
>>
>> 1. Remove useless write_inode_now. Its work is done
>>      by filemap_write_and_wait_range(), there is no need to repeat it
>> under mutex.
>> 2. move mutex_lock _after_  fuse_sync_writes(), which is essentially
>>      fuse continuation forfilemap_write_and_wait_range().
>> 3. i_mutex is preserved only around fsync call.
>>
>> B.
>> 1. Remove  write_inode_now as well.
>> 2. Remove i_mutex _completely_. (No idea about mtime voodo though)
>> 2. Replace fuse_sync_writes() with fuse_set_nowrite()
>>      and add release after call to FSYNC.
>>
>> Both prevent livelock. B is obviosly optimal.
>>
>> But A preserves historic fuse protocol semantics.
>> F.e. I have no idea would user space survive truncate
>> racing with fsync. pstorage should survice, though this
>> path was never tested.
>>
>>
>>
>>
>>
>> On Wed, Nov 30, 2016 at 4:02 AM, Maxim Patlasov <mpatlasov@virtuozzo.com> wrote:
>>> fuse_fsync_common() does need i_mutex for fuse_sync_writes() and
>>> fuse_flush_mtime(). But when those operations are done, it's actually
>>> doesn't matter whether to hold the lock over fuse_request_send(FUSE_FSYNC)
>>> or not: we ensured that all relevant data were already seen by
>>> userspace fuse daemon, and so it will sync them (by handling FUSE_FSYNC)
>>> anyway; if the user screws up by leaking new data updates in-between, it
>>> is up to the user and doesn't violate fsync(2) semantics.
>>>
>>> https://jira.sw.ru/browse/PSBM-55919
>>>
>>> Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
>>> ---
>>>   fs/fuse/file.c |    3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>> index 464b2f5..559dfd9 100644
>>> --- a/fs/fuse/file.c
>>> +++ b/fs/fuse/file.c
>>> @@ -697,6 +697,8 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>>                  goto out;
>>>          }
>>>
>>> +       mutex_unlock(&inode->i_mutex);
>>> +
>>>          memset(&inarg, 0, sizeof(inarg));
>>>          inarg.fh = ff->fh;
>>>          inarg.fsync_flags = datasync ? 1 : 0;
>>> @@ -715,6 +717,7 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>>                          fc->no_fsync = 1;
>>>                  err = 0;
>>>          }
>>> +       return err;
>>>   out:
>>>          mutex_unlock(&inode->i_mutex);
>>>          return err;
>>>
Dmitry Monakhov Dec. 1, 2016, 8:06 a.m.
Maxim Patlasov <mpatlasov@virtuozzo.com> writes:

> Alexey,
>
>
> You're right. And while composing the patch I well understood that it's 
> possible to rework fuse_sync_writes() using a counter instead of 
> negative bias. But the problem with flush_mtime still exists anyway. 
> Think about it: we firstly acquire local mtime from local inode, then 
> fill and submit mtime-update-request. Since then, we don't know when 
> exactly fuse daemon will apply that new mtime to its metadata 
> structures. If another mtime-update is generated in-between (e.g. "touch 
> -d <date> file", or even simplier -- just a single direct write 
> implicitly updating mtime), we wouldn't know which of those two 
> mtime-update-requests are processed by fused first. That comes from a 
> general FUSE protocol limitation: when kernel fuse queues request A, 
> then request B, it cannot be sure if they will be processed by userspace 
> as <A, then B> or <B, then A>.
>
>
> The big advantage of the patch I sent is that it's very simple, 
> straightforward and presumably will remove 99% of contention between 
> fsync and io_submit (assuming we spend most of time waiting for 
> userspace ACK for FUSE_FSYNC request. There are actually three questions 
> to answer:

>
>
> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a 
> lot of io_submits? The goal of fsync is to ensure that some state is 
> actually went to platters. An app who races io_submit-s with fsync-s 
> actually doesn't care which state will come to platters. I'm not sure 
> that it's reasonable to work very hard to achieve the best possible 
> performance for such a marginal app.
Obiously any filesystem behave like this.
Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of io_submit-s
All that io may happens in parallel, fs guarantee only that metadata
will be serialized. So all that concurent IO flowa to blockdevice which
does no have i_mutex so all IO indeed happen concurrently.
But when we dealt with fs-in-file (loop/ploop/qemu-nbd) we face i_mutex
on file. For general filesystem (xfs/ext4) we grab i_mutex only on write
path, fsync is lockless. But int case of fuse we artificially introduce
i_mutex inside fsync which basically kill concurrency for upper FS.
As result we have SMP scalability as we have in Linux-v2.2 with single
mutex in VFS.

BTW: I'm wondering why do we care about mtime at all. for fs-in-file
we can relax that, for example flush mtime only on fsync, and not for fdatasync.

>
>
> 2) Will the patch (in the form I sent it) break something? I think no. 
> If you know some usecase that can be broken, let's discuss it in more 
> details.
>
>
> 3) Should we expect some noticeable (or significant) improvement in 
> performance comparing fuse_fsync with no locking at all vs. the locking 
> we have with that patch applied? I tend to think that the answer is "no" 
> because handling FUSE_FSYNC is notoriously heavy-weight operation. If 
> you disagree, let's firstly measure that difference in performance 
> (simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then 
> start to think if it's really worthy to fully re-work locking scheme to 
> preserve flush_mtime correctness w/o i_mutex.
>
>
> Thanks,
>
> Maxim
>
>
> On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote:
>> Sorry, missed that pair fuse_set_nowrite/fuse_release_writes
>> can be done only under i_mutex.
>>
>> IMHO it is only due to bad implementation.
>> If fuse_set_nowrite would be done with separate
>> count instead of adding negative bias, it would
>> be possible.
>>
>>
>> On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov <kuznet@virtuozzo.com> wrote:
>>> Hello!
>>>
>>> I do not think you got it right.
>>>
>>> i_mutex in fsync is not about some atomicity,
>>> it is about stopping data feed while fsync is executed
>>> to prevent livelock.
>>>
>>> I cannot tell anything about mtime update, it is just some voodoo
>>> magic for me.
>>>
>>> What's about fsync semantics, I see two different ways:
>>>
>>> A.
>>>
>>> 1. Remove useless write_inode_now. Its work is done
>>>      by filemap_write_and_wait_range(), there is no need to repeat it
>>> under mutex.
>>> 2. move mutex_lock _after_  fuse_sync_writes(), which is essentially
>>>      fuse continuation forfilemap_write_and_wait_range().
>>> 3. i_mutex is preserved only around fsync call.
>>>
>>> B.
>>> 1. Remove  write_inode_now as well.
>>> 2. Remove i_mutex _completely_. (No idea about mtime voodo though)
>>> 2. Replace fuse_sync_writes() with fuse_set_nowrite()
>>>      and add release after call to FSYNC.
>>>
>>> Both prevent livelock. B is obviosly optimal.
>>>
>>> But A preserves historic fuse protocol semantics.
>>> F.e. I have no idea would user space survive truncate
>>> racing with fsync. pstorage should survice, though this
>>> path was never tested.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 30, 2016 at 4:02 AM, Maxim Patlasov <mpatlasov@virtuozzo.com> wrote:
>>>> fuse_fsync_common() does need i_mutex for fuse_sync_writes() and
>>>> fuse_flush_mtime(). But when those operations are done, it's actually
>>>> doesn't matter whether to hold the lock over fuse_request_send(FUSE_FSYNC)
>>>> or not: we ensured that all relevant data were already seen by
>>>> userspace fuse daemon, and so it will sync them (by handling FUSE_FSYNC)
>>>> anyway; if the user screws up by leaking new data updates in-between, it
>>>> is up to the user and doesn't violate fsync(2) semantics.
>>>>
>>>> https://jira.sw.ru/browse/PSBM-55919
>>>>
>>>> Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
>>>> ---
>>>>   fs/fuse/file.c |    3 +++
>>>>   1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>> index 464b2f5..559dfd9 100644
>>>> --- a/fs/fuse/file.c
>>>> +++ b/fs/fuse/file.c
>>>> @@ -697,6 +697,8 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>>>                  goto out;
>>>>          }
>>>>
>>>> +       mutex_unlock(&inode->i_mutex);
>>>> +
>>>>          memset(&inarg, 0, sizeof(inarg));
>>>>          inarg.fh = ff->fh;
>>>>          inarg.fsync_flags = datasync ? 1 : 0;
>>>> @@ -715,6 +717,7 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>>>                          fc->no_fsync = 1;
>>>>                  err = 0;
>>>>          }
>>>> +       return err;
>>>>   out:
>>>>          mutex_unlock(&inode->i_mutex);
>>>>          return err;
>>>>
Maxim Patlasov Dec. 1, 2016, 7:09 p.m.
On 12/01/2016 12:06 AM, Dmitry Monakhov wrote:

> Maxim Patlasov <mpatlasov@virtuozzo.com> writes:
>
>> Alexey,
>>
>>
>> You're right. And while composing the patch I well understood that it's
>> possible to rework fuse_sync_writes() using a counter instead of
>> negative bias. But the problem with flush_mtime still exists anyway.
>> Think about it: we firstly acquire local mtime from local inode, then
>> fill and submit mtime-update-request. Since then, we don't know when
>> exactly fuse daemon will apply that new mtime to its metadata
>> structures. If another mtime-update is generated in-between (e.g. "touch
>> -d <date> file", or even simplier -- just a single direct write
>> implicitly updating mtime), we wouldn't know which of those two
>> mtime-update-requests are processed by fused first. That comes from a
>> general FUSE protocol limitation: when kernel fuse queues request A,
>> then request B, it cannot be sure if they will be processed by userspace
>> as <A, then B> or <B, then A>.
>>
>>
>> The big advantage of the patch I sent is that it's very simple,
>> straightforward and presumably will remove 99% of contention between
>> fsync and io_submit (assuming we spend most of time waiting for
>> userspace ACK for FUSE_FSYNC request. There are actually three questions
>> to answer:
>>
>> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a
>> lot of io_submits? The goal of fsync is to ensure that some state is
>> actually went to platters. An app who races io_submit-s with fsync-s
>> actually doesn't care which state will come to platters. I'm not sure
>> that it's reasonable to work very hard to achieve the best possible
>> performance for such a marginal app.
> Obiously any filesystem behave like this.
> Task A(mail-server) may perform write/fsync, task B(mysql) do a lot of io_submit-s
> All that io may happens in parallel, fs guarantee only that metadata
> will be serialized. So all that concurent IO flowa to blockdevice which
> does no have i_mutex so all IO indeed happen concurrently.

Looks as you're comparing an app doing POSIX open/read/write/fsync/close 
with fs doing submit_bio. This is a stretch. But OK, there is a 
similarity. But I don't think this rather vague similarity proves something.

> But when we dealt with fs-in-file (loop/ploop/qemu-nbd) we face i_mutex
> on file.

This really makes sense. If an app inside a VM loops over ordinary 
direct writes, while another app (in the same VM) does fsync, it's not 
fair to suspend the first app for long while just because fuse holds 
i_mutex for long somewhere deep in fuse_fsync.

> For general filesystem (xfs/ext4)

What makes xfs/ext4 general? They are *local* fs having almost direct 
access to underlying HDD. Have you explored how CEPH implements fsync?

> we grab i_mutex only on write
> path, fsync is lockless.

What about the following part of ext4_sync_file:

>     if (!journal) {
>         ret = generic_file_fsync(file, start, end, datasync);
? Is it lockless?

> But int case of fuse we artificially introduce
> i_mutex inside fsync

exactly the same as generic_file_fsync()

> which basically kill concurrency for upper FS.
> As result we have SMP scalability as we have in Linux-v2.2 with single
> mutex in VFS.

You can't derive "kill concurrency" from "introduce i_mutex" as simple 
as that. It really does matter for how long the mutex is held and how 
often it contends with writes.

>
> BTW: I'm wondering why do we care about mtime at all. for fs-in-file
> we can relax that, for example flush mtime only on fsync, and not for fdatasync.

Good catch, makes sense!


>
>>
>> 2) Will the patch (in the form I sent it) break something? I think no.
>> If you know some usecase that can be broken, let's discuss it in more
>> details.
>>
>>
>> 3) Should we expect some noticeable (or significant) improvement in
>> performance comparing fuse_fsync with no locking at all vs. the locking
>> we have with that patch applied? I tend to think that the answer is "no"
>> because handling FUSE_FSYNC is notoriously heavy-weight operation. If
>> you disagree, let's firstly measure that difference in performance
>> (simply commenting out lock/unlock(i_mutex) in fuse_fsync) and then
>> start to think if it's really worthy to fully re-work locking scheme to
>> preserve flush_mtime correctness w/o i_mutex.
>>
>>
>> Thanks,
>>
>> Maxim
>>
>>
>> On 11/30/2016 05:09 AM, Alexey Kuznetsov wrote:
>>> Sorry, missed that pair fuse_set_nowrite/fuse_release_writes
>>> can be done only under i_mutex.
>>>
>>> IMHO it is only due to bad implementation.
>>> If fuse_set_nowrite would be done with separate
>>> count instead of adding negative bias, it would
>>> be possible.
>>>
>>>
>>> On Wed, Nov 30, 2016 at 3:47 PM, Alexey Kuznetsov <kuznet@virtuozzo.com> wrote:
>>>> Hello!
>>>>
>>>> I do not think you got it right.
>>>>
>>>> i_mutex in fsync is not about some atomicity,
>>>> it is about stopping data feed while fsync is executed
>>>> to prevent livelock.
>>>>
>>>> I cannot tell anything about mtime update, it is just some voodoo
>>>> magic for me.
>>>>
>>>> What's about fsync semantics, I see two different ways:
>>>>
>>>> A.
>>>>
>>>> 1. Remove useless write_inode_now. Its work is done
>>>>       by filemap_write_and_wait_range(), there is no need to repeat it
>>>> under mutex.
>>>> 2. move mutex_lock _after_  fuse_sync_writes(), which is essentially
>>>>       fuse continuation forfilemap_write_and_wait_range().
>>>> 3. i_mutex is preserved only around fsync call.
>>>>
>>>> B.
>>>> 1. Remove  write_inode_now as well.
>>>> 2. Remove i_mutex _completely_. (No idea about mtime voodo though)
>>>> 2. Replace fuse_sync_writes() with fuse_set_nowrite()
>>>>       and add release after call to FSYNC.
>>>>
>>>> Both prevent livelock. B is obviosly optimal.
>>>>
>>>> But A preserves historic fuse protocol semantics.
>>>> F.e. I have no idea would user space survive truncate
>>>> racing with fsync. pstorage should survice, though this
>>>> path was never tested.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Nov 30, 2016 at 4:02 AM, Maxim Patlasov <mpatlasov@virtuozzo.com> wrote:
>>>>> fuse_fsync_common() does need i_mutex for fuse_sync_writes() and
>>>>> fuse_flush_mtime(). But when those operations are done, it's actually
>>>>> doesn't matter whether to hold the lock over fuse_request_send(FUSE_FSYNC)
>>>>> or not: we ensured that all relevant data were already seen by
>>>>> userspace fuse daemon, and so it will sync them (by handling FUSE_FSYNC)
>>>>> anyway; if the user screws up by leaking new data updates in-between, it
>>>>> is up to the user and doesn't violate fsync(2) semantics.
>>>>>
>>>>> https://jira.sw.ru/browse/PSBM-55919
>>>>>
>>>>> Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
>>>>> ---
>>>>>    fs/fuse/file.c |    3 +++
>>>>>    1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>> index 464b2f5..559dfd9 100644
>>>>> --- a/fs/fuse/file.c
>>>>> +++ b/fs/fuse/file.c
>>>>> @@ -697,6 +697,8 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>>>>                   goto out;
>>>>>           }
>>>>>
>>>>> +       mutex_unlock(&inode->i_mutex);
>>>>> +
>>>>>           memset(&inarg, 0, sizeof(inarg));
>>>>>           inarg.fh = ff->fh;
>>>>>           inarg.fsync_flags = datasync ? 1 : 0;
>>>>> @@ -715,6 +717,7 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>>>>                           fc->no_fsync = 1;
>>>>>                   err = 0;
>>>>>           }
>>>>> +       return err;
>>>>>    out:
>>>>>           mutex_unlock(&inode->i_mutex);
>>>>>           return err;
>>>>>
Denis V. Lunev Dec. 1, 2016, 7:27 p.m.
On 12/01/2016 10:09 PM, Maxim Patlasov wrote:
> On 12/01/2016 12:06 AM, Dmitry Monakhov wrote:
>
>> Maxim Patlasov <mpatlasov@virtuozzo.com> writes:
>>
>>> Alexey,
>>>
>>>
>>> You're right. And while composing the patch I well understood that it's
>>> possible to rework fuse_sync_writes() using a counter instead of
>>> negative bias. But the problem with flush_mtime still exists anyway.
>>> Think about it: we firstly acquire local mtime from local inode, then
>>> fill and submit mtime-update-request. Since then, we don't know when
>>> exactly fuse daemon will apply that new mtime to its metadata
>>> structures. If another mtime-update is generated in-between (e.g.
>>> "touch
>>> -d <date> file", or even simplier -- just a single direct write
>>> implicitly updating mtime), we wouldn't know which of those two
>>> mtime-update-requests are processed by fused first. That comes from a
>>> general FUSE protocol limitation: when kernel fuse queues request A,
>>> then request B, it cannot be sure if they will be processed by
>>> userspace
>>> as <A, then B> or <B, then A>.
>>>
>>>
>>> The big advantage of the patch I sent is that it's very simple,
>>> straightforward and presumably will remove 99% of contention between
>>> fsync and io_submit (assuming we spend most of time waiting for
>>> userspace ACK for FUSE_FSYNC request. There are actually three
>>> questions
>>> to answer:
>>>
>>> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a
>>> lot of io_submits? The goal of fsync is to ensure that some state is
>>> actually went to platters. An app who races io_submit-s with fsync-s
>>> actually doesn't care which state will come to platters. I'm not sure
>>> that it's reasonable to work very hard to achieve the best possible
>>> performance for such a marginal app.
>> Obiously any filesystem behave like this.
>> Task A(mail-server) may perform write/fsync, task B(mysql) do a lot
>> of io_submit-s
>> All that io may happens in parallel, fs guarantee only that metadata
>> will be serialized. So all that concurent IO flowa to blockdevice which
>> does no have i_mutex so all IO indeed happen concurrently.
>
> Looks as you're comparing an app doing POSIX
> open/read/write/fsync/close with fs doing submit_bio. This is a
> stretch. But OK, there is a similarity. But I don't think this rather
> vague similarity proves something.

we are speaking about VM process, which essentially
re-submits IO from the guest to host like above. For sure
QEMU and VM_app have this IO pattern. Thus this
pattern MUST be optimized as this is one of our
main loads.

That is why I think that this case is not marginal
and important.

Den
Maxim Patlasov Dec. 1, 2016, 7:30 p.m.
On 12/01/2016 11:27 AM, Denis V. Lunev wrote:

> On 12/01/2016 10:09 PM, Maxim Patlasov wrote:
>> On 12/01/2016 12:06 AM, Dmitry Monakhov wrote:
>>
>>> Maxim Patlasov <mpatlasov@virtuozzo.com> writes:
>>>
>>>> Alexey,
>>>>
>>>>
>>>> You're right. And while composing the patch I well understood that it's
>>>> possible to rework fuse_sync_writes() using a counter instead of
>>>> negative bias. But the problem with flush_mtime still exists anyway.
>>>> Think about it: we firstly acquire local mtime from local inode, then
>>>> fill and submit mtime-update-request. Since then, we don't know when
>>>> exactly fuse daemon will apply that new mtime to its metadata
>>>> structures. If another mtime-update is generated in-between (e.g.
>>>> "touch
>>>> -d <date> file", or even simplier -- just a single direct write
>>>> implicitly updating mtime), we wouldn't know which of those two
>>>> mtime-update-requests are processed by fused first. That comes from a
>>>> general FUSE protocol limitation: when kernel fuse queues request A,
>>>> then request B, it cannot be sure if they will be processed by
>>>> userspace
>>>> as <A, then B> or <B, then A>.
>>>>
>>>>
>>>> The big advantage of the patch I sent is that it's very simple,
>>>> straightforward and presumably will remove 99% of contention between
>>>> fsync and io_submit (assuming we spend most of time waiting for
>>>> userspace ACK for FUSE_FSYNC request. There are actually three
>>>> questions
>>>> to answer:
>>>>
>>>> 1) Do we really must honor a crazy app who mixes a lot of fsyncs with a
>>>> lot of io_submits? The goal of fsync is to ensure that some state is
>>>> actually went to platters. An app who races io_submit-s with fsync-s
>>>> actually doesn't care which state will come to platters. I'm not sure
>>>> that it's reasonable to work very hard to achieve the best possible
>>>> performance for such a marginal app.
>>> Obiously any filesystem behave like this.
>>> Task A(mail-server) may perform write/fsync, task B(mysql) do a lot
>>> of io_submit-s
>>> All that io may happens in parallel, fs guarantee only that metadata
>>> will be serialized. So all that concurent IO flowa to blockdevice which
>>> does no have i_mutex so all IO indeed happen concurrently.
>> Looks as you're comparing an app doing POSIX
>> open/read/write/fsync/close with fs doing submit_bio. This is a
>> stretch. But OK, there is a similarity. But I don't think this rather
>> vague similarity proves something.
> we are speaking about VM process, which essentially
> re-submits IO from the guest to host like above. For sure
> QEMU and VM_app have this IO pattern. Thus this
> pattern MUST be optimized as this is one of our
> main loads.

Yes, I agree. That's exactly why I wrote in the same email (next paragraph):

> This really makes sense. If an app inside a VM loops over ordinary 
> direct writes, while another app (in the same VM) does fsync, it's not 
> fair to suspend the first app for long while just because fuse holds 
> i_mutex for long somewhere deep in fuse_fsync. 

Max

>
> That is why I think that this case is not marginal
> and important.
>
> Den