[Devel,RFC] mm: Limit number of busy-looped shrinking processes

Submitted by Kirill Tkhai on Sept. 5, 2017, 8:46 a.m.

Details

Message ID 150460101177.5887.10913656633993385863.stgit@localhost.localdomain
State New
Series "mm: Limit number of busy-looped shrinking processes"
Headers show

Commit Message

Kirill Tkhai Sept. 5, 2017, 8:46 a.m.
When a FUSE process is making shrink, it must not wait
on page writeback. Otherwise, it may meet a page,
that is being writebacked by him, and the process will stall.

So, our kernel does not wait writeback after commit a9707947010d
"mm: vmscan: never wait on writeback pages".

But in case of huge number of writebacked pages and
memory pressure, this lead to busy loop: many process
in the system are trying to shrink memory and have
no success. And the node shows high time, spent in kernel.

This patch reduces the number of processes, which may
busy looping on shrink. Only one userspace process --
vstorage -- will be allowed not to sleep on writeback.
Other processes will sleep up to 5 seconds to wait
writeback completion on every page.

The detection of vstorage is very simple and it based
on process name. It seems, there is no a way to detect
all FUSE processes, especially from !ve0, because FUSE
mount is tricky, and a process doing mount may not be
a FUSE daemon. So, we remain the vanila kernel behaviour,
but we don't wait forever, just 5 second. This will save
us from lookup messages from kernel and will allow
to kill FUSE daemon if necessary.

https://jira.sw.ru/browse/PSBM-69296

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 mm/vmscan.c |   19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

Patch hide | download patch | download mbox

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a5db5940bb1..e72d515c111 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -959,8 +959,16 @@  static unsigned long shrink_page_list(struct list_head *page_list,
 
 			/* Case 3 above */
 			} else {
-				nr_immediate++;
-				goto keep_locked;
+				/*
+				 * Currently, vstorage is the only fuse process,
+				 * exercising writeback; it mustn't sleep to avoid
+				 * deadlocks.
+				 */
+				if (!strncmp(current->comm, "vstorage", 8) ||
+				    wait_on_page_bit_killable_timeout(page, PG_writeback, 5 * HZ) != 0) {
+					nr_immediate++;
+					goto keep_locked;
+				}
 			}
 		}
 
@@ -1592,9 +1600,10 @@  shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (nr_writeback && nr_writeback == nr_taken)
 		zone_set_flag(zone, ZONE_WRITEBACK);
 
-	if (!global_reclaim(sc) && nr_immediate)
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
-
+	/*
+	 * memcg will stall in page writeback so only consider forcibly
+	 * stalling for global reclaim
+	 */
 	if (global_reclaim(sc)) {
 		/*
 		 * Tag a zone as congested if all the dirty pages scanned were

Comments

Dmitry Monakhov Sept. 5, 2017, 9:49 a.m.
Kirill Tkhai <ktkhai@virtuozzo.com> writes:

> When a FUSE process is making shrink, it must not wait
> on page writeback. Otherwise, it may meet a page,
> that is being writebacked by him, and the process will stall.
>
> So, our kernel does not wait writeback after commit a9707947010d
> "mm: vmscan: never wait on writeback pages".
>
> But in case of huge number of writebacked pages and
> memory pressure, this lead to busy loop: many process
> in the system are trying to shrink memory and have
> no success. And the node shows high time, spent in kernel.
>
> This patch reduces the number of processes, which may
> busy looping on shrink. Only one userspace process --
> vstorage -- will be allowed not to sleep on writeback.
> Other processes will sleep up to 5 seconds to wait
> writeback completion on every page.
>
> The detection of vstorage is very simple and it based
> on process name. It seems, there is no a way to detect
NAK. Detection by name is very very bad design style.
fused and others should mark iself as writeback-proof explicitly
via API similar ioctl/madvice/ionice/ulimit,
may be it is reasonable to place such app to speciffic cgroup,
you may pick any recepy you like. But please do not do comm-name
matching.

> all FUSE processes, especially from !ve0, because FUSE
> mount is tricky, and a process doing mount may not be
> a FUSE daemon. So, we remain the vanila kernel behaviour,
> but we don't wait forever, just 5 second. This will save
> us from lookup messages from kernel and will allow
> to kill FUSE daemon if necessary.
>
> https://jira.sw.ru/browse/PSBM-69296
>
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
>  mm/vmscan.c |   19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a5db5940bb1..e72d515c111 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -959,8 +959,16 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  
>  			/* Case 3 above */
>  			} else {
> -				nr_immediate++;
> -				goto keep_locked;
> +				/*
> +				 * Currently, vstorage is the only fuse process,
> +				 * exercising writeback; it mustn't sleep to avoid
> +				 * deadlocks.
> +				 */
> +				if (!strncmp(current->comm, "vstorage", 8) ||
> +				    wait_on_page_bit_killable_timeout(page, PG_writeback, 5 * HZ) != 0) {
> +					nr_immediate++;
> +					goto keep_locked;
> +				}
>  			}
>  		}
>  
> @@ -1592,9 +1600,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	if (nr_writeback && nr_writeback == nr_taken)
>  		zone_set_flag(zone, ZONE_WRITEBACK);
>  
> -	if (!global_reclaim(sc) && nr_immediate)
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> -
> +	/*
> +	 * memcg will stall in page writeback so only consider forcibly
> +	 * stalling for global reclaim
> +	 */
>  	if (global_reclaim(sc)) {
>  		/*
>  		 * Tag a zone as congested if all the dirty pages scanned were