[RHEL7,COMMIT] ms/mm: throttle on IO only when there are too many dirty and writeback pages

Submitted by Konstantin Khorenko on Jan. 31, 2018, 3:29 p.m.

Details

Message ID 201801311529.w0VFTXsH024302@finist_ce7.work
State New
Series "Series without cover letter"
Headers show

Commit Message

Konstantin Khorenko Jan. 31, 2018, 3:29 p.m.
The commit is pushed to "branch-rh7-3.10.0-693.11.6.vz7.42.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-693.11.6.vz7.42.4
------>
commit 00a46c2c9e50f959777dcb17ce0127e6b1108e66
Author: Michal Hocko <mhocko@suse.com>
Date:   Wed Jan 31 18:29:32 2018 +0300

    ms/mm: throttle on IO only when there are too many dirty and writeback pages
    
    wait_iff_congested has been used to throttle allocator before it retried
    another round of direct reclaim to allow the writeback to make some
    progress and prevent reclaim from looping over dirty/writeback pages
    without making any progress.
    
    We used to do congestion_wait before commit 0e093d99763e ("writeback: do
    not sleep on the congestion queue if there are no congested BDIs or if
    significant congestion is not being encountered in the current zone")
    but that led to undesirable stalls and sleeping for the full timeout
    even when the BDI wasn't congested.  Hence wait_iff_congested was used
    instead.
    
    But it seems that even wait_iff_congested doesn't work as expected.  We
    might have a small file LRU list with all pages dirty/writeback and yet
    the bdi is not congested so this is just a cond_resched in the end and
    can end up triggering pre mature OOM.
    
    This patch replaces the unconditional wait_iff_congested by
    congestion_wait which is executed only if we _know_ that the last round
    of direct reclaim didn't make any progress and dirty+writeback pages are
    more than a half of the reclaimable pages on the zone which might be
    usable for our target allocation.  This shouldn't reintroduce stalls
    fixed by 0e093d99763e because congestion_wait is called only when we are
    getting hopeless when sleeping is a better choice than OOM with many
    pages under IO.
    
    We have to preserve logic introduced by commit 373ccbe59270 ("mm,
    vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
    progress") into the __alloc_pages_slowpath now that wait_iff_congested
    is not used anymore.  As the only remaining user of wait_iff_congested
    is shrink_inactive_list we can remove the WQ specific short sleep from
    wait_iff_congested because the sleep is needed to be done only once in
    the allocation retry cycle.
    
    [mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly]
     Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Joonsoo Kim <js1304@gmail.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    
    https://jira.sw.ru/browse/PSBM-61409
    (cherry-picked from ede37713737834d98ec72ed299a305d53e909f73)
    Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 mm/internal.h   |  2 ++
 mm/page_alloc.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
 mm/vmscan.c     |  2 +-
 3 files changed, 45 insertions(+), 4 deletions(-)

Patch hide | download patch | download mbox

diff --git a/mm/internal.h b/mm/internal.h
index 5c15f27c6823..2072b9b04b6b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -14,6 +14,8 @@ 
 #include <linux/mm.h>
 #include <linux/migrate_mode.h>
 
+unsigned long zone_reclaimable_pages(struct zone *zone);
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4e443d34b18..cd8ed1f5543e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3069,9 +3069,48 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, did_some_progress,
 						pages_reclaimed)) {
-		/* Wait for some write requests to complete then retry */
-		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
-		goto rebalance;
+		struct zone *zone;
+		struct zoneref *z;
+
+		/*
+		 * Keep reclaiming pages while there is a chance this will lead somewhere.
+		 * If none of the target zones can satisfy our allocation request even
+		 * if all reclaimable pages are considered then we are screwed and have
+		 * to go OOM.
+		 */
+		for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
+						nodemask) {
+			unsigned long writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
+			unsigned long dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+			unsigned long reclaimable = zone_reclaimable_pages(zone);
+
+			/*
+			 * If we didn't make any progress and have a lot of
+			 * dirty + writeback pages then we should wait for
+			 * an IO to complete to slow down the reclaim and
+			 * prevent from pre mature OOM
+			 */
+			if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				goto rebalance;
+			}
+
+			/*
+			 * Memory allocation/reclaim might be called from a WQ
+			 * context and the current implementation of the WQ
+			 * concurrency control doesn't recognize that
+			 * a particular WQ is congested if the worker thread is
+			 * looping without ever sleeping. Therefore we have to
+			 * do a short sleep here rather than calling
+			 * cond_resched().
+			 */
+			if (current->flags & PF_WQ_WORKER)
+				schedule_timeout(1);
+			else
+				cond_resched();
+
+			goto rebalance;
+		}
 	} else {
 		/*
 		 * High-order allocations do not necessarily loop after
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dcd450c1064a..f974f57dd546 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -186,7 +186,7 @@  static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-static unsigned long zone_reclaimable_pages(struct zone *zone)
+unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	int nr;