[RHEL7,COMMIT] Revert "mm/memcg: use seqlock to protect reclaim_iter updates"

Submitted by Vasily Averin on March 3, 2021, 6:25 a.m.

Details

Message ID 202103030625.1236PoUC008983@vz7build.vvs.sw.ru
State New
Series "mm/mem_cgroup_iter: Reduce the number of iterator restarts upon cgroup removals"
Headers show

Commit Message

Vasily Averin March 3, 2021, 6:25 a.m.
The commit is pushed to "branch-rh7-3.10.0-1160.15.2.vz7.173.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1160.15.2.vz7.173.1
------>
commit e5d473d881f3c6f62e6349d9b326514b7904f61e
Author: Konstantin Khorenko <khorenko@virtuozzo.com>
Date:   Wed Mar 3 09:25:50 2021 +0300

    Revert "mm/memcg: use seqlock to protect reclaim_iter updates"
    
    Patch-set description:
    May thanks to Kirill Tkhai for his bright ideas and review!
    
    Problem description from the user point of view:
      * the Node is slow
      * the Node has a lot of free RAM
      * the Node has a lot of swapin/swapout
      * kswapd is always running
    
    Problem in a nutshell from technical point of view:
      * kswapd is looping in shrink_zone() inside the loop
          do {} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
        (and never goes trough the outer loop)
      * there are a quite a number of memory cgroups of the Node (~1000)
      * some cgroups are hard to reclaim (reclaim may take ~3 seconds),
        this is because of very busy disk due to permanent swapin/swapout
      * mem_cgroup_iter() does not have success scanning all cgroups
        in a row, it restarts from the root cgroup one time after
        another (after different number of cgroups scanned)
    
    Q: Why does mem_cgroup_iter() restart from the root memcg?
    A: Because it is invalidated once some memory cgroup is
       destroyed on the Node.
       Note: ANY memory cgroup destroy on the Node leads to iter
       restart.
    
    The following patchset solves this problem in the following way:
    there is no need to restart the iter until we see the iter has
    the position which is exactly the memory cgroup being destroyed.
    
    The patchset ensures the iter->last_visited is NULL-ified on
    invalidation and thus restarts only in the unlikely case when
    the iter points to the memcg being destroyed.
    
    Testing: i've tested this patchset using modified kernel which breaks
    the memcg iterator in case of global reclaim with probability of 2%.
    
    3 kernels have been tested: "release", KASAN-only, "debug" kernels.
    Each worked for 12 hours, no issues, from 12000 to 26000 races were
    caught during this period (i.e. dying memcg was found in some iterator
    and wiped).
    
    The testing scenario is documented in the jira issue.
    
    https://jira.sw.ru/browse/PSBM-123655
    +++ Current patch description:
    This reverts commit 5a2d13cf16faedb8a2c318d50cca71d74d2be264.
    
    We are going to make 'iter->last_visited' always valid to skip
    verification 'iter->last_dead_count' vs 'root->dead_count',
    thus there will be no need to update 'last_visited' and
    'last_dead_count' consistently (we'll remove 'iter->last_dead_count'
    field at all), thus dropping the seqlock.
    
    https://jira.sw.ru/browse/PSBM-123655
    Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
    
    Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 mm/memcontrol.c | 18 +++---------------
 1 file changed, 3 insertions(+), 15 deletions(-)

Patch hide | download patch | download mbox

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 57181e8..68dfb8f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -197,7 +197,6 @@  struct mem_cgroup_reclaim_iter {
 	 */
 	struct mem_cgroup *last_visited;
 	unsigned long last_dead_count;
-	seqlock_t last_visited_lock;
 
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
@@ -1584,8 +1583,6 @@  mem_cgroup_iter_load(struct mem_cgroup_reclaim_iter *iter,
 		     int *sequence)
 {
 	struct mem_cgroup *position = NULL;
-	unsigned seq;
-
 	/*
 	 * A cgroup destruction happens in two stages: offlining and
 	 * release.  They are separated by a RCU grace period.
@@ -1595,13 +1592,9 @@  mem_cgroup_iter_load(struct mem_cgroup_reclaim_iter *iter,
 	 * released, tryget will fail if we lost the race.
 	 */
 	*sequence = atomic_read(&root->dead_count);
-retry:
-	seq = read_seqbegin(&iter->last_visited_lock);
 	if (iter->last_dead_count == *sequence) {
-		position = READ_ONCE(iter->last_visited);
-
-		if (read_seqretry(&iter->last_visited_lock, seq))
-			goto retry;
+		smp_rmb();
+		position = iter->last_visited;
 
 		/*
 		 * We cannot take a reference to root because we might race
@@ -1632,10 +1625,9 @@  static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter,
 	 * don't lose destruction events in between.  We could have
 	 * raced with the destruction of @new_position after all.
 	 */
-	write_seqlock(&iter->last_visited_lock);
 	iter->last_visited = new_position;
+	smp_wmb();
 	iter->last_dead_count = sequence;
-	write_sequnlock(&iter->last_visited_lock);
 }
 
 /**
@@ -6595,15 +6587,11 @@  static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 		return 1;
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-		int i;
-
 		mz = &pn->zoneinfo[zone];
 		lruvec_init(&mz->lruvec);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->memcg = memcg;
-		for (i = 0; i < ARRAY_SIZE(mz->reclaim_iter); i++)
-			seqlock_init(&mz->reclaim_iter[i].last_visited_lock);
 	}
 	memcg->info.nodeinfo[node] = pn;
 	return 0;