[RHEL7,COMMIT] ms/mm/page_alloc: fix memmap_init_zone pageblock alignment

Submitted by Konstantin Khorenko on April 27, 2018, 10 a.m.

Details

Message ID 201804271000.w3RA0srQ032091@finist_ce7.work
State New
Series "mm/page_alloc: fix false positive BUG_ON on uninitialized page structure due to pageblock alignment"
Headers show

Commit Message

Konstantin Khorenko April 27, 2018, 10 a.m.
The commit is pushed to "branch-rh7-3.10.0-693.21.1.vz7.47.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-693.21.1.vz7.47.1
------>
commit 8786a9d026a3b8b3f2e23814bab66b0a02569f69
Author: Daniel Vacek <neelx@redhat.com>
Date:   Fri Apr 27 13:00:54 2018 +0300

    ms/mm/page_alloc: fix memmap_init_zone pageblock alignment
    
    Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
    where possible") introduced a bug where move_freepages() triggers a
    VM_BUG_ON() on uninitialized page structure due to pageblock alignment.
    To fix this, simply align the skipped pfns in memmap_init_zone() the
    same way as in move_freepages_block().
    
    Seen in one of the RHEL reports:
    
      crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
      kernel BUG at mm/page_alloc.c:1389!
      invalid opcode: 0000 [#1] SMP
      --
      RIP: 0010:[<ffffffff8118833e>]  [<ffffffff8118833e>] move_freepages+0x15e/0x160
      RSP: 0018:ffff88054d727688  EFLAGS: 00010087
      --
      Call Trace:
       [<ffffffff811883b3>] move_freepages_block+0x73/0x80
       [<ffffffff81189e63>] __rmqueue+0x263/0x460
       [<ffffffff8118c781>] get_page_from_freelist+0x7e1/0x9e0
       [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
      --
      RIP  [<ffffffff8118833e>] move_freepages+0x15e/0x160
       RSP <ffff88054d727688>
    
      crash> page_init_bug -v | grep RAM
      <struct resource 0xffff88067fffd2f8>          1000 -        9bfff     System RAM (620.00 KiB)
      <struct resource 0xffff88067fffd3a0>        100000 -     430bffff     System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
      <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff     System RAM ( 14.83 MiB = 15188.00 KiB)
      <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff     System RAM (391.02 MiB = 400408.00 KiB)
      <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff     System RAM (480.00 KiB)
      <struct resource 0xffff88067fffd640>     100000000 -    67fffffff     System RAM ( 22.00 GiB)
    
      crash> page_init_bug | head -6
      <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff     System RAM (480.00 KiB)
      <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
      <struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
      <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
      <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
      BUG, zones differ!
    
    Note that this range follows two not populated sections
    68000000-77ffffff in this zone.  7b788000-7b7fffff is the first one
    after a gap.  This makes memmap_init_zone() skip all the pfns up to the
    beginning of this range.  But this range is not pageblock (2M) aligned.
    In fact no range has to be.
    
      crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
            PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
      ffffea0001e00000  78000000                0        0  0 0
      ffffea0001ed7fc0  7b5ff000                0        0  0 0
      ffffea0001ed8000  7b600000                0        0  0 0     <<<<
      ffffea0001ede1c0  7b787000                0        0  0 0
      ffffea0001ede200  7b788000                0        0  1 1fffff00000000
    
    Top part of page flags should contain nodeid and zonenr, which is not
    the case for page ffffea0001ed8000 here (<<<<).
    
      crash> log | grep -o fffea0001ed[^\ ]* | sort -u
      fffea0001ed8000
      fffea0001eded20
      fffea0001edffc0
    
      crash> bt -r | grep -o fffea0001ed[^\ ]* | sort -u
      fffea0001ed8000
      fffea0001eded00
      fffea0001eded20
      fffea0001edffc0
    
    Initialization of the whole beginning of the section is skipped up to
    the start of the range due to the commit b92df1de5d28.  Now any code
    calling move_freepages_block() (like reusing the page from a freelist as
    in this example) with a page from the beginning of the range will get
    the page rounded down to start_page ffffea0001ed8000 and passed to
    move_freepages() which crashes on assertion getting wrong zonenr.
    
      >         VM_BUG_ON(page_zone(start_page) != page_zone(end_page));
    
    Note, page_zone() derives the zone from page flags here.
    
    >From similar machine before commit b92df1de5d28:
    
      crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
            PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
      fffff73941e00000  78000000                0        0  1 1fffff00000000
      fffff73941ed7fc0  7b5ff000                0        0  1 1fffff00000000
      fffff73941ed8000  7b600000                0        0  1 1fffff00000000
      fffff73941edff80  7b7fe000                0        0  1 1fffff00000000
      fffff73941edffc0  7b7ff000 ffff8e67e04d3ae0     ad84  1 1fffff00020068 uptodate,lru,active,mappedtodisk
    
    All the pages since the beginning of the section are initialized.
    move_freepages()' not gonna blow up.
    
    The same machine with this fix applied:
    
      crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
            PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
      ffffea0001e00000  78000000                0        0  0 0
      ffffea0001e00000  7b5ff000                0        0  0 0
      ffffea0001ed8000  7b600000                0        0  1 1fffff00000000
      ffffea0001edff80  7b7fe000                0        0  1 1fffff00000000
      ffffea0001edffc0  7b7ff000 ffff88017fb13720        8  2 1fffff00020068 uptodate,lru,active,mappedtodisk
    
    At least the bare minimum of pages is initialized preventing the crash
    as well.
    
    Customers started to report this as soon as 7.4 (where b92df1de5d28 was
    merged in RHEL) was released.  I remember reports from
    September/October-ish times.  It's not easily reproduced and happens on
    a handful of machines only.  I guess that's why.  But that does not make
    it less serious, I think.
    
    Though there actually is a report here:
      https://bugzilla.kernel.org/show_bug.cgi?id=196443
    
    And there are reports for Fedora from July:
      https://bugzilla.redhat.com/show_bug.cgi?id=1473242
    and CentOS:
      https://bugs.centos.org/view.php?id=13964
    and we internally track several dozens reports for RHEL bug
      https://bugzilla.redhat.com/show_bug.cgi?id=1525121
    
    Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945.git.neelx@redhat.com
    Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
    Signed-off-by: Daniel Vacek <neelx@redhat.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Paul Burton <paul.burton@imgtec.com>
    Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    
    https://jira.sw.ru/browse/PSBM-83746
    
    (cherry picked from commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae)
    Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
---
 mm/page_alloc.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 15575b24a97c..2647fe4d484b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4720,9 +4720,14 @@  void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			/*
 			 * Skip to the pfn preceding the next valid one (or
 			 * end_pfn), such that we hit a valid pfn (or end_pfn)
-			 * on our next iteration of the loop.
+			 * on our next iteration of the loop. Note that it needs
+			 * to be pageblock aligned even when the region itself
+			 * is not. move_freepages_block() can shift ahead of
+			 * the valid region but still depends on correct page
+			 * metadata.
 			 */
-			pfn = memblock_next_valid_pfn(pfn, end_pfn) - 1;
+			pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
+					~(pageblock_nr_pages-1)) - 1;
 #endif
 			continue;
 		}