Looking at the aarch64 failure (issue #415)

Submitted by Adrian Reber on Feb. 20, 2018, 10:19 a.m.

Details

Message ID 20180220101934.GB8899@lisas.de
State New
Series "Looking at the aarch64 failure (issue #415)"
Headers show

Commit Message

Adrian Reber Feb. 20, 2018, 10:19 a.m.
I had one more look at:

 Complete aarch64 failure 
 https://github.com/checkpoint-restore/criu/issues/415

I got following error now:

(00.027519) pagemap-cache: created for pid 21225 (takes 4096 bytes)
(00.027532) page-pipe: Create page pipe for 624 segs
(00.027546) page-pipe: Will grow page pipe (iov off is 0)
(00.027756) pagemap-cache: filling VMA 400000-410000 (64K) [l:400000 h:600000]
(00.027773) pagemap-cache: 	          400000-410000           nr:1     cov:65536
(00.027784) pagemap-cache: 	          410000-420000           nr:2     cov:131072
(00.027794) pagemap-cache: 	          420000-430000           nr:3     cov:196608
(00.027806) pagemap-cache: 	cache  mode [l:400000 h:600000]
(00.027857) Pagemap generated: 0 pages (0 lazy) 0 holes
(00.027867) Pagemap generated: 0 pages (0 lazy) 0 holes
(00.027881) Pagemap generated: 0 pages (0 lazy) 0 holes
(00.027889) pagemap-cache: filling VMA ffffaa5d0000-ffffaa730000 (1408K) [l:ffffaa400000 h:ffffaa600000]
(00.027912) Error (criu/pagemap-cache.c:159): pagemap-cache: Can't read 21225's pagemap file: No such file or directory
(00.027924) Error (criu/pagemap-cache.c:175): pagemap-cache: Failed to fill cache for 21225 (ffffaa5d0000-ffffaa730000)
(00.027990) page-pipe: Killing page pipe

The actual dump 'finished successful' (with 0), but the restore failed.

A closer look at a strace showed:

4843  write(1023, "(00.091065) Pagemap generated: 0"..., 56) = 56
4843  write(1023, "(00.091148) pagemap-cache: filli"..., 107) = 107
4843  pread64(8, "", 204928, 549751300864) = 0
                             ^^^^^^^^^^^^ this looks wrong
4843  write(1023, "(00.091309) Error (criu/pagemap-"..., 119) = 119
4843  write(1023, "(00.091395) Error (criu/pagemap-"..., 119) = 119

pagemap-cache.c: 157:

 if (pread(pmc->fd, pmc->map, size_map, PAGEMAP_PFN_OFF(pmc->start)) != size_map) {

So PAGEMAP_PFN_OFF was wrong, which is caused by the wrong PAGE_SHIFT:

# grep SHIFT config*
CONFIG_ARM64_PAGE_SHIFT=16
CONFIG_ARM64_CONT_SHIFT=5
CONFIG_LOG_BUF_SHIFT=20
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_NODES_SHIFT=2

Following hack fixes the dump failure:


But the restore still segfaults:

[ 5281.926998] busyloop00[2000]: unhandled level 3 translation fault (11) at 0xffffa95a06c0, esr 0x82000007
[ 5281.936485] pgd = ffff800f60a0e000
[ 5281.939884] [ffffa95a06c0] *pgd=0000000f62e10003, *pud=0000000f62e10003, *pmd=0000000f65da0003, *pte=0000000000000000

[ 5281.951989] CPU: 1 PID: 2000 Comm: busyloop00 Not tainted 4.11.0-44.4.1.el7a.aarch64 #1
[ 5281.967370] task: ffff800fd49e3000 task.stack: ffff800fd4a04000
[ 5281.973285] PC is at 0xffffa95a06c0
[ 5281.976761] LR is at 0xffffa95a06c0
[ 5281.980244] pc : [<0000ffffa95a06c0>] lr : [<0000ffffa95a06c0>] pstate: 00000000
[ 5281.987629] sp : 0000ffffc623ebe0
[ 5281.990948] x29: 0000ffffc623fe30 x28: 0000000000000000 
[ 5281.996251] x27: 0000000000000000 x26: 0000000000000000 
[ 5282.001559] x25: 0000000000000000 x24: 0000000000000000 
[ 5282.006862] x23: 0000000000000000 x22: 0000000000000000 
[ 5282.012170] x21: 0000000000401c30 x20: 0000000000000000 
[ 5282.017472] x19: 0000000000000001 x18: 0000ffffc61ffb20 
[ 5282.022779] x17: 0000ffffa36067f0 x16: 0000000000420228 
[ 5282.028081] x15: 00007b2545efc518 x14: 002ba74863b89f91 
[ 5282.033389] x13: 00000003e8000000 x12: 0000000000000018 
[ 5282.038710] x11: 00000000000b4c65 x10: 00000000ffffffff 
[ 5282.044014] x9 : 003b9aca00000000 x8 : 0000000000000062 
[ 5282.049322] x7 : 0000000000420000 x6 : 0000000000420000 
[ 5282.054624] x5 : 0000000000000000 x4 : 0000000000000000 
[ 5282.059932] x3 : 0000000000000000 x2 : 000000007fffffff 
[ 5282.065234] x1 : 0000000000000001 x0 : 0000000000000000 

From zdtm:

# ./zdtm.py run  -f h -t zdtm/static/busyloop00 
=== Run 1/1 ================ zdtm/static/busyloop00

======================= Run zdtm/static/busyloop00 in h ========================
Start test
./busyloop00 --pidfile=busyloop00.pid --outfile=busyloop00.out
Run criu dump
Run criu restore
Send the 15 signal to  24
Wait for zdtm/static/busyloop00(24) to die for 0.100000
Wait for zdtm/static/busyloop00(24) to die for 0.200000
tail: cannot open ‘zdtm/static/busyloop00.out’ for reading: No such file or directory
==================== zdtm/static/busyloop00.out.inprogress =====================

==================== zdtm/static/busyloop00.out.inprogress =====================
############### Test zdtm/static/busyloop00 FAIL at result check ###############
##################################### FAIL #####################################

No obvious errors in the restore.log

Is there another place the page size needs to be adapted to make the restore
not segfault?

		Adrian

Patch hide | download patch | download mbox

diff --git a/criu/include/image.h b/criu/include/image.h
index d9c4bdb..9ce9565 100644
--- a/criu/include/image.h
+++ b/criu/include/image.h
@@ -15,7 +15,7 @@ 
 #ifdef _ARCH_PPC64
 #define PAGE_IMAGE_SIZE        65536
 #else
-#define PAGE_IMAGE_SIZE        4096
+#define PAGE_IMAGE_SIZE        65536
 #endif /* _ARCH_PPC64 */
 #define PAGE_RSS       1
 #define PAGE_ANON      2
diff --git a/include/common/arch/aarch64/asm/page.h b/include/common/arch/aarch64/asm/page.h
index de1fe54..4309018 100644
--- a/include/common/arch/aarch64/asm/page.h
+++ b/include/common/arch/aarch64/asm/page.h
@@ -4,7 +4,7 @@ 
 #include <unistd.h>
 
 #ifndef PAGE_SHIFT
-# define PAGE_SHIFT    12
+# define PAGE_SHIFT    16
 #endif
 
 #ifndef PAGE_SIZE