[v2,12/14] files: Make tasks set their own service_fd_base

Submitted by Kirill Tkhai on Dec. 29, 2017, 8:58 a.m.

Details

Message ID f877cfe3-c2e3-3ab7-ff26-5dc5f848cae1@virtuozzo.com
State New
Series "Introduce custom per-task service fds placement"
Headers show

Commit Message

Kirill Tkhai Dec. 29, 2017, 8:58 a.m.
On 28.12.2017 22:00, Andrei Vagin wrote:
> On Thu, Dec 28, 2017 at 04:10:29PM +0300, Kirill Tkhai wrote:
>> Currently, we set rlim(RLIMIT_NOFILE) unlimited
>> and service_fd_rlim_cur to place service fds.
>> This leads to a signify problem: every task uses
>> the biggest possible files_struct in kernel, and
>> it consumes excess memory after restore
>> in comparation to dump. In some situations this
>> may end in restore fail as there is no enough
>> memory in memory cgroup of on node.
>>
>> The patch fixes the problem by introducing
>> task-measured service_fd_base. It's calculated
>> in dependence of max used file fd and is placed
>> near the right border of kernel-allocated memory
>> hunk for task's fds (see alloc_fdtable() for
>> details). This reduces kernel-allocated files_struct
>> to 512 fds for the most process in standard linux
>> system (I've analysed the processes in my work system).
>>
>> Also, since the "standard processes" will have the same
>> service_fd_base, clone_service_fd() won't have to
>> actualy dup() their service fds for them like we
>> have at the moment. This is the one of reasons why
>> we still keep service fds as a range of fds,
>> and do not try to use unused holes in task fds.
>>
>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>>
>> v2: Add a handle for very big fd numbers near service_fd_rlim_cur.
>> ---
>>  criu/util.c |   57 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/criu/util.c b/criu/util.c
>> index d45bb7d61..8a25ab2cf 100644
>> --- a/criu/util.c
>> +++ b/criu/util.c
>> @@ -527,7 +527,7 @@ int close_service_fd(enum sfd_type type)
>>  static void move_service_fd(struct pstree_item *me, int type, int new_id, int new_base)
>>  {
>>  	int old = get_service_fd(type);
>> -	int new = __get_service_fd(type, new_id);
>> +	int new = new_base - type - SERVICE_FD_MAX * new_id;
>>  	int ret;
>>  
>>  	if (old < 0)
>> @@ -540,24 +540,73 @@ static void move_service_fd(struct pstree_item *me, int type, int new_id, int ne
>>  		close(old);
>>  }
>>  
>> +static int choose_service_fd_base(struct pstree_item *me)
>> +{
>> +	int nr, real_nr, fdt_nr = 1, id = rsti(me)->service_fd_id;
>> +
>> +	if (rsti(me)->fdt) {
>> +		/* The base is set by owner of fdt (id 0) */
>> +		if (id != 0)
>> +			return service_fd_base;
>> +		fdt_nr = rsti(me)->fdt->nr;
>> +	}
>> +	/* Now find process's max used fd number */
>> +	if (!list_empty(&rsti(me)->fds))
>> +		nr = list_entry(rsti(me)->fds.prev,
>> +				struct fdinfo_list_entry, ps_list)->fe->fd;
>> +	else
>> +		nr = -1;
>> +
>> +	nr = max(nr, inh_fd_max);
>> +	/*
>> +	 * Service fds go after max fd near right border of alignment:
>> +	 *
>> +	 * ...|max_fd|max_fd+1|...|sfd first|...|sfd last (aligned)|
>> +	 *
>> +	 * So, they take maximum numbers of area allocated by kernel.
>> +	 * See linux alloc_fdtable() for details.
>> +	 */
>> +	nr += (SERVICE_FD_MAX - SERVICE_FD_MIN) * fdt_nr;
>> +	nr += 16; /* Safety pad */
>> +	real_nr = nr;
>> +
>> +	nr /= (1024 / sizeof(void *));
>> +	nr = 1 << (32 - __builtin_clz(nr + 1));
>> +	nr *= (1024 / sizeof(void *));
>> +
>> +	if (nr > service_fd_rlim_cur) {
>> +		/* Right border is bigger, than rlim. OK, then just aligned value is enough */
>> +		nr = round_down(service_fd_rlim_cur, (1024 / sizeof(void *)));
>> +		if (nr < real_nr) {
>> +			pr_err("Can't chose service_fd_base: %d %d\n", nr, real_nr);
>> +			return -1;
>> +		}
>> +	}
>> +
>> +	return nr;
>> +}
>> +
>>  int clone_service_fd(struct pstree_item *me)
>>  {
>>  	int id, new_base, i, ret = -1;
>>  
>> -	new_base = service_fd_base;
>> +	new_base = choose_service_fd_base(me);
>>  	id = rsti(me)->service_fd_id;
>>  
>> -	if (service_fd_id == id)
>> +	if (new_base == -1)
>> +		return -1;
>> +	if (service_fd_base == new_base && service_fd_id == id)
>>  		return 0;
> 
> For the root process, service_fd_base is changed 1048576 -> 256 and
> there is one service fd, so for the root process fdtable will be
> bigger than it has to be. Actually we have a problem each time when
> a child process has a smaller fdtable than a parent process.

It's because of process's fds are read in root_prepare_shared()->prepare_fd_pid().
This is the reason, why I moved setup_newborn_fds() in "files: Do setup_newborn_fds() later".
It seems this was made to finish CR_STATE_PREPARE_NAMESPACES stage faster (it's being finished
before root_prepare_shared()), because it's important

 
> And here is a test which shows this problem with more collors
> t a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
> index 1feabfa9f..2505cb54c 100644
> --- a/test/zdtm/static/env00.c
> +++ b/test/zdtm/static/env00.c
> @@ -1,6 +1,7 @@
>  #include <errno.h>
>  #include <stdlib.h>
>  #include <string.h>
> +#include <sched.h>
>  
>  #include "zdtmtst.h"
>  
> @@ -13,9 +14,22 @@ TEST_OPTION(envname, string, "environment variable name", 1);
>  int main(int argc, char **argv)
>  {
>         char *env;
> +       int i;
> +       pid_t pid;
>  
>         test_init(argc, argv);
>  
> +       for (i = 0; i < 10; i++) {
> +               pid = fork();
> +               if (pid == 0) {
> +                       while (1)
> +                               sleep(1);
> +                       return 0;
> +               }
> +       }
> +
> +       dup2(1, 1000000);
> +
>         if (setenv(envname, test_author, 1)) {
>                 pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
>                 exit(1);
> 
> before dump:
> [root@fc24 criu]# echo  $((`cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes` >> 10))
> 12276
> 
> after restore:
> [root@fc24 criu]# echo  $((`cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes` >> 10))
> 108432

It's not not obligatory connected with sfds. It's very difficult to measure the memory
connected with file table only. See below:


before dump:
root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
11849728

after restore:
root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
20877312

Criu restore changes many in-kernel structures, and this is a global problem, not
about sfds.

>>  
>>  	/* Dup sfds in memmove() style: they may overlap */
>> -	if (get_service_fd(LOG_FD_OFF) > __get_service_fd(LOG_FD_OFF, id))
>> +	if (get_service_fd(LOG_FD_OFF) > new_base - LOG_FD_OFF - SERVICE_FD_MAX * id)
>>  		for (i = SERVICE_FD_MIN + 1; i < SERVICE_FD_MAX; i++)
>>  			move_service_fd(me, i, id, new_base);
>>  	else
>>  		for (i = SERVICE_FD_MAX - 1; i > SERVICE_FD_MIN; i--)
>>  			move_service_fd(me, i, id, new_base);
>>  
>> +	service_fd_base = new_base;
>>  	service_fd_id = id;
>>  	ret = 0;

Kirill

Patch hide | download patch | download mbox

diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
index 1feabfa9f..d2517b835 100644
--- a/test/zdtm/static/env00.c
+++ b/test/zdtm/static/env00.c
@@ -12,10 +12,23 @@  TEST_OPTION(envname, string, "environment variable name", 1);
 
 int main(int argc, char **argv)
 {
+	int i;
+	pid_t pid;
 	char *env;
 
 	test_init(argc, argv);
 
+	for (i = 0; i < 10; i++) {
+		pid = fork();
+		if (pid == 0) {
+			while (1)
+				sleep(1);
+			return 0;
+		}
+	}
+
+//	dup2(1, 1000000);
+
 	if (setenv(envname, test_author, 1)) {
 		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
 		exit(1);

Comments

Andrei Vagin Dec. 29, 2017, 6:18 p.m.
On Fri, Dec 29, 2017 at 11:58:25AM +0300, Kirill Tkhai wrote:
> On 28.12.2017 22:00, Andrei Vagin wrote:
> > On Thu, Dec 28, 2017 at 04:10:29PM +0300, Kirill Tkhai wrote:
> >> Currently, we set rlim(RLIMIT_NOFILE) unlimited
> >> and service_fd_rlim_cur to place service fds.
> >> This leads to a signify problem: every task uses
> >> the biggest possible files_struct in kernel, and
> >> it consumes excess memory after restore
> >> in comparation to dump. In some situations this
> >> may end in restore fail as there is no enough
> >> memory in memory cgroup of on node.
> >>
> >> The patch fixes the problem by introducing
> >> task-measured service_fd_base. It's calculated
> >> in dependence of max used file fd and is placed
> >> near the right border of kernel-allocated memory
> >> hunk for task's fds (see alloc_fdtable() for
> >> details). This reduces kernel-allocated files_struct
> >> to 512 fds for the most process in standard linux
> >> system (I've analysed the processes in my work system).
> >>
> >> Also, since the "standard processes" will have the same
> >> service_fd_base, clone_service_fd() won't have to
> >> actualy dup() their service fds for them like we
> >> have at the moment. This is the one of reasons why
> >> we still keep service fds as a range of fds,
> >> and do not try to use unused holes in task fds.
> >>
> >> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> >>
> >> v2: Add a handle for very big fd numbers near service_fd_rlim_cur.
> >> ---
> >>  criu/util.c |   57 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
> >>  1 file changed, 53 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/criu/util.c b/criu/util.c
> >> index d45bb7d61..8a25ab2cf 100644
> >> --- a/criu/util.c
> >> +++ b/criu/util.c
> >> @@ -527,7 +527,7 @@ int close_service_fd(enum sfd_type type)
> >>  static void move_service_fd(struct pstree_item *me, int type, int new_id, int new_base)
> >>  {
> >>  	int old = get_service_fd(type);
> >> -	int new = __get_service_fd(type, new_id);
> >> +	int new = new_base - type - SERVICE_FD_MAX * new_id;
> >>  	int ret;
> >>  
> >>  	if (old < 0)
> >> @@ -540,24 +540,73 @@ static void move_service_fd(struct pstree_item *me, int type, int new_id, int ne
> >>  		close(old);
> >>  }
> >>  
> >> +static int choose_service_fd_base(struct pstree_item *me)
> >> +{
> >> +	int nr, real_nr, fdt_nr = 1, id = rsti(me)->service_fd_id;
> >> +
> >> +	if (rsti(me)->fdt) {
> >> +		/* The base is set by owner of fdt (id 0) */
> >> +		if (id != 0)
> >> +			return service_fd_base;
> >> +		fdt_nr = rsti(me)->fdt->nr;
> >> +	}
> >> +	/* Now find process's max used fd number */
> >> +	if (!list_empty(&rsti(me)->fds))
> >> +		nr = list_entry(rsti(me)->fds.prev,
> >> +				struct fdinfo_list_entry, ps_list)->fe->fd;
> >> +	else
> >> +		nr = -1;
> >> +
> >> +	nr = max(nr, inh_fd_max);
> >> +	/*
> >> +	 * Service fds go after max fd near right border of alignment:
> >> +	 *
> >> +	 * ...|max_fd|max_fd+1|...|sfd first|...|sfd last (aligned)|
> >> +	 *
> >> +	 * So, they take maximum numbers of area allocated by kernel.
> >> +	 * See linux alloc_fdtable() for details.
> >> +	 */
> >> +	nr += (SERVICE_FD_MAX - SERVICE_FD_MIN) * fdt_nr;
> >> +	nr += 16; /* Safety pad */
> >> +	real_nr = nr;
> >> +
> >> +	nr /= (1024 / sizeof(void *));
> >> +	nr = 1 << (32 - __builtin_clz(nr + 1));
> >> +	nr *= (1024 / sizeof(void *));
> >> +
> >> +	if (nr > service_fd_rlim_cur) {
> >> +		/* Right border is bigger, than rlim. OK, then just aligned value is enough */
> >> +		nr = round_down(service_fd_rlim_cur, (1024 / sizeof(void *)));
> >> +		if (nr < real_nr) {
> >> +			pr_err("Can't chose service_fd_base: %d %d\n", nr, real_nr);
> >> +			return -1;
> >> +		}
> >> +	}
> >> +
> >> +	return nr;
> >> +}
> >> +
> >>  int clone_service_fd(struct pstree_item *me)
> >>  {
> >>  	int id, new_base, i, ret = -1;
> >>  
> >> -	new_base = service_fd_base;
> >> +	new_base = choose_service_fd_base(me);
> >>  	id = rsti(me)->service_fd_id;
> >>  
> >> -	if (service_fd_id == id)
> >> +	if (new_base == -1)
> >> +		return -1;
> >> +	if (service_fd_base == new_base && service_fd_id == id)
> >>  		return 0;
> > 
> > For the root process, service_fd_base is changed 1048576 -> 256 and
> > there is one service fd, so for the root process fdtable will be
> > bigger than it has to be. Actually we have a problem each time when
> > a child process has a smaller fdtable than a parent process.
> 
> It's because of process's fds are read in root_prepare_shared()->prepare_fd_pid().
> This is the reason, why I moved setup_newborn_fds() in "files: Do setup_newborn_fds() later".
> It seems this was made to finish CR_STATE_PREPARE_NAMESPACES stage faster (it's being finished
> before root_prepare_shared()), because it's important
> 
>  
> > And here is a test which shows this problem with more collors
> > t a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
> > index 1feabfa9f..2505cb54c 100644
> > --- a/test/zdtm/static/env00.c
> > +++ b/test/zdtm/static/env00.c
> > @@ -1,6 +1,7 @@
> >  #include <errno.h>
> >  #include <stdlib.h>
> >  #include <string.h>
> > +#include <sched.h>
> >  
> >  #include "zdtmtst.h"
> >  
> > @@ -13,9 +14,22 @@ TEST_OPTION(envname, string, "environment variable name", 1);
> >  int main(int argc, char **argv)
> >  {
> >         char *env;
> > +       int i;
> > +       pid_t pid;
> >  
> >         test_init(argc, argv);
> >  
> > +       for (i = 0; i < 10; i++) {
> > +               pid = fork();
> > +               if (pid == 0) {
> > +                       while (1)
> > +                               sleep(1);
> > +                       return 0;
> > +               }
> > +       }
> > +
> > +       dup2(1, 1000000);
> > +
> >         if (setenv(envname, test_author, 1)) {
> >                 pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
> >                 exit(1);
> > 
> > before dump:
> > [root@fc24 criu]# echo  $((`cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes` >> 10))
> > 12276
> > 
> > after restore:
> > [root@fc24 criu]# echo  $((`cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes` >> 10))
> > 108432
> 
> It's not not obligatory connected with sfds. It's very difficult to measure the memory
> connected with file table only. See below:
> 
> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
> index 1feabfa9f..d2517b835 100644
> --- a/test/zdtm/static/env00.c
> +++ b/test/zdtm/static/env00.c
> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
>  
>  int main(int argc, char **argv)
>  {
> +	int i;
> +	pid_t pid;
>  	char *env;
>  
>  	test_init(argc, argv);
>  
> +	for (i = 0; i < 10; i++) {
> +		pid = fork();
> +		if (pid == 0) {
> +			while (1)
> +				sleep(1);
> +			return 0;
> +		}
> +	}
> +
> +//	dup2(1, 1000000);
> +
>  	if (setenv(envname, test_author, 1)) {
>  		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
>  		exit(1);
> 
> before dump:
> root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> 11849728
> 
> after restore:
> root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> 20877312
> 
> Criu restore changes many in-kernel structures, and this is a global problem, not
> about sfds.

In this case, the delta is 10MB. If you uncomment dup2(), the delta will
be 100MB. This is not a global problem, it is the problem that we move
service descriptors in child processes.

> 
> >>  
> >>  	/* Dup sfds in memmove() style: they may overlap */
> >> -	if (get_service_fd(LOG_FD_OFF) > __get_service_fd(LOG_FD_OFF, id))
> >> +	if (get_service_fd(LOG_FD_OFF) > new_base - LOG_FD_OFF - SERVICE_FD_MAX * id)
> >>  		for (i = SERVICE_FD_MIN + 1; i < SERVICE_FD_MAX; i++)
> >>  			move_service_fd(me, i, id, new_base);
> >>  	else
> >>  		for (i = SERVICE_FD_MAX - 1; i > SERVICE_FD_MIN; i--)
> >>  			move_service_fd(me, i, id, new_base);
> >>  
> >> +	service_fd_base = new_base;
> >>  	service_fd_id = id;
> >>  	ret = 0;
> 
> Kirill
Andrei Vagin Dec. 29, 2017, 7:59 p.m.
> > diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
> > index 1feabfa9f..d2517b835 100644
> > --- a/test/zdtm/static/env00.c
> > +++ b/test/zdtm/static/env00.c
> > @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
> >  
> >  int main(int argc, char **argv)
> >  {
> > +	int i;
> > +	pid_t pid;
> >  	char *env;
> >  
> >  	test_init(argc, argv);
> >  
> > +	for (i = 0; i < 10; i++) {
> > +		pid = fork();
> > +		if (pid == 0) {
> > +			while (1)
> > +				sleep(1);
> > +			return 0;
> > +		}
> > +	}
> > +
> > +//	dup2(1, 1000000);
> > +
> >  	if (setenv(envname, test_author, 1)) {
> >  		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
> >  		exit(1);
> > 
> > before dump:
> > root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> > 11849728
> > 
> > after restore:
> > root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> > 20877312
> > 
> > Criu restore changes many in-kernel structures, and this is a global problem, not
> > about sfds.
> 
> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
> be 100MB. This is not a global problem, it is the problem that we move
> service descriptors in child processes.

I create a small path for zdtm.py to get kmem before and after c/r:
git clone https://github.com/avagin/criu -b 2195

And here are steps how I do my experiments:

[root@fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
[root@fc24 criu]# ulimit -n 1024
[root@fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
=== Run 1/1 ================ zdtm/static/env00

========================== Run zdtm/static/env00 in h ==========================
Start test
./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
memory.kmem.usage_in_bytes = 9920512
Pause at pre-dump. Press any key to continue.
Run criu dump

Unable to kill 36: [Errno 3] No such process
Pause at pre-restore. Press any key to continue.Run criu restore
Pause at post-restore. Press any key to continue.
memory.kmem.usage_in_bytes = 22913024
before: memory.kmem.usage_in_bytes =              9920512
after:  memory.kmem.usage_in_bytes =             22913024 (230.97%)
Send the 15 signal to  36
Wait for zdtm/static/env00(36) to die for 0.100000
Unable to kill 36: [Errno 3] No such process
Removing dump/zdtm/static/env00/36
========================= Test zdtm/static/env00 PASS ==========================

[root@fc24 criu]# 
[root@fc24 criu]# ulimit -n $((1 << 20))
[root@fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
[root@fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
=== Run 1/1 ================ zdtm/static/env00

========================== Run zdtm/static/env00 in h ==========================
Start test
./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
memory.kmem.usage_in_bytes = 19714048
Pause at pre-dump. Press any key to continue.
Run criu dump
Unable to kill 36: [Errno 3] No such process
Pause at pre-restore. Press any key to continue.
Run criu restore
Pause at post-restore. Press any key to continue.
memory.kmem.usage_in_bytes = 914178048
before: memory.kmem.usage_in_bytes =             19714048
after:  memory.kmem.usage_in_bytes =            914178048 (4637.19%)
Send the 15 signal to  36
Wait for zdtm/static/env00(36) to die for 0.100000
Unable to kill 36: [Errno 3] No such process
Removing dump/zdtm/static/env00/36
========================= Test zdtm/static/env00 PASS ==========================

When we call ulimit -n 1024, dup2(1, 1000000) returns an error and
the test has only descriptors with small numbers. In this case,
a kmem consumption increases on 130%. We need to investigate this fact,
but it isn't so critical, if we compare it with the next case.

When we call ulimit -n $((1 << 20)) before running the test, dup2(1, 1000000)
creates the 1000000 descriptor, and a kmem consumption after c/r increases on
4537.19%. This is a real problem, what we have to solve with a high priority.

> 
> > 
> > >>  
> > >>  	/* Dup sfds in memmove() style: they may overlap */
> > >> -	if (get_service_fd(LOG_FD_OFF) > __get_service_fd(LOG_FD_OFF, id))
> > >> +	if (get_service_fd(LOG_FD_OFF) > new_base - LOG_FD_OFF - SERVICE_FD_MAX * id)
> > >>  		for (i = SERVICE_FD_MIN + 1; i < SERVICE_FD_MAX; i++)
> > >>  			move_service_fd(me, i, id, new_base);
> > >>  	else
> > >>  		for (i = SERVICE_FD_MAX - 1; i > SERVICE_FD_MIN; i--)
> > >>  			move_service_fd(me, i, id, new_base);
> > >>  
> > >> +	service_fd_base = new_base;
> > >>  	service_fd_id = id;
> > >>  	ret = 0;
> > 
> > Kirill
> _______________________________________________
> CRIU mailing list
> CRIU@openvz.org
> https://lists.openvz.org/mailman/listinfo/criu
Kirill Tkhai Dec. 30, 2017, 8:36 a.m.
On 29.12.2017 21:18, Andrei Vagin wrote:
> On Fri, Dec 29, 2017 at 11:58:25AM +0300, Kirill Tkhai wrote:
>> On 28.12.2017 22:00, Andrei Vagin wrote:
>>> On Thu, Dec 28, 2017 at 04:10:29PM +0300, Kirill Tkhai wrote:
>>>> Currently, we set rlim(RLIMIT_NOFILE) unlimited
>>>> and service_fd_rlim_cur to place service fds.
>>>> This leads to a signify problem: every task uses
>>>> the biggest possible files_struct in kernel, and
>>>> it consumes excess memory after restore
>>>> in comparation to dump. In some situations this
>>>> may end in restore fail as there is no enough
>>>> memory in memory cgroup of on node.
>>>>
>>>> The patch fixes the problem by introducing
>>>> task-measured service_fd_base. It's calculated
>>>> in dependence of max used file fd and is placed
>>>> near the right border of kernel-allocated memory
>>>> hunk for task's fds (see alloc_fdtable() for
>>>> details). This reduces kernel-allocated files_struct
>>>> to 512 fds for the most process in standard linux
>>>> system (I've analysed the processes in my work system).
>>>>
>>>> Also, since the "standard processes" will have the same
>>>> service_fd_base, clone_service_fd() won't have to
>>>> actualy dup() their service fds for them like we
>>>> have at the moment. This is the one of reasons why
>>>> we still keep service fds as a range of fds,
>>>> and do not try to use unused holes in task fds.
>>>>
>>>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>>>>
>>>> v2: Add a handle for very big fd numbers near service_fd_rlim_cur.
>>>> ---
>>>>  criu/util.c |   57 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>>>  1 file changed, 53 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/criu/util.c b/criu/util.c
>>>> index d45bb7d61..8a25ab2cf 100644
>>>> --- a/criu/util.c
>>>> +++ b/criu/util.c
>>>> @@ -527,7 +527,7 @@ int close_service_fd(enum sfd_type type)
>>>>  static void move_service_fd(struct pstree_item *me, int type, int new_id, int new_base)
>>>>  {
>>>>  	int old = get_service_fd(type);
>>>> -	int new = __get_service_fd(type, new_id);
>>>> +	int new = new_base - type - SERVICE_FD_MAX * new_id;
>>>>  	int ret;
>>>>  
>>>>  	if (old < 0)
>>>> @@ -540,24 +540,73 @@ static void move_service_fd(struct pstree_item *me, int type, int new_id, int ne
>>>>  		close(old);
>>>>  }
>>>>  
>>>> +static int choose_service_fd_base(struct pstree_item *me)
>>>> +{
>>>> +	int nr, real_nr, fdt_nr = 1, id = rsti(me)->service_fd_id;
>>>> +
>>>> +	if (rsti(me)->fdt) {
>>>> +		/* The base is set by owner of fdt (id 0) */
>>>> +		if (id != 0)
>>>> +			return service_fd_base;
>>>> +		fdt_nr = rsti(me)->fdt->nr;
>>>> +	}
>>>> +	/* Now find process's max used fd number */
>>>> +	if (!list_empty(&rsti(me)->fds))
>>>> +		nr = list_entry(rsti(me)->fds.prev,
>>>> +				struct fdinfo_list_entry, ps_list)->fe->fd;
>>>> +	else
>>>> +		nr = -1;
>>>> +
>>>> +	nr = max(nr, inh_fd_max);
>>>> +	/*
>>>> +	 * Service fds go after max fd near right border of alignment:
>>>> +	 *
>>>> +	 * ...|max_fd|max_fd+1|...|sfd first|...|sfd last (aligned)|
>>>> +	 *
>>>> +	 * So, they take maximum numbers of area allocated by kernel.
>>>> +	 * See linux alloc_fdtable() for details.
>>>> +	 */
>>>> +	nr += (SERVICE_FD_MAX - SERVICE_FD_MIN) * fdt_nr;
>>>> +	nr += 16; /* Safety pad */
>>>> +	real_nr = nr;
>>>> +
>>>> +	nr /= (1024 / sizeof(void *));
>>>> +	nr = 1 << (32 - __builtin_clz(nr + 1));
>>>> +	nr *= (1024 / sizeof(void *));
>>>> +
>>>> +	if (nr > service_fd_rlim_cur) {
>>>> +		/* Right border is bigger, than rlim. OK, then just aligned value is enough */
>>>> +		nr = round_down(service_fd_rlim_cur, (1024 / sizeof(void *)));
>>>> +		if (nr < real_nr) {
>>>> +			pr_err("Can't chose service_fd_base: %d %d\n", nr, real_nr);
>>>> +			return -1;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	return nr;
>>>> +}
>>>> +
>>>>  int clone_service_fd(struct pstree_item *me)
>>>>  {
>>>>  	int id, new_base, i, ret = -1;
>>>>  
>>>> -	new_base = service_fd_base;
>>>> +	new_base = choose_service_fd_base(me);
>>>>  	id = rsti(me)->service_fd_id;
>>>>  
>>>> -	if (service_fd_id == id)
>>>> +	if (new_base == -1)
>>>> +		return -1;
>>>> +	if (service_fd_base == new_base && service_fd_id == id)
>>>>  		return 0;
>>>
>>> For the root process, service_fd_base is changed 1048576 -> 256 and
>>> there is one service fd, so for the root process fdtable will be
>>> bigger than it has to be. Actually we have a problem each time when
>>> a child process has a smaller fdtable than a parent process.
>>
>> It's because of process's fds are read in root_prepare_shared()->prepare_fd_pid().
>> This is the reason, why I moved setup_newborn_fds() in "files: Do setup_newborn_fds() later".
>> It seems this was made to finish CR_STATE_PREPARE_NAMESPACES stage faster (it's being finished
>> before root_prepare_shared()), because it's important
>>
>>  
>>> And here is a test which shows this problem with more collors
>>> t a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
>>> index 1feabfa9f..2505cb54c 100644
>>> --- a/test/zdtm/static/env00.c
>>> +++ b/test/zdtm/static/env00.c
>>> @@ -1,6 +1,7 @@
>>>  #include <errno.h>
>>>  #include <stdlib.h>
>>>  #include <string.h>
>>> +#include <sched.h>
>>>  
>>>  #include "zdtmtst.h"
>>>  
>>> @@ -13,9 +14,22 @@ TEST_OPTION(envname, string, "environment variable name", 1);
>>>  int main(int argc, char **argv)
>>>  {
>>>         char *env;
>>> +       int i;
>>> +       pid_t pid;
>>>  
>>>         test_init(argc, argv);
>>>  
>>> +       for (i = 0; i < 10; i++) {
>>> +               pid = fork();
>>> +               if (pid == 0) {
>>> +                       while (1)
>>> +                               sleep(1);
>>> +                       return 0;
>>> +               }
>>> +       }
>>> +
>>> +       dup2(1, 1000000);
>>> +
>>>         if (setenv(envname, test_author, 1)) {
>>>                 pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
>>>                 exit(1);
>>>
>>> before dump:
>>> [root@fc24 criu]# echo  $((`cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes` >> 10))
>>> 12276
>>>
>>> after restore:
>>> [root@fc24 criu]# echo  $((`cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes` >> 10))
>>> 108432
>>
>> It's not not obligatory connected with sfds. It's very difficult to measure the memory
>> connected with file table only. See below:
>>
>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
>> index 1feabfa9f..d2517b835 100644
>> --- a/test/zdtm/static/env00.c
>> +++ b/test/zdtm/static/env00.c
>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
>>  
>>  int main(int argc, char **argv)
>>  {
>> +	int i;
>> +	pid_t pid;
>>  	char *env;
>>  
>>  	test_init(argc, argv);
>>  
>> +	for (i = 0; i < 10; i++) {
>> +		pid = fork();
>> +		if (pid == 0) {
>> +			while (1)
>> +				sleep(1);
>> +			return 0;
>> +		}
>> +	}
>> +
>> +//	dup2(1, 1000000);
>> +
>>  	if (setenv(envname, test_author, 1)) {
>>  		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
>>  		exit(1);
>>
>> before dump:
>> root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>> 11849728
>>
>> after restore:
>> root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>> 20877312
>>
>> Criu restore changes many in-kernel structures, and this is a global problem, not
>> about sfds.
> 
> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
> be 100MB. This is not a global problem, it is the problem that we move
> service descriptors in child processes.

Yes, but it's just the consequent of the service fds design in general, and this
is not result of this patchset.
 
>>
>>>>  
>>>>  	/* Dup sfds in memmove() style: they may overlap */
>>>> -	if (get_service_fd(LOG_FD_OFF) > __get_service_fd(LOG_FD_OFF, id))
>>>> +	if (get_service_fd(LOG_FD_OFF) > new_base - LOG_FD_OFF - SERVICE_FD_MAX * id)
>>>>  		for (i = SERVICE_FD_MIN + 1; i < SERVICE_FD_MAX; i++)
>>>>  			move_service_fd(me, i, id, new_base);
>>>>  	else
>>>>  		for (i = SERVICE_FD_MAX - 1; i > SERVICE_FD_MIN; i--)
>>>>  			move_service_fd(me, i, id, new_base);
>>>>  
>>>> +	service_fd_base = new_base;
>>>>  	service_fd_id = id;
>>>>  	ret = 0;
>>
>> Kirill
Andrei Vagin Dec. 30, 2017, 4:48 p.m.
On Sat, Dec 30, 2017 at 12:22:32PM +0300, Kirill Tkhai wrote:
> On 29.12.2017 22:59, Andrei Vagin wrote:
> >>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
> >>> index 1feabfa9f..d2517b835 100644
> >>> --- a/test/zdtm/static/env00.c
> >>> +++ b/test/zdtm/static/env00.c
> >>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
> >>>  
> >>>  int main(int argc, char **argv)
> >>>  {
> >>> +	int i;
> >>> +	pid_t pid;
> >>>  	char *env;
> >>>  
> >>>  	test_init(argc, argv);
> >>>  
> >>> +	for (i = 0; i < 10; i++) {
> >>> +		pid = fork();
> >>> +		if (pid == 0) {
> >>> +			while (1)
> >>> +				sleep(1);
> >>> +			return 0;
> >>> +		}
> >>> +	}
> >>> +
> >>> +//	dup2(1, 1000000);
> >>> +
> >>>  	if (setenv(envname, test_author, 1)) {
> >>>  		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
> >>>  		exit(1);
> >>>
> >>> before dump:
> >>> root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> >>> 11849728
> >>>
> >>> after restore:
> >>> root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> >>> 20877312
> >>>
> >>> Criu restore changes many in-kernel structures, and this is a global problem, not
> >>> about sfds.
> >>
> >> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
> >> be 100MB. This is not a global problem, it is the problem that we move
> >> service descriptors in child processes.
> > 
> > I create a small path for zdtm.py to get kmem before and after c/r:
> > git clone https://github.com/avagin/criu -b 2195
> > 
> > And here are steps how I do my experiments:
> > 
> > [root@fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> > [root@fc24 criu]# ulimit -n 1024
> > [root@fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> > === Run 1/1 ================ zdtm/static/env00
> > 
> > ========================== Run zdtm/static/env00 in h ==========================
> > Start test
> > ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> > memory.kmem.usage_in_bytes = 9920512
> > Pause at pre-dump. Press any key to continue.
> > Run criu dump
> > 
> > Unable to kill 36: [Errno 3] No such process
> > Pause at pre-restore. Press any key to continue.Run criu restore
> > Pause at post-restore. Press any key to continue.
> > memory.kmem.usage_in_bytes = 22913024
> > before: memory.kmem.usage_in_bytes =              9920512
> > after:  memory.kmem.usage_in_bytes =             22913024 (230.97%)
> > Send the 15 signal to  36
> > Wait for zdtm/static/env00(36) to die for 0.100000
> > Unable to kill 36: [Errno 3] No such process
> > Removing dump/zdtm/static/env00/36
> > ========================= Test zdtm/static/env00 PASS ==========================
> > 
> > [root@fc24 criu]# 
> > [root@fc24 criu]# ulimit -n $((1 << 20))
> > [root@fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> > [root@fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> > === Run 1/1 ================ zdtm/static/env00
> > 
> > ========================== Run zdtm/static/env00 in h ==========================
> > Start test
> > ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> > memory.kmem.usage_in_bytes = 19714048
> > Pause at pre-dump. Press any key to continue.
> > Run criu dump
> > Unable to kill 36: [Errno 3] No such process
> > Pause at pre-restore. Press any key to continue.
> > Run criu restore
> > Pause at post-restore. Press any key to continue.
> > memory.kmem.usage_in_bytes = 914178048
> > before: memory.kmem.usage_in_bytes =             19714048
> > after:  memory.kmem.usage_in_bytes =            914178048 (4637.19%)
> > Send the 15 signal to  36
> > Wait for zdtm/static/env00(36) to die for 0.100000
> > Unable to kill 36: [Errno 3] No such process
> > Removing dump/zdtm/static/env00/36
> > ========================= Test zdtm/static/env00 PASS ==========================
> > 
> > When we call ulimit -n 1024, dup2(1, 1000000) returns an error and
> > the test has only descriptors with small numbers. In this case,
> > a kmem consumption increases on 130%. We need to investigate this fact,
> > but it isn't so critical, if we compare it with the next case.
> > 
> > When we call ulimit -n $((1 << 20)) before running the test, dup2(1, 1000000)
> > creates the 1000000 descriptor, and a kmem consumption after c/r increases on
> > 4537.19%. This is a real problem, what we have to solve with a high priority.
> 
> It's not a problem of my patchset, it's the problem of current engine. My patchset
> does not aim to exploit the engine and rework everything. It just improves
> the engine and it works better then current engine in many cases and not worse
> in the rest of cases. Yes, there are the cases, when the patchset works like
> current engine, but there is no a case, when it behaves worse. Strange to
> call me to solve all the file or memory problems at once, isn't it?!

I like this work. Thank you for it. But I see nothing strange to
discuss the other problem of this code and try to solve them now or
later.

> 
> The goals of the patchset are:
> 1)Disconnect prlimit(RLIMIT_NOFILE) and service fds placement,
> i.e. make the placement to depend on task fds, not on prlimit().
> 2)Make people able to restore big fd numbers without increasing
> memory usage on all tasks in a container (as currently this requires
> to increase global limits).
> 
> #ulimit -n $((1 << 20))
> --- a/test/zdtm/static/env00.c
> +++ b/test/zdtm/static/env00.c
> @@ -25,14 +25,13 @@ int main(int argc, char **argv)
>         for (i = 0; i < 100; i++) {
>                 pid = fork();
>                 if (pid == 0) {
> +                       dup2(1, 10000);
>                         while (1)
>                                 sleep(1);
>                         return 0;
>                 }
>         }
>  
> -       dup2(1, 1000000);
> -
>         test_daemon();
>         test_waitsig();
> 
> Before patchset:
> before: memory.kmem.usage_in_bytes =             24076288
> after:  memory.kmem.usage_in_bytes =            921894912 (3829.06%)
> 
> After patchset:
> before: memory.kmem.usage_in_bytes =             23855104
> after:  memory.kmem.usage_in_bytes =             39591936 (165.97%)
> 
> Someone may rework this later, and implement new engine using holes in task fd numbers.
> I'm not against that, everybody is welcome! I'm just not sure this will take small time for
> writing/stabilization/etc. And I don't think this is the first priority problem in criu,
> so this is why I'm not going to solve it now.

I think we can solve this problem in contex of these changes.

We can create service descriptors with a small base in criu, then we can
fork all processes and then we can move service descriptors to a
calculated base for each process. It is not optimal, but it will work.

It can be optimazed if we predict a case when a child process has a
smaller service fd base...

> 
> Also, keep in mind, that after the rework is happened, you still meet the situation, when
> there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
> scheme will still require more memory than before the dump. Just one more boundary case
> like you pointed in your message.
> 
> Just analyse the tasks files on your linux system and find out how the patchset will help
> to reduce memory usage on restore of generic case.
> 
> Thanks,
> Kirill
Kirill Tkhai Jan. 9, 2018, 11:54 a.m.
On 30.12.2017 19:48, Andrei Vagin wrote:
> On Sat, Dec 30, 2017 at 12:22:32PM +0300, Kirill Tkhai wrote:
>> On 29.12.2017 22:59, Andrei Vagin wrote:
>>>>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
>>>>> index 1feabfa9f..d2517b835 100644
>>>>> --- a/test/zdtm/static/env00.c
>>>>> +++ b/test/zdtm/static/env00.c
>>>>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
>>>>>  
>>>>>  int main(int argc, char **argv)
>>>>>  {
>>>>> +	int i;
>>>>> +	pid_t pid;
>>>>>  	char *env;
>>>>>  
>>>>>  	test_init(argc, argv);
>>>>>  
>>>>> +	for (i = 0; i < 10; i++) {
>>>>> +		pid = fork();
>>>>> +		if (pid == 0) {
>>>>> +			while (1)
>>>>> +				sleep(1);
>>>>> +			return 0;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +//	dup2(1, 1000000);
>>>>> +
>>>>>  	if (setenv(envname, test_author, 1)) {
>>>>>  		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
>>>>>  		exit(1);
>>>>>
>>>>> before dump:
>>>>> root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>>>>> 11849728
>>>>>
>>>>> after restore:
>>>>> root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>>>>> 20877312
>>>>>
>>>>> Criu restore changes many in-kernel structures, and this is a global problem, not
>>>>> about sfds.
>>>>
>>>> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
>>>> be 100MB. This is not a global problem, it is the problem that we move
>>>> service descriptors in child processes.
>>>
>>> I create a small path for zdtm.py to get kmem before and after c/r:
>>> git clone https://github.com/avagin/criu -b 2195
>>>
>>> And here are steps how I do my experiments:
>>>
>>> [root@fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
>>> [root@fc24 criu]# ulimit -n 1024
>>> [root@fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
>>> === Run 1/1 ================ zdtm/static/env00
>>>
>>> ========================== Run zdtm/static/env00 in h ==========================
>>> Start test
>>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
>>> memory.kmem.usage_in_bytes = 9920512
>>> Pause at pre-dump. Press any key to continue.
>>> Run criu dump
>>>
>>> Unable to kill 36: [Errno 3] No such process
>>> Pause at pre-restore. Press any key to continue.Run criu restore
>>> Pause at post-restore. Press any key to continue.
>>> memory.kmem.usage_in_bytes = 22913024
>>> before: memory.kmem.usage_in_bytes =              9920512
>>> after:  memory.kmem.usage_in_bytes =             22913024 (230.97%)
>>> Send the 15 signal to  36
>>> Wait for zdtm/static/env00(36) to die for 0.100000
>>> Unable to kill 36: [Errno 3] No such process
>>> Removing dump/zdtm/static/env00/36
>>> ========================= Test zdtm/static/env00 PASS ==========================
>>>
>>> [root@fc24 criu]# 
>>> [root@fc24 criu]# ulimit -n $((1 << 20))
>>> [root@fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
>>> [root@fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
>>> === Run 1/1 ================ zdtm/static/env00
>>>
>>> ========================== Run zdtm/static/env00 in h ==========================
>>> Start test
>>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
>>> memory.kmem.usage_in_bytes = 19714048
>>> Pause at pre-dump. Press any key to continue.
>>> Run criu dump
>>> Unable to kill 36: [Errno 3] No such process
>>> Pause at pre-restore. Press any key to continue.
>>> Run criu restore
>>> Pause at post-restore. Press any key to continue.
>>> memory.kmem.usage_in_bytes = 914178048
>>> before: memory.kmem.usage_in_bytes =             19714048
>>> after:  memory.kmem.usage_in_bytes =            914178048 (4637.19%)
>>> Send the 15 signal to  36
>>> Wait for zdtm/static/env00(36) to die for 0.100000
>>> Unable to kill 36: [Errno 3] No such process
>>> Removing dump/zdtm/static/env00/36
>>> ========================= Test zdtm/static/env00 PASS ==========================
>>>
>>> When we call ulimit -n 1024, dup2(1, 1000000) returns an error and
>>> the test has only descriptors with small numbers. In this case,
>>> a kmem consumption increases on 130%. We need to investigate this fact,
>>> but it isn't so critical, if we compare it with the next case.
>>>
>>> When we call ulimit -n $((1 << 20)) before running the test, dup2(1, 1000000)
>>> creates the 1000000 descriptor, and a kmem consumption after c/r increases on
>>> 4537.19%. This is a real problem, what we have to solve with a high priority.
>>
>> It's not a problem of my patchset, it's the problem of current engine. My patchset
>> does not aim to exploit the engine and rework everything. It just improves
>> the engine and it works better then current engine in many cases and not worse
>> in the rest of cases. Yes, there are the cases, when the patchset works like
>> current engine, but there is no a case, when it behaves worse. Strange to
>> call me to solve all the file or memory problems at once, isn't it?!
> 
> I like this work. Thank you for it. But I see nothing strange to
> discuss the other problem of this code and try to solve them now or
> later.
> 
>>
>> The goals of the patchset are:
>> 1)Disconnect prlimit(RLIMIT_NOFILE) and service fds placement,
>> i.e. make the placement to depend on task fds, not on prlimit().
>> 2)Make people able to restore big fd numbers without increasing
>> memory usage on all tasks in a container (as currently this requires
>> to increase global limits).
>>
>> #ulimit -n $((1 << 20))
>> --- a/test/zdtm/static/env00.c
>> +++ b/test/zdtm/static/env00.c
>> @@ -25,14 +25,13 @@ int main(int argc, char **argv)
>>         for (i = 0; i < 100; i++) {
>>                 pid = fork();
>>                 if (pid == 0) {
>> +                       dup2(1, 10000);
>>                         while (1)
>>                                 sleep(1);
>>                         return 0;
>>                 }
>>         }
>>  
>> -       dup2(1, 1000000);
>> -
>>         test_daemon();
>>         test_waitsig();
>>
>> Before patchset:
>> before: memory.kmem.usage_in_bytes =             24076288
>> after:  memory.kmem.usage_in_bytes =            921894912 (3829.06%)
>>
>> After patchset:
>> before: memory.kmem.usage_in_bytes =             23855104
>> after:  memory.kmem.usage_in_bytes =             39591936 (165.97%)
>>
>> Someone may rework this later, and implement new engine using holes in task fd numbers.
>> I'm not against that, everybody is welcome! I'm just not sure this will take small time for
>> writing/stabilization/etc. And I don't think this is the first priority problem in criu,
>> so this is why I'm not going to solve it now.
> 
> I think we can solve this problem in contex of these changes.
> 
> We can create service descriptors with a small base in criu, then we can
> fork all processes and then we can move service descriptors to a
> calculated base for each process. It is not optimal, but it will work.
> 
> It can be optimazed if we predict a case when a child process has a
> smaller service fd base...

I've tried this one, but it looks not good. Tasks with shared fd table still require
the fds relocation are made before forking of children. Helpers are included there too.
Service fds relocation and closing of old files break up in several cases, which
(I assume) won't be easy to maintain.

In my opinion parent with big fds is unlikely case, and we may skip them for a while.
We may get a better profit if we thing about setting up lower service_fds_base before
forking of root_item, as it happens always.

What do you think about this?

>>
>> Also, keep in mind, that after the rework is happened, you still meet the situation, when
>> there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
>> scheme will still require more memory than before the dump. Just one more boundary case
>> like you pointed in your message.
>>
>> Just analyse the tasks files on your linux system and find out how the patchset will help
>> to reduce memory usage on restore of generic case.
>>
>> Thanks,
>> Kirill
Andrei Vagin Jan. 9, 2018, 4:30 p.m.
On Tue, Jan 09, 2018 at 02:54:31PM +0300, Kirill Tkhai wrote:
> On 30.12.2017 19:48, Andrei Vagin wrote:
> > On Sat, Dec 30, 2017 at 12:22:32PM +0300, Kirill Tkhai wrote:
> >> On 29.12.2017 22:59, Andrei Vagin wrote:
> >>>>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
> >>>>> index 1feabfa9f..d2517b835 100644
> >>>>> --- a/test/zdtm/static/env00.c
> >>>>> +++ b/test/zdtm/static/env00.c
> >>>>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
> >>>>>  
> >>>>>  int main(int argc, char **argv)
> >>>>>  {
> >>>>> +	int i;
> >>>>> +	pid_t pid;
> >>>>>  	char *env;
> >>>>>  
> >>>>>  	test_init(argc, argv);
> >>>>>  
> >>>>> +	for (i = 0; i < 10; i++) {
> >>>>> +		pid = fork();
> >>>>> +		if (pid == 0) {
> >>>>> +			while (1)
> >>>>> +				sleep(1);
> >>>>> +			return 0;
> >>>>> +		}
> >>>>> +	}
> >>>>> +
> >>>>> +//	dup2(1, 1000000);
> >>>>> +
> >>>>>  	if (setenv(envname, test_author, 1)) {
> >>>>>  		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
> >>>>>  		exit(1);
> >>>>>
> >>>>> before dump:
> >>>>> root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> >>>>> 11849728
> >>>>>
> >>>>> after restore:
> >>>>> root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
> >>>>> 20877312
> >>>>>
> >>>>> Criu restore changes many in-kernel structures, and this is a global problem, not
> >>>>> about sfds.
> >>>>
> >>>> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
> >>>> be 100MB. This is not a global problem, it is the problem that we move
> >>>> service descriptors in child processes.
> >>>
> >>> I create a small path for zdtm.py to get kmem before and after c/r:
> >>> git clone https://github.com/avagin/criu -b 2195
> >>>
> >>> And here are steps how I do my experiments:
> >>>
> >>> [root@fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> >>> [root@fc24 criu]# ulimit -n 1024
> >>> [root@fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> >>> === Run 1/1 ================ zdtm/static/env00
> >>>
> >>> ========================== Run zdtm/static/env00 in h ==========================
> >>> Start test
> >>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> >>> memory.kmem.usage_in_bytes = 9920512
> >>> Pause at pre-dump. Press any key to continue.
> >>> Run criu dump
> >>>
> >>> Unable to kill 36: [Errno 3] No such process
> >>> Pause at pre-restore. Press any key to continue.Run criu restore
> >>> Pause at post-restore. Press any key to continue.
> >>> memory.kmem.usage_in_bytes = 22913024
> >>> before: memory.kmem.usage_in_bytes =              9920512
> >>> after:  memory.kmem.usage_in_bytes =             22913024 (230.97%)
> >>> Send the 15 signal to  36
> >>> Wait for zdtm/static/env00(36) to die for 0.100000
> >>> Unable to kill 36: [Errno 3] No such process
> >>> Removing dump/zdtm/static/env00/36
> >>> ========================= Test zdtm/static/env00 PASS ==========================
> >>>
> >>> [root@fc24 criu]# 
> >>> [root@fc24 criu]# ulimit -n $((1 << 20))
> >>> [root@fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
> >>> [root@fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
> >>> === Run 1/1 ================ zdtm/static/env00
> >>>
> >>> ========================== Run zdtm/static/env00 in h ==========================
> >>> Start test
> >>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
> >>> memory.kmem.usage_in_bytes = 19714048
> >>> Pause at pre-dump. Press any key to continue.
> >>> Run criu dump
> >>> Unable to kill 36: [Errno 3] No such process
> >>> Pause at pre-restore. Press any key to continue.
> >>> Run criu restore
> >>> Pause at post-restore. Press any key to continue.
> >>> memory.kmem.usage_in_bytes = 914178048
> >>> before: memory.kmem.usage_in_bytes =             19714048
> >>> after:  memory.kmem.usage_in_bytes =            914178048 (4637.19%)
> >>> Send the 15 signal to  36
> >>> Wait for zdtm/static/env00(36) to die for 0.100000
> >>> Unable to kill 36: [Errno 3] No such process
> >>> Removing dump/zdtm/static/env00/36
> >>> ========================= Test zdtm/static/env00 PASS ==========================
> >>>
> >>> When we call ulimit -n 1024, dup2(1, 1000000) returns an error and
> >>> the test has only descriptors with small numbers. In this case,
> >>> a kmem consumption increases on 130%. We need to investigate this fact,
> >>> but it isn't so critical, if we compare it with the next case.
> >>>
> >>> When we call ulimit -n $((1 << 20)) before running the test, dup2(1, 1000000)
> >>> creates the 1000000 descriptor, and a kmem consumption after c/r increases on
> >>> 4537.19%. This is a real problem, what we have to solve with a high priority.
> >>
> >> It's not a problem of my patchset, it's the problem of current engine. My patchset
> >> does not aim to exploit the engine and rework everything. It just improves
> >> the engine and it works better then current engine in many cases and not worse
> >> in the rest of cases. Yes, there are the cases, when the patchset works like
> >> current engine, but there is no a case, when it behaves worse. Strange to
> >> call me to solve all the file or memory problems at once, isn't it?!
> > 
> > I like this work. Thank you for it. But I see nothing strange to
> > discuss the other problem of this code and try to solve them now or
> > later.
> > 
> >>
> >> The goals of the patchset are:
> >> 1)Disconnect prlimit(RLIMIT_NOFILE) and service fds placement,
> >> i.e. make the placement to depend on task fds, not on prlimit().
> >> 2)Make people able to restore big fd numbers without increasing
> >> memory usage on all tasks in a container (as currently this requires
> >> to increase global limits).
> >>
> >> #ulimit -n $((1 << 20))
> >> --- a/test/zdtm/static/env00.c
> >> +++ b/test/zdtm/static/env00.c
> >> @@ -25,14 +25,13 @@ int main(int argc, char **argv)
> >>         for (i = 0; i < 100; i++) {
> >>                 pid = fork();
> >>                 if (pid == 0) {
> >> +                       dup2(1, 10000);
> >>                         while (1)
> >>                                 sleep(1);
> >>                         return 0;
> >>                 }
> >>         }
> >>  
> >> -       dup2(1, 1000000);
> >> -
> >>         test_daemon();
> >>         test_waitsig();
> >>
> >> Before patchset:
> >> before: memory.kmem.usage_in_bytes =             24076288
> >> after:  memory.kmem.usage_in_bytes =            921894912 (3829.06%)
> >>
> >> After patchset:
> >> before: memory.kmem.usage_in_bytes =             23855104
> >> after:  memory.kmem.usage_in_bytes =             39591936 (165.97%)
> >>
> >> Someone may rework this later, and implement new engine using holes in task fd numbers.
> >> I'm not against that, everybody is welcome! I'm just not sure this will take small time for
> >> writing/stabilization/etc. And I don't think this is the first priority problem in criu,
> >> so this is why I'm not going to solve it now.
> > 
> > I think we can solve this problem in contex of these changes.
> > 
> > We can create service descriptors with a small base in criu, then we can
> > fork all processes and then we can move service descriptors to a
> > calculated base for each process. It is not optimal, but it will work.
> > 
> > It can be optimazed if we predict a case when a child process has a
> > smaller service fd base...
> 
> I've tried this one, but it looks not good. Tasks with shared fd table still require
> the fds relocation are made before forking of children. Helpers are included there too.
> Service fds relocation and closing of old files break up in several cases, which
> (I assume) won't be easy to maintain.
> 
> In my opinion parent with big fds is unlikely case, and we may skip them for a while.
> We may get a better profit if we thing about setting up lower service_fds_base before
> forking of root_item, as it happens always.
> 
> What do you think about this?

Ok, let's postpone this task. If you have time, could you describe the
current state of service fd-s on the criu wiki with all known issues.

Thanks,
Andrei

> 
> >>
> >> Also, keep in mind, that after the rework is happened, you still meet the situation, when
> >> there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
> >> scheme will still require more memory than before the dump. Just one more boundary case
> >> like you pointed in your message.
> >>
> >> Just analyse the tasks files on your linux system and find out how the patchset will help
> >> to reduce memory usage on restore of generic case.
> >>
> >> Thanks,
> >> Kirill
Kirill Tkhai Jan. 10, 2018, 7:39 a.m.
On 09.01.2018 19:30, Andrei Vagin wrote:
> On Tue, Jan 09, 2018 at 02:54:31PM +0300, Kirill Tkhai wrote:
>> On 30.12.2017 19:48, Andrei Vagin wrote:
>>> On Sat, Dec 30, 2017 at 12:22:32PM +0300, Kirill Tkhai wrote:
>>>> On 29.12.2017 22:59, Andrei Vagin wrote:
>>>>>>> diff --git a/test/zdtm/static/env00.c b/test/zdtm/static/env00.c
>>>>>>> index 1feabfa9f..d2517b835 100644
>>>>>>> --- a/test/zdtm/static/env00.c
>>>>>>> +++ b/test/zdtm/static/env00.c
>>>>>>> @@ -12,10 +12,23 @@ TEST_OPTION(envname, string, "environment variable name", 1);
>>>>>>>  
>>>>>>>  int main(int argc, char **argv)
>>>>>>>  {
>>>>>>> +	int i;
>>>>>>> +	pid_t pid;
>>>>>>>  	char *env;
>>>>>>>  
>>>>>>>  	test_init(argc, argv);
>>>>>>>  
>>>>>>> +	for (i = 0; i < 10; i++) {
>>>>>>> +		pid = fork();
>>>>>>> +		if (pid == 0) {
>>>>>>> +			while (1)
>>>>>>> +				sleep(1);
>>>>>>> +			return 0;
>>>>>>> +		}
>>>>>>> +	}
>>>>>>> +
>>>>>>> +//	dup2(1, 1000000);
>>>>>>> +
>>>>>>>  	if (setenv(envname, test_author, 1)) {
>>>>>>>  		pr_perror("Can't set env var \"%s\" to \"%s\"", envname, test_author);
>>>>>>>  		exit(1);
>>>>>>>
>>>>>>> before dump:
>>>>>>> root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>>>>>>> 11849728
>>>>>>>
>>>>>>> after restore:
>>>>>>> root@pro:/home/kirill# cat /sys/fs/cgroup/memory/test/memory.kmem.usage_in_bytes
>>>>>>> 20877312
>>>>>>>
>>>>>>> Criu restore changes many in-kernel structures, and this is a global problem, not
>>>>>>> about sfds.
>>>>>>
>>>>>> In this case, the delta is 10MB. If you uncomment dup2(), the delta will
>>>>>> be 100MB. This is not a global problem, it is the problem that we move
>>>>>> service descriptors in child processes.
>>>>>
>>>>> I create a small path for zdtm.py to get kmem before and after c/r:
>>>>> git clone https://github.com/avagin/criu -b 2195
>>>>>
>>>>> And here are steps how I do my experiments:
>>>>>
>>>>> [root@fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
>>>>> [root@fc24 criu]# ulimit -n 1024
>>>>> [root@fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
>>>>> === Run 1/1 ================ zdtm/static/env00
>>>>>
>>>>> ========================== Run zdtm/static/env00 in h ==========================
>>>>> Start test
>>>>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
>>>>> memory.kmem.usage_in_bytes = 9920512
>>>>> Pause at pre-dump. Press any key to continue.
>>>>> Run criu dump
>>>>>
>>>>> Unable to kill 36: [Errno 3] No such process
>>>>> Pause at pre-restore. Press any key to continue.Run criu restore
>>>>> Pause at post-restore. Press any key to continue.
>>>>> memory.kmem.usage_in_bytes = 22913024
>>>>> before: memory.kmem.usage_in_bytes =              9920512
>>>>> after:  memory.kmem.usage_in_bytes =             22913024 (230.97%)
>>>>> Send the 15 signal to  36
>>>>> Wait for zdtm/static/env00(36) to die for 0.100000
>>>>> Unable to kill 36: [Errno 3] No such process
>>>>> Removing dump/zdtm/static/env00/36
>>>>> ========================= Test zdtm/static/env00 PASS ==========================
>>>>>
>>>>> [root@fc24 criu]# 
>>>>> [root@fc24 criu]# ulimit -n $((1 << 20))
>>>>> [root@fc24 criu]# rmdir /sys/fs/cgroup/memory/test/xxx
>>>>> [root@fc24 criu]# python test/zdtm.py run -t zdtm/static/env00 --sbs -f h --memcg /sys/fs/cgroup/memory/test/xxx
>>>>> === Run 1/1 ================ zdtm/static/env00
>>>>>
>>>>> ========================== Run zdtm/static/env00 in h ==========================
>>>>> Start test
>>>>> ./env00 --pidfile=env00.pid --outfile=env00.out --envname=ENV_00_TEST
>>>>> memory.kmem.usage_in_bytes = 19714048
>>>>> Pause at pre-dump. Press any key to continue.
>>>>> Run criu dump
>>>>> Unable to kill 36: [Errno 3] No such process
>>>>> Pause at pre-restore. Press any key to continue.
>>>>> Run criu restore
>>>>> Pause at post-restore. Press any key to continue.
>>>>> memory.kmem.usage_in_bytes = 914178048
>>>>> before: memory.kmem.usage_in_bytes =             19714048
>>>>> after:  memory.kmem.usage_in_bytes =            914178048 (4637.19%)
>>>>> Send the 15 signal to  36
>>>>> Wait for zdtm/static/env00(36) to die for 0.100000
>>>>> Unable to kill 36: [Errno 3] No such process
>>>>> Removing dump/zdtm/static/env00/36
>>>>> ========================= Test zdtm/static/env00 PASS ==========================
>>>>>
>>>>> When we call ulimit -n 1024, dup2(1, 1000000) returns an error and
>>>>> the test has only descriptors with small numbers. In this case,
>>>>> a kmem consumption increases on 130%. We need to investigate this fact,
>>>>> but it isn't so critical, if we compare it with the next case.
>>>>>
>>>>> When we call ulimit -n $((1 << 20)) before running the test, dup2(1, 1000000)
>>>>> creates the 1000000 descriptor, and a kmem consumption after c/r increases on
>>>>> 4537.19%. This is a real problem, what we have to solve with a high priority.
>>>>
>>>> It's not a problem of my patchset, it's the problem of current engine. My patchset
>>>> does not aim to exploit the engine and rework everything. It just improves
>>>> the engine and it works better then current engine in many cases and not worse
>>>> in the rest of cases. Yes, there are the cases, when the patchset works like
>>>> current engine, but there is no a case, when it behaves worse. Strange to
>>>> call me to solve all the file or memory problems at once, isn't it?!
>>>
>>> I like this work. Thank you for it. But I see nothing strange to
>>> discuss the other problem of this code and try to solve them now or
>>> later.
>>>
>>>>
>>>> The goals of the patchset are:
>>>> 1)Disconnect prlimit(RLIMIT_NOFILE) and service fds placement,
>>>> i.e. make the placement to depend on task fds, not on prlimit().
>>>> 2)Make people able to restore big fd numbers without increasing
>>>> memory usage on all tasks in a container (as currently this requires
>>>> to increase global limits).
>>>>
>>>> #ulimit -n $((1 << 20))
>>>> --- a/test/zdtm/static/env00.c
>>>> +++ b/test/zdtm/static/env00.c
>>>> @@ -25,14 +25,13 @@ int main(int argc, char **argv)
>>>>         for (i = 0; i < 100; i++) {
>>>>                 pid = fork();
>>>>                 if (pid == 0) {
>>>> +                       dup2(1, 10000);
>>>>                         while (1)
>>>>                                 sleep(1);
>>>>                         return 0;
>>>>                 }
>>>>         }
>>>>  
>>>> -       dup2(1, 1000000);
>>>> -
>>>>         test_daemon();
>>>>         test_waitsig();
>>>>
>>>> Before patchset:
>>>> before: memory.kmem.usage_in_bytes =             24076288
>>>> after:  memory.kmem.usage_in_bytes =            921894912 (3829.06%)
>>>>
>>>> After patchset:
>>>> before: memory.kmem.usage_in_bytes =             23855104
>>>> after:  memory.kmem.usage_in_bytes =             39591936 (165.97%)
>>>>
>>>> Someone may rework this later, and implement new engine using holes in task fd numbers.
>>>> I'm not against that, everybody is welcome! I'm just not sure this will take small time for
>>>> writing/stabilization/etc. And I don't think this is the first priority problem in criu,
>>>> so this is why I'm not going to solve it now.
>>>
>>> I think we can solve this problem in contex of these changes.
>>>
>>> We can create service descriptors with a small base in criu, then we can
>>> fork all processes and then we can move service descriptors to a
>>> calculated base for each process. It is not optimal, but it will work.
>>>
>>> It can be optimazed if we predict a case when a child process has a
>>> smaller service fd base...
>>
>> I've tried this one, but it looks not good. Tasks with shared fd table still require
>> the fds relocation are made before forking of children. Helpers are included there too.
>> Service fds relocation and closing of old files break up in several cases, which
>> (I assume) won't be easy to maintain.
>>
>> In my opinion parent with big fds is unlikely case, and we may skip them for a while.
>> We may get a better profit if we thing about setting up lower service_fds_base before
>> forking of root_item, as it happens always.
>>
>> What do you think about this?
> 
> Ok, let's postpone this task. If you have time, could you describe the
> current state of service fd-s on the criu wiki with all known issues.

Ok. Should some existing file wiki page be used or to introduce a new one?

Kirill
 
> Thanks,
> Andrei
> 
>>
>>>>
>>>> Also, keep in mind, that after the rework is happened, you still meet the situation, when
>>>> there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
>>>> scheme will still require more memory than before the dump. Just one more boundary case
>>>> like you pointed in your message.
>>>>
>>>> Just analyse the tasks files on your linux system and find out how the patchset will help
>>>> to reduce memory usage on restore of generic case.
>>>>
>>>> Thanks,
>>>> Kirill
Andrei Vagin Jan. 11, 2018, 5:45 p.m.
On Wed, Jan 10, 2018 at 10:39:56AM +0300, Kirill Tkhai wrote:
> On 09.01.2018 19:30, Andrei Vagin wrote:
> >> What do you think about this?
> > 
> > Ok, let's postpone this task. If you have time, could you describe the
> > current state of service fd-s on the criu wiki with all known issues.
> 
> Ok. Should some existing file wiki page be used or to introduce a new one?

https://criu.org/Service_descriptors


> 
> Kirill
>  
> > Thanks,
> > Andrei
> > 
> >>
> >>>>
> >>>> Also, keep in mind, that after the rework is happened, you still meet the situation, when
> >>>> there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
> >>>> scheme will still require more memory than before the dump. Just one more boundary case
> >>>> like you pointed in your message.
> >>>>
> >>>> Just analyse the tasks files on your linux system and find out how the patchset will help
> >>>> to reduce memory usage on restore of generic case.
> >>>>
> >>>> Thanks,
> >>>> Kirill
Kirill Tkhai Jan. 12, 2018, 8:43 a.m.
On 11.01.2018 20:45, Andrei Vagin wrote:
> On Wed, Jan 10, 2018 at 10:39:56AM +0300, Kirill Tkhai wrote:
>> On 09.01.2018 19:30, Andrei Vagin wrote:
>>>> What do you think about this?
>>>
>>> Ok, let's postpone this task. If you have time, could you describe the
>>> current state of service fd-s on the criu wiki with all known issues.
>>
>> Ok. Should some existing file wiki page be used or to introduce a new one?
> 
> https://criu.org/Service_descriptors

Ok, thanks. I'll update actual state of service fds after we finish with the patchset.
 
>>
>> Kirill
>>  
>>> Thanks,
>>> Andrei
>>>
>>>>
>>>>>>
>>>>>> Also, keep in mind, that after the rework is happened, you still meet the situation, when
>>>>>> there are no spare fds of a task (1,..., pow(2,n)*128) are occupied, and the spare
>>>>>> scheme will still require more memory than before the dump. Just one more boundary case
>>>>>> like you pointed in your message.
>>>>>>
>>>>>> Just analyse the tasks files on your linux system and find out how the patchset will help
>>>>>> to reduce memory usage on restore of generic case.
>>>>>>
>>>>>> Thanks,
>>>>>> Kirill