[libvzctl] nsops: Extend cg_attach_task to skip certain cgroups

Submitted by Kirill Gorkunov on Nov. 2, 2017, 3:48 p.m.

Details

Message ID 20171102154801.GA27304@uranus
State New
Series "nsops: Extend cg_attach_task to skip certain cgroups"
Headers show

Commit Message

Kirill Gorkunov Nov. 2, 2017, 3:48 p.m.
When doing a restore preocedure we rely on criu to move the task
being restored into appropriate cgroups. But due to restore specifics
we already start criu inside designated memory cgroup. Moreover once
userns daemon started inside criu (to make privileged operations from
ve0 context) we poke container with "init" task and pass its pid into
"START $pid" (but tasks are not yet moved into target cgroups) the
kernel marks existing roots with CGRP_VE_ROOT to hide toplevel cgroup
paths from inside of container view.

So that all later task assignment into cgroups no longer obtain that
bit and not mangled when looking into /proc/$pid/cgroup poining into
toplevel entries.

We get

 | 16:ve:/
 | 11:freezer:/machine.slice/101
 | 10:devices:/machine.slice/101
 | 9:net_prio,net_cls:/machine.slice/101
 | 8:cpuacct,cpu:/machine.slice/101
 | 7:pids:/machine.slice/101
 | 6:hugetlb:/machine.slice/101
 | 5:perf_event:/machine.slice/101
 | 4:name=systemd:/101
 | 3:beancounter:/
 | 2:memory:/
 | 1:blkio:/machine.slice/101

Instead of proper

 | 16:ve:/
 | 14:devices:/
 | 13:freezer:/
 | 12:pids:/
 | 9:cpuacct,cpu:/
 | 8:net_prio,net_cls:/
 | 7:hugetlb:/
 | 5:perf_event:/
 | 4:name=systemd:/
 | 3:beancounter:/
 | 2:memory:/
 | 1:blkio:/

Thus in the patch we extend cg_attach_task to allow to filter
VE cgroup from attachement on restore while the rest of cgroups
are attached. CRIU will join veX by self.

https://jira.sw.ru/browse/PSBM-64756

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---

Guys, please review carefully! I tested it locally and all went fine,
still I'm a bit nervious about device cgroup (at moment of restore
all permissions we need for device access already setup by libvzctl
but still).

 lib/cgroup.c    |  8 ++++++--
 lib/cgroup.h    |  2 +-
 lib/env_nsops.c | 18 +++++++++---------
 3 files changed, 16 insertions(+), 12 deletions(-)

Patch hide | download patch | download mbox

diff --git a/lib/cgroup.c b/lib/cgroup.c
index d5cc9e5..95467c8 100644
--- a/lib/cgroup.c
+++ b/lib/cgroup.c
@@ -620,12 +620,16 @@  int cg_disable_pseudosuper(const int pseudosuper_fd)
 	return do_write_data(pseudosuper_fd, NULL, "0", 1);
 }
 
-int cg_attach_task(const char *ctid, pid_t pid, char *cg_subsys)
+int cg_attach_task(const char *ctid, pid_t pid, char *cg_subsys_only, char *cg_subsys_except)
 {
 	int ret, i;
 
 	for (i = 0; i < sizeof(cg_ctl_map)/sizeof(cg_ctl_map[0]); i++) {
-		if (cg_subsys && strcmp(cg_ctl_map[i].subsys, cg_subsys))
+		if (cg_subsys_only &&
+		    strcmp(cg_ctl_map[i].subsys, cg_subsys_only))
+			continue;
+		else if (cg_subsys_except &&
+			 !strcmp(cg_ctl_map[i].subsys, cg_subsys_except))
 			continue;
 		ret = cg_set_ul(ctid, cg_ctl_map[i].subsys, "tasks", pid);
 		if (ret == -1)
diff --git a/lib/cgroup.h b/lib/cgroup.h
index a27ca96..2407d07 100644
--- a/lib/cgroup.h
+++ b/lib/cgroup.h
@@ -58,7 +58,7 @@  int cg_destroy_cgroup(const char *ctid);
 int cg_enable_pseudosuper(const char *ctid);
 int cg_pseudosuper_open(const char *ctid, int *fd);
 int cg_disable_pseudosuper(const int pseudosuper_fd);
-int cg_attach_task(const char *ctid, pid_t pid, char *cg_subsys);
+int cg_attach_task(const char *ctid, pid_t pid, char *cg_subsys_only, char *cg_subsys_except);
 int cg_set_param(const char *ctid, const char *subsys, const char *name, const char *data);
 int cg_get_param(const char *ctid, const char *subsys, const char *name, char *out, int size);
 int cg_get_ul(const char *ctid, const char *subsys, const char *name,
diff --git a/lib/env_nsops.c b/lib/env_nsops.c
index e2a826a..249b63b 100644
--- a/lib/env_nsops.c
+++ b/lib/env_nsops.c
@@ -771,21 +771,21 @@  static int do_env_create(struct vzctl_env_handle *h, struct start_param *param)
 	 * When plain container start we should
 	 * exec init from inside of VE and other
 	 * cgroups, in turn restore procedure
-	 * always start on VE0 and criu moves
-	 * children into appropriate cgroups.
+	 * always start on VE0 so joining inside
+	 * VEX made by CRIU. Still we have to
+	 * enter the rest of cgoups to properly
+	 * hide cgroup roots in /proc/$pid/cgroup
+	 * from inside of container (grep CGRP_VE_ROOT
+	 * in kernel source code).
 	 */
 	if (!param->fn) {
-		ret = cg_attach_task(h->ctid, getpid(), NULL);
+		ret = cg_attach_task(h->ctid, getpid(), NULL, NULL);
 		if (ret)
 			goto err;
 	} else {
-		ret = cg_attach_task(h->ctid, getpid(), CG_MEMORY);
+		ret = cg_attach_task(h->ctid, getpid(), NULL, CG_VE);
 		if (ret)
 			goto err;
-		ret = cg_attach_task(h->ctid, getpid(), CG_UB);
-		if (ret)
-			goto err;
-
 	}
 
 #if 0
@@ -947,7 +947,7 @@  static int ns_env_enter(struct vzctl_env_handle *h, int flags)
 	if (dp == NULL)
 		return vzctl_err(-1, errno, "Unable to open dir %s", path);
 
-	ret = cg_attach_task(EID(h), getpid(), NULL);
+	ret = cg_attach_task(EID(h), getpid(), NULL, NULL);
 	if (ret)
 		goto err;