[Devel,RH7] ve/pid: Export kernel.pid_max via ve cgroup

Submitted by Pavel Tikhomirov on July 19, 2016, 11 a.m.

Details

Message ID 1468926020-25609-1-git-send-email-ptikhomirov@virtuozzo.com
State New
Series "ve/pid: Export kernel.pid_max via ve cgroup"
Headers show

Commit Message

Pavel Tikhomirov July 19, 2016, 11 a.m.
This member represents kernel.pid_max sysctl it is vz-specific but
lays on pid namespace. To be able to c/r from libvzctl script it is
better put pid_max in ve cgroup, these way we do not need to enter
container root pid namespace to get/set these sysctl.

Add ve_write_running_u64 helper to be able to set pid_max on running
container, as we can't set pid_max before we have ve's pidns.

Will send libvzctl PR after these one is commited.

https://jira.sw.ru/browse/PSBM-48397
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
 kernel/ve/ve.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 50 insertions(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index b5399e1..f7eb445 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -1194,6 +1194,7 @@  enum {
 	VE_CF_CLOCK_MONOTONIC,
 	VE_CF_CLOCK_BOOTBASED,
 	VE_CF_AIO_MAX_NR,
+	VE_CF_PID_MAX,
 };
 
 static int ve_ts_read(struct cgroup *cg, struct cftype *cft, struct seq_file *m)
@@ -1255,6 +1256,11 @@  static u64 ve_read_u64(struct cgroup *cg, struct cftype *cft)
 		return cgroup_ve(cg)->is_pseudosuper;
 	else if (cft->private == VE_CF_AIO_MAX_NR)
 		return cgroup_ve(cg)->aio_max_nr;
+	else if (cft->private == VE_CF_PID_MAX) {
+		struct ve_struct *ve = cgroup_ve(cg);
+		if (ve->ve_ns && ve->ve_ns->pid_ns)
+			return ve->ve_ns->pid_ns->pid_max;
+	}
 	return 0;
 }
 
@@ -1287,7 +1293,26 @@  static int ve_write_pseudosuper(struct cgroup *cg,
 	return 0;
 }
 
-static int ve_write_u64(struct cgroup *cg, struct cftype *cft, u64 value)
+extern int pid_max_min, pid_max_max;
+
+static int ve_write_pid_max(struct cgroup *cg,
+			    struct cftype *cft,
+			    u64 value)
+{
+	struct ve_struct *ve = cgroup_ve(cg);
+	if (!ve->ve_ns || !ve->ve_ns->pid_ns)
+		return -EBUSY;
+
+	if (pid_max_min > value ||
+	     pid_max_max < value)
+		return -EINVAL;
+
+	ve->ve_ns->pid_ns->pid_max = value;
+	return 0;
+}
+
+static int _ve_write_u64(struct cgroup *cg, struct cftype *cft,
+                         u64 value, int running)
 {
 	struct ve_struct *ve = cgroup_ve(cg);
 
@@ -1296,7 +1321,7 @@  static int ve_write_u64(struct cgroup *cg, struct cftype *cft, u64 value)
 		return -EPERM;
 
 	down_write(&ve->op_sem);
-	if (ve->is_running || ve->ve_ns) {
+	if (!running && (ve->is_running || ve->ve_ns)) {
 		up_write(&ve->op_sem);
 		return -EBUSY;
 	}
@@ -1309,10 +1334,26 @@  static int ve_write_u64(struct cgroup *cg, struct cftype *cft, u64 value)
 #endif
 	else if (cft->private == VE_CF_AIO_MAX_NR)
 		ve->aio_max_nr = value;
+	else if (cft->private == VE_CF_PID_MAX) {
+		int ret;
+		ret = ve_write_pid_max(cg, cft, value);
+		up_write(&ve->op_sem);
+		return ret;
+	}
 	up_write(&ve->op_sem);
 	return 0;
 }
 
+static int ve_write_u64(struct cgroup *cg, struct cftype *cft, u64 value)
+{
+	return _ve_write_u64(cg, cft, value, 0);
+}
+
+static int ve_write_running_u64(struct cgroup *cg, struct cftype *cft, u64 value)
+{
+	return _ve_write_u64(cg, cft, value, 1);
+}
+
 static struct cftype ve_cftypes[] = {
 	{
 		.name			= "state",
@@ -1384,6 +1425,13 @@  static struct cftype ve_cftypes[] = {
 		.write_u64		= ve_write_u64,
 		.private		= VE_CF_AIO_MAX_NR,
 	},
+	{
+		.name			= "pid_max",
+		.flags			= CFTYPE_NOT_ON_ROOT,
+		.read_u64		= ve_read_u64,
+		.write_u64		= ve_write_running_u64,
+		.private		= VE_CF_PID_MAX,
+	},
 	{ }
 };
 

Comments

Kirill Gorkunov July 19, 2016, 11:17 a.m.
On Tue, Jul 19, 2016 at 02:00:20PM +0300, Pavel Tikhomirov wrote:
> This member represents kernel.pid_max sysctl it is vz-specific but
> lays on pid namespace. To be able to c/r from libvzctl script it is
> better put pid_max in ve cgroup, these way we do not need to enter
> container root pid namespace to get/set these sysctl.

Wait, kernel.pid_max is not ve-specific (see kernel/sysctl.c in
vanilla kernel). Why do we have to c/r it at all?
Pavel Tikhomirov July 19, 2016, 11:26 a.m.
On 07/19/2016 02:17 PM, Cyrill Gorcunov wrote:
> On Tue, Jul 19, 2016 at 02:00:20PM +0300, Pavel Tikhomirov wrote:
>> This member represents kernel.pid_max sysctl it is vz-specific but
>> lays on pid namespace. To be able to c/r from libvzctl script it is
>> better put pid_max in ve cgroup, these way we do not need to enter
>> container root pid namespace to get/set these sysctl.
>
> Wait, kernel.pid_max is not ve-specific (see kernel/sysctl.c in
> vanilla kernel). Why do we have to c/r it at all?

It is virtualized only in VZ7(see proc_dointvec_pidmax), so in 
mainstream it is global sysctl unlike our case.

>
Kirill Gorkunov July 19, 2016, 11:46 a.m.
On Tue, Jul 19, 2016 at 02:26:06PM +0300, Pavel Tikhomirov wrote:
> 
> 
> On 07/19/2016 02:17 PM, Cyrill Gorcunov wrote:
> > On Tue, Jul 19, 2016 at 02:00:20PM +0300, Pavel Tikhomirov wrote:
> > > This member represents kernel.pid_max sysctl it is vz-specific but
> > > lays on pid namespace. To be able to c/r from libvzctl script it is
> > > better put pid_max in ve cgroup, these way we do not need to enter
> > > container root pid namespace to get/set these sysctl.
> > 
> > Wait, kernel.pid_max is not ve-specific (see kernel/sysctl.c in
> > vanilla kernel). Why do we have to c/r it at all?
> 
> It is virtualized only in VZ7(see proc_dointvec_pidmax), so in mainstream it
> is global sysctl unlike our case.

Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

p.s. I'm not really follow why this feature is needed in container
at all, i mean the @pid_max virtualization. Presume due to hist. reasons.
Pavel Tikhomirov July 19, 2016, 12:25 p.m.
On 07/19/2016 02:46 PM, Cyrill Gorcunov wrote:
> On Tue, Jul 19, 2016 at 02:26:06PM +0300, Pavel Tikhomirov wrote:
>>
>>
>> On 07/19/2016 02:17 PM, Cyrill Gorcunov wrote:
>>> On Tue, Jul 19, 2016 at 02:00:20PM +0300, Pavel Tikhomirov wrote:
>>>> This member represents kernel.pid_max sysctl it is vz-specific but
>>>> lays on pid namespace. To be able to c/r from libvzctl script it is
>>>> better put pid_max in ve cgroup, these way we do not need to enter
>>>> container root pid namespace to get/set these sysctl.
>>>
>>> Wait, kernel.pid_max is not ve-specific (see kernel/sysctl.c in
>>> vanilla kernel). Why do we have to c/r it at all?
>>
>> It is virtualized only in VZ7(see proc_dointvec_pidmax), so in mainstream it
>> is global sysctl unlike our case.
>
> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
>
> p.s. I'm not really follow why this feature is needed in container
> at all, i mean the @pid_max virtualization. Presume due to hist. reasons.

If the only reason was(as far as I understood 
https://jira.sw.ru/browse/PSBM-6437) to have more pids available on 
host(for pid mapping from containers pids) but not in CT, than actually 
we can instead make kernel.pid_max readonly in CT and we won't need 
c/r-ing it.

Why pid_max is writable in pidns in Upstream is also a riddle - one can 
restrict number of processes to 301 from unprivileged user on hole node, 
like:
unshare -Upm --fork --mount-proc sysctl -w kernel.pid_max=301

>
Kirill Gorkunov July 19, 2016, 1:02 p.m.
On Tue, Jul 19, 2016 at 03:25:08PM +0300, Pavel Tikhomirov wrote:
> > 
> > p.s. I'm not really follow why this feature is needed in container
> > at all, i mean the @pid_max virtualization. Presume due to hist. reasons.
> 
> If the only reason was(as far as I understood
> https://jira.sw.ru/browse/PSBM-6437) to have more pids available on host(for
> pid mapping from containers pids) but not in CT, than actually we can
> instead make kernel.pid_max readonly in CT and we won't need c/r-ing it.
> 
> Why pid_max is writable in pidns in Upstream is also a riddle - one can
> restrict number of processes to 301 from unprivileged user on hole node,
> like:
> unshare -Upm --fork --mount-proc sysctl -w kernel.pid_max=301

Fair enough, thanks!
Konstantin Khorenko Aug. 9, 2016, 1:57 p.m.
Why we had to virtualize "pid_max": https://jira.sw.ru/browse/PCLIN-27054

In brief: x86_64 node + x86 CT + pid_max > 65536 = non-working ps, top because inode numbers become greater than 32bit and compat_sys_getdents()->compat_filldir() fails to handle them

=> as we still can have 32bit Containers on Virtuozzo 7 (although there are no 32bit templates in vz7 - legacy Containers, migrated from PCS6 are still possible)
=> we have to keep sysctl virtualized for now
=> applying the patch for ease of migration.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 07/19/2016 02:46 PM, Cyrill Gorcunov wrote:
> On Tue, Jul 19, 2016 at 02:26:06PM +0300, Pavel Tikhomirov wrote:
>>
>>
>> On 07/19/2016 02:17 PM, Cyrill Gorcunov wrote:
>>> On Tue, Jul 19, 2016 at 02:00:20PM +0300, Pavel Tikhomirov wrote:
>>>> This member represents kernel.pid_max sysctl it is vz-specific but
>>>> lays on pid namespace. To be able to c/r from libvzctl script it is
>>>> better put pid_max in ve cgroup, these way we do not need to enter
>>>> container root pid namespace to get/set these sysctl.
>>>
>>> Wait, kernel.pid_max is not ve-specific (see kernel/sysctl.c in
>>> vanilla kernel). Why do we have to c/r it at all?
>>
>> It is virtualized only in VZ7(see proc_dointvec_pidmax), so in mainstream it
>> is global sysctl unlike our case.
>
> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
>
> p.s. I'm not really follow why this feature is needed in container
> at all, i mean the @pid_max virtualization. Presume due to hist. reasons.
> .
>