cgroup/cpuset: emulate cgroup in container

Submitted by Stanislav Kinsburskiy on Dec. 13, 2017, 10:37 a.m.

Details

Message ID 20171213103745.13570.92356.stgit@localhost.localdomain
State New
Series "cgroup/cpuset: emulate cgroup in container"
Headers show

Commit Message

Stanislav Kinsburskiy Dec. 13, 2017, 10:37 a.m.
Any changes to this cgroup are skipped in container, but success code is
returned.
The idea is to fool Docker/Kubernetes.

https://jira.sw.ru/browse/PSBM-58423

This patch obsoletes "ve/proc/cpuset: do not show cpuset in CT"

Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
---
 kernel/cpuset.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Patch hide | download patch | download mbox

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 26d88eb..dfac505 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1441,6 +1441,9 @@  static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 	struct task_struct *task;
 	int ret;
 
+	if (!ve_is_super(get_exec_env()))
+		return 0;
+
 	mutex_lock(&cpuset_mutex);
 
 	ret = -ENOSPC;
@@ -1470,6 +1473,9 @@  static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 static void cpuset_cancel_attach(struct cgroup *cgrp,
 				 struct cgroup_taskset *tset)
 {
+	if (!ve_is_super(get_exec_env()))
+		return;
+
 	mutex_lock(&cpuset_mutex);
 	cgroup_cs(cgrp)->attach_in_progress--;
 	mutex_unlock(&cpuset_mutex);
@@ -1494,6 +1500,9 @@  static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 	struct cpuset *cs = cgroup_cs(cgrp);
 	struct cpuset *oldcs = cgroup_cs(oldcgrp);
 
+	if (!ve_is_super(get_exec_env()))
+		return;
+
 	mutex_lock(&cpuset_mutex);
 
 	/* prepare for attach */

Comments

Pavel Tikhomirov Dec. 13, 2017, 12:43 p.m.
Personally I don't like these as we still have no unswer to "If cpusets 
are optional for docker, why k8s can't work without them?" it seem there 
is not enough explanation in VZAP-31.

We also need to revert the patch below to show cpuset in CT:
commit 5160bd34c9bd ("ve/proc/cpuset: do not show cpuset in CT")

It seem I can still attach a process to a nested cgroup in CT with these 
patch:

CT-6ecd9be1 /# cat /proc/cgroups | grep cpuset
cpuset	16	1	1
CT-6ecd9be1 /# ls /sys/fs/cgroup/cpuset/cpuset.cpus
/sys/fs/cgroup/cpuset/cpuset.cpus
CT-6ecd9be1 /# mkdir /sys/fs/cgroup/cpuset/test
CT-6ecd9be1 /# sleep 1000 &
[1] 678
CT-6ecd9be1 /# echo  678 > /sys/fs/cgroup/cpuset/test/tasks
CT-6ecd9be1 /# cat /sys/fs/cgroup/cpuset/test/tasks
678

On 12/13/2017 01:37 PM, Stanislav Kinsburskiy wrote:
> Any changes to this cgroup are skipped in container, but success code is
> returned.
> The idea is to fool Docker/Kubernetes.
> 
> https://jira.sw.ru/browse/PSBM-58423
> 
> This patch obsoletes "ve/proc/cpuset: do not show cpuset in CT"
> 
> Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
> ---
>   kernel/cpuset.c |    9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 26d88eb..dfac505 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1441,6 +1441,9 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>   	struct task_struct *task;
>   	int ret;
>   
> +	if (!ve_is_super(get_exec_env()))
> +		return 0;
> +
>   	mutex_lock(&cpuset_mutex);
>   
>   	ret = -ENOSPC;
> @@ -1470,6 +1473,9 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>   static void cpuset_cancel_attach(struct cgroup *cgrp,
>   				 struct cgroup_taskset *tset)
>   {
> +	if (!ve_is_super(get_exec_env()))
> +		return;
> +
>   	mutex_lock(&cpuset_mutex);
>   	cgroup_cs(cgrp)->attach_in_progress--;
>   	mutex_unlock(&cpuset_mutex);
> @@ -1494,6 +1500,9 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>   	struct cpuset *cs = cgroup_cs(cgrp);
>   	struct cpuset *oldcs = cgroup_cs(oldcgrp);
>   
> +	if (!ve_is_super(get_exec_env()))
> +		return;
> +
>   	mutex_lock(&cpuset_mutex);
>   
>   	/* prepare for attach */
>
Stanislav Kinsburskiy Dec. 13, 2017, 1:24 p.m.
Hi Pavel, please, see my comments/question below

13.12.2017 13:43, Pavel Tikhomirov пишет:
> Personally I don't like these as we still have no unswer to "If cpusets are optional for docker, why k8s can't work without them?" it seem there is not enough explanation in VZAP-31.
> 

Well... I have to admit that I don't like it either.
I (frankly) don't like the whole idea of putting kuber into container.
And the reason is so simple: CT is a cheating technique. Those, who like "Russian dolls" should use VMs with nested virtualization.
But the truth is that we do care about our customers (not sure why, since they pay us less and less).
And because of this we've put so much various sh*t into our kernel already, that I got tired to trow it away on each major rebase a long time ago.
But this all is lyrics.
There is a task - and that's the fix.

> We also need to revert the patch below to show cpuset in CT:
> commit 5160bd34c9bd ("ve/proc/cpuset: do not show cpuset in CT")
> 

Sure! It's mentioned in the patch description.

> It seem I can still attach a process to a nested cgroup in CT with these patch:
> 
> CT-6ecd9be1 /# cat /proc/cgroups | grep cpuset
> cpuset    16    1    1
> CT-6ecd9be1 /# ls /sys/fs/cgroup/cpuset/cpuset.cpus
> /sys/fs/cgroup/cpuset/cpuset.cpus
> CT-6ecd9be1 /# mkdir /sys/fs/cgroup/cpuset/test
> CT-6ecd9be1 /# sleep 1000 &
> [1] 678
> CT-6ecd9be1 /# echo  678 > /sys/fs/cgroup/cpuset/test/tasks
> CT-6ecd9be1 /# cat /sys/fs/cgroup/cpuset/test/tasks
> 678
> 

Isn't it wonderful? :)
Poor me, I have to admit, that I didn't know, that this task will be even visible in the nest cgroup... :(
I thought, that emulation would be less effective.

Nevertheless, if you have some better way to solve the issue, please, share.

> On 12/13/2017 01:37 PM, Stanislav Kinsburskiy wrote:
>> Any changes to this cgroup are skipped in container, but success code is
>> returned.
>> The idea is to fool Docker/Kubernetes.
>>
>> https://jira.sw.ru/browse/PSBM-58423
>>
>> This patch obsoletes "ve/proc/cpuset: do not show cpuset in CT"
>>
>> Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
>> ---
>>   kernel/cpuset.c |    9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
>> index 26d88eb..dfac505 100644
>> --- a/kernel/cpuset.c
>> +++ b/kernel/cpuset.c
>> @@ -1441,6 +1441,9 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>>       struct task_struct *task;
>>       int ret;
>>   +    if (!ve_is_super(get_exec_env()))
>> +        return 0;
>> +
>>       mutex_lock(&cpuset_mutex);
>>         ret = -ENOSPC;
>> @@ -1470,6 +1473,9 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>>   static void cpuset_cancel_attach(struct cgroup *cgrp,
>>                    struct cgroup_taskset *tset)
>>   {
>> +    if (!ve_is_super(get_exec_env()))
>> +        return;
>> +
>>       mutex_lock(&cpuset_mutex);
>>       cgroup_cs(cgrp)->attach_in_progress--;
>>       mutex_unlock(&cpuset_mutex);
>> @@ -1494,6 +1500,9 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>>       struct cpuset *cs = cgroup_cs(cgrp);
>>       struct cpuset *oldcs = cgroup_cs(oldcgrp);
>>   +    if (!ve_is_super(get_exec_env()))
>> +        return;
>> +
>>       mutex_lock(&cpuset_mutex);
>>         /* prepare for attach */
>>
>
Pavel Tikhomirov Dec. 13, 2017, 3:18 p.m.
On 12/13/2017 04:24 PM, Stanislav Kinsburskiy wrote:
> Hi Pavel, please, see my comments/question below
> 
> 13.12.2017 13:43, Pavel Tikhomirov пишет:
>> Personally I don't like these as we still have no unswer to "If cpusets are optional for docker, why k8s can't work without them?" it seem there is not enough explanation in VZAP-31.
>>
> 
> Well... I have to admit that I don't like it either.
> I (frankly) don't like the whole idea of putting kuber into container.
> And the reason is so simple: CT is a cheating technique. Those, who like "Russian dolls" should use VMs with nested virtualization.
> But the truth is that we do care about our customers (not sure why, since they pay us less and less).
> And because of this we've put so much various sh*t into our kernel already, that I got tired to trow it away on each major rebase a long time ago.
> But this all is lyrics.
> There is a task - and that's the fix.

Sure, I agree.

> 
>> We also need to revert the patch below to show cpuset in CT:
>> commit 5160bd34c9bd ("ve/proc/cpuset: do not show cpuset in CT")
>>
> 
> Sure! It's mentioned in the patch description.

Sorry, missed it, now I see. Thanks for pointing that out!

> 
>> It seem I can still attach a process to a nested cgroup in CT with these patch:
>>
>> CT-6ecd9be1 /# cat /proc/cgroups | grep cpuset
>> cpuset    16    1    1
>> CT-6ecd9be1 /# ls /sys/fs/cgroup/cpuset/cpuset.cpus
>> /sys/fs/cgroup/cpuset/cpuset.cpus
>> CT-6ecd9be1 /# mkdir /sys/fs/cgroup/cpuset/test
>> CT-6ecd9be1 /# sleep 1000 &
>> [1] 678
>> CT-6ecd9be1 /# echo  678 > /sys/fs/cgroup/cpuset/test/tasks
>> CT-6ecd9be1 /# cat /sys/fs/cgroup/cpuset/test/tasks
>> 678
>>
> 
> Isn't it wonderful? :)
> Poor me, I have to admit, that I didn't know, that this task will be even visible in the nest cgroup... :(
> I thought, that emulation would be less effective.

May be I don't understand your patch completely, but my guess was - you 
want to fake attaching to cpuset cgroups so that attach does not 
actually moves process in it but just says OK. Thus all configurations 
of nested cgroups does not matter as there are no tasks in them and we 
are OK. Now, AFAIKS, we have a partial state when task is moved to other 
cgroup but it's cpuset is left unchanged.

I can still apply cpuset to a task from CT (sorry for a lot of lines):

1) Host is idle.

[root@silo ~]# mpstat 1 1 -P ALL
Linux 3.10.0-693.11.1.ovz.39.4 (silo.sw.ru) 	12/13/2017 	_x86_64_	(4 CPU)

06:02:08 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
  %guest  %gnice   %idle
06:02:09 PM  all    0.00    0.00    0.00    0.00    0.00    0.25    0.00 
    0.00    0.00   99.75
06:02:09 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
06:02:09 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
06:02:09 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
06:02:09 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
  %guest  %gnice   %idle
Average:     all    0.00    0.00    0.00    0.00    0.00    0.25    0.00 
    0.00    0.00   99.75
Average:       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
Average:       2    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
Average:       3    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00

2) Run something in CT:
CT-6ecd9be1 /# dd if=/dev/zero of=/dev/null &
[1] 758

3) It burns cpu0:

[root@silo ~]# mpstat 1 1 -P ALL
Linux 3.10.0-693.11.1.ovz.39.4 (silo.sw.ru) 	12/13/2017 	_x86_64_	(4 CPU)

06:02:18 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
  %guest  %gnice   %idle
06:02:19 PM  all    9.75    0.00   15.50    0.00    0.00    0.00    0.00 
    0.00    0.00   74.75
06:02:19 PM    0   38.61    0.00   61.39    0.00    0.00    0.00    0.00 
    0.00    0.00    0.00
06:02:19 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
06:02:19 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
06:02:19 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
  %guest  %gnice   %idle
Average:     all    9.75    0.00   15.50    0.00    0.00    0.00    0.00 
    0.00    0.00   74.75
Average:       0   38.61    0.00   61.39    0.00    0.00    0.00    0.00 
    0.00    0.00    0.00
Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
Average:       2    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
Average:       3    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00

4) Create and setup cgroup and attach 758 to it in CT

CT-6ecd9be1 /# mkdir /sys/fs/cgroup/cpuset/test
CT-6ecd9be1 /# echo 3 > /sys/fs/cgroup/cpuset/test/cpuset.cpus
CT-6ecd9be1 /# echo 0 > /sys/fs/cgroup/cpuset/test/cpuset.mems
CT-6ecd9be1 /# echo 758  >/sys/fs/cgroup/cpuset/test/tasks
CT-6ecd9be1 /# cat /sys/fs/cgroup/cpuset/test/tasks
758

5) Still burns cpu0, cpuset does not apply on 758!

[root@silo ~]# mpstat 1 1 -P ALL
Linux 3.10.0-693.11.1.ovz.39.4 (silo.sw.ru) 	12/13/2017 	_x86_64_	(4 CPU)

06:03:06 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
  %guest  %gnice   %idle
06:03:07 PM  all    9.77    0.00   15.29    0.00    0.00    0.00    0.00 
    0.00    0.00   74.94
06:03:07 PM    0   39.00    0.00   61.00    0.00    0.00    0.00    0.00 
    0.00    0.00    0.00
06:03:07 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
06:03:07 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
06:03:07 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
  %guest  %gnice   %idle
Average:     all    9.77    0.00   15.29    0.00    0.00    0.00    0.00 
    0.00    0.00   74.94
Average:       0   39.00    0.00   61.00    0.00    0.00    0.00    0.00 
    0.00    0.00    0.00
Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
Average:       2    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
Average:       3    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00

6) Change cpuset for these cgroup

echo 2 > /sys/fs/cgroup/cpuset/test/cpuset.cpus

7) Now it burns cpu 2

[root@silo ~]# mpstat 1 1 -P ALL
Linux 3.10.0-693.11.1.ovz.39.4 (silo.sw.ru) 	12/13/2017 	_x86_64_	(4 CPU)

06:03:29 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
  %guest  %gnice   %idle
06:03:30 PM  all    9.48    0.00   15.71    0.00    0.00    0.00    0.00 
    0.00    0.00   74.81
06:03:30 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
06:03:30 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
06:03:30 PM    2   37.00    0.00   63.00    0.00    0.00    0.00    0.00 
    0.00    0.00    0.00
06:03:30 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
  %guest  %gnice   %idle
Average:     all    9.48    0.00   15.71    0.00    0.00    0.00    0.00 
    0.00    0.00   74.81
Average:       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00
Average:       2   37.00    0.00   63.00    0.00    0.00    0.00    0.00 
    0.00    0.00    0.00
Average:       3    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
    0.00    0.00  100.00

So change to cpuset.cpus is actually applied to the task, that's what I 
mean. Correct me if I'm wrong.

> 
> Nevertheless, if you have some better way to solve the issue, please, share.

I see: 1) emulation way 2) Fake attaching to cpuset cgroups in CT, 
similar to what you do, but __completely__ so that task is left in root 
container cgroup.

> 
>> On 12/13/2017 01:37 PM, Stanislav Kinsburskiy wrote:
>>> Any changes to this cgroup are skipped in container, but success code is
>>> returned.
>>> The idea is to fool Docker/Kubernetes.
>>>
>>> https://jira.sw.ru/browse/PSBM-58423
>>>
>>> This patch obsoletes "ve/proc/cpuset: do not show cpuset in CT"
>>>
>>> Signed-off-by: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
>>> ---
>>>    kernel/cpuset.c |    9 +++++++++
>>>    1 file changed, 9 insertions(+)
>>>
>>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
>>> index 26d88eb..dfac505 100644
>>> --- a/kernel/cpuset.c
>>> +++ b/kernel/cpuset.c
>>> @@ -1441,6 +1441,9 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>>>        struct task_struct *task;
>>>        int ret;
>>>    +    if (!ve_is_super(get_exec_env()))
>>> +        return 0;
>>> +
>>>        mutex_lock(&cpuset_mutex);
>>>          ret = -ENOSPC;
>>> @@ -1470,6 +1473,9 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>>>    static void cpuset_cancel_attach(struct cgroup *cgrp,
>>>                     struct cgroup_taskset *tset)
>>>    {
>>> +    if (!ve_is_super(get_exec_env()))
>>> +        return;
>>> +
>>>        mutex_lock(&cpuset_mutex);
>>>        cgroup_cs(cgrp)->attach_in_progress--;
>>>        mutex_unlock(&cpuset_mutex);
>>> @@ -1494,6 +1500,9 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>>>        struct cpuset *cs = cgroup_cs(cgrp);
>>>        struct cpuset *oldcs = cgroup_cs(oldcgrp);
>>>    +    if (!ve_is_super(get_exec_env()))
>>> +        return;
>>> +
>>>        mutex_lock(&cpuset_mutex);
>>>          /* prepare for attach */
>>>
>>
Stanislav Kinsburskiy Dec. 13, 2017, 4:02 p.m.
Nice catch, thanks!
I'll first try to tweak this gently, so it will look like cpuset cgroup works, but it won't.

13.12.2017 16:18, Pavel Tikhomirov пишет:
> So change to cpuset.cpus is actually applied to the task, that's what I mean. Correct me if I'm wrong.