[Devel] ms/sched: Guarantee new group-entities always have weight

Submitted by Kirill Tkhai on Oct. 7, 2016, 8:11 a.m.

Details

Message ID 147582774743.23266.4565575957013755686.stgit@localhost.localdomain
State New
Series "ms/sched: Guarantee new group-entities always have weight"
Headers show

Commit Message

Kirill Tkhai Oct. 7, 2016, 8:11 a.m.
There is a node with never scheduled cpu cgroup. Its cfs_rq looks like:

struct cfs_rq {
  load = {
    weight = 0,
    inv_weight = 0
  },
  nr_running = 2,
  throttled = 0,

while its sched_entity has on_rq = 0.

The only way, where not-empty cfs_rq may be deqeued,
is when it was throttled. But now it's not throttled,
this means that unthrottle_cfs_rq() earlier skipped
its queueing. This may happen only
if cfs_rq->load.weight is 0, and we see that.

cfs_rq.load.weight is assigned by update_load_set(),
and it's possible in two places:

update_cfs_shares()
  shares = calc_cfs_shares(cfs_rq, tg); /* returns shares > 0 */
  reweight_entity(shares);
     update_load_set(shares);
and

init_tg_cfs_entry()
  update_load_set(0);

As it's clearly seen, assignment of 0 is possible in only place,
and there is a race, which has already fixed in mainstream.
So, let's backport their patch to fix the problem:

ms commit 0ac9b1c21874 from Paul Turner <pjt@google.com>
"sched: Guarantee new group-entities always have weight"

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

  put_prev_task(group_cfs_rq(a[0]), t1)
  put_prev_entity(..., t1)
  check_cfs_rq_runtime(group_cfs_rq(a[0]))
  throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

  enqueue_task_fair(rq[0], t2)
  enqueue_entity(group_cfs_rq(b[0]), t2)
  enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
  account_entity_enqueue(group_cfs_ra(b[0]), t2)
  update_cfs_shares(group_cfs_rq(b[0]))
  < skipped because b is part of a throttled hierarchy >
  enqueue_entity(group_cfs_rq(a[0]), b[0])
  ...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

https://jira.sw.ru/browse/PSBM-53136

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 kernel/sched/fair.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index be0c3dd..96fe1d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8039,7 +8039,8 @@  void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 	}
 
 	se->my_q = cfs_rq;
-	update_load_set(&se->load, 0);
+	/* guarantee group entities always have weight */
+	update_load_set(&se->load, NICE_0_LOAD);
 	se->parent = parent;
 
 #ifdef CONFIG_SCHEDSTATS

Comments

Cyrill Gorcunov Oct. 7, 2016, 8:28 a.m.
On Fri, Oct 07, 2016 at 11:11:15AM +0300, Kirill Tkhai wrote:
> There is a node with never scheduled cpu cgroup. Its cfs_rq looks like:
> 
> struct cfs_rq {
>   load = {
>     weight = 0,
>     inv_weight = 0
>   },
>   nr_running = 2,
>   throttled = 0,
> 
> while its sched_entity has on_rq = 0.
> 
> The only way, where not-empty cfs_rq may be deqeued,
> is when it was throttled. But now it's not throttled,
> this means that unthrottle_cfs_rq() earlier skipped
> its queueing. This may happen only
> if cfs_rq->load.weight is 0, and we see that.
> 
> cfs_rq.load.weight is assigned by update_load_set(),
> and it's possible in two places:
> 
> update_cfs_shares()
>   shares = calc_cfs_shares(cfs_rq, tg); /* returns shares > 0 */
>   reweight_entity(shares);
>      update_load_set(shares);
> and
> 
> init_tg_cfs_entry()
>   update_load_set(0);
> 
> As it's clearly seen, assignment of 0 is possible in only place,
> and there is a race, which has already fixed in mainstream.
> So, let's backport their patch to fix the problem:
> 
> ms commit 0ac9b1c21874 from Paul Turner <pjt@google.com>
> "sched: Guarantee new group-entities always have weight"
> 
> Currently, group entity load-weights are initialized to zero. This
> admits some races with respect to the first time they are re-weighted in
> earlty use. ( Let g[x] denote the se for "g" on cpu "x". )
> 
> Suppose that we have root->a and that a enters a throttled state,
> immediately followed by a[0]->t1 (the only task running on cpu[0])
> blocking:
> 
>   put_prev_task(group_cfs_rq(a[0]), t1)
>   put_prev_entity(..., t1)
>   check_cfs_rq_runtime(group_cfs_rq(a[0]))
>   throttle_cfs_rq(group_cfs_rq(a[0]))
> 
> Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
> time:
> 
>   enqueue_task_fair(rq[0], t2)
>   enqueue_entity(group_cfs_rq(b[0]), t2)
>   enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
>   account_entity_enqueue(group_cfs_ra(b[0]), t2)
>   update_cfs_shares(group_cfs_rq(b[0]))
>   < skipped because b is part of a throttled hierarchy >
>   enqueue_entity(group_cfs_rq(a[0]), b[0])
>   ...
> 
> We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
> which violates invariants in several code-paths. Eliminate the
> possibility of this by initializing group entity weight.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> 
> https://jira.sw.ru/browse/PSBM-53136
> 
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>