[Devel,RHEL7,COMMIT] kvm/x86: skip async_pf when in guest mode

Submitted by Konstantin Khorenko on Dec. 2, 2016, 2:35 p.m.

Details

Message ID 201612021435.uB2EZfUJ023099@finist_cl7.x64_64.work.ct
State New
Series "kvm/x86: skip async_pf when in guest mode"
Headers show

Commit Message

Konstantin Khorenko Dec. 2, 2016, 2:35 p.m.
The commit is pushed to "branch-rh7-3.10.0-327.36.1.vz7.20.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.36.1.vz7.20.9
------>
commit 5173f45a28cdf3d5808e236eab882273a760a363
Author: Roman Kagan <rkagan@virtuozzo.com>
Date:   Fri Dec 2 18:35:41 2016 +0400

    kvm/x86: skip async_pf when in guest mode
    
    Async pagefault machinery assumes communication with L1 guests only: all
    the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
    currently doesn't check if the vCPU is running L1 or L2, and may inject
    
    To reproduce the problem, use a host with swap enabled, run a VM on it,
    run a nested VM on top, and set RSS limit for L1 on the host via
    /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
    to swap it out (you may need to tighten and release it once or twice, or
    create some memory load inside L1).  Very quickly L2 guest starts
    receiving pagefaults with bogus %cr2 (apf tokens from the host
    actually), and L1 guest starts accumulating tasks stuck in D state in
    kvm_async_pf_task_wait.
    
    To avoid that, only do async_pf stuff when executing L1 guest.
    
    Note: this patch only fixes x86; other async_pf-capable arches may also
    need something similar.
    
    Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
    Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
    (cherry picked from commit 80e2a7bb8d7050d2ea6d8961c526a65d30d5eb08)
    
    https://jira.sw.ru/browse/PSBM-54491
---
 arch/x86/kvm/mmu.c | 2 +-
 arch/x86/kvm/x86.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 17973ed..c82bf5f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3481,7 +3481,7 @@  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	if (!async)
 		return false; /* *pfn has correct page already */
 
-	if (!prefault && can_do_async_pf(vcpu)) {
+	if (!prefault && !is_guest_mode(vcpu) && can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(gva, gfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			trace_kvm_async_pf_doublefault(gva, gfn);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 78ea28c..4edeb8a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6780,7 +6780,8 @@  static int __vcpu_run(struct kvm_vcpu *vcpu)
 			++vcpu->stat.request_irq_exits;
 		}
 
-		kvm_check_async_pf_completion(vcpu);
+		if (!is_guest_mode(vcpu))
+			kvm_check_async_pf_completion(vcpu);
 
 		if (signal_pending(current)) {
 			r = -EINTR;

Comments

Konstantin Khorenko Dec. 7, 2016, 3:27 p.m.
Please consider to RK.

Den, let us know if you don't think it's needed.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 12/02/2016 05:35 PM, Konstantin Khorenko wrote:
> The commit is pushed to "branch-rh7-3.10.0-327.36.1.vz7.20.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
> after rh7-3.10.0-327.36.1.vz7.20.9
> ------>
> commit 5173f45a28cdf3d5808e236eab882273a760a363
> Author: Roman Kagan <rkagan@virtuozzo.com>
> Date:   Fri Dec 2 18:35:41 2016 +0400
>
>     kvm/x86: skip async_pf when in guest mode
>
>     Async pagefault machinery assumes communication with L1 guests only: all
>     the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
>     currently doesn't check if the vCPU is running L1 or L2, and may inject
>
>     To reproduce the problem, use a host with swap enabled, run a VM on it,
>     run a nested VM on top, and set RSS limit for L1 on the host via
>     /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
>     to swap it out (you may need to tighten and release it once or twice, or
>     create some memory load inside L1).  Very quickly L2 guest starts
>     receiving pagefaults with bogus %cr2 (apf tokens from the host
>     actually), and L1 guest starts accumulating tasks stuck in D state in
>     kvm_async_pf_task_wait.
>
>     To avoid that, only do async_pf stuff when executing L1 guest.
>
>     Note: this patch only fixes x86; other async_pf-capable arches may also
>     need something similar.
>
>     Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
>     Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
>     (cherry picked from commit 80e2a7bb8d7050d2ea6d8961c526a65d30d5eb08)
>
>     https://jira.sw.ru/browse/PSBM-54491
> ---
>  arch/x86/kvm/mmu.c | 2 +-
>  arch/x86/kvm/x86.c | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 17973ed..c82bf5f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3481,7 +3481,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>  	if (!async)
>  		return false; /* *pfn has correct page already */
>
> -	if (!prefault && can_do_async_pf(vcpu)) {
> +	if (!prefault && !is_guest_mode(vcpu) && can_do_async_pf(vcpu)) {
>  		trace_kvm_try_async_get_page(gva, gfn);
>  		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>  			trace_kvm_async_pf_doublefault(gva, gfn);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 78ea28c..4edeb8a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6780,7 +6780,8 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>  			++vcpu->stat.request_irq_exits;
>  		}
>
> -		kvm_check_async_pf_completion(vcpu);
> +		if (!is_guest_mode(vcpu))
> +			kvm_check_async_pf_completion(vcpu);
>
>  		if (signal_pending(current)) {
>  			r = -EINTR;
> .
>
Denis Lunev Dec. 7, 2016, 4:06 p.m.
On 12/07/2016 06:27 PM, Konstantin Khorenko wrote:
> Please consider to RK.
>
> Den, let us know if you don't think it's needed.
>
which branch you are speaking about? For RK in UP3?
How we can be sure that all QA nodes will be updated
in this case? They are running not released kernels.

Den

> -- 
> Best regards,
>
> Konstantin Khorenko,
> Virtuozzo Linux Kernel Team
>
> On 12/02/2016 05:35 PM, Konstantin Khorenko wrote:
>> The commit is pushed to "branch-rh7-3.10.0-327.36.1.vz7.20.x-ovz" and
>> will appear at https://src.openvz.org/scm/ovz/vzkernel.git
>> after rh7-3.10.0-327.36.1.vz7.20.9
>> ------>
>> commit 5173f45a28cdf3d5808e236eab882273a760a363
>> Author: Roman Kagan <rkagan@virtuozzo.com>
>> Date:   Fri Dec 2 18:35:41 2016 +0400
>>
>>     kvm/x86: skip async_pf when in guest mode
>>
>>     Async pagefault machinery assumes communication with L1 guests
>> only: all
>>     the state -- MSRs, apf area addresses, etc, -- are for L1. 
>> However, it
>>     currently doesn't check if the vCPU is running L1 or L2, and may
>> inject
>>
>>     To reproduce the problem, use a host with swap enabled, run a VM
>> on it,
>>     run a nested VM on top, and set RSS limit for L1 on the host via
>>    
>> /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
>>     to swap it out (you may need to tighten and release it once or
>> twice, or
>>     create some memory load inside L1).  Very quickly L2 guest starts
>>     receiving pagefaults with bogus %cr2 (apf tokens from the host
>>     actually), and L1 guest starts accumulating tasks stuck in D
>> state in
>>     kvm_async_pf_task_wait.
>>
>>     To avoid that, only do async_pf stuff when executing L1 guest.
>>
>>     Note: this patch only fixes x86; other async_pf-capable arches
>> may also
>>     need something similar.
>>
>>     Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
>>     Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
>>     (cherry picked from commit 80e2a7bb8d7050d2ea6d8961c526a65d30d5eb08)
>>
>>     https://jira.sw.ru/browse/PSBM-54491
>> ---
>>  arch/x86/kvm/mmu.c | 2 +-
>>  arch/x86/kvm/x86.c | 3 ++-
>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 17973ed..c82bf5f 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -3481,7 +3481,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu,
>> bool prefault, gfn_t gfn,
>>      if (!async)
>>          return false; /* *pfn has correct page already */
>>
>> -    if (!prefault && can_do_async_pf(vcpu)) {
>> +    if (!prefault && !is_guest_mode(vcpu) && can_do_async_pf(vcpu)) {
>>          trace_kvm_try_async_get_page(gva, gfn);
>>          if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>>              trace_kvm_async_pf_doublefault(gva, gfn);
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 78ea28c..4edeb8a 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6780,7 +6780,8 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>>              ++vcpu->stat.request_irq_exits;
>>          }
>>
>> -        kvm_check_async_pf_completion(vcpu);
>> +        if (!is_guest_mode(vcpu))
>> +            kvm_check_async_pf_completion(vcpu);
>>
>>          if (signal_pending(current)) {
>>              r = -EINTR;
>> .
>>
Vasily Averin Dec. 7, 2016, 4:11 p.m.
Den,
it's for our customers,
both for VZ7-rtm (vz7.15.2) and VZ7-u1 (vz7.18.7) kernels 
Is it important for customers?

thank you,
	Vasily Averin

On 12/07/2016 07:06 PM, Denis V. Lunev wrote:
> On 12/07/2016 06:27 PM, Konstantin Khorenko wrote:
>> Please consider to RK.
>>
>> Den, let us know if you don't think it's needed.
>>
> which branch you are speaking about? For RK in UP3?
> How we can be sure that all QA nodes will be updated
> in this case? They are running not released kernels.
> 
> Den
> 
>> -- 
>> Best regards,
>>
>> Konstantin Khorenko,
>> Virtuozzo Linux Kernel Team
>>
>> On 12/02/2016 05:35 PM, Konstantin Khorenko wrote:
>>> The commit is pushed to "branch-rh7-3.10.0-327.36.1.vz7.20.x-ovz" and
>>> will appear at https://src.openvz.org/scm/ovz/vzkernel.git
>>> after rh7-3.10.0-327.36.1.vz7.20.9
>>> ------>
>>> commit 5173f45a28cdf3d5808e236eab882273a760a363
>>> Author: Roman Kagan <rkagan@virtuozzo.com>
>>> Date:   Fri Dec 2 18:35:41 2016 +0400
>>>
>>>     kvm/x86: skip async_pf when in guest mode
>>>
>>>     Async pagefault machinery assumes communication with L1 guests
>>> only: all
>>>     the state -- MSRs, apf area addresses, etc, -- are for L1. 
>>> However, it
>>>     currently doesn't check if the vCPU is running L1 or L2, and may
>>> inject
>>>
>>>     To reproduce the problem, use a host with swap enabled, run a VM
>>> on it,
>>>     run a nested VM on top, and set RSS limit for L1 on the host via
>>>    
>>> /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
>>>     to swap it out (you may need to tighten and release it once or
>>> twice, or
>>>     create some memory load inside L1).  Very quickly L2 guest starts
>>>     receiving pagefaults with bogus %cr2 (apf tokens from the host
>>>     actually), and L1 guest starts accumulating tasks stuck in D
>>> state in
>>>     kvm_async_pf_task_wait.
>>>
>>>     To avoid that, only do async_pf stuff when executing L1 guest.
>>>
>>>     Note: this patch only fixes x86; other async_pf-capable arches
>>> may also
>>>     need something similar.
>>>
>>>     Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
>>>     Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
>>>     (cherry picked from commit 80e2a7bb8d7050d2ea6d8961c526a65d30d5eb08)
>>>
>>>     https://jira.sw.ru/browse/PSBM-54491
>>> ---
>>>  arch/x86/kvm/mmu.c | 2 +-
>>>  arch/x86/kvm/x86.c | 3 ++-
>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>> index 17973ed..c82bf5f 100644
>>> --- a/arch/x86/kvm/mmu.c
>>> +++ b/arch/x86/kvm/mmu.c
>>> @@ -3481,7 +3481,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu,
>>> bool prefault, gfn_t gfn,
>>>      if (!async)
>>>          return false; /* *pfn has correct page already */
>>>
>>> -    if (!prefault && can_do_async_pf(vcpu)) {
>>> +    if (!prefault && !is_guest_mode(vcpu) && can_do_async_pf(vcpu)) {
>>>          trace_kvm_try_async_get_page(gva, gfn);
>>>          if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>>>              trace_kvm_async_pf_doublefault(gva, gfn);
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 78ea28c..4edeb8a 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -6780,7 +6780,8 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>>>              ++vcpu->stat.request_irq_exits;
>>>          }
>>>
>>> -        kvm_check_async_pf_completion(vcpu);
>>> +        if (!is_guest_mode(vcpu))
>>> +            kvm_check_async_pf_completion(vcpu);
>>>
>>>          if (signal_pending(current)) {
>>>              r = -EINTR;
>>> .
>>>
> 
>
Denis Lunev Dec. 7, 2016, 4:13 p.m.
On 12/07/2016 07:11 PM, Vasily Averin wrote:
> Den,
> it's for our customers,
> both for VZ7-rtm (vz7.15.2) and VZ7-u1 (vz7.18.7) kernels 
> Is it important for customers?
>
> thank you,
> 	Vasily Averin
no. we do not care. This functionality is not announced.
We do not support nesting officially for customers.

Den