[RFC,rh7] vznetstat: Move the code to drop redundant skb marks to *_xmit() functions #PSBM-122082

Submitted by Konstantin Khorenko on Nov. 16, 2020, 7:40 p.m.

Details

Message ID 20201116194050.622-1-khorenko@virtuozzo.com
State New
Series "vznetstat: Move the code to drop redundant skb marks to *_xmit() functions #PSBM-122082"
Headers show

Commit Message

Konstantin Khorenko Nov. 16, 2020, 7:40 p.m.
We have added code to clear vz specific skb marks, but it's done in the
hook NF_INET_LOCAL_IN (which is not the very first one) and if someone
sets a mark with xt_mark in PREROUTING chain, the mark will be set in
NF_INET_PRE_ROUTING hook (which is prior to NF_INET_LOCAL_IN) and
dropped by our hook venet_acct_in_hook() in later NF_INET_LOCAL_IN hook.

Example of the rule (taken from Portainer):
 -A PREROUTING -p tcp -m tcp --dport 443 -j MARK --set-xmark 0x100/0xffffffff

Calltrace example where the skb mark is set by the xt_mark rule:

telnet 908809 [001] probe:mark_tg: mark=0x100 skb=0xffff8b53a4ec4af8
        mark_tg+0x21 (xt_mark.ko.xz)
        ipt_do_table+0x30b ([ip_tables])
        iptable_mangle_hook+0x43 ([iptable_mangle])
 // here we have ops->hooknum == 0 == NF_INET_PRE_ROUTING
 // priority=-150 == NF_IP_PRI_MANGLE

        nf_iterate+0x98
        nf_hook_slow+0xa8
        ip_rcv+0x339
        __netif_receive_skb_core+0x729
        __netif_receive_skb+0x18
        process_backlog+0xae
        net_rx_action+0x27f
        __do_softirq+0x125
        call_softirq+0x1c
        do_softirq+0x65
        __local_bh_enable_ip+0x9b
        local_bh_enable+0x17
        ip_finish_output+0x284
        ip_output+0x7b
        ip_local_out_sk+0x37
        ip_queue_xmit+0x144
        tcp_transmit_skb+0x4e4
        tcp_connect+0x70d
        tcp_v4_connect+0x345
        __inet_stream_connect+0xc3
        inet_stream_connect+0x38
        SYSC_connect+0xed
        sys_connect+0xe

Calltrace example where the skb mark is dropped by vznetstat hook:

telnet 908809 [001] probe:venet_acct_classify_add_incoming: veid=0x142d mark=0x100 skb=0xffff8b53a4ec4af8
        venet_acct_classify_add_incoming+0x1 (vznetstat.ko.xz)
 // it's hooknum == NF_INET_LOCAL_IN == 1
 // priority      = NF_IP_PRI_FIRST

        nf_iterate+0x98
        nf_hook_slow+0xa8
        ip_local_deliver+0xc6
        ip_rcv_finish+0x90
        ip_rcv+0x2c0
        __netif_receive_skb_core+0x729
        __netif_receive_skb+0x18
        process_backlog+0xae
        net_rx_action+0x27f
        __do_softirq+0x125
        call_softirq+0x1c
        do_softirq+0x65
        __local_bh_enable_ip+0x9b
        local_bh_enable+0x17
        ip_finish_output+0x284
        ip_output+0x7b
        ip_local_out_sk+0x37
        ip_queue_xmit+0x144
        tcp_transmit_skb+0x4e4
        tcp_connect+0x70d
        tcp_v4_connect+0x345
        __inet_stream_connect+0xc3
        inet_stream_connect+0x38
        SYSC_connect+0xed
        sys_connect+0xe

What do we do to fix this issue:

1) Let's set skb marks only if the skb is going to be sent via venet or
   veth interface which are known to be a boundary between VE0 and VE.

   Well, in fact we need to set marks only on VE->VE0 path, but in
   venet acct "out" hook seems it's impossible to detect the direction,
   so we have to put marks on both directions.

   We could probably move the code for setting marks somewhere else but
   netfilter "out" hook is a convenient place for VE traffic accounting
   and as a side effect we get the correct accounting vz traffic class
   which is needed for setting correct skb mark as well,
   so let's we just here put marks on both directions.

2) As "in" netfilter hook does not provide info about the skb direction
   (VE0->VE or VE->VE0), we have to move the code which drops mark from
   VE0->VE direction skbs to somewhere else.

   We do have knowledge of direction in *_xmit() functions,
   so let's drop redundant packet marks in venet_xmit() and veth_xmit()
   functions (as we don't currently have other network interfaces
   between VE<->VE0).

3) As a side effect let's fix vz traffic accounting a bit:
   before this patch any traffic in Containers (except localhost) was
   accounted. For example the traffic between Docker Containers inside
   a single VZ Container - was accounted.
   Add a check into "in" and "out" hook to account data only in case
   "in"/"out" interfaces have NETIF_F_VENET features (i.e. are
   boundaries between VE<->VE0).

   Note: the VE0 accounting remains the same - there can be extra
   accounting in case of Docker Containers on host for example.

Fixes: f4943221d710 ("vznetstat: Clear skb->mark on entering to VE's IP
stack")
https://jira.sw.ru/browse/PSBM-122082

Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>

Note: the very final version (added extra check for letting VE0
accounting unchanged) not tested yet, will test it carefully tomorrow.

Comments are very welcome!

---
 drivers/net/venetdev.c             |  6 ++++++
 drivers/net/veth.c                 |  8 ++++++++
 kernel/ve/vznetstat/ip_vznetstat.c | 20 ++++++++++++++++++++
 kernel/ve/vznetstat/vznetstat.c    |  8 +++++---
 4 files changed, 39 insertions(+), 3 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/net/venetdev.c b/drivers/net/venetdev.c
index 28d06bca1f86..5385e3e7f30f 100644
--- a/drivers/net/venetdev.c
+++ b/drivers/net/venetdev.c
@@ -464,6 +464,12 @@  static int venet_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	stats = venet_stats(dev, smp_processor_id());
 	ve = dev_net(dev)->owner_ve;
+	/*
+	 * If the packet goes in VE0 -> VE direction, drop the skb mark used
+	 * for vz traffic shaping.
+	 */
+	if (ve_is_super(ve))
+		skb->mark = 0;
 
 	if (skb->protocol == __constant_htons(ETH_P_IP)) {
 		struct iphdr *iph;
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 2fc8b913a49a..150142843777 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -149,6 +149,14 @@  static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
+	/*
+	 * skb is going from VE0 to VE via veth =>
+	 * drop the vz spesific mark used for traffic shaping
+	 */
+	if (dev->features & NETIF_F_VENET &&
+	    ve_is_super(dev_net(dev)->owner_ve))
+		skb->mark = 0;
+
 	if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
 		struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
diff --git a/kernel/ve/vznetstat/ip_vznetstat.c b/kernel/ve/vznetstat/ip_vznetstat.c
index 5ea978d6dd88..a33c41c05dba 100644
--- a/kernel/ve/vznetstat/ip_vznetstat.c
+++ b/kernel/ve/vznetstat/ip_vznetstat.c
@@ -43,6 +43,17 @@  static unsigned int venet_acct_in_hook(const struct nf_hook_ops *hook,
 	/* Skip loopback dev */
 	if (in == dev_net(in)->loopback_dev)
 		goto out;
+	/*
+	 * For VE:  let's account only traffic which crosses the coundary
+	 *	    VE0->VE.
+	 *
+	 * For VE0: let's account all the traffic as it always was.
+	 * There can be extra accounting in case of traffic between
+	 * Docker Containers in VE0.
+	 */
+	if (!ve_is_super(in->nd_net->owner_ve) &&
+	    !(in->features & NETIF_F_VENET))
+		goto out;
 
 #if VZNS_DEBUG
 	printk("%s: in %s, out %s, size %d, in->owner_env=%s\n",
@@ -92,6 +103,15 @@  static unsigned int venet_acct_out_hook(const struct nf_hook_ops *hook,
 	 */
 	if (dst && (dst->dev->flags & IFF_LOOPBACK))
 		goto out;
+	/*
+	 * Don't account and mark VE traffic if it does not cross VE/VE0
+	 * boundary.
+	 * For VE0 there can be extra accounting in case of traffic
+	 * between Docker Containers in VE0.
+	 */
+	if (!(ve_is_super(out->nd_net->owner_ve)) &&
+	    dst && !(dst->dev->features & NETIF_F_VENET))
+		goto out;
 
 	/* Paranoia */
 	if (unlikely(!pskb_may_pull(skb, sizeof(struct iphdr))))
diff --git a/kernel/ve/vznetstat/vznetstat.c b/kernel/ve/vznetstat/vznetstat.c
index aa8d007adbbe..e03d98df19ef 100644
--- a/kernel/ve/vznetstat/vznetstat.c
+++ b/kernel/ve/vznetstat/vznetstat.c
@@ -726,10 +726,12 @@  void venet_acct_classify_add_incoming(struct venet_stat *stat, struct sk_buff *s
 {
 	acct_one_skb(stat, skb, ACCT_IN, venet_acct_skb_size(skb));
 	/*
-	 * Every incomming skb must have zero mark, since here is its first
-	 * come into VE's IP stack.
+	 * We do set mark on skb in venet acct "out" hook so it would be
+	 * reasonable to drop redundant marks in venet acct "in" hook.
+	 * But seems it's impossible to understand the direction of
+	 * packet here (if it goes VE->VE0 or VE0->VE), thus mark is
+	 * dropped in venet_xmit() and veth_xmit() instead.
 	 */
-	venet_clear_mark(skb);
 }
 
 static inline void venet_acct_mark(struct venet_stat *stat,

Comments

Vasily Averin Nov. 17, 2020, 5:55 a.m.
See comments below
On 11/16/20 10:40 PM, Konstantin Khorenko wrote:
> We have added code to clear vz specific skb marks, but it's done in the
> hook NF_INET_LOCAL_IN (which is not the very first one) and if someone
> sets a mark with xt_mark in PREROUTING chain, the mark will be set in
> NF_INET_PRE_ROUTING hook (which is prior to NF_INET_LOCAL_IN) and
> dropped by our hook venet_acct_in_hook() in later NF_INET_LOCAL_IN hook.
> 

> diff --git a/drivers/net/venetdev.c b/drivers/net/venetdev.c
> index 28d06bca1f86..5385e3e7f30f 100644
> --- a/drivers/net/venetdev.c
> +++ b/drivers/net/venetdev.c
> @@ -464,6 +464,12 @@ static int venet_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	stats = venet_stats(dev, smp_processor_id());
>  	ve = dev_net(dev)->owner_ve;
> +	/*
> +	 * If the packet goes in VE0 -> VE direction, drop the skb mark used
> +	 * for vz traffic shaping.

Please make comment more clear.
Why we should drop vz traffic shaping mark here?
You know that it is used for outgoing traffic only, but reader does not.

> diff --git a/kernel/ve/vznetstat/ip_vznetstat.c b/kernel/ve/vznetstat/ip_vznetstat.c
> index 5ea978d6dd88..a33c41c05dba 100644
> --- a/kernel/ve/vznetstat/ip_vznetstat.c
> +++ b/kernel/ve/vznetstat/ip_vznetstat.c
> @@ -92,6 +103,15 @@ static unsigned int venet_acct_out_hook(const struct nf_hook_ops *hook,
>  	 */
>  	if (dst && (dst->dev->flags & IFF_LOOPBACK))
>  		goto out;
> +	/*
> +	 * Don't account and mark VE traffic if it does not cross VE/VE0
> +	 * boundary.
> +	 * For VE0 there can be extra accounting in case of traffic
> +	 * between Docker Containers in VE0.
> +	 */
> +	if (!(ve_is_super(out->nd_net->owner_ve)) &&
> +	    dst && !(dst->dev->features & NETIF_F_VENET))
> +		goto out;

Are you sure that we can use dst for correct "outgoing traffic"-check here?
Can be traffic re-routed after this chain?

I mean the following scenario:
dst can  be "internal" first but the skb is rerouted to outside
or vice versa:
when "external" traffic is re-routed to local netdevices?

Thank you,
	Vasily Averin
Konstantin Khorenko Nov. 17, 2020, 10:07 a.m.
On 11/17/2020 08:55 AM, Vasily Averin wrote:
> See comments below
> On 11/16/20 10:40 PM, Konstantin Khorenko wrote:
>> We have added code to clear vz specific skb marks, but it's done in the
>> hook NF_INET_LOCAL_IN (which is not the very first one) and if someone
>> sets a mark with xt_mark in PREROUTING chain, the mark will be set in
>> NF_INET_PRE_ROUTING hook (which is prior to NF_INET_LOCAL_IN) and
>> dropped by our hook venet_acct_in_hook() in later NF_INET_LOCAL_IN hook.
>>
>
>> diff --git a/drivers/net/venetdev.c b/drivers/net/venetdev.c
>> index 28d06bca1f86..5385e3e7f30f 100644
>> --- a/drivers/net/venetdev.c
>> +++ b/drivers/net/venetdev.c
>> @@ -464,6 +464,12 @@ static int venet_xmit(struct sk_buff *skb, struct net_device *dev)
>>
>>  	stats = venet_stats(dev, smp_processor_id());
>>  	ve = dev_net(dev)->owner_ve;
>> +	/*
>> +	 * If the packet goes in VE0 -> VE direction, drop the skb mark used
>> +	 * for vz traffic shaping.
>
> Please make comment more clear.
> Why we should drop vz traffic shaping mark here?
> You know that it is used for outgoing traffic only, but reader does not.

Ok, will do in the next version.

>
>> diff --git a/kernel/ve/vznetstat/ip_vznetstat.c b/kernel/ve/vznetstat/ip_vznetstat.c
>> index 5ea978d6dd88..a33c41c05dba 100644
>> --- a/kernel/ve/vznetstat/ip_vznetstat.c
>> +++ b/kernel/ve/vznetstat/ip_vznetstat.c
>> @@ -92,6 +103,15 @@ static unsigned int venet_acct_out_hook(const struct nf_hook_ops *hook,
>>  	 */
>>  	if (dst && (dst->dev->flags & IFF_LOOPBACK))
>>  		goto out;
>> +	/*
>> +	 * Don't account and mark VE traffic if it does not cross VE/VE0
>> +	 * boundary.
>> +	 * For VE0 there can be extra accounting in case of traffic
>> +	 * between Docker Containers in VE0.
>> +	 */
>> +	if (!(ve_is_super(out->nd_net->owner_ve)) &&
>> +	    dst && !(dst->dev->features & NETIF_F_VENET))
>> +		goto out;
>
> Are you sure that we can use dst for correct "outgoing traffic"-check here?
> Can be traffic re-routed after this chain?
>
> I mean the following scenario:
> dst can  be "internal" first but the skb is rerouted to outside
> or vice versa:
> when "external" traffic is re-routed to local netdevices?

1) i do not change the place for accounting and setting mark by this patch,
    so i definitely don't make it worse.

2) i do see packets in venet_acct_out_hook() with "venet0" "out" net device and
    "lo" net device in skb->dst, so at least _some_ rerouting to local netdevice
    is already done up to the moment.

So the only question left - can dst change after NF_INET_LOCAL_OUT netfilter hook?
Cannot find the answer up to the moment, will dig further.

+ i've just found that i forgot to handle ipv6 venet hooks in the similar manner.
Konstantin Khorenko Nov. 17, 2020, 11:54 a.m.
On 11/17/2020 01:07 PM, Konstantin Khorenko wrote:
> On 11/17/2020 08:55 AM, Vasily Averin wrote:
>> See comments below
>> On 11/16/20 10:40 PM, Konstantin Khorenko wrote:
>>> We have added code to clear vz specific skb marks, but it's done in the
>>> hook NF_INET_LOCAL_IN (which is not the very first one) and if someone
>>> sets a mark with xt_mark in PREROUTING chain, the mark will be set in
>>> NF_INET_PRE_ROUTING hook (which is prior to NF_INET_LOCAL_IN) and
>>> dropped by our hook venet_acct_in_hook() in later NF_INET_LOCAL_IN hook.
>>>
>>
>>> diff --git a/drivers/net/venetdev.c b/drivers/net/venetdev.c
>>> index 28d06bca1f86..5385e3e7f30f 100644
>>> --- a/drivers/net/venetdev.c
>>> +++ b/drivers/net/venetdev.c
>>> @@ -464,6 +464,12 @@ static int venet_xmit(struct sk_buff *skb, struct net_device *dev)
>>>
>>>  	stats = venet_stats(dev, smp_processor_id());
>>>  	ve = dev_net(dev)->owner_ve;
>>> +	/*
>>> +	 * If the packet goes in VE0 -> VE direction, drop the skb mark used
>>> +	 * for vz traffic shaping.
>>
>> Please make comment more clear.
>> Why we should drop vz traffic shaping mark here?
>> You know that it is used for outgoing traffic only, but reader does not.
>
> Ok, will do in the next version.
>
>>
>>> diff --git a/kernel/ve/vznetstat/ip_vznetstat.c b/kernel/ve/vznetstat/ip_vznetstat.c
>>> index 5ea978d6dd88..a33c41c05dba 100644
>>> --- a/kernel/ve/vznetstat/ip_vznetstat.c
>>> +++ b/kernel/ve/vznetstat/ip_vznetstat.c
>>> @@ -92,6 +103,15 @@ static unsigned int venet_acct_out_hook(const struct nf_hook_ops *hook,
>>>  	 */
>>>  	if (dst && (dst->dev->flags & IFF_LOOPBACK))
>>>  		goto out;
>>> +	/*
>>> +	 * Don't account and mark VE traffic if it does not cross VE/VE0
>>> +	 * boundary.
>>> +	 * For VE0 there can be extra accounting in case of traffic
>>> +	 * between Docker Containers in VE0.
>>> +	 */
>>> +	if (!(ve_is_super(out->nd_net->owner_ve)) &&
>>> +	    dst && !(dst->dev->features & NETIF_F_VENET))
>>> +		goto out;
>>
>> Are you sure that we can use dst for correct "outgoing traffic"-check here?
>> Can be traffic re-routed after this chain?
>>
>> I mean the following scenario:
>> dst can  be "internal" first but the skb is rerouted to outside
>> or vice versa:
>> when "external" traffic is re-routed to local netdevices?
>
> 1) i do not change the place for accounting and setting mark by this patch,
>     so i definitely don't make it worse.
>
> 2) i do see packets in venet_acct_out_hook() with "venet0" "out" net device and
>     "lo" net device in skb->dst, so at least _some_ rerouting to local netdevice
>     is already done up to the moment.
>
> So the only question left - can dst change after NF_INET_LOCAL_OUT netfilter hook?
> Cannot find the answer up to the moment, will dig further.

Well, the code is ... not trivial.
There are places (most of cases) where dst is changed in NF_INET_LOCAL_OUT hook,
but our venet hook goes with NF_IP_PRI_LAST priority and no other hook uses that priority.
So this is fine.

Nevertheless there are cases when dst is changed after this case, for example in IPSec:

iptable_nat_ipv4_out()
  nf_nat_ipv4_out()
   nf_xfrm_me_harder()
{
...
         dst = xfrm_lookup(net, dst, &fl, sk, 0);
         if (IS_ERR(dst))
                 return PTR_ERR(dst);

         skb_dst_drop(skb);
         skb_dst_set(skb, dst);
...
}
         {
                 .hook           = iptable_nat_ipv4_out,
                 .owner          = THIS_MODULE,
                 .pf             = NFPROTO_IPV4,
                 .hooknum        = NF_INET_POST_ROUTING,
                 .priority       = NF_IP_PRI_NAT_SRC,
         },

And dst is changed in some exotic cases even in some *_xmit() functions.
Well, may be for the sake of fairness we better move the vznetstat accounting
(and setting mark) code to venet_xmit()/veth_xmit() functions as well,
but honestly i do not make such big changes in vz7.
We live with this kind of accounting for a long time and it works well for most usecases.

So i'd leave setting mark on the same place along with accounting.


> + i've just found that i forgot to handle ipv6 venet hooks in the similar manner.