[rh7,1/2] net/skbuff: Don't waste memory reserves

Submitted by Andrey Ryabinin on April 8, 2019, 12:09 p.m.

Details

Message ID 20190408120949.7574-1-aryabinin@virtuozzo.com
State New
Series "Series without cover letter"
Headers show

Commit Message

Andrey Ryabinin April 8, 2019, 12:09 p.m.
We were observing network performance issues due to packets
being dropped by sk_filter_trim_cap() since the 'skb' was allocated
from pfmemalloc reserves:

    /*
     * If the skb was allocated from pfmemalloc reserves, only
     * allow SOCK_MEMALLOC sockets to use it as this socket is
     * helping free memory
     */
    if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
        return -ENOMEM;

Memalloc sockets are used for stuff like swap over NBD or NFS
and only memalloc sockets can process memalloc skbs. Since we
don't have any memalloc sockets in our setups we shouldn't have
memalloc skbs either. It simply doesn't make any sense to waste
memory reserves on skb only to drop packets later.

It appears that __dev_alloc_pages() unconditionally uses
__GFP_MEMALLOC, so unless caller added __GFP_NOMEMALLOC, the
__dev_alloc_pages() may dive into memory reserves.
Later build_skb() or __skb_fill_page_desc() sets skb->pfmemalloc = 1
so this skb always dropped by sk_filter_trim_cap().

Instead of wasting memory reserves we should simply avoid using
them in case of absence memalloc sockets in the system. Do this
by adding __GFP_MEMALLOC only when such socket is present in the
system.

https://pmc.acronis.com/browse/VSTOR-21390
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 include/linux/skbuff.h | 21 +++++++++++++++++++--
 include/net/sock.h     | 15 ---------------
 2 files changed, 19 insertions(+), 17 deletions(-)

Patch hide | download patch | download mbox

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 58515be85153..28a112d06ece 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2535,6 +2535,21 @@  void napi_consume_skb(struct sk_buff *skb, int budget);
 void __kfree_skb_flush(void);
 void __kfree_skb_defer(struct sk_buff *skb);
 
+#ifdef CONFIG_NET
+extern struct static_key memalloc_socks;
+static inline int sk_memalloc_socks(void)
+{
+	return static_key_false(&memalloc_socks);
+}
+#else
+
+static inline int sk_memalloc_socks(void)
+{
+	return 0;
+}
+
+#endif
+
 /**
  * __dev_alloc_pages - allocate page for network Rx
  * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx
@@ -2555,7 +2570,9 @@  static inline struct page *__dev_alloc_pages(gfp_t gfp_mask,
 	 * 4.  __GFP_MEMALLOC is ignored if __GFP_NOMEMALLOC is set due to
 	 *     code in gfp_to_alloc_flags that should be enforcing this.
 	 */
-	gfp_mask |= __GFP_COLD | __GFP_COMP | __GFP_MEMALLOC;
+	gfp_mask |= __GFP_COLD | __GFP_COMP;
+	if (sk_memalloc_socks())
+		gfp_mask |= __GFP_MEMALLOC;
 
 	return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
 }
@@ -2601,7 +2618,7 @@  static inline struct page *__skb_alloc_pages(gfp_t gfp_mask,
 
 	gfp_mask |= __GFP_COLD;
 
-	if (!(gfp_mask & __GFP_NOMEMALLOC))
+	if (sk_memalloc_socks() && !(gfp_mask & __GFP_NOMEMALLOC))
 		gfp_mask |= __GFP_MEMALLOC;
 
 	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
diff --git a/include/net/sock.h b/include/net/sock.h
index 1b86b37fe884..6ac59b86b3a2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -785,21 +785,6 @@  static inline bool sock_flag(const struct sock *sk, enum sock_flags flag)
 	return test_bit(flag, &sk->sk_flags);
 }
 
-#ifdef CONFIG_NET
-extern struct static_key memalloc_socks;
-static inline int sk_memalloc_socks(void)
-{
-	return static_key_false(&memalloc_socks);
-}
-#else
-
-static inline int sk_memalloc_socks(void)
-{
-	return 0;
-}
-
-#endif
-
 static inline gfp_t sk_gfp_atomic(struct sock *sk, gfp_t gfp_mask)
 {
 	return GFP_ATOMIC | (sk->sk_allocation & __GFP_MEMALLOC);

Comments

Alexey Kuznetsov April 8, 2019, 1:24 p.m.
Hello!

Does not this mean that now we have a reserve,
which nobody ever uses?

I am aware this reserve can be used in theory
from mm itself. When? Tracing showed the only
function ever doing ALLOC_NO_WATERMARK used
to be dev_alloc_pages.


On Mon, Apr 8, 2019 at 3:10 PM Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:
>
> We were observing network performance issues due to packets
> being dropped by sk_filter_trim_cap() since the 'skb' was allocated
> from pfmemalloc reserves:
>
>     /*
>      * If the skb was allocated from pfmemalloc reserves, only
>      * allow SOCK_MEMALLOC sockets to use it as this socket is
>      * helping free memory
>      */
>     if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
>         return -ENOMEM;
>
> Memalloc sockets are used for stuff like swap over NBD or NFS
> and only memalloc sockets can process memalloc skbs. Since we
> don't have any memalloc sockets in our setups we shouldn't have
> memalloc skbs either. It simply doesn't make any sense to waste
> memory reserves on skb only to drop packets later.
>
> It appears that __dev_alloc_pages() unconditionally uses
> __GFP_MEMALLOC, so unless caller added __GFP_NOMEMALLOC, the
> __dev_alloc_pages() may dive into memory reserves.
> Later build_skb() or __skb_fill_page_desc() sets skb->pfmemalloc = 1
> so this skb always dropped by sk_filter_trim_cap().
>
> Instead of wasting memory reserves we should simply avoid using
> them in case of absence memalloc sockets in the system. Do this
> by adding __GFP_MEMALLOC only when such socket is present in the
> system.
>
> https://pmc.acronis.com/browse/VSTOR-21390
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> ---
>  include/linux/skbuff.h | 21 +++++++++++++++++++--
>  include/net/sock.h     | 15 ---------------
>  2 files changed, 19 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 58515be85153..28a112d06ece 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -2535,6 +2535,21 @@ void napi_consume_skb(struct sk_buff *skb, int budget);
>  void __kfree_skb_flush(void);
>  void __kfree_skb_defer(struct sk_buff *skb);
>
> +#ifdef CONFIG_NET
> +extern struct static_key memalloc_socks;
> +static inline int sk_memalloc_socks(void)
> +{
> +       return static_key_false(&memalloc_socks);
> +}
> +#else
> +
> +static inline int sk_memalloc_socks(void)
> +{
> +       return 0;
> +}
> +
> +#endif
> +
>  /**
>   * __dev_alloc_pages - allocate page for network Rx
>   * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx
> @@ -2555,7 +2570,9 @@ static inline struct page *__dev_alloc_pages(gfp_t gfp_mask,
>          * 4.  __GFP_MEMALLOC is ignored if __GFP_NOMEMALLOC is set due to
>          *     code in gfp_to_alloc_flags that should be enforcing this.
>          */
> -       gfp_mask |= __GFP_COLD | __GFP_COMP | __GFP_MEMALLOC;
> +       gfp_mask |= __GFP_COLD | __GFP_COMP;
> +       if (sk_memalloc_socks())
> +               gfp_mask |= __GFP_MEMALLOC;
>
>         return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
>  }
> @@ -2601,7 +2618,7 @@ static inline struct page *__skb_alloc_pages(gfp_t gfp_mask,
>
>         gfp_mask |= __GFP_COLD;
>
> -       if (!(gfp_mask & __GFP_NOMEMALLOC))
> +       if (sk_memalloc_socks() && !(gfp_mask & __GFP_NOMEMALLOC))
>                 gfp_mask |= __GFP_MEMALLOC;
>
>         page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 1b86b37fe884..6ac59b86b3a2 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -785,21 +785,6 @@ static inline bool sock_flag(const struct sock *sk, enum sock_flags flag)
>         return test_bit(flag, &sk->sk_flags);
>  }
>
> -#ifdef CONFIG_NET
> -extern struct static_key memalloc_socks;
> -static inline int sk_memalloc_socks(void)
> -{
> -       return static_key_false(&memalloc_socks);
> -}
> -#else
> -
> -static inline int sk_memalloc_socks(void)
> -{
> -       return 0;
> -}
> -
> -#endif
> -
>  static inline gfp_t sk_gfp_atomic(struct sock *sk, gfp_t gfp_mask)
>  {
>         return GFP_ATOMIC | (sk->sk_allocation & __GFP_MEMALLOC);
> --
> 2.21.0
>
Andrey Ryabinin April 8, 2019, 4:29 p.m.
On 4/8/19 4:24 PM, Alexey Kuznetsov wrote:
> Hello!
> 
> Does not this mean that now we have a reserve,
> which nobody ever uses?
> 

Don't think so.

> I am aware this reserve can be used in theory
> from mm itself.

Yes, it the whole idea of reserves - to use them only for memory reclaim.


> When?

Most common case must be kswapd (it has PF_MEMALLOC in task->flags which tells page_alloc ) doing ->writepage(), e.g. see bellow.
Also it could be any task in direct reclaim (again PF_MEMALLOC in ->flags) and allocation in ->scan_objects() callback in slab shrinkers.

            7fffa98279bd __alloc_pages_nodemask ([kernel.kallsyms])
            7fffa987d077 alloc_pages_current ([kernel.kallsyms])
            7fffa9887d1d new_slab ([kernel.kallsyms])
            7fffa9889ae7 ___slab_alloc ([kernel.kallsyms])
            7fffa9889c70 __slab_alloc ([kernel.kallsyms])
            7fffa988a927 __kmalloc ([kernel.kallsyms])
            7fffc04b4a7d alloc_indirect.isra.11 ([kernel.kallsyms])
            7fffc04b5030 virtqueue_add_sgs ([kernel.kallsyms])
            7fffc051b6e8 __virtblk_add_req ([kernel.kallsyms])
            7fffc051b9ff virtio_queue_rq ([kernel.kallsyms])
            7fffa99d0982 blk_mq_dispatch_rq_list ([kernel.kallsyms])
            7fffa99d59af blk_mq_do_dispatch_sched ([kernel.kallsyms])
            7fffa99d638e blk_mq_sched_dispatch_requests ([kernel.kallsyms])
            7fffa99ce8b0 __blk_mq_run_hw_queue ([kernel.kallsyms])
            7fffa99cea38 __blk_mq_delay_run_hw_queue ([kernel.kallsyms])
            7fffa99ceb41 blk_mq_run_hw_queue ([kernel.kallsyms])
            7fffa99d685e blk_mq_sched_insert_requests ([kernel.kallsyms])
            7fffa99d1989 blk_mq_flush_plug_list ([kernel.kallsyms])
            7fffa99c6a5e blk_flush_plug_list ([kernel.kallsyms])
            7fffa99d1596 blk_mq_make_request ([kernel.kallsyms])
            7fffa99c4ea2 generic_make_request ([kernel.kallsyms])
            7fffa99c5171 submit_bio ([kernel.kallsyms])
            7fffa986d400 __swap_writepage ([kernel.kallsyms])
            7fffa986d5ee swap_writepage ([kernel.kallsyms])
            7fffa98341ae shrink_page_list ([kernel.kallsyms])
            7fffa983511a shrink_inactive_list ([kernel.kallsyms])
            7fffa9835b54 shrink_zone_memcg ([kernel.kallsyms])
            7fffa9836007 shrink_zone ([kernel.kallsyms])
            7fffa98371c0 balance_pgdat ([kernel.kallsyms])
            7fffa98376b4 kswapd ([kernel.kallsyms])
            7fffa972d941 kthread ([kernel.kallsyms])
            7fffa9e132b7 ret_from_fork_nospec_end ([kernel.kallsyms])


Another use-case for reserves is oom-killed tasks which need to allocate some memory to finish process and free it's resources (see TIF_MEMDIE thread flag).

> Tracing showed the only
> function ever doing ALLOC_NO_WATERMARK used
> to be dev_alloc_pages.
> 

I've did a little testing myself and managed to trigger only the trace from above.
However there are many different ->writepage() and ->scan_objects() methods, I'm pretty
sure that some of them, besides swap_writepage() allocate memory.