TCP_REPAIR MSS issue

Submitted by Andrey Vagin on June 16, 2016, 9:09 p.m.

Details

Message ID 20160616210902.GB6010@outlook.office365.com
State Rejected
Series "TCP_REPAIR MSS issue"
Headers show

Commit Message

Andrey Vagin June 16, 2016, 9:09 p.m.
On Thu, Jun 16, 2016 at 07:51:22AM +0000, Eggert, Lars wrote:
> Hi,
> 
> On 2016-06-14, at 23:21, Andrey Vagin <avagin@virtuozzo.com> wrote:
> > On my host, I see that dst is set in tcp_v4_connect() -> sk_setup_caps()
> 
> sorry, are you saying that you don't see the issue with TCP_MSS_DEFAULT-sized segments after TCP_REPAIR on your kernel? Or are you saying my quick attempt at analyzing the cause was wrong?

I can't reproduce this issue, now I'm trying to understand why it works
for me and doesn't work for you.

I've read you version of a reason:

> When TCP_REPAIR is on, tcp_connect() directly calls tcp_finish_connect() before
> returning, passing NULL for skb, which causes sk_rx_dst_set() to be bypassed.
> Later, when TCP_REPAIR is being turned off, do_tcp_setsockopt() just does
> tcp_send_window_probe(), but apparently all the "dst" stuff is being bypassed
> then also, so the mss remains at TCP_MSS_DEFAULT.

I found where dst is set for a socket when a tcp connection is restored. Then I
added a debug message into tcp_sync_mss and found that mss is intialized to
TCP_MSS_DEFAULT, but then it's updated after unlocking network. So here is a
question why mss isn't updated in your case.


[   86.095286] tcp_sync_mss:1372: pmtu = 1500 mss = 524 (536)
[   86.095292] CPU: 0 PID: 12474 Comm: criu ve: 101 Not tainted 3.10.0-327.18.2.ovz.14.14-00004-g4ba9241-dirty #9 14.14
[   86.095294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[   86.095297]  ffff8804094ec400 00000000b1bcc4c2 ffff880427b9bcf8 ffffffff8164c988
[   86.095301]  ffff880427b9bd18 ffffffff815a4aca ffff8804275c0780 ffff8804094ec400
[   86.095303]  ffff880427b9bd98 ffffffff815a70c8 ffffffff815911f0 ffffffff81a43500
[   86.095307] Call Trace:
[   86.095315]  [<ffffffff8164c988>] dump_stack+0x19/0x1b
[   86.095320]  [<ffffffff815a4aca>] tcp_sync_mss+0x19a/0x1a0
[   86.095323]  [<ffffffff815a70c8>] tcp_connect+0x98/0x9d0
[   86.095327]  [<ffffffff815911f0>] ? inet_unhash+0xc0/0xc0
[   86.095333]  [<ffffffff81543e0b>] ? secure_ipv4_port_ephemeral+0x5b/0x80
[   86.095337]  [<ffffffff815ac4da>] tcp_v4_connect+0x2da/0x4d0
[   86.095342]  [<ffffffff811af5f9>] ? __do_fault+0x589/0x670
[   86.095347]  [<ffffffff815c376d>] __inet_stream_connect+0xbd/0x330
[   86.095351]  [<ffffffff811b4db1>] ? handle_mm_fault+0x521/0x920
[   86.095354]  [<ffffffff815c3a18>] inet_stream_connect+0x38/0x50
[   86.095358]  [<ffffffff815314a3>] SYSC_connect+0x73/0xf0
[   86.095363]  [<ffffffff81657d63>] ? trace_do_page_fault+0x43/0x110
[   86.095366]  [<ffffffff81657389>] ? do_async_page_fault+0x29/0xe0
[   86.095369]  [<ffffffff81531c8e>] SyS_connect+0xe/0x10
[   86.095373]  [<ffffffff8165c749>] system_call_fastpath+0x16/0x1b
[   91.813519] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   91.814600] device veth51e6d765 entered promiscuous mode
[   91.814654] br0: port 2(veth51e6d765) entered forwarding state
[   91.814661] br0: port 2(veth51e6d765) entered forwarding state
[  106.853351] br0: port 2(veth51e6d765) entered forwarding state
[  116.224891] tcp_sync_mss:1372: pmtu = 1500 mss = 1448 (524)
[  116.224929] CPU: 1 PID: 0 Comm: swapper/1 ve: 0 Not tainted 3.10.0-327.18.2.ovz.14.14-00004-g4ba9241-dirty #9 14.14
[  116.224935] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[  116.224941]  000000002cd1a562 9139c0c08fdb34c5 ffff88043fc83a88 ffffffff8164c988
[  116.224948]  ffff88043fc83aa8 ffffffff815a4aca ffff8804275c0780 0000000000004100
[  116.224954]  ffff88043fc83b48 ffffffff8159fa24 ffff88043fc83be8 ffffffffa0289299
[  116.224960] Call Trace:
[  116.224965]  <IRQ>  [<ffffffff8164c988>] dump_stack+0x19/0x1b
[  116.224980]  [<ffffffff815a4aca>] tcp_sync_mss+0x19a/0x1a0
[  116.224986]  [<ffffffff8159fa24>] tcp_ack+0x394/0x11a0
[  116.225005]  [<ffffffffa0289299>] ? ipt_do_table+0x339/0x700 [ip_tables]
[  116.225014]  [<ffffffffa0289299>] ? ipt_do_table+0x339/0x700 [ip_tables]
[  116.225024]  [<ffffffff815a23d6>] tcp_rcv_established+0x1c6/0x740
[  116.225031]  [<ffffffff815ad6fa>] tcp_v4_do_rcv+0x10a/0x3b0
[  116.225039]  [<ffffffff815914f7>] ? __inet_lookup_established+0x47/0x140
[  116.225045]  [<ffffffff815aec03>] tcp_v4_rcv+0x823/0xa90
[  116.225051]  [<ffffffff815873b6>] ip_local_deliver_finish+0xe6/0x220
[  116.225060]  [<ffffffff81587695>] ip_local_deliver+0x55/0xd0
[  116.225066]  [<ffffffff815872d0>] ? ip_rcv_finish+0x350/0x350
[  116.225071]  [<ffffffff81586ffd>] ip_rcv_finish+0x7d/0x350
[  116.225077]  [<ffffffff815879cc>] ip_rcv+0x2bc/0x3e0


> 
> Thanks,
> Lars

Patch hide | download patch | download mbox

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 95c0b50..b0d323f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1367,6 +1367,13 @@  unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
        icsk->icsk_pmtu_cookie = pmtu;
        if (icsk->icsk_mtup.enabled)
                mss_now = min(mss_now, tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low));
+
+       static struct tcp_sock *tp_s = NULL;
+        if (tp->repair || tp == tp_s) {
+                printk("%s:%d: pmtu = %d mss = %d (%d)\n", __func__, __LINE__, pmtu, mss_now, tp->mss_cache);
+               tp_s = tp;
+                dump_stack();
+        }
        tp->mss_cache = mss_now;
 
        return mss_now;

Comments

Eggert, Lars July 18, 2016, 9:06 a.m.
Hi,


On 2016-06-16, at 23:09, Andrey Vagin <avagin@virtuozzo.com> wrote:
> I can't reproduce this issue, now I'm trying to understand why it works
> for me and doesn't work for you.

just to conclude this thread for the list:

Andrey and me debugged this off-list. The issue arose, because my code did a bind() to 0.0.0.0 in TCP_REPAIR mode. When turning off TCP_REPAIR and sending into the socket, this caused minimum-MSS-sized segments to be transmitted. The issue goes away when I bind() to the local IP address of my local egress interface.

To me at least, this is a POLA violation (excuse the FreeBSD terminology :-) Either binding to 0.0.0.0 should fail, or it should succeed and full-sized segments should be sent. But at least I have a work-around now.

Thanks,
Lars