Home > linux > TCP Auto-Tuning

TCP Auto-Tuning

November 17th, 2014

There is a field in each TCP segment called the “receive window”.  The receiver is essentially signalling the amount of data that it can accept, or is willing to accept.  This post describes the metrics and overhead.

The overhead is: window/2^tcp_adv_win_scale (tcp_adv_win_scale default is 2) So for linux default parameters for the recieve window (tcp_rmem):
87380 – (87380 / 2^2) = 65536.

Given a transatlantic link (150 ms RTT), the maximum performance ends up at:
65536/0.150 = 436906 bytes/s (or about 400 kbyte/s), which is really slow today.

Here is a link to a formula for receive buffer: http://www.acc.umu.se/~maswan/linux-netperf.txt.

With the increased default size: (873800 – 873800/2^2)/0.150 = 4369000 bytes/s, or about 4Mbytes/s, which is reasonable for a modern network.

And note that this is the default, if the sender is configured with a larger window size it will happily scale up to 10 times this (8738000*0.75/0.150 = ~40Mbytes/s), pretty good for a modern network.

FIND LINUX VERSION (Because the default Linux tcp window sizing parameters before 2.6.17 are not optimal).

[daren@Shimla 20-oct]$ cat /proc/version
Linux version 2.6.32-431.20.5.el6.x86_64 (mockbuild@c6b8.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Fri Jul 25 08:34:44 UTC 2014
[daren@Shimla 20-oct]$

FIND WINDOW SIZING PARAMETERS:

The default Linux tcp window sizing parameters before 2.6.17 are not optimal.
– To check what setting your system is using, use ‘sysctl name’ (e.g.: ‘sysctl net.ipv4.tcp_rmem’).
– To change a setting use ‘sysctl -w’. To make the setting permanent add the setting to the file ‘sysctl.conf’.
EXAMPLE:
sysctl -w net.ipv4.tcp_rmem=’4096 87380 8388608′

For this version (Linux version 2.6.32-431.20.5.el6.x86_64):

[daren@Shimla 20-oct]$ sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096        87380   4194304
[daren@Shimla 20-oct]$ sysctl net.ipv4.tcp_wmem
net.ipv4.tcp_wmem = 4096        16384   4194304
[daren@Shimla 20-oct]$ sysctl net.core.rmem_max
net.core.rmem_max = 124928

TCP Autotuning setting.
– The first value tells the kernel the minimum receive buffer for each TCP connection, and this buffer is always allocated to a TCP socket, even under high pressure on the system.
– The second value specified tells the kernel the default receive buffer allocated for each TCP socket. This value overrides the /proc/sys/net/core/rmem_default value used by other protocols.
– The third and last value specified in this variable specifies the maximum receive buffer that can be allocated for a TCP socket.”

The sender on seeing the receive window size mentioned in the the segment sent by the receiver, will make a note of it. Now the sender cannot send more than the receive window size mentioned by the receiver until they are acknowledged. Once the acknowledgement is received, and a new receive window value is send by the receiver, the sender can now send next set of data (again only that amount of data which the receiver has mentioned in the receiver window size.)

If a receiver sends a receive window size of 0, the sender cannot send any more data till an acknowledgement is received for its previous sent data and a new receive window size is send by the receiver.

RFC 1323 ( http://tools.ietf.org/html/rfc1323 )contains details about performance improvements in TCP.

The limitation was that the maximum receive window size that can be included in a TCP frame is 65,535 bytes (window/2^tcp_adv_win_scale or 87380 – (87380 / 2^2) = 65536.

The new modification came up with something called as window scaling, that increases the limit of receive window size from 65535 bytes to a maximum of 1,073,725,440 bytes (which is very close to 1 Giga byte). To understand this more closely let’s dive into a little bit of calculation. This calculation is called as Bandwidth Delay Product.

BANDWIDTH DELAY PRODUCT:
The term bandwidth delay product is self explanatory, it is the “product of Bandwidth and Delay caused while communicating between two end points”.

The second value with which we will multiply the bandwidth is nothing but the delay caused in sending a packet and then getting an acknowledgement back. We saw how to determine that value in our Round Trip Time section.

Now if you have your bandwidth of 10Mbps, and has a RTT (Round Trip Time) of around 200ms for reaching your target receiver, then the Bandwidth delay product will be.

Bandwidth Delay Product = 2 x 106  b/s x 200 x 10-3 s =  244.14Kilobytes

The bandwidth delay product result shows the amount of bytes that must be transmitted, to efficiently use the connection speed. However as our operating system has this default value of 65535 bytes (65 Kilobytes )window size, the connection is not at all efficiently used.

If you calculate 244.14 – 65 = 179 KiloBytes is left unused. So the more Round Trip Time you have the more data needs to be send to fully utilize the link speed(because more delay means you need to send a little bit more data at once, to utilize the bandwidth.). Bandwidth Delay product will go on increasing if the latency (Round Trip Time) is more.

SOLUTION:
To solve this problem we need to use a higher window size. As mentioned before performance improvements were brought to TCP with a new modification in the form of RFC 1323.

(WSCALE)
Increasing the window size for performance is implemented in the form of something called as TCP Window Scaling.

[daren@Shimla 20-oct]$ sysctl net.ipv4.tcp_window_scaling
net.ipv4.tcp_window_scaling = 1

RWIN
Even after enabling window scaling option, the maximum amount of data that can be send to a receiver without getting an acknowledgement back depends on one more factor. Its called as receive window size. This is the maximum amount of data that a receiver can buffer before being processed by the receiving application.

Now if the receiver’s receive window size is smaller, even after setting up window scaling option the sender can only send maximum data size equal to the receive window size configured at the receiver end.

Hence we need to modify the receive window size to a bigger maximum value. This configuration is also made using sysctl.conf file, with the below option.
net.core.rmem_max = 16777216

Modifying the maximum send window size, is also similar to the way we modified the maximum receive window size.
net.core.wmem_max = 16777216

Now other than the above mentioned maximum values of receive window size and send window size, there is one more setting that the operating system uses which sets these values for different conditions. Let’s see that option (this is also set in sysctl.conf file). Let’s see the receive window size values first.
net.ipv4.tcp_rmem = 4096 87380 16777216

(remember that there are three values):

The first value is the minimum amount of receive window that will be set to each TCP connection, even if the system is under extreme high pressure.
The default value allocated to each tcp connection
The third one is the maximum that can be allocated to a TCP connection

Similarly there is send window settings show here:
net.ipv4.tcp_wmem = 4096        16384   16777216

With all these combined together the sysctl.conf file will look something like the below.
net.ipv4.tcp_window_scaling = 1
net.core.rmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096        16384   16777216

Once the sysctl.conf file has been modified you can reload the configuration, making it permanent, by issuing:
sysctl -p /etc/sysctl.conf

—–
QUICK FIX: [wirespeed for Gigabit Ethernet within 5 ms RTT and Fast Ethernet within 50 ms RTT]:

in /etc/sysctl.conf

net/core/rmem_max = 8738000
net/core/wmem_max = 6553600

net/ipv4/tcp_rmem = 8192 873800 8738000
net/ipv4/tcp_wmem = 4096 655360 6553600

Alternatively you can also modify the above values on the fly, by redirecting your required values to the required file in /proc. This can be done as shown below.
echo ‘16777216’ > /proc/sys/net/core/rmem_max
echo ‘16777216’ > /proc/sys/net/core/wmem_max

Notes for Linux 2.4 users:

The RFC1323 window scale value is initially calculated as
roof(ln(x/65536)/ln(2)), where in 2.4 based kernels x = initial receive buffer (or tcp_rmem[default]) and in 2.6 kernels x = max(tcp_rmem[max], core_rmem_max)

This effectively limits 2.4 kernel tcp windows to ~tcp_rmem[default], but from 2.4.27 (or possibly earlier) the wscale is set to
max(calculated_wscale, tcp_default_win_scale) which means that one can set tcp_default_win_scale to the same value as a 2.6 kernel would.

So if there is a tcp_default_win_scale in /proc/sys/net/ipv4 you should add net/ipv4/tcp_default_win_scale = roof(ln(tcp_rmem[max]/65536)/ln(2)) to /etc/sysctl.conf

DISABLING AUTO-TUNING:
NB: Turn off auto-tuning of the TCP receive buffer size. On the receiver:

$ sudo sysctl net.ipv4.tcp_moderate_rcvbuf=0
or
$ sudo echo 0 > /proc/sys/net/ipv4/tcp_moderate_rcvbuf

What is rcv_space in the ‘ss –info’ output, and why it’s value is larger than net.core.rmem_max:

” rcv_space is used in TCP’s internal auto-tuning to grow socket buffers based on how much data the kernel estimates the sender can send. It will change over the life of any connection. It’s measured in bytes. You can see where the value is populated by reading the tcp_get_info() function in the kernel.

The value is not measuring the actual socket buffer size, which is what net.ipv4.tcp_rmem controls. You’d need to call getsockopt() within the application to check the buffer size. You can see current buffer usage with the Recv-Q and Send-Q fields of ss.
Note that if the buffer size is set with setsockopt(), the value returned with getsockopt() is alwaysdouble the size requested to allow for overhead. This is described in man 7 socket.

You can check the value of kernel tunables with sysctl tunable.name, for example sysctl net.ipv4.tcp_rmem, or you can just run sysctl -a to see all the tunables. I find it easier to just see all of them and grep, like sysctl -a | egrep tcp..*mem”
sources:
http://www.acc.umu.se/~maswan/linux-netperf.txt

https://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php

https://www.frozentux.net/ipsysctl-tutorial/chunkyhtml/tcpvariables.html

http://www.slashroot.in/linux-network-tcp-performance-tuning-sysctl

http://fasterdata.es.net/host-tuning/

http://support.microsoft.com/kb/983528

https://www.duckware.com/blog/how-windows-is-killing-internet-download-speeds/index.html

https://access.redhat.com/discussions/782343

C:\Windows\system32>netsh int tcp show global
Querying active state…

TCP Global Parameters
———————————————-
Receive-Side Scaling State          : enabled
Chimney Offload State               : automatic
NetDMA State                        : enabled
Direct Cache Acess (DCA)            : disabled
Receive Window Auto-Tuning Level    : normal
Add-On Congestion Control Provider  : none
ECN Capability                      : disabled
RFC 1323 Timestamps                 : disabled

 

C:\Windows\system32>
Querying active state…

TCP Global Parameters
———————————————-
Receive-Side Scaling State          : enabled
Chimney Offload State               : automatic
NetDMA State                        : enabled
Direct Cache Acess (DCA)            : disabled
Receive Window Auto-Tuning Level    : normal
Add-On Congestion Control Provider  : none
ECN Capability                      : disabled
RFC 1323 Timestamps                 : disabled

C:\Windows\system32>
C:\Windows\system32>netsh interface tcp show heuristics
TCP Window Scaling heuristics Parameters
———————————————-
Window Scaling heuristics         : enabled
Qualifying Destination Threshold  : 3
Profile type unknown              : normal
Profile type public               : normal
Profile type private              : normal
Profile type domain               : normal

Categories: linux Tags:
Comments are closed.