TCP Window Scaling Bottleneck

Advanced Performance

Bulk TCP transfers over high-bandwidth, high-latency links (e.g., intercontinental links, satellite, or WAN) achieve only a small fraction of the available bandwidth. The bandwidth-delay product (BDP) of the link far exceeds the maximum TCP receive window being advertised, causing the sender to stall waiting for ACKs before it can send more data. This is a classic long fat network (LFN) problem.

Symptoms

⚠ iperf3 shows throughput of 10-100 Mbps on a 10 Gbps link when RTT is above 50ms
⚠ Throughput improves dramatically when using multiple parallel TCP streams (-P flag in iperf3)
⚠ ss or netstat shows small receive window sizes (rcv_wnd) on active connections
⚠ CPU utilisation is low during transfers, confirming the bottleneck is not computational
⚠ Wireshark capture shows TCP window size plateauing without Zero Window conditions
⚠ Throughput scales predictably: BDP = bandwidth x RTT, and measured throughput matches calculated window / RTT

Possible Root Causes

• TCP receive window too small for the bandwidth-delay product: the default 65535-byte window is sufficient for local LANs but not for intercontinental or satellite links
• TCP window scaling option disabled or stripped by a firewall or middlebox in the path, preventing the window from growing beyond 65535 bytes
• Suboptimal TCP congestion control algorithm (e.g., Cubic) that is slow to fill high-BDP pipes; BBR performs better on such links
• Socket buffer size limits set too low in the kernel (net.core.rmem_max, net.core.wmem_max), capping the window at values well below what the link could support
• Operating system auto-tuning disabled, leaving static small window sizes that are appropriate for LAN but not WAN workloads

Diagnosis Steps

Step 1 — Calculate the bandwidth-delay product

The theoretical maximum throughput of a single TCP stream is:

Max throughput = TCP window size / RTT

Example: 65535 bytes window / 0.100 sec RTT = 655,350 bytes/sec = ~5.2 Mbps

Measure your RTT first:

# Measure RTT to the remote host
ping -c 20 remote-host.com

# Run iperf3 with RTT output
iperf3 -c remote-host.com -t 30 --json | python3 -c \
  "import sys,json; d=json.load(sys.stdin); print('RTT min/avg/max:', d['end']['streams'][0]['sender']['mean_rtt'])"

Step 2 — Check current TCP window sizes

# Check kernel TCP buffer settings
sysctl net.core.rmem_max net.core.wmem_max
sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem
sysctl net.ipv4.tcp_window_scaling

# Check window size on active connections
ss -tin | grep remote-host
# Look for: rcv_wnd (receive window) and snd_wnd (send window)

Step 3 — Test single-stream vs. multi-stream throughput

# Single stream (limited by window / RTT)
iperf3 -c remote-host.com -t 30

# Multi-stream (bypasses single-stream window limit)
iperf3 -c remote-host.com -t 30 -P 8

# If multi-stream >> single-stream, it confirms window scaling is the bottleneck

Step 4 — Verify TCP window scaling is negotiated

# Capture a connection with tcpdump and check SYN options
tcpdump -i eth0 -w /tmp/cap.pcap host remote-host.com &
curl -s https://remote-host.com/largefile > /dev/null
kill %1

# Inspect the SYN packet for Window Scale option
tcpdump -r /tmp/cap.pcap -v | grep -i 'wscale'

Step 5 — Check for middlebox interference

# Test if a firewall or middlebox is stripping TCP options
curl -o /dev/null -w '%{speed_download}' https://remote-host.com/bigfile

# Check if window scaling option survives the path
nmap --script tcp-ts remote-host.com

# Test with DSCP marking to bypass some middleboxes
iperf3 -c remote-host.com -S 0x28  # DSCP AF11

Step 6 — Profile the congestion control algorithm in use

# Check which TCP congestion control algorithm is active
sysctl net.ipv4.tcp_congestion_control

# List available algorithms
sysctl net.ipv4.tcp_available_congestion_control

Solution

Fix 1 — Increase TCP socket buffer sizes

Edit /etc/sysctl.conf on both sender and receiver:

# Calculate required buffer: BDP = bandwidth (bytes/s) x RTT (seconds)
# Example: 10 Gbps link, 100ms RTT: BDP = 1,250,000,000 x 0.1 = 125 MB

# Set maximum socket buffer to cover the BDP (double for headroom)
sudo sysctl -w net.core.rmem_max=268435456         # 256 MB
sudo sysctl -w net.core.wmem_max=268435456         # 256 MB
sudo sysctl -w net.ipv4.tcp_rmem='4096 87380 268435456'
sudo sysctl -w net.ipv4.tcp_wmem='4096 65536 268435456'
sudo sysctl -w net.ipv4.tcp_window_scaling=1       # Ensure window scaling is enabled
sudo sysctl -w net.ipv4.tcp_timestamps=1           # Required for RWIN > 65535
sudo sysctl -w net.ipv4.tcp_sack=1                 # Selective ACK for efficiency

# Persist the settings
cat << 'EOF' | sudo tee -a /etc/sysctl.d/99-tcp-tuning.conf
net.core.rmem_max = 268435456
net.core.wmem_max = 268435456
net.ipv4.tcp_rmem = 4096 87380 268435456
net.ipv4.tcp_wmem = 4096 65536 268435456
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
EOF
sudo sysctl -p /etc/sysctl.d/99-tcp-tuning.conf

Fix 2 — Switch to BBR congestion control

BBR (Bottleneck Bandwidth and RTT) is more efficient on high-BDP links than CUBIC:

# Check kernel version (BBR requires 4.9+)
uname -r

# Enable BBR
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
sudo sysctl -w net.core.default_qdisc=fq

# Persist
echo "net.ipv4.tcp_congestion_control = bbr" | sudo tee -a /etc/sysctl.d/99-tcp-tuning.conf
echo "net.core.default_qdisc = fq" | sudo tee -a /etc/sysctl.d/99-tcp-tuning.conf
sudo sysctl -p /etc/sysctl.d/99-tcp-tuning.conf

# Verify
sysctl net.ipv4.tcp_congestion_control

Fix 3 — Investigate and fix middlebox window scaling stripping

If tcpdump confirms the Window Scale option is missing from SYN-ACK packets:

# On the firewall (iptables), ensure TCP options are not stripped
# Some stateful firewalls use --tcp-flags manipulation that strips options
iptables -L -n -v | grep -i "mss\|tcp-opt"

# If using iptables, ensure TCPMSS clamping is not too aggressive
# Correct: clamp to PMTU
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

Fix 4 — Use QUIC/HTTP3 for application-layer transfers

QUIC (used by HTTP/3) implements its own flow control and is not limited by TCP window constraints:

# Test HTTP/3 download speed (requires a server with QUIC support)
curl --http3 -o /dev/null -w '%{speed_download}' https://remote-host.com/largefile

Verification

# After applying fixes, re-run iperf3 single stream test
iperf3 -c remote-host.com -t 60

# Expected: throughput should now approach BDP / RTT
# With 256MB window and 100ms RTT: max = 256MB / 0.1s = 2.56 GB/s (20 Gbps) — far above link speed

Prevention

Apply TCP buffer tuning (large rmem/wmem) as a standard baseline in all server provisioning playbooks, especially for WAN-facing servers
Enable BBR congestion control by default on servers that serve intercontinental traffic or large file downloads
Audit firewall and NAT device configurations to ensure they do not strip TCP timestamp or window scaling options
Include iperf3 single-stream vs multi-stream benchmarks in deployment tests to detect window scaling issues before production
Use QUIC or HTTP/3 for applications with strict throughput requirements on high-latency links, since QUIC's flow control avoids TCP window limitations

Related Protocols

TCP UDP QUIC HTTP HTTP2 HTTP3 TLS

Related Terms

tcp throughput bandwidth latency mtu packet-loss quic

More in Performance

Bufferbloat Causing Latency Under Load Intermediate High CDN Cache Miss Rate Advanced High Latency to Specific Geographic Region Intermediate Packet Loss During Peak Hours Intermediate Suboptimal File Transfer Speeds Beginner

Scenario Info

Difficulty Advanced

Category Performance

Diagnostic Tools

port-checker http-headers

Quick Links

All Scenarios Threat Profiles Glossary