TCP Window Scaling Bottleneck
Embed This Widget
Add the script tag and a data attribute to embed this widget.
Embed via iframe for maximum compatibility.
<iframe src="https://ipfyi.com/iframe/entity//" width="420" height="400" frameborder="0" style="border:0;border-radius:10px;max-width:100%" loading="lazy"></iframe>
Paste this URL in WordPress, Medium, or any oEmbed-compatible platform.
https://ipfyi.com/entity//
Add a dynamic SVG badge to your README or docs.
[](https://ipfyi.com/entity//)
Use the native HTML custom element.
Bulk TCP transfers over high-bandwidth, high-latency links (e.g., intercontinental links, satellite, or WAN) achieve only a small fraction of the available bandwidth. The bandwidth-delay product (BDP) of the link far exceeds the maximum TCP receive window being advertised, causing the sender to stall waiting for ACKs before it can send more data. This is a classic long fat network (LFN) problem.
Symptoms
- ⚠ iperf3 shows throughput of 10-100 Mbps on a 10 Gbps link when RTT is above 50ms
- ⚠ Throughput improves dramatically when using multiple parallel TCP streams (-P flag in iperf3)
- ⚠ ss or netstat shows small receive window sizes (rcv_wnd) on active connections
- ⚠ CPU utilisation is low during transfers, confirming the bottleneck is not computational
- ⚠ Wireshark capture shows TCP window size plateauing without Zero Window conditions
- ⚠ Throughput scales predictably: BDP = bandwidth x RTT, and measured throughput matches calculated window / RTT
Possible Root Causes
- • TCP receive window too small for the bandwidth-delay product: the default 65535-byte window is sufficient for local LANs but not for intercontinental or satellite links
- • TCP window scaling option disabled or stripped by a firewall or middlebox in the path, preventing the window from growing beyond 65535 bytes
- • Suboptimal TCP congestion control algorithm (e.g., Cubic) that is slow to fill high-BDP pipes; BBR performs better on such links
- • Socket buffer size limits set too low in the kernel (net.core.rmem_max, net.core.wmem_max), capping the window at values well below what the link could support
- • Operating system auto-tuning disabled, leaving static small window sizes that are appropriate for LAN but not WAN workloads
Diagnosis Steps
Step 1 — Calculate the bandwidth-delay product
The theoretical maximum throughput of a single TCP stream is:
Max throughput = TCP window size / RTT
Example: 65535 bytes window / 0.100 sec RTT = 655,350 bytes/sec = ~5.2 Mbps
Measure your RTT first:
# Measure RTT to the remote host
ping -c 20 remote-host.com
# Run iperf3 with RTT output
iperf3 -c remote-host.com -t 30 --json | python3 -c \
"import sys,json; d=json.load(sys.stdin); print('RTT min/avg/max:', d['end']['streams'][0]['sender']['mean_rtt'])"
Step 2 — Check current TCP window sizes
# Check kernel TCP buffer settings
sysctl net.core.rmem_max net.core.wmem_max
sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem
sysctl net.ipv4.tcp_window_scaling
# Check window size on active connections
ss -tin | grep remote-host
# Look for: rcv_wnd (receive window) and snd_wnd (send window)
Step 3 — Test single-stream vs. multi-stream throughput
# Single stream (limited by window / RTT)
iperf3 -c remote-host.com -t 30
# Multi-stream (bypasses single-stream window limit)
iperf3 -c remote-host.com -t 30 -P 8
# If multi-stream >> single-stream, it confirms window scaling is the bottleneck
Step 4 — Verify TCP window scaling is negotiated
# Capture a connection with tcpdump and check SYN options
tcpdump -i eth0 -w /tmp/cap.pcap host remote-host.com &
curl -s https://remote-host.com/largefile > /dev/null
kill %1
# Inspect the SYN packet for Window Scale option
tcpdump -r /tmp/cap.pcap -v | grep -i 'wscale'
Step 5 — Check for middlebox interference
# Test if a firewall or middlebox is stripping TCP options
curl -o /dev/null -w '%{speed_download}' https://remote-host.com/bigfile
# Check if window scaling option survives the path
nmap --script tcp-ts remote-host.com
# Test with DSCP marking to bypass some middleboxes
iperf3 -c remote-host.com -S 0x28 # DSCP AF11
Step 6 — Profile the congestion control algorithm in use
# Check which TCP congestion control algorithm is active
sysctl net.ipv4.tcp_congestion_control
# List available algorithms
sysctl net.ipv4.tcp_available_congestion_control
Solution
Fix 1 — Increase TCP socket buffer sizes
Edit /etc/sysctl.conf on both sender and receiver:
# Calculate required buffer: BDP = bandwidth (bytes/s) x RTT (seconds)
# Example: 10 Gbps link, 100ms RTT: BDP = 1,250,000,000 x 0.1 = 125 MB
# Set maximum socket buffer to cover the BDP (double for headroom)
sudo sysctl -w net.core.rmem_max=268435456 # 256 MB
sudo sysctl -w net.core.wmem_max=268435456 # 256 MB
sudo sysctl -w net.ipv4.tcp_rmem='4096 87380 268435456'
sudo sysctl -w net.ipv4.tcp_wmem='4096 65536 268435456'
sudo sysctl -w net.ipv4.tcp_window_scaling=1 # Ensure window scaling is enabled
sudo sysctl -w net.ipv4.tcp_timestamps=1 # Required for RWIN > 65535
sudo sysctl -w net.ipv4.tcp_sack=1 # Selective ACK for efficiency
# Persist the settings
cat << 'EOF' | sudo tee -a /etc/sysctl.d/99-tcp-tuning.conf
net.core.rmem_max = 268435456
net.core.wmem_max = 268435456
net.ipv4.tcp_rmem = 4096 87380 268435456
net.ipv4.tcp_wmem = 4096 65536 268435456
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
EOF
sudo sysctl -p /etc/sysctl.d/99-tcp-tuning.conf
Fix 2 — Switch to BBR congestion control
BBR (Bottleneck Bandwidth and RTT) is more efficient on high-BDP links than CUBIC:
# Check kernel version (BBR requires 4.9+)
uname -r
# Enable BBR
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
sudo sysctl -w net.core.default_qdisc=fq
# Persist
echo "net.ipv4.tcp_congestion_control = bbr" | sudo tee -a /etc/sysctl.d/99-tcp-tuning.conf
echo "net.core.default_qdisc = fq" | sudo tee -a /etc/sysctl.d/99-tcp-tuning.conf
sudo sysctl -p /etc/sysctl.d/99-tcp-tuning.conf
# Verify
sysctl net.ipv4.tcp_congestion_control
Fix 3 — Investigate and fix middlebox window scaling stripping
If tcpdump confirms the Window Scale option is missing from SYN-ACK packets:
# On the firewall (iptables), ensure TCP options are not stripped
# Some stateful firewalls use --tcp-flags manipulation that strips options
iptables -L -n -v | grep -i "mss\|tcp-opt"
# If using iptables, ensure TCPMSS clamping is not too aggressive
# Correct: clamp to PMTU
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
Fix 4 — Use QUIC/HTTP3 for application-layer transfers
QUIC (used by HTTP/3) implements its own flow control and is not limited by TCP window constraints:
# Test HTTP/3 download speed (requires a server with QUIC support)
curl --http3 -o /dev/null -w '%{speed_download}' https://remote-host.com/largefile
Verification
# After applying fixes, re-run iperf3 single stream test
iperf3 -c remote-host.com -t 60
# Expected: throughput should now approach BDP / RTT
# With 256MB window and 100ms RTT: max = 256MB / 0.1s = 2.56 GB/s (20 Gbps) — far above link speed
Prevention
- Apply TCP buffer tuning (large rmem/wmem) as a standard baseline in all server provisioning playbooks, especially for WAN-facing servers
- Enable BBR congestion control by default on servers that serve intercontinental traffic or large file downloads
- Audit firewall and NAT device configurations to ensure they do not strip TCP timestamp or window scaling options
- Include iperf3 single-stream vs multi-stream benchmarks in deployment tests to detect window scaling issues before production
- Use QUIC or HTTP/3 for applications with strict throughput requirements on high-latency links, since QUIC's flow control avoids TCP window limitations