🔧 Network Troubleshooting 12 मिनट पढ़ें

Complete Network Performance Audit Checklist

A systematic network performance audit covering baseline measurement, bandwidth testing, latency mapping, packet loss analysis, DNS, TLS, HTTP/2, and TCP tuning.

Why Network Performance Audits Matter

A network performance audit answers a simple but important question: is the network performing as well as it could? Without a baseline, you cannot know whether a 200ms page load is normal for your infrastructure or a regression caused by a configuration change. This audit provides a repeatable process for measuring, analyzing, and documenting network performance across every layer of the stack.

Baseline Measurement

Before optimizing anything, capture a baseline. This is your reference point for future comparisons.

# Record the date, time, and network conditions
echo "Audit start: $(date)" > audit-$(date +%Y%m%d).log
echo "Host: $(hostname)" >> audit-$(date +%Y%m%d).log
echo "Public IP: $(curl -s ifconfig.me)" >> audit-$(date +%Y%m%d).log

# Interface statistics
ip -s link show
netstat -s | head -50   # Cumulative counters since boot

# System load at time of audit
uptime
vmstat 1 5

Run audits during both peak and off-peak hours to capture the effect of network load. A 10am weekday measurement versus a 2am Sunday measurement will often differ significantly.

Bandwidth Testing

Measure available bandwidth at multiple layers: from the server's perspective, from the client's perspective, and at the application layer.

# Server-to-server bandwidth (iperf3)
# On server (listen mode)
iperf3 -s

# On client (connect and test)
iperf3 -c server-ip -t 30          # 30-second TCP test
iperf3 -c server-ip -u -b 100M     # UDP at 100 Mbps
iperf3 -c server-ip -P 4           # 4 parallel streams (saturate more capacity)

# Example iperf3 output:
# [SUM] 0.00-30.00 sec  3.32 GBytes  950 Mbits/sec    sender
# [SUM] 0.00-30.00 sec  3.32 GBytes  949 Mbits/sec    receiver

# From external perspective (Speedtest CLI)
speedtest-cli --simple
# Or
fast --upload

# Web server throughput (single file download benchmark)
# Generate a test file
dd if=/dev/urandom of=/var/www/html/100mb.bin bs=1M count=100
# Download from remote:
curl -o /dev/null -s -w "Speed: %{speed_download} bytes/s\n" \
  https://yourserver.com/100mb.bin

Document both TCP and UDP throughput. UDP bandwidth without retransmission reveals the raw pipe capacity. TCP bandwidth includes protocol overhead and congestion control behavior.

Latency Mapping

Map latency to all critical endpoints your application depends on.

# Ping with statistics
ping -c 100 -i 0.2 target.host

# Example output analysis
# rtt min/avg/max/mdev = 1.234/1.456/3.789/0.234 ms
# mdev (mean deviation) indicates jitter — important for real-time applications

# Measure latency to each dependency
for host in google.com github.com api.yourprovider.com db.internal; do
    echo -n "$host: "
    ping -c 20 -q $host | grep rtt
done

# latency to your CDN edge
curl -w "time_connect: %{time_connect}\ntime_starttransfer: %{time_starttransfer}\n" \
     -o /dev/null -s https://cdn.yourdomain.com/

# Comprehensive path latency with mtr
mtr --report --report-cycles 50 --no-dns target.host

Latency benchmarks for reference:

Connection Type Typical Latency
Same datacenter < 1 ms
Same city, different DC 1-5 ms
Cross-country (US) 40-80 ms
Trans-Atlantic 80-120 ms
Trans-Pacific 150-200 ms

Packet Loss Analysis

Even 0.1% packet loss can significantly degrade TCP throughput due to retransmissions and congestion window reduction.

# Extended packet loss test
ping -c 1000 -i 0.01 target.host | grep -E "transmitted|loss"
# -i 0.01 = 100 pings per second

# mtr for continuous monitoring
mtr --report --report-cycles 100 target.host

# Monitor interface error counters
watch -n 1 'ip -s link show eth0 | grep -A3 "RX errors"'

# Check TCP retransmission rate
netstat -s | grep -E "retransmit|failed"
ss -s   # socket statistics summary

# Kernel TCP statistics
cat /proc/net/snmp | grep Tcp
# Columns: InSegs, OutSegs, RetransSegs, InErrs, OutRsts
# RetransSegs/OutSegs * 100 = retransmission percentage

Causes of packet loss by layer:

Layer Cause How to Identify
Physical Bad cable, SFP Interface error counters
Link Duplex mismatch Collisions in interface stats
Network Congested link Loss at specific hop in mtr
Transport Application bug Loss only for specific service

DNS Performance

DNS lookup time adds directly to page load time. Each uncached DNS lookup blocks the connection from starting.

# Measure DNS resolution time for your domains
time dig yourdomain.com @8.8.8.8
time dig yourdomain.com @1.1.1.1
time dig yourdomain.com @YOUR_CURRENT_RESOLVER

# DNSPerf-style batch test
for domain in yourdomain.com api.yourdomain.com cdn.yourdomain.com; do
    echo -n "$domain: "
    dig +stats $domain @1.1.1.1 | grep "Query time"
done

# Test from server (eliminates client-side caching)
time dig yourdomain.com

# Check DNSSEC validation performance
dig +dnssec yourdomain.com @8.8.8.8 | grep "Query time"

# DNS propagation check (compare multiple resolvers)
for ns in 8.8.8.8 1.1.1.1 9.9.9.9 208.67.222.222; do
    echo -n "NS $ns: "
    dig +short yourdomain.com @$ns
done

Target DNS performance:

Resolver Location Target Query Time
Authoritative NS < 5 ms (within DC)
Recursive resolver (cached) < 1 ms (local)
Recursive resolver (uncached) < 50 ms
Global CDN DNS (Cloudflare/AWS) < 5 ms worldwide

TLS Handshake Timing

TLS handshake time directly impacts TTFB (Time to First Byte). A slow handshake adds hundreds of milliseconds to every new connection.

# Measure TLS handshake time
curl -w "time_namelookup:    %{time_namelookup}s\n\
time_connect:       %{time_connect}s\n\
time_appconnect:    %{time_appconnect}s\n\
time_pretransfer:   %{time_pretransfer}s\n\
time_redirect:      %{time_redirect}s\n\
time_starttransfer: %{time_starttransfer}s\n\
time_total:         %{time_total}s\n" \
  -o /dev/null -s https://yourdomain.com

# TLS handshake time = time_appconnect - time_connect
# Target: < 50ms on same continent, < 200ms trans-oceanic

# Test TLS protocol and cipher performance
openssl s_client -connect yourdomain.com:443 -tls1_3 -brief
openssl s_client -connect yourdomain.com:443 -tls1_2 -brief

# Check TLS session resumption (eliminates full handshake on repeat visits)
openssl s_client -connect yourdomain.com:443 -reconnect 2>/dev/null | grep -E "Session-ID|Reused"

Optimizations that reduce TLS overhead:

Optimization Impact
TLS 1.3 Removes 1 round-trip (0-RTT possible)
Session resumption Skips full handshake on reconnect
OCSP stapling Eliminates OCSP lookup by client
Certificate size Smaller certs = faster handshake
ECDSA vs RSA ECDSA certificates are 3x smaller

HTTP/2 Multiplexing

HTTP/2 multiplexes multiple requests over a single TCP connection, eliminating head-of-line blocking at the HTTP layer. Verify your server actually supports and correctly configures it.

# Check HTTP version in use
curl -v --http2 https://yourdomain.com 2>&1 | grep "< HTTP"
# "< HTTP/2 200" = HTTP/2 working
# "< HTTP/1.1 200" = HTTP/2 not negotiated

# Use h2load for HTTP/2 specific benchmarking
# apt install nghttp2-client
h2load -n 1000 -c 10 -m 100 https://yourdomain.com/
# -n 1000 = 1000 requests, -c 10 = 10 clients, -m 100 = 100 multiplexed streams

# Verify HTTP/2 Server Push (if configured)
curl -v --http2 https://yourdomain.com 2>&1 | grep push

# Check ALPN negotiation
openssl s_client -connect yourdomain.com:443 -alpn h2 -brief 2>&1 | grep ALPN
# "ALPN protocol: h2" = HTTP/2 TLS negotiation working

HTTP/2 performance benefits depend on many concurrent small requests being multiplexed. For APIs with many resources, multiplexing reduces the "connection per resource" overhead significantly.

TCP Window Sizing

The TCP receive window limits how much data can be in-flight before waiting for acknowledgment. On high-bandwidth, high-latency links (satellite, trans-oceanic), small windows become the bottleneck.

# Check current TCP buffer settings
cat /proc/sys/net/ipv4/tcp_rmem   # read buffer: min, default, max
cat /proc/sys/net/ipv4/tcp_wmem   # write buffer: min, default, max
cat /proc/sys/net/core/rmem_max
cat /proc/sys/net/core/wmem_max

# Required buffer size = Bandwidth × RTT (Bandwidth-Delay Product)
# Example: 1 Gbps link, 200ms RTT
# BDP = 1,000,000,000 bps * 0.200s = 200,000,000 bits = 25 MB

# Tune TCP buffers for high-BDP links
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"
sudo sysctl -w net.core.rmem_max=134217728
sudo sysctl -w net.core.wmem_max=134217728

# Enable TCP window scaling (should be on by default)
sudo sysctl net.ipv4.tcp_window_scaling
# Should return: 1

# Check if BBR congestion control is available (better for high-latency)
sudo sysctl net.ipv4.tcp_available_congestion_control
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

# Make permanent in /etc/sysctl.d/99-tcp-tune.conf

Verify the tuning worked with iperf3 — the -P 4 option (4 parallel streams) helps fully utilize a high-BDP connection that a single stream may not saturate.

Final Report Template

Document your findings consistently to enable trend analysis over time.

# Network Performance Audit Report
**Date**: 2026-02-26
**Auditor**: ops-team
**Environment**: production / apps-us

## Summary
| Metric | Measured | Target | Status |
|--------|---------|--------|--------|
| Bandwidth (to nearest PoP) | 850 Mbps | >800 Mbps | PASS |
| RTT to CDN edge | 3.2 ms | <10 ms | PASS |
| Packet loss (100 pings) | 0.0% | <0.1% | PASS |
| DNS query time (cached) | 0.4 ms | <1 ms | PASS |
| TLS handshake | 35 ms | <50 ms | PASS |
| HTTP/2 enabled | Yes | Yes | PASS |
| TCP congestion control | cubic | bbr | FAIL |

## Issues Found
1. TCP congestion control is `cubic`, not `bbr` — trans-oceanic latency suboptimal
2. DNS TTL for api.yourdomain.com is 60s — extend to 300s to reduce resolver load

## Recommendations
1. Enable BBR: `sysctl -w net.ipv4.tcp_congestion_control=bbr`
2. Update DNS TTL to 300 seconds for all A/AAAA records

## Raw Data
[Attach iperf3 logs, mtr reports, curl timing outputs]

Store audit reports in version control alongside runbooks. Compare consecutive audits to catch performance regressions before users notice them.