Packet Loss During Peak Hours

Intermediate Performance

Users report intermittent connectivity drops, slow page loads, and degraded quality during evenings or business hours, but performance returns to normal during off-peak times. The pattern strongly suggests congestion-related packet loss on an upstream link, local network segment, or ISP backhaul that becomes saturated under high traffic load.

Symptoms

  • Ping packet loss of 1-10% that is absent outside peak hours
  • MTR shows loss concentrated at one or two specific hops in the path
  • TCP retransmissions increase significantly during peak periods (visible in netstat/ss)
  • Throughput degrades during peak hours but recovers automatically at off-peak times
  • VoIP and video calls break up or disconnect during peak windows
  • Users on the same ISP or network segment all report the issue simultaneously

Possible Root Causes

  • ISP backhaul or transit link congestion during peak hours — the ISP has insufficient capacity for peak demand
  • Local network switch or router interface approaching bandwidth saturation, causing tail-drop on ingress queues
  • A single application or server consuming disproportionate bandwidth during peak times (backup jobs, video streams, P2P)
  • QoS not configured — all traffic treated equally, allowing bulk transfers to starve latency-sensitive flows
  • Shared infrastructure (co-location, cloud provider) with noisy neighbours consuming shared bandwidth

Diagnosis Steps

Step 1 — Confirm the time-correlated pattern

# Run continuous ping to gateway and an external target and log to file
ping -i 1 -W 1 8.8.8.8 | ts '%Y-%m-%d %H:%M:%S' >> /tmp/ping_log.txt &

# Run during peak and off-peak hours and compare
# After collecting data, count loss percentage
grep -c "timeout\|100%" /tmp/ping_log.txt

Step 2 — Isolate the congested hop with MTR

# Run MTR during peak hours
mtr --report --report-cycles 100 --interval 1 8.8.8.8

# Compare with an off-peak run
mtr --report --report-cycles 100 8.8.8.8 > /tmp/mtr_offpeak.txt

Note the hop where loss first appears — this identifies the congested segment.

Step 3 — Check local interface utilisation

# Monitor interface utilisation in real-time
sar -n DEV 1 60

# Or use nload/iftop for visual bandwidth usage
nload eth0
iftop -i eth0

# Check interface errors and drops
ip -s link show eth0
ethtool -S eth0 | grep -i 'drop\|miss\|error\|overflow'

Step 4 — Check TCP retransmission rate

# Watch TCP retransmissions
watch -n 1 'ss -s | grep -i retrans'
netstat -s | grep -i retransmit

# For a more detailed view
ss -tin dst your-server.com | grep -i retrans

Step 5 — Identify top bandwidth consumers

# Find which processes are consuming bandwidth
nethogs eth0

# Find which connections have the highest throughput
iftop -i eth0 -n -P

# Check if a single host is consuming most bandwidth (potential culprit)
tcpdump -i eth0 -w /tmp/peak_capture.pcap -G 60 -W 1

Step 6 — Check ISP link utilisation

# Measure your uplink capacity vs. current usage
iperf3 -c iperf.he.net -t 30 -R   # Download test
iperf3 -c iperf.he.net -t 30       # Upload test

# Compare with your provisioned link speed
ethtool eth0 | grep Speed

Solution

Step 1 — Implement QoS traffic shaping

Use tc (traffic control) to prioritise latency-sensitive traffic and rate-limit bulk flows:

# Create HTB qdisc on egress interface
tc qdisc add dev eth0 root handle 1: htb default 30

# Total link bandwidth: 1Gbit
tc class add dev eth0 parent 1: classid 1:1 htb rate 1gbit

# High priority class: 500Mbit (interactive/voice/video)
tc class add dev eth0 parent 1:1 classid 1:10 htb rate 500mbit ceil 1gbit prio 1
# Normal class: 400Mbit (web, DNS)
tc class add dev eth0 parent 1:1 classid 1:20 htb rate 400mbit ceil 1gbit prio 2
# Bulk class: 100Mbit (backups, P2P)
tc class add dev eth0 parent 1:1 classid 1:30 htb rate 100mbit ceil 200mbit prio 3

# Add SFQ for fair queuing within each class
tc qdisc add dev eth0 parent 1:10 handle 10: sfq perturb 10
tc qdisc add dev eth0 parent 1:20 handle 20: sfq perturb 10
tc qdisc add dev eth0 parent 1:30 handle 30: sfq perturb 10

# Classify SSH and VoIP to high priority
tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 match ip dport 22 0xffff flowid 1:10
tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 match ip dport 5060 0xffff flowid 1:10

Step 2 — Reschedule bulk jobs to off-peak hours

Move backup, log shipping, and batch processing jobs away from peak windows:

# Reschedule cron jobs to off-peak (e.g., 2-5 AM)
crontab -e
# 0 2 * * * /usr/local/bin/backup.sh    # Run at 2 AM instead of business hours

Step 3 — Upgrade or add capacity

If ISP congestion is confirmed, escalate with the ISP citing specific MTR evidence of their congested link. Consider: - Upgrading to a higher-capacity plan - Adding a secondary ISP for failover and load balancing - Using a CDN to offload bandwidth from the origin

Step 4 — Verify improvement

# After changes, re-run MTR during peak hours
mtr --report --report-cycles 100 8.8.8.8

# Monitor TCP retransmission rates
watch -n 5 'netstat -s | grep retransmit'

Prevention

  • Schedule bandwidth-intensive jobs (database dumps, log uploads, software updates) outside peak hours using cron
  • Deploy QoS policies on routers and switches to prioritise interactive traffic over bulk transfers at all times
  • Monitor interface utilisation with time-series metrics (Prometheus + node_exporter) and alert at 70% sustained utilisation
  • Negotiate SLAs with your ISP that include congestion measurements and escalation procedures
  • Use a CDN to serve static assets and cached responses, reducing the amount of traffic that must traverse the upstream link

Related Protocols

Related Terms

More in Performance