Packet Loss: Detection, Diagnosis & Resolution

Measuring Packet Loss

Even 1% packet loss noticeably degrades performance because TCP retransmits lost packets, and video/voice applications have no retransmission mechanism at all.

# Measure packet loss with ping (100 packets for statistical validity)
ping -c 100 8.8.8.8
# Look for: 5 packets transmitted, 4 received, 20% packet loss

# Extended test for intermittent loss
ping -c 1000 -i 0.1 8.8.8.8   # 1000 packets, 100ms interval

# Use mtr for per-hop loss analysis
mtr --report --report-cycles 200 8.8.8.8

# Test towards multiple destinations to isolate the location
ping -c 100 8.8.8.8       # Google DNS
ping -c 100 1.1.1.1       # Cloudflare DNS
ping -c 100 192.168.1.1   # Your router

Interpreting loss by location:

Where Loss Occurs	Likely Cause
To your router only (192.168.1.1)	Local hardware, Wi-Fi signal, cable
After the router, before ISP	Modem or ISP access equipment
At specific ISP hop	ISP network congestion or hardware fault
Only to certain destinations	Routing or peering issue
Random across all hops	ISP line quality or modem signal

Note on ICMP deprioritization: Some routers and switches deprioritize ICMP (ping) packets under load. If mtr shows 100% loss at one hop but 0% at subsequent hops, the intermediate router is simply not responding to ping — this is not real packet loss.

Layer 1: Cable and Hardware

Physical layer problems cause packet loss that no software change can fix.

# Check network card error counters (Linux)
ip -s link show eth0
# Look for non-zero values in:
# RX: errors, dropped, overrun
# TX: errors, dropped, carrier, collisions

# Check for hardware errors in kernel log
dmesg | grep -i -E "eth|nic|network|link|carrier|reset" | tail -20

# Check cable quality (requires managed switch)
# Connect to switch CLI and check interface statistics:
# show interfaces GigabitEthernet 0/1
# Look for: CRC errors, giants, runts, input/output errors

Physical layer checklist:

Cable quality — Cat5e is rated to 1 Gbps at 100m. Longer runs, damaged jackets, or tight bends cause errors.
Cable connectors — Crimped connectors often cause intermittent issues. Replace the connector or use a cable tester.
Duplex mismatch — If one end is auto-negotiating and the other is forced to full-duplex, collisions occur. Symptoms: loss under load, not at idle.
NIC failure — Partial hardware failure shows as intermittent errors. Test with a USB Ethernet adapter to eliminate the NIC.

# Check duplex and speed (Linux)
ethtool eth0
# Look for: Speed: 1000Mb/s, Duplex: Full
# Mismatch: Speed: 100Mb/s, Duplex: Half = degraded

# Force duplex (only if auto-negotiation is genuinely broken)
sudo ethtool -s eth0 speed 1000 duplex full autoneg off

Layer 2: Switch Issues

Layer 2 (Data Link) problems manifest as CRC errors, broadcast storms, or STP (Spanning Tree Protocol) convergence issues.

# Check for excessive broadcasts (sign of broadcast storm or misconfig)
# On a Linux bridge
bridge -s link show

# Check ARP table for duplicate IPs (can cause packet loss)
ip neigh show
# Look for two different MAC addresses for the same IP — indicates IP conflict

# Test for STP issues
# If a switch port is in "blocking" state, traffic will be dropped silently
# On managed switches:
# show spanning-tree
# Ports in BLK state are normal; ports in LRN or transitioning may cause brief loss

Common Layer 2 causes:

IP address conflict — Two devices with the same IP causes intermittent loss as the switch flips between them.
Bad switch port — Hardware failure on one port. Test by moving the cable to a different port.
Loop without STP — If someone connects two ports of the same switch together without STP, a broadcast storm occurs, bringing the entire network down.
Faulty SFP module — In fiber setups, a marginal SFP transceiver causes bit errors.

Layer 3: Routing Problems

Routing issues cause loss at specific destinations or along specific paths.

# Check routing table for unexpected routes
ip route show
# Verify: default gateway is correct, no duplicate default routes

# Check for ICMP redirect storms
# If router is sending ICMP redirects, traffic may be misrouted
sudo tcpdump -i eth0 icmp and icmp[icmptype] == 5

# Test with different packet sizes (reveals MTU-related fragmentation)
ping -M do -s 100 8.8.8.8
ping -M do -s 500 8.8.8.8
ping -M do -s 1400 8.8.8.8
# If large packets are lost but small ones pass: MTU black hole

MTU black hole: A common routing problem where packets above a certain size are silently dropped because intermediate devices cannot fragment them but do not send ICMP "fragmentation needed" back. Symptoms: ping works, curl hangs, SSH sessions freeze when transferring data.

Fix: lower your interface MTU:

# Temporarily lower MTU
sudo ip link set dev eth0 mtu 1400

# Test if this fixes the loss
curl https://example.com

ISP Congestion

ISP network congestion causes loss during peak hours on specific segments of the network.

# Time-stamped packet loss test over several hours
while true; do
    echo -n "$(date '+%H:%M:%S') "
    ping -c 10 -q 8.8.8.8 | tail -1
    sleep 60
done | tee /tmp/loss_log.txt

# Analyze the log
grep "packet loss" /tmp/loss_log.txt | grep -v "0% packet loss"

# Test at peak vs off-peak
# 7-10 PM local time is typically peak for residential ISPs

If loss only occurs at peak hours (7-10 PM), your ISP's backhaul is oversold. Document the pattern and report it:

Record timestamps and percentage loss
Note whether wired and wireless both affected
Check if other devices on the network have the same loss
Submit a formal service complaint with the data

Buffer Bloat

Buffer bloat occurs when large buffers in networking equipment fill with packets during congestion, causing high latency and effectively masking packet loss behind queuing delay.

# Test for buffer bloat
# The "Bufferbloat" test at waveform.com
# Or use flent (Flexible Network Tester)
sudo apt install flent
flent rrul -p all_scaled -l 60 -H 8.8.8.8 -t "Buffer Bloat Test"

# Manual test: measure latency under load
# In one terminal, start a download
wget -q -O /dev/null http://speedtest.tele2.net/100MB.zip &

# In another terminal, measure latency simultaneously
ping -c 60 8.8.8.8

# Buffer bloat signature: idle latency 5ms, latency under load 200ms+

Fix: Enable CAKE or fq_codel:

# Check current qdisc
tc qdisc show dev eth0

# Replace with CAKE (best option for home networks)
sudo tc qdisc replace dev eth0 root cake bandwidth 95mbit

# Or use fq_codel (also excellent)
sudo tc qdisc replace dev eth0 root fq_codel

# Make permanent via /etc/rc.local or a systemd service

CAKE and fq_codel are active queue management (AQM) algorithms that intelligently drop packets before the buffer fills, keeping latency low under load.

QoS Prioritization

When packet loss is selective — affecting some traffic more than others — QoS configuration is either the cause or the solution.

# Check if certain traffic is being policed (rate limited)
# Use iperf3 to test throughput for different traffic classes
iperf3 -c 8.8.8.8 -p 5201 -t 30   # Standard port
iperf3 -c 8.8.8.8 -p 80 -t 30     # Port 80

# If different ports give very different results, QoS policing is active

# For Linux routers: list all tc qdiscs and classes
tc qdisc show
tc class show dev eth0
tc filter show dev eth0

When QoS causes unintended loss:

Overly aggressive policing drops packets that exceed a rate limit. Fix by increasing rate limits or using traffic shaping (smooth the rate) instead of policing (hard drop):

# Change policing (drop) to shaping (delay) for better behavior
# Before (policing — drops packets)
sudo tc filter add dev eth0 protocol ip parent 1: \
  u32 match ip protocol 17 0xff \
  police rate 10mbit burst 15k drop flowid 1:30

# After (shaping — delays packets instead of dropping)
sudo tc class add dev eth0 parent 1:1 classid 1:30 \
  htb rate 10mbit ceil 10mbit burst 15k

The key principle: shaping queues packets and releases them at the target rate. Policing drops packets that exceed the rate. For most traffic, shaping produces better outcomes for the user experience.

Packet Loss: Detection, Diagnosis, and Resolution

Embed This Widget