Aggressive Web Scraper

Low Scanning & Reconnaissance

An aggressive web scraper sends high-frequency HTTP requests to enumerate and download site content at a rate far exceeding normal human browsing, often bypassing robots.txt restrictions. Unlike legitimate crawlers, these bots typically rotate user-agent strings, ignore crawl-delay directives, and may consume significant server resources. They are commonly used for price intelligence, content theft, or data harvesting without authorization.

🔍Indicators

Extremely high request rate from a single IP or ASN (hundreds of requests per minute)
Sequential URL enumeration patterns (e.g., /product/1, /product/2, …)
Ignoring or violating robots.txt Disallow rules
Rotating or spoofed User-Agent headers (e.g., cycling through browser strings)
Missing or minimal browser fingerprints: no Accept-Language, no cookie handling, no JS execution
Requests clustered around specific resource paths (sitemap.xml, API endpoints)
Low or zero referrer headers across thousands of requests
Abnormally uniform inter-request timing (e.g., exactly 100 ms between requests)

🛡Detection Methods

Nginx / Apache access log analysis

# Top IPs by request count in last hour
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

# Detect sequential product/page enumeration
grep -oP '(?<=GET )/[a-z-]+/\d+' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

fail2ban rule for excessive crawling

[nginx-scraper]
enabled  = true
filter   = nginx-scraper
logpath  = /var/log/nginx/access.log
maxretry = 300
findtime = 60
bantime  = 3600

fail2ban filter (/etc/fail2ban/filter.d/nginx-scraper.conf)

[Definition]
failregex = ^<HOST> .* "(GET|HEAD) .*
ignoreregex =

Snort rule

alert tcp any any -> $HTTP_SERVERS $HTTP_PORTS \
  (msg:"Aggressive Web Scraper — High Request Rate"; \
   flow:to_server,established; \
   threshold:type threshold, track by_src, count 200, seconds 60; \
   classtype:web-application-activity; sid:9100018; rev:1;)

✅Mitigation

Enforce robots.txt and monitor violations — legitimate crawlers respect it; ban IPs that do not.
Rate-limit by IP at the reverse proxy (e.g., Nginx limit_req_zone): nginx limit_req_zone $binary_remote_addr zone=scraper:10m rate=30r/m; limit_req zone=scraper burst=10 nodelay;
Deploy a Web Application Firewall (WAF) with bot-detection rules (Cloudflare Bot Management, AWS WAF, ModSecurity).
Serve a honeypot path in robots.txt (Disallow: /trap/) — any visit triggers an automated ban.
Implement CAPTCHA challenges on high-value endpoints (search, product listings) for suspicious clients.
Use CAPTCHAs or JS challenges that distinguish real browsers from headless HTTP clients.
Monitor and block ASNs associated with known scraping-as-a-service providers (e.g., datacenter ASNs with no residential traffic).
Rotate and obscure API endpoints for sensitive data; require authenticated sessions with CSRF tokens.

📋Real-World Examples

In 2018, LinkedIn sued hiQ Labs for scraping public profile data at scale, leading to a landmark legal battle over the Computer Fraud and Abuse Act (CFAA) — hiQ v. LinkedIn. In 2021, Ryanair sued Booking.com for screen-scraping flight prices without authorization, resulting in an Irish court injunction requiring Booking.com to cease the practice.

Related Terms

API Gateway Firewall Proxy Server Rate Limiting