Aggressive Web Scraper

Low Scanning & Reconnaissance

An aggressive web scraper sends high-frequency HTTP requests to enumerate and download site content at a rate far exceeding normal human browsing, often bypassing robots.txt restrictions. Unlike legitimate crawlers, these bots typically rotate user-agent strings, ignore crawl-delay directives, and may consume significant server resources. They are commonly used for price intelligence, content theft, or data harvesting without authorization.

🔍Indicators

  • Extremely high request rate from a single IP or ASN (hundreds of requests per minute)
  • Sequential URL enumeration patterns (e.g., /product/1, /product/2, …)
  • Ignoring or violating robots.txt Disallow rules
  • Rotating or spoofed User-Agent headers (e.g., cycling through browser strings)
  • Missing or minimal browser fingerprints: no Accept-Language, no cookie handling, no JS execution
  • Requests clustered around specific resource paths (sitemap.xml, API endpoints)
  • Low or zero referrer headers across thousands of requests
  • Abnormally uniform inter-request timing (e.g., exactly 100 ms between requests)

🛡Detection Methods

Nginx / Apache access log analysis

# Top IPs by request count in last hour
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

# Detect sequential product/page enumeration
grep -oP '(?<=GET )/[a-z-]+/\d+' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

fail2ban rule for excessive crawling

[nginx-scraper]
enabled  = true
filter   = nginx-scraper
logpath  = /var/log/nginx/access.log
maxretry = 300
findtime = 60
bantime  = 3600

fail2ban filter (/etc/fail2ban/filter.d/nginx-scraper.conf)

[Definition]
failregex = ^<HOST> .* "(GET|HEAD) .*
ignoreregex =

Snort rule

alert tcp any any -> $HTTP_SERVERS $HTTP_PORTS \
  (msg:"Aggressive Web Scraper — High Request Rate"; \
   flow:to_server,established; \
   threshold:type threshold, track by_src, count 200, seconds 60; \
   classtype:web-application-activity; sid:9100018; rev:1;)

Mitigation

  1. Enforce robots.txt and monitor violations — legitimate crawlers respect it; ban IPs that do not.
  2. Rate-limit by IP at the reverse proxy (e.g., Nginx limit_req_zone): nginx limit_req_zone $binary_remote_addr zone=scraper:10m rate=30r/m; limit_req zone=scraper burst=10 nodelay;
  3. Deploy a Web Application Firewall (WAF) with bot-detection rules (Cloudflare Bot Management, AWS WAF, ModSecurity).
  4. Serve a honeypot path in robots.txt (Disallow: /trap/) — any visit triggers an automated ban.
  5. Implement CAPTCHA challenges on high-value endpoints (search, product listings) for suspicious clients.
  6. Use CAPTCHAs or JS challenges that distinguish real browsers from headless HTTP clients.
  7. Monitor and block ASNs associated with known scraping-as-a-service providers (e.g., datacenter ASNs with no residential traffic).
  8. Rotate and obscure API endpoints for sensitive data; require authenticated sessions with CSRF tokens.

📋Real-World Examples

In 2018, LinkedIn sued hiQ Labs for scraping public profile data at scale, leading to a landmark legal battle over the Computer Fraud and Abuse Act (CFAA) — hiQ v. LinkedIn. In 2021, Ryanair sued Booking.com for screen-scraping flight prices without authorization, resulting in an Irish court injunction requiring Booking.com to cease the practice.

Related Terms

More in Scanning & Reconnaissance