Performance

Amazonbot and Bingbot excessively crawls websites, query parameters, and shop filters

Table of Contents

Symptoms

At times websites might appears super busy, with 10s of gigs of bandwidth usage per month, but it doesn’t make sense. After viewing the log file you see this repeating every second:

[root@server ~]# tail -f /usr/local/apache/domlogs/domain-name-ssl_log

3.89.170.186 - - [01/Jun/2025:08:48:25 +0200] "GET /shop/page/10/?filter_product_brand=82,84,81,85,86,76,78,80,90,77,88,87&filtering=1 HTTP/1.1" 301 - "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"
34.236.41.241 - - [01/Jun/2025:08:48:26 +0200] "GET /shop/page/12/?filter_product_brand=82,81,80,85,92,86,90,88,78,77,76,84&filtering=1 HTTP/1.1" 301 - "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"
34.239.197.197 - - [01/Jun/2025:08:48:27 +0200] "GET /shop/page/4/?filter_product_brand=82,84,81,92,90,76,88,80,77,86,78,87&filtering=1 HTTP/1.1" 301 - "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"

If it goes on for days and weeks, then this is the next step:

Robots.txt solution for Amazon bot

You can try and block `Amazonbot` in robots.txt. But unfortunately Amazon has long abandoned their bot so it doesn’t respect this setting:

cat /home/domain-name/public_html/robots.txt

# Block Amazonbot due to excessive crawling of shop filters
User-agent: Amazonbot
Disallow: /

.htaccess solution for Amazon bot

Since robots.txt didn’t work, I went over to doing it in .htaccess:

# BEGIN Block Amazonbot
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Amazonbot [NC]
RewriteRule .* - [F,L]
</IfModule>

IP address madness for Amazon bot

I tried logging and blocking all the IPs for Amazonbot:

tail -f /usr/local/apache/domlogs/the-username-example.com-ssl_log | egrep -i amazonbot | awk '{print $1, $4, "Amazonbot"}'

After 1 hour I gave up. The more I blocked, the more the IPs changed. Then I went over to brute force protection and created a script that protects 100 servers instead of just one. This script is run from a central server firewall controlled server and tails the server under attack. The biggest challenge was buffering to the local server.

ssh root@server-under-attack.example.com 'tail -F /usr/local/apache/domlogs/diebraaigat.co.za-ssl_log' \ | stdbuf -oL grep -i amazonbot \
| stdbuf -oL awk '{print $1}' \
| while read -r ip; do
echo "$ip" >> crawl.amazonbot.amazon
pvesh create /cluster/firewall/ipset/amazonbot -cidr "$ip" -comment "$(date '+%F_%T')"
done

I’ve ended up with a nice list of most of Amazonbot’s IPs. You’re welcome to contact me for then including their reverse.

Bing Bot

You might have the same with Bing bot:

cat /usr/local/apache/domlogs/username/domain-name2-ssl_log | grep -i bing

40.77.167.58 - - [01/Jun/2025:08:44:13 +0200] "GET /?m=184521425 HTTP/1.1" 301 - "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36"
40.77.167.58 - - [01/Jun/2025:08:44:14 +0200] "GET /?m=1334325 HTTP/1.1" 301 - "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36"
40.77.167.58 - - [01/Jun/2025:08:44:16 +0200] "GET /?m=297123225 HTTP/1.1" 301 - "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36"

Solution for Bing bot:

cat /home/domain-name2/public_html/robots.txt

# Block Bingbot due to redundant or excessive crawling of query parameters
User-agent: bingbot
Disallow: /

But Why

In this day and age of AI, we suspect both Anthropic uses disguised Amazon Bot, and Bing might be doing some AI crawling on it’s own, or another robots (Chinese) is fooling us. We don’t know, and we don’t care. We just want these bots to back off and stop using excessive resources on our servers.