on
A simple bot defense for crawlers
I recently noticed a web service which I used before that started showing a “are you human” prompt. It is using PHP for the challenge page, so I figured that I should try that out since my main domain host is using Apache and PHP and I asked the author of the page how they are doing it.
They replied with a blog page that describes the php script which is using lighttpd (#), so I tried to modify it for Apache mod_rewrite.
For now I am using it on my Dokuwiki page to block crawlers that go through all the pages with idx, backlinks and so on while allowing the plain pages since I have no problem with my stuff ending up in the collective AI mind.
The main problem that I had was that I needed a rule that checks if cookies are set and I didn’t get the rule for empty string not correct, I ended up using a regex which works fine. The rules try to serve a few pages and urls without blocking and blocks all urls that have a query string, so e.g. wiki.domain/?do=backlink gets blocked. Once any page is loaded, the cookie rules let’s the requests pass and bookmarked pages or inlinks that lead to a url with ? will ask the bot question only once.
The php script is exactly the same as from the blog article, I will provide a version if I make any substantial changes, for now you can just use the one from the original post.
The mod_rewrite rules are the following:
# run all urls that have a query through the bot check
# allow cookiecheck urls in all cases
RewriteRule ^/cookiecheck/ - [L]
# allow local files like robots.txt and favicon.ico
RewriteRule ^/[^/]*\.[^/]*$ - [L]
# check for cookie present
RewriteCond %{QUERY_STRING} .
RewriteCond %{HTTP_COOKIE} "^$"
RewriteRule ^(/.*) /cookiecheck/check.php?redirect=$1?%{QUERY_STRING} [R=302,L]
The script is in /cookiecheck/check.php, the article urls are in the root dir, e.g. wiki.domain/page:start, a few files like robots.txt are allowed to be accessed as well.
I will start monitoring my access log for the wiki server, maybe there are some mistakes in the rules, for now it looks quite good.