A simple bot defense for crawlers

I recently noticed a web service which I used before that started showing a “are you human” prompt. It is using PHP for the challenge page, so I figured that I should try that out since my main domain host is using Apache and PHP and I asked the author of the page how they are doing it.

They replied with a blog page that describes the php script which is using lighttpd (#), so I tried to modify it for Apache mod_rewrite.

For now I am using it on my Dokuwiki page to block crawlers that go through all the pages with idx, backlinks and so on while allowing the plain pages since I have no problem with my stuff ending up in the collective AI mind.

The main problem that I had was that I needed a rule that checks if cookies are set and I didn’t get the rule for empty string not correct, I ended up using a regex which works fine. The rules try to serve a few pages and urls without blocking and blocks all urls that have a query string, so e.g. wiki.domain/?do=backlink gets blocked. Once any page is loaded, the cookie rules let’s the requests pass and bookmarked pages or inlinks that lead to a url with ? will ask the bot question only once.

The php script is exactly the same as from the blog article, I will provide a version if I make any substantial changes, for now you can just use the one from the original post.

The mod_rewrite rules are the following:

# run all urls that have a query through the bot check

# allow cookiecheck urls in all cases
  RewriteRule ^/cookiecheck/    -        [L]

# allow local files like robots.txt and favicon.ico
  RewriteRule ^/[^/]*\.[^/]*$   -       [L]

# check for cookie present
  RewriteCond %{QUERY_STRING}   .
  RewriteCond %{HTTP_COOKIE}    "^$"
  RewriteRule ^(/.*) /cookiecheck/check.php?redirect=$1?%{QUERY_STRING} [R=302,L]

The script is in /cookiecheck/check.php, the article urls are in the root dir, e.g. wiki.domain/page:start, a few files like robots.txt are allowed to be accessed as well.

I will start monitoring my access log for the wiki server, maybe there are some mistakes in the rules, for now it looks quite good.

Top