Spider Spoof Detection
Many spiders identify themselves. Others, quite a few, actually, spoof their identity and pretend to be a regular browser.
The spoofers can create havoc with site visitor page view and click-through statistics. Not to mention following no-follow links, submitting forms and, in general, snooping where they have no business being.
This article will help you identify spiders, even spiders that pretend to be a regular browser. It does not describe how to put the kibosh on them (misdirect, ban, …) because every site and situation is different. But it does identify all who wander into a special link trap.
The special link trap is in a div or paragraph with a CSS display:none
declaration. To sooth the suspicion of especially wary spiders, JavaScript is present to change the CSS display
property, albeit for a highly unlikely situation.
The special link, which no site visitor with a modern browser will ever see (unless they view source code like a spider does), leads to a script that logs the IP address of the spider and the user-agent string it is presenting as its identity.
This is the special link, which can be put anywhere on a PHP web page (with correct link URL):
<p id="link-display" style="display:none;"> <a href="https://example.com/special.php?<?php echo($_SERVER['PHP_SELF']) ?>">A special link</a> </p> <script type="text/javascript"> if(location.search=="?unlikely-to-ever-happen") { document.getElementById("link-display").style.display="block"; } </script>
You'll see the link to the special.php
logging script within a p
tag having a display:none;
CSS declaration. It has a bit of PHP code to insert the URL of the current web page.
Below the paragraph is the suspicion-soothing JavaScript. It will actually display the special link to the site visitor if they arrive with ?unlikely-to-ever-happen
appended to the URL in the browser's address bar. It is virtually unlikely to happen unintentionally.
The special.php
logging script needs to be installed so the special link can identify spiders.
Here is the source code for the PHP script. Comments follow.
<?php /* Access Log Intended to Identify Spiders Version 1.0 April 25, 2020 Will Bontrager Software LLC https://www.willmaster.com/ */ /* CUSTOMIZATIONS */ /* Two places to customize. */ // Place 1 -- // Specify the location of the log file // to record access to this script. // The file is CSV formatted. $LogFileLocation = "spiderTrapLog.csv"; // Place 2 -- // Between the lines containing the PAGECONTENT // word, specify the content to show to // the spider when it follows the special // link. HTML markup may be used. $PageContent = <<<PAGECONTENT <p> This is a page. </p> PAGECONTENT; /* END OF CUSTOMIZATION */ $LogLIne = array(); $LogLine[] = date('r'); $LogLine[] = $_SERVER['REMOTE_ADDR']; $LogLine[] = str_replace('"','""', (isset($_SERVER['QUERY_STRING']) ? rawurldecode($_SERVER['QUERY_STRING']) : '') ); $LogLine[] = str_replace('"','""',$_SERVER['HTTP_USER_AGENT']); file_put_contents( $LogFileLocation, '"'.implode('","',$LogLine)."\"\n", FILE_APPEND ); echo $PageContent; exit; ?>
Customizations —
Two places need to be customized.
1.
Replace spiderTrapLog.csv
with the location for your log file. It is a CSV file, so give it a .csv
file name extension.
2.
Between the lines with the word PAGECONTENT
, Replace
<p>
This is a page.
</p>
with whatever content you wish the spiders to see when they follow the link trap. If you wish, you can leave it as is.
Upload the customized PHP script to your server in a place where a browser can access it. Name it special.php
or other .php
name you prefer.
Replace https://example.com/special.php
in the special link code with the URL to the customized PHP script you uploaded.
Put the special link code on one or more pages of your website.
The special links can be tested by putting the web page into your browser. Append ?unlikely-to-ever-happen
to the URL in the browser's address bar. When the page reloads, the special link should be visible to click on for testing.
You are now in position to detect spiders to the pages with the special link code. The user-agent string in the log file can tell you which are spoofing themselves as a regular web browser.
(This article first appeared with an issue of the Possibilities newsletter.)
Will Bontrager