Controlling the Spiders
You can control the good spiders. Most search engine indexing spiders are good spiders.
Controlling the spiders means leaving instructions for them. Good spiders follow your instructions.
Rogue spiders ignore your instructions.
Really bad spiders go after directories and files you specifically disallow.
At the end of this article, I'll mention a way to control the really bad ones. But that's just a few paragraphs. Most of this article is dedicated to controlling the good spiders.
A website is personal property. Just like guests in your home, spiders should behave in a manner consistent with your wishes.
There are two ways to let spiders know how they should behave:
-
At the individual page level, with meta tags to instruct whether or not the page may be indexed and whether or not any links on the page may be followed.
-
At the domain level, with a file named robots.txt in the document root directory that lists all the directories and/or pages that may not be indexed and, sometimes, a list of directories and/or pages that may be indexed.
Controlling Spiders with Meta Tags
If you're unfamiliar with meta tags, they are optional HTML tags that reside in the HEAD area of web pages.
The name of the meta tag is "robots" and the instructions are in the content= attribute. Here is an example with the instructions blank:
<meta name="robots" content="_____,_____" />
Notice between the quotes are two blank areas for the instructions, the two separated with a comma.
The first blank area is for the instruction whether or not to index the current page. The instructions are either "index" or "noindex".
The second blank area is for the instruction whether or not to follow any links. The instructions are either "follow" or "nofollow".
To tell spiders it is okay to index the current page but not to follow any links, this would be the meta tag:
<meta name="robots" content="index,nofollow" />
Specify index or noindex before the comma, and follow or nofollow after the comma, to indicate your wishes.
Controlling Spiders Using a robots.txt File
To control spiders with the robots.txt file, the file must be in your document root directory. If you can access the file with your browser using the URL http://yourdomainhere.com/robots.txt then it's in the right location. (Name the file with all lower-case characters. It is case-sensitive.)
Now that you know where to put the file, here is what it should contain.
robots.txt contains instructions for individual spiders and/or for all spiders. The instructions simply specify what may not be indexed.
Which spiders the instructions pertain to is first. The instructions follow.
The line(s) naming the spiders starts with
User-agent:
(notice the colon) and is followed by the spider's identification.
The lines specifying what may not be indexed starts with
Disallow:
(again, the colon) followed by the directory or file name to be disallowed.
There may be more than one spider identification line and there may be more than one directory/file name line.
Here is an example that tells the googlebot to stay away from all files in the /cgi-bin/ directory:
User-agent: googlebot Disallow: /cgi-bin/
If you have instructions for other spiders, insert a blank line in the file then follow the same format spiders first, then instructions.
Here is an example robots.txt file with 3 instruction blocks.
-
Tell the googlebot and Googlebot-Image (Google's image indexing spider) to say away from the /hotbotstuff/ directory.
-
Tell HotBot's spider to stay away from the /googlestuff/ directory.
-
Tell all spiders (indicated by an asterisk in place of the spider identification) to stay out of the /cgi-bin/ and /secret/ directories and out of the mypasswords.html file located in the document root.
User-agent: googlebot User-agent: Googlebot-Image Disallow: /hotbotstuff/ User-agent: slurp Disallow: /googlestuff/ User-agent: * Disallow: /cgi-bin/ Disallow: /secret/ Disallow: /mypasswords.html
When a good spider reads the robots.txt file, it keeps the instructions at hand while it's spidering the website.
Before the spider indexes a page, it consults the robots.txt file, omitting any instruction blocks that do not pertain to it. For example, the googlebot would read blocks with a user-agent of "*" and a user-agent of "googlebot". In the above example, it would read only the first and last instruction blocks.
The spider reads from top to bottom, stopping whenever there's a directory or file name match. Thus, a spider consulting the above robots.txt file before spidering http://example.com/secret/index.html would see the
Disallow: /secret/
line and not index the /secret/index.html page.
Some spiders will recognize an Allow: instruction in the robots.txt file. If you use an allow instruction to allow indexing certain directories or files, realize that those spiders that don't recognize it will simply ignore it.
An example of use would be to let google index only the http://example.com/mostlysecret/googlefood.html page in that directory and disallow the entire /mostlysecret/ directory to all other robots. Because the robots.txt file is read from top to bottom, the allow should be in the file above the "all robots" disallow. Example:
User-agent: googlebot User-agent: Googlebot-Image Disallow: /hotbotstuff/ Allow: /mostlysecret/googlefood.html User-agent: slurp Disallow: /googlestuff/ User-agent: * Disallow: /mostlysecret/ Disallow: /cgi-bin/ Disallow: /secret/ Disallow: /mypasswords.html
If you want to disallow nothing, use
User-agent: * Disallow:
(The above is what good spiders assume when they don't find a robots.txt file)
If you want to disallow everything, use
User-agent: * Disallow: /
Notice that the difference between allowing your whole site to be indexed and preventing any indexing of your site at all is one character, the "/".
Disallow: with nothing following it means disallow nothing (which actually means allow everything). And Disallow: with the single character "/" following it means disallow everything (which means nothing can be indexed).
The reason the "/" character is so powerful is because it represents the root directory. When the root directory is disallowed, everything in it, including all subdirectories, are disallowed, too.
Spider Names
The spider name (following the "User-agent:" line) is obtained from the user-agent signature the spider provides when it visits a website. That user-agent information is recorded in server log files.
One way to get the spider names is to scan your server logs for robots.txt request entries. These will almost always be spiders.
Another way, and far easier, is to consult compiled lists of spider names, although you may still want to consult your logs for currently unlisted spiders. Here are several websites with such links:
http://www.useragentstring.com/pages/useragentstring.php
http://www.robotstxt.org/db.html
http://www.botsvsbrowsers.org/
Resources
This page has robots.txt specific instructions for creating a robots.txt file:
http://www.robotstxt.org/robotstxt.html
If you decide to view other website's robots.txt files (which are always at the http://example.com/robots.txt URL) as a resource, realize they may be informative but they may also contain errors you do not want to repeat. Use the "common problems" and "specifications" URLs above to inform yourself rather than assuming others are doing it right.
Controlling Really Bad Spiders
There's not a whole lot you can do about these guys, but you can make it uncomfortable for some.
The first thing to do is to identify them. One way to do that is to put a "Disallow:" line in your robots.txt file that disallows a non-existent directory. Make sure it's a directory name that isn't referenced anywhere else. An unlikely name, like "skeidafkewe" would be good, because some good people try to guess where certain information might be at rather than go through a navigation menu "/contact" or "/checkout" for examples.
When you have your robots.txt file updated to disallow a non-existent directory, it's fairly certain that only very bad spiders will attempt to access it. They would read the entry in robots.txt and try to spider the disallowed directory anyway.
You can find those attempts in your error log. The error log will contain an IP address.
Use your .htaccess file to ban that IP address from your domain. Simply add this line to your .htaccess file except replace #.#.#.# with the IP address you're banning:
deny from #.#.#.#
Another thing a person might do, which can be especially annoying for really bad spiders in the service of email address harvesters, is to actually create the disallowed directory and give it some web page files. These web page files can then be filled with invalid email addresses.
If you do this method, the IP addresses can be found in the regular server logs instead of in the error logs.
Will Bontrager