Controlling the Spiders
You can control the good spiders. Most search engine
indexing spiders are good spiders.
Controlling the spiders means leaving instructions for
them. Good spiders follow your instructions.
Rogue spiders ignore your instructions.
Really bad spiders go after directories and files you
specifically disallow.
At the end of this article, I'll mention a way to control
the really bad ones. But that's just a few paragraphs. Most
of this article is dedicated to controlling the good
spiders.
A web site is personal property. Just like guests in your
home, spiders should behave in a manner consistent with
your wishes.
There are two ways to let spiders know how they should
behave:
-
At the individual page level, with meta tags to
instruct whether or not the page may be indexed
and whether or not any links on the page may be
followed.
-
At the domain level, with a file named robots.txt
in the document root directory that lists all the
directories and/or pages that may not be indexed
and, sometimes, a list of directories and/or pages
that may be indexed.
Controlling Spiders with Meta Tags
If you're unfamiliar with meta tags, they are optional HTML
tags that reside in the HEAD area of web pages.
The name of the meta tag is "robots" and the instructions
are in the content= attribute. Here is an example with the
instructions blank:
<meta
name="robots"
content="_____,_____" />
Notice between the quotes are two blank areas for the
instructions, the two separated with a comma.
The first blank area is for the instruction whether or not
to index the current page. The instructions are either
"index" or "noindex".
The second blank area is for the instruction whether or
not to follow any links. The instructions are either
"follow" or "nofollow".
To tell spiders it is okay to index the current page but
not to follow any links, this would be the meta tag:
<meta
name="robots"
content="index,nofollow" />
Specify index or noindex before the comma, and follow or
nofollow after the comma, to indicate your wishes.
Controlling Spiders Using a robots.txt File
To control spiders with the robots.txt file, the file must
be in your document root directory. If you can access the
file with your browser using the URL
http://yourdomainhere.com/robots.txt then it's in the right
location. (Name the file with all lower-case characters. It
is case-sensitive.)
Now that you know where to put the file, here is what it
should contain.
robots.txt contains instructions for individual spiders
and/or for all spiders. The instructions simply specify
what may not be indexed.
Which spiders the instructions pertain to is first. The
instructions follow.
The line(s) naming the spiders starts with
User-agent:
(notice the colon) and is followed by the spider's
identification.
The lines specifying what may not be indexed starts with
Disallow:
(again, the colon) followed by the directory or file name
to be disallowed.
There may be more than one spider identification line and
there may be more than one directory/file name line.
Here is an example that tells the googlebot to stay away
from all files in the /cgi-bin/ directory:
User-agent: googlebot
Disallow: /cgi-bin/
If you have instructions for other spiders, insert a blank
line in the file then follow the same format spiders
first, then instructions.
Here is an example robots.txt file with 3 instruction
blocks.
-
Tell the googlebot and Googlebot-Image (Google's
image indexing spider) to say away from the
/hotbotstuff/ directory.
-
Tell HotBot's spider to stay away from the
/googlestuff/ directory.
-
Tell all spiders (indicated by an asterisk in place
of the spider identification) to stay out of the
/cgi-bin/ and /secret/ directories and out of the
mypasswords.html file located in the document root.
User-agent: googlebot
User-agent: Googlebot-Image
Disallow: /hotbotstuff/
User-agent: slurp
Disallow: /googlestuff/
User-agent: *
Disallow: /cgi-bin/
Disallow: /secret/
Disallow: /mypasswords.html
When a good spider reads the robots.txt file, it keeps the
instructions at hand while it's spidering the web site.
Before the spider indexes a page, it consults the robots.txt
file, omitting any instruction blocks that do not pertain
to it. For example, the googlebot would read blocks with a
user-agent of "*" and a user-agent of "googlebot". In the
above example, it would read only the first
and last
instruction blocks.
The spider reads from top to bottom, stopping whenever
there's a directory or file name match. Thus, a spider
consulting the above robots.txt file before spidering
http://example.com/secret/index.html would see the
Disallow: /secret/
line and not index the /secret/index.html page.
Some spiders will recognize an Allow: instruction in the
robots.txt file. If you use an allow instruction to allow
indexing certain directories or files, realize that those
spiders that don't recognize it will simply ignore it.
An example of use would be to let google index only the
http://example.com/mostlysecret/googlefood.html page in
that directory and disallow the entire /mostlysecret/
directory to all other robots. Because the robots.txt file
is read from top to bottom, the allow should be in the
file above the "all robots" disallow. Example:
User-agent: googlebot
User-agent: Googlebot-Image
Disallow: /hotbotstuff/
Allow: /mostlysecret/googlefood.html
User-agent: slurp
Disallow: /googlestuff/
User-agent: *
Disallow: /mostlysecret/
Disallow: /cgi-bin/
Disallow: /secret/
Disallow: /mypasswords.html
If you want to disallow nothing, use
User-agent: *
Disallow:
(The above is what good spiders assume when they don't find
a robots.txt file)
If you want to disallow everything, use
User-agent: *
Disallow: /
Notice that the difference between allowing your whole site
to be indexed and preventing any indexing of your site at
all is one character, the "/".
Disallow: with nothing following it means disallow nothing
(which actually means allow everything). And Disallow: with
the single character "/" following it means disallow
everything (which means nothing can be indexed).
The reason the "/" character is so powerful is because it
represents the root directory. When the root directory is
disallowed, everything in it, including all subdirectories,
are disallowed, too.
Spider Names
The spider name (following the "User-agent:" line) is
obtained from the user-agent signature the spider provides
when it visits a web site. That user-agent information is
recorded in server log files.
One way to get the spider names is to scan your server logs
for robots.txt request entries. These will almost always be
spiders.
Another way, and far easier, is to consult compiled lists
of spider names, although you may still want to consult
your logs for currently unlisted spiders. Here are three
web sites with such lists:
http://www.robotstxt.org/wc/active.html
http://www.jafsoft.com/searchengines/webbots.html
http://searchenginewatch.com/webmasters/article.php/2167991
Common Problems
This article contains enough to get you going. But it is
far from everything there is to know about the robots.txt
file.
A list of common problems found during a large crawl is at
http://www.searchengineworld.com/misc/robots_txt_crawl.htm
Validators
Here are two robots.txt validators (the first doesn't
recognize the Allow: instruction):
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
http://www.ukoln.ac.uk/web-focus/webwatch/services/robots-txt/
Resources
Here are two pages with robots.txt specifications:
http://www.robotstxt.org/wc/norobots-rfc.html
http://www.robotstxt.org/wc/norobots.html
If you decide to view other web site's robots.txt files
(which are always at the http://example.com/robots.txt URL)
as a resource, realize they may be informative but they may
also contain errors you do not want to repeat. Use the
"common problems" and "specifications" URLs above to inform
yourself rather than assuming others are doing it right.
Controlling Really Bad Spiders
There's not a whole lot you can do about these guys, but
you can make it uncomfortable for some.
The first thing to do is to identify them. One way to do
that is to put a "Disallow:" line in your robots.txt file
that disallows a non-existent directory. Make sure it's a
directory name that isn't referenced anywhere else. An
unlikely name, like "skeidafkewe" would be good, because
some good people try to guess where certain information
might be at rather than go through a navigation menu
"/contact" or "/checkout" for examples.
When you have your robots.txt file updated to disallow a
non-existent directory, it's fairly certain that only very
bad spiders will attempt to access it. They would read the
entry in robots.txt and try to spider the disallowed
directory anyway.
You can find those attempts in your error log. The error
log will contain an IP address.
Use your .htaccess file to ban that IP address from your
domain. Simply add this line to your .htaccess file except
replace #.#.#.# with the IP address you're banning:
deny from #.#.#.#
Another thing a person might do, which can be especially
annoying for really bad spiders in the service of email
address harvesters, is to actually create the disallowed
directory and give it some web page files. These web page
files can then be filled with invalid email addresses.
If you do this method, the IP addresses can be found in
the regular server logs instead of in the error logs.
Will Bontrager
©2004 Bontrager Connection, LLC
Please note:
Articles on this website are presented "as is". However -
If you have a question about a CGI script, HTML, CSS, PHP, or JavaScript
Ask one of our Experts and you'll have your answer!
Click here for details.