One Way robots.txt Can Be a Security Risk
The robots.txt file can be a security risk if this one thing is present: A disallow line to a directory containing sensitive information.
If you don't specify a sensitive directory, the risk isn't there.
I'll describe why specifying a sensitive directory is a security risk. And also a way to provide some protection for sensitive information if you can't get away from listing it in robots.txt.
The risk exists because robots.txt is a public file. Not just to spiders, but also to humans.
To see for yourself, request the robots.txt file at a domain and you'll see what its file contains. Most domains have a robots.txt file. To request the file, type "robots.txt" after the slash following the domain name. Example:
https://www.willmaster.com/robots.txt
Rogue bots tend to be especially interested in directories listed as disallowed in the robots.txt file. See the Controlling the Spiders article for more information about this.
One reason directories with sensitive data may be listed as disallowed in the robots.txt file is when the content tends to end up in search indexes.
Perhaps the sensitive information is linked to from other websites or perhaps the search spider goes after every page the browser loads that isn't already in their search index. Or gets there some other way.
Listing the directory with sensitive information in robots.txt may eventually remove the directory from the search indexes.
There's a way to provide some protection from rogue bots even when directories with sensitive information are listed in robots.txt.
There are two rules to follow in regard to the index page of the directory with sensitive information:
-
Don't put sensitive information on the index page itself.
-
Don't link to sensitive information from the index page.
Rogue spiders who follow the disallowed directory get the index page. At the index page, they don't find any of your sensitive information and they don't find any links to sensitive information.
All they get is the information on the index page and information at the destination of links found there.
Legitimate, well-behaved spiders don't index the pages with sensitive information because the pages are in a disallowed directory.
Rogue robots do follow directories disallowed in the robots.txt file. When disallowed directories contain files with sensitive information, the technique described above can alleviate the security risk.
Will Bontrager