Software, your way.
burger menu icon
WillMaster

WillMaster > LibrarySecurity and Blocking

FREE! Coding tips, tricks, and treasures.

Possibilities weekly ezine

Get the weekly email website developers read:

 

Your email address

name@example.com
YES! Send Possibilities every week!

One Way robots.txt Can Be a Security Risk

The robots.txt file can be a security risk if this one thing is present: A disallow line to a directory containing sensitive information.

If you don't specify a sensitive directory, the risk isn't there.

I'll describe why specifying a sensitive directory is a security risk. And also a way to provide some protection for sensitive information if you can't get away from listing it in robots.txt.

The risk exists because robots.txt is a public file. Not just to spiders, but also to humans.

To see for yourself, request the robots.txt file at a domain and you'll see what its file contains. Most domains have a robots.txt file. To request the file, type "robots.txt" after the slash following the domain name. Example:

https://www.willmaster.com/robots.txt

Rogue bots tend to be especially interested in directories listed as disallowed in the robots.txt file. See the Controlling the Spiders article for more information about this.

One reason directories with sensitive data may be listed as disallowed in the robots.txt file is when the content tends to end up in search indexes.

Perhaps the sensitive information is linked to from other websites or perhaps the search spider goes after every page the browser loads that isn't already in their search index. Or gets there some other way.

Listing the directory with sensitive information in robots.txt may eventually remove the directory from the search indexes.

There's a way to provide some protection from rogue bots even when directories with sensitive information are listed in robots.txt.

There are two rules to follow in regard to the index page of the directory with sensitive information:

  1. Don't put sensitive information on the index page itself.

  2. Don't link to sensitive information from the index page.

Rogue spiders who follow the disallowed directory get the index page. At the index page, they don't find any of your sensitive information and they don't find any links to sensitive information.

All they get is the information on the index page and information at the destination of links found there.

Legitimate, well-behaved spiders don't index the pages with sensitive information because the pages are in a disallowed directory.

Rogue robots do follow directories disallowed in the robots.txt file. When disallowed directories contain files with sensitive information, the technique described above can alleviate the security risk.

Will Bontrager

Was this article helpful to you?
(anonymous form)

Support This Website

Some of our support is from people like you who see the value of all that's offered for FREE at this website.

"Yes, let me contribute."

Amount (USD):

Tap to Choose
Contribution
Method

All information in WillMaster Library articles is presented AS-IS.

We only suggest and recommend what we believe is of value. As remuneration for the time and research involved to provide quality links, we generally use affiliate links when we can. Whenever we link to something not our own, you should assume they are affiliate links or that we benefit in some way.

How Can We Help You? balloons
How Can We Help You?
bullet Custom Programming
bullet Ready-Made Software
bullet Technical Support
bullet Possibilities Newsletter
bullet Website "How-To" Info
bullet Useful Information List

© 1998-2001 William and Mari Bontrager
© 2001-2011 Bontrager Connection, LLC
© 2011-2024 Will Bontrager Software LLC