Server Log Mining
Every request for a file on your server is logged. Web pages, images, CSS files — every file. Whether by browser, spider, robot, or other software, the request is logged.
The server access log has a treasure of data mining opportunities.
Some statistics services, like Google Analytics and Yandex, do a pretty good job. Some hit counter services do a better job than others. However, none of those services have all the information available in your server logs.
Even log analysis software installed on your server can miss treasures.
I'll show you how to mine those treasures.
Some of the data mining opportunities to be found in your server access log:
-
Which blog indexers and aggregators slurp your RSS feed.
-
Which search engine spiders visit your site, when they come, and which pages they slurp.
-
Which links to your web pages Facebook users are sending to other Facebook users.
-
How many visitors a certain website sent to a specific web page.
-
How many times a certain web page was loaded with a specific code in the URL.
-
How often search engine Bing's spider visits your website, or any other search engine spiders.
The server access log referred to in this article is the log file created by the server on Unix/Linux hosting accounts. Most use the Apache server, which creates the access log. Other operating systems may also use Apache.
If you are unsure where on your server the access log is located, your hosting company can help you.
Download a PDF document describing how to read the server access log.
In this article, I'll first provide representative examples of log entries for the first three data mining opportunities in the above list.
Then, I'll provide software for mining additional data from your server logs, such as the last three treasures in the above list.
Aggregators, Indexers, and Facebook User Links
Here are example access log entries from requests by several blog indexers and aggregators.
173.192.238.41 - - [22/Nov/2024:03:04:37 -0500] "GET /index.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.19; aggregator:Spinn3r (Spinn3r 3.1); http://spinn3r.com/robot) Gecko/2010040121 Firefox/3.0.19" 72.14.199.90 - - [22/Nov/2024:03:10:44 -0500] "GET /index.xml HTTP/1.1" 304 - "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 4 subscribers; feed-id=1180860485387498323)" 66.235.116.128 - - [22/Nov/2024:03:29:20 -0500] "GET /index.xml HTTP/1.1" 304 - "-" "Bloglines/3.1 (http://www.bloglines.com; 3 subscribers)"
Legitimate spiders, including indexers and aggregators, generally provide a valid URL within the user-agent information to obtain information about the spider.
You'll find a URL in each of the above log entries that contain more information about that particular spider.
The following 3 example log entries show that 3 search engine spiders made requests for web pages — Google's Googlebot, Yahoo!'s Yahoo! Slurp, and Bing's bingbot.
66.249.71.75 - - [22/Nov/2024:03:11:25 -0500] "GET /lightstories/ HTTP/1.1" 200 7634 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.195.111.176 - - [22/Nov/2024:03:18:24 -0500] "GET /2002travel-log/Park-Rapids-MN.php HTTP/1.0" 200 6563 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)" 207.46.13.96 - - [22/Nov/2024:03:42:48 -0500] "GET /robots.txt HTTP/1.1" 200 213 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
You'll see a URL in each of the user-agent information fields of the above log entries. The page at the URL contains information about the particular spider that made the request.
Here is an example log entry made when Facebook's external hit spider requested a page.
66.220.149.251 - - [22/Nov/2024:03:42:02 -0500] "GET /recipes/Vegetable-Beef-Soup-Recipe.php HTTP/1.1" 206 6280 "-" "Facebookexternalhit/1.1 (+http://www.Facebook.com/externalhit_uatext.php)"
Following the URL in the user-agent field, you'll see a web page that says a Facebook user sent a link to other Facebook users.
Mining Data
Before I show you how to mine specific information from the server access log, here is a snippet showing that a request for a web page often results in additional requests.
For example, a search at Google for "how long to let a pie cool" could find the "How to Make a Very Good Pie" page at example.com
When the browser retrieves the page, it would need an external CSS file and several images in order to display the page for the user. Thus, it would send a separate request to the server for each of those files.
74.216.112.159 - - [22/Nov/2024:14:00:27 -0500] "GET /recipes/How-to-Make-a-Very-Good-Pie.php HTTP/1.1" 200 12044 "http://www.google.ca/url?sa=t&source=web&cd=1&ved=0CBQQFjAA&url=http%3A%2F%2Fexample.com%2Frecipes%2FHow-to-Make-a-Very-Good-Pie.php&rct=j&q=how%20long%20to%20let%20a%20pie%20cool&ei=vkaeTJXINIiPnwehiNGtDQ&usg=AFQjCNHjfVRKxTH2-AehlqewSZiaLvEuXA" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; GTB6.5; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; VER#5A#80837548716745484952484849; InfoPath.2)" 74.216.112.159 - - [22/Nov/2024:14:00:28 -0500] "GET /lfstyle.css HTTP/1.1" 200 3965 "http://example.com/recipes/How-to-Make-a-Very-Good-Pie.php" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; GTB6.5; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; VER#5A#80837548716745484952484849; InfoPath.2)" 74.216.112.159 - - [22/Nov/2024:14:00:28 -0500] "GET /images/wmlogolightfocus_sm.gif HTTP/1.1" 200 2001 "http://example.com/recipes/How-to-Make-a-Very-Good-Pie.php" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; GTB6.5; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; VER#5A#80837548716745484952484849; InfoPath.2)" 74.216.112.159 - - [22/Nov/2024:14:00:28 -0500] "GET /images/dot.gif HTTP/1.1" 200 44 "http://example.com/recipes/How-to-Make-a-Very-Good-Pie.php" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; GTB6.5; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; VER#5A#80837548716745484952484849; InfoPath.2)" 74.216.112.159 - - [22/Nov/2024:14:00:29 -0500] "GET /terra/thblue-midnite.jpg HTTP/1.1" 200 2992 "http://example.com/recipes/How-to-Make-a-Very-Good-Pie.php" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; GTB6.5; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; VER#5A#80837548716745484952484849; InfoPath.2)" 74.216.112.159 - - [22/Nov/2024:14:00:30 -0500] "GET /favicon.ico HTTP/1.1" 200 3638 "-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; GTB6.5; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; VER#5A#80837548716745484952484849; InfoPath.2)"
Each request was from the same IP address by the same browser.
The last of the above log entries is the browser requesting the favicon.ico file (custom icon for browser address bar and bookmarks) in the document root directory.
The Actual Data Mining
To make the search through your server logs faster and much more efficient, use the Logminer PHP software (source code below).
Below is a representation of the form Logminer provides when it's launched for mining data in your server access log.
Location of log file:
Search for (blank to match all):
Maximum matches (blank or 0 for unlimited):
When the form is submitted and the software finds a match, it displays the entire log entry within which the match was found.
If the log file is huge, consider specifying a maximum number of matches.
If your server's permissions are set so that the above software can not read the access log, put a copy of the log into an accessible area of your server.
Here is the source code of the Logminer PHP software.
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <title>Access Log Data Mining</title> <style text/css> body { font-family:sans-serif; font-size:12pt; } </style> <pre> <?php if( isset($_POST['submit']) and isset($_POST['logfile']) and strlen(trim($_POST['logfile']))>0) { $max = ( isset($_POST['max']) and intval($_POST['max'])>0 ) ? intval($_POST['max']) : 0; $term = ( isset($_POST['term']) and strlen(trim($_POST['term']))>0 ) ? trim($_POST['term']) : ''; $f = fopen(trim($_POST['logfile']),'rt'); if($f) { $count = 0; while( ! feof($f) ) { $logline = fgets($f); if( (!$term) or preg_match("/$term/i",$logline) ) { echo $logline; $count++; if( $max and ($count == $max) ) { break; } } } fclose($f); } else { echo '<h1 style="margin:100px;">Unable to open file "'.$_POST['logfile'].'"</h1>'; } } ?> </pre> <div style="margin:100px auto 100px auto; width:400px;"> <h3 style="text-align:center;"> <span style="letter-spacing:3px;">Access Log Mining</span><br> <span style="font-weight:normal;"><span style="font-size:smaller;">from</span><br> <a href="//www.willmaster.com/">Willmaster.com</a></span> </h3> <form method="post" action="<?php echo($_SERVER['PHP_SELF']) ?>"> <p> Location of log file:<br> <input type="text" style="width:400px;" name="logfile" value="<?php echo(@$_POST['logfile']); ?>"> </p> <p> Search for (blank to match all):<br> <input type="text" style="width:400px;" name="term" value="<?php echo(@$_POST['term']); ?>"> </p> <p> Maximum matches (blank or 0 for unlimited): <input type="text" style="width:75px;" name="max" value="<?php echo(@$_POST['max']); ?>"> </p> <p> <input type="submit" style="width:400px;" name="submit" value="Mine It"> </p> </form> </div> </body> </html>
Copy the source code and save it as logminer.php or other file name with .php extension. Upload it to your server.
If you are concerned that others might snoop into your log files (the concern is valid), put the logminer.php file into a password protected directory. (See How To Password Protect Domain Directories/Folders)
To mine, type the URL of logminer.php into your browser's address bar.
Some search terms to get you started:
- Feedfetcher-Google
- Yahoo! Slurp
- bingbot
- download (to see if anyone is snooping for downloads)
- The file name of your RSS feed
- The domain name of a referrer
- An image file name (to see if someone is publishing it on their site)
- wp-login.php (to see if hackers are looking for a WordPress login page)
With this software, you can mine any treasure available in your server's access log file.
Will Bontrager