Software, your way.
burger menu icon
WillMaster

WillMaster > LibraryStatistics and Tracking

FREE! Coding tips, tricks, and treasures.

Possibilities weekly ezine

Get the weekly email website developers read:

 

Your email address

name@example.com
YES! Send Possibilities every week!

Uncontaminated Logs

Contamination from illegitimate robot log entries can seriously skew conversion calculations. Many robots spoof their identity as real browsers.

Conversion calculations of sales from sales pages and opt-in forms filled in on landing pages are especially important to get right. With incorrect percentages:

  1. Traffic may be wasted because the statistics indicated one thing worked better than the other thing when the actuality is vice versa.

  2. Ad money may be spent uselessly, or not spent where it would have done more good.

  3. Affiliates may get widely different results than your own measurements.

Server logs don't filter out robots because they log every request. Statistics software that uses server logs will identify some robot and spider tracks, but not all of them. Perhaps no more than a small percentage.

To have correct page view counts with which to do conversion calculations, page requests by illegitimate robots and spiders need to be eliminated.

The trick presented here uses JavaScript, which many robots don't follow — especially not when the JavaScript is constructed in the manner I'm going to demonstrate. Very few robots actually run JavaScript.

Legitimate bots and spiders, such as Googlebot, do follow and run JavaScript. Legitimate ones identify themselves, so no deception there. Their log entries are easily discounted.

The deception is from robots that spoof their identity as real browsers. At least a significant percentage, perhaps a very high percentage, don't follow JavaScript. An even lower percentage don't follow JavaScript pulled in from an external file when the URL is fragmented.

I'll describe how to do that.

Undependable Measurements

With illegitimate bots contaminating page request logs, conversion measurements are undependable. Some days there will be more illegitimate bots and some days less. The difference could be significant.

If your stats say you have 1000 page loads, how many legitimate page views do you really have? 999? 550? ("Legitimate" being real people viewing a page the way it's meant to be viewed.)

If you knew that between 11% and 14% of your page loads are always robot generated, then your calculations can take the range into account. But generally, it's not that predictable. Some days it might be 5% and other days 35%. Or more.

It's the not knowing that makes things uncertain. You just don't know if an opt-in form is converting at, for example, 7% or 12%. When a change is made to the form or the form page and the conversion percentage goes up, is it because a higher percentage of people used the form or is it because there were less robot entries in the log?

With illegitimate robot tracks removed from the log file, conversion percentage numbers are higher — less page views logged with the same conversion count equals a higher percentage.

Let's use 10 sales as an example. If there were 1000 page loads including illegitimate robot log entries, the conversion percentage would be 1.0%. Without those robot log entries, there may be only 700 page load entries, with a conversion percentage of 1.4% - over a third higher.

What the Clean Log Software Is For

Clean Log won't replace your regular server logs.

Instead, it's for measuring page loads of special pages where an accurate page load count is imperative for calculating actual conversion rates critical to your business.

You can see how the correct calculation results can have an effect on business decisions.

How Clean Log Works

The JavaScript triggers a PHP logging script. Spiders and robots that don't follow the JavaScript won't trigger the logging software. Thus, you end up with relatively clean logs.

Set up one page or many pages for clean logs.

The log files are CSV, ready to import into your spreadsheet software.

Each log entry contains

  • the date and time the page was loaded,
  • the IP address of the internet connection used to view the page,
  • the URL of the page being viewed,
  • the URL of any referrer, and
  • the user-agent string to identify the type of browser and operating system being used.

A new log file is created every day, so you have a daily view.

Implementing Clean Log

There's one PHP script. And there's JavaScript for every web page to be logged.

1. The PHP Script.

Let's install the Clean Log PHP script, first. Customization notes follow the source code.

<?php
/* 
   Clean Log
   Version 1.0
   January 23, 2016

   Will Bontrager Software LLC
   https://www.willmaster.com/
   Copyright 2016 Will Bontrager Software LLC

   This software is provided "AS IS," without 
   any warranty of any kind, without even any 
   implied warranty such as merchantability 
   or fitness for a particular purpose.
   Will Bontrager Software LLC grants 
   you a royalty free license to use or 
   modify this software provided this 
   notice appears on all copies. 
*/
/* Customization. */
//
// Specify the log file name.
// Example: date("Y_m_d") . "_log.csv"
// In the example, date("Y_m_d") inserts the current date 
//   into the log file name, causing a new file every day.
//   To have one log file, rather than split up by days, 
//   omit the date specify your preferred log file name.

$logfilename = date("Y_m_d") . "_log.csv";

// Specify the directory where the log file will be maintained.

$logfiledirectory = "/stats/log";

/* End of customization. */
/* ** ** ** * * ** ** ** */

mb_regex_encoding('UTF-8');
mb_internal_encoding('UTF-8');
if( ! ini_get('date.timezone') ) { date_default_timezone_set('UTC'); }
$line = array();
$line[] = CSVize( 'datetime='.date('Y-m-d H:i:s') );
$line[] = CSVize( "IP={$_SERVER['REMOTE_ADDR']}" );
$line[] = CSVize( "page={$_GET['here']}" );
$line[] = CSVize( "referrer={$_GET['ref']}" );
$line[] = CSVize( "UA={$_SERVER['HTTP_USER_AGENT']}" );
$logfiledirectory = $_SERVER["DOCUMENT_ROOT"] . preg_replace('!/*$!','',$logfiledirectory);
file_put_contents("$logfiledirectory/$logfilename",implode(',',$line)."\n",FILE_APPEND);
echo header('Content-type: text/javascript');
echo 'var "nothing to see here";';
function CSVize($s)
{
   if( preg_match('/[",\r\n]/',$s) )
   {
      $s = str_replace('"','""',$s);
      $s = "\"$s\"";
   }
   return $s;
} # function CSVize()
?>

Customization —

The PHP script source code has two places to customize.

  1. The log file name. (colored blue) The copied source code has this log file name:

    date("Y_m_d") . "_log.csv"

    The date() function inserts the current date into the file name. That causes a new log file to be created every day for that day's log entries.

    To have log entries for all days in one log file, remove the date() function and the dot that follows it:

    date("Y_m_d") .

    If appropriate, change "_log.csv" to the name of the log file you prefer; "cleanlog.csv" for example.

  2. The log file directory. (colored red) Specify the directory where the log file is to be placed. It must be writable which, depending on how your server is set up, may required the directory to have 777 permissions.

That's all for customization.

Save the file as CleanLog.php or other .php file name if you prefer.

Upload CleanLog.php to your server. Make a note of its URL.

2. The JavaScript.

The JavaScript code is where illegitimate robots and spiders are eliminated.

They can follow fully-constructed URLs just fine. But many are unable to follow URLs that are segmented as I'll show you how to do with the JavaScript.

Further, many robots, perhaps most, don't run JavaScript at all. It takes a LOT of software code to run JavaScript efficiently. Hobby robot makers just don't spend the time to learn how it works and how to implement it. Thus, few illegitimate bots have it.

The Clean Log JavaScript obfuscates the CleanLog.php URL so illegitimates don't notice it's there. And it embeds the call to CleanLog.php within a JavaScript statement.

To implement, the JavaScript needs to first be customized for the URL to your CleanLog.php. Then obfuscated.

Here's the basic JavaScript code, before obfuscation. A customization note follows.

<script>
var now = new Date();
document.write('<'+'scr'+'ipt sr'+'c="http://example.com/CleanLog.php?JS='+now.getTime()+'&here='+document.URL+'&ref='+document.referrer+'"'+'><'+'/scr'+'ipt'+'>');
</script>

Customization — The URL colored blue needs to be changed to the URL of your installation of CleanLog.php.

In a moment, we'll obfuscate the URL.

You'll notice that, except for the first and last line, the words "script" and some other elements of the source code are segmented. That serves two purposes: (i) It keeps the browser from running the script until after it's written to the page. (ii) It helps to obfuscate the code.

How that works is beyond the scope of this article. In essence, bots that parse source code looking for certain key words are unlikely to spot the code for what it is.

Now, let's obfuscate the URL.

To obfuscate, insert the 3-character code

'+'

into the URL in strategic places:

  • Somewhere inside "http"
  • Between the "//" characters
  • Between the first and second character of the top level domain name ("com" in this case)
  • Somewhere inside the "php" file name extension

Additional places in the URL may have the 3-character code inserted. But the above should be sufficient.

With the URL obfuscated, the JavaScript becomes:

<script>
var now = new Date();
document.write('<'+'scr'+'ipt sr'+'c="ht'+'tp:/'+'/example.c'+'om/CleanLog.p'+'hp?JS='+now.getTime()+'&here='+document.URL+'&ref='+document.referrer+'"'+'><'+'/scr'+'ipt'+'>');
</script>

Obfuscate the JavaScript after updating it with the URL of your installation of CleanLog.php.

The obfuscation is designed to defeat text parsing routines so they don't notice the URL to CleanLogs.php and, therefore, don't cause it to add an entry to the page view log …

… so you have uncontaminated logs, and your conversion calculations are reliable.

(This article first appeared in Possibilities ezine.)

Will Bontrager

Was this article helpful to you?
(anonymous form)

Support This Website

Some of our support is from people like you who see the value of all that's offered for FREE at this website.

"Yes, let me contribute."

Amount (USD):

Tap to Choose
Contribution
Method

All information in WillMaster Library articles is presented AS-IS.

We only suggest and recommend what we believe is of value. As remuneration for the time and research involved to provide quality links, we generally use affiliate links when we can. Whenever we link to something not our own, you should assume they are affiliate links or that we benefit in some way.

How Can We Help You? balloons
How Can We Help You?
bullet Custom Programming
bullet Ready-Made Software
bullet Technical Support
bullet Possibilities Newsletter
bullet Website "How-To" Info
bullet Useful Information List

© 1998-2001 William and Mari Bontrager
© 2001-2011 Bontrager Connection, LLC
© 2011-2024 Will Bontrager Software LLC