Software, your way.
burger menu icon
WillMaster

WillMaster > LibrarySecurity and Blocking

FREE! Coding tips, tricks, and treasures.

Possibilities weekly ezine

Get the weekly email website developers read:

 

Your email address

name@example.com
YES! Send Possibilities every week!

Grab the Source Code

There are a number of reasons to programmatically retrieve web page source code. Here are a few.

  • Testing URLs when doing automated links checking.

  • Verifying that certain HTML or JavaScript code either is present or is absent from the web page at the URL. (It comes in handy when maintaining a list of pages that link back to your website.)

  • To update your own search engine's database.

  • To see the source code as provided by the server rather than as might be changed by the browser.

The software you'll find in this article retrieves the source code straight from the server for each of a list of URLs. It lists any 404 Not Found, 500 Internal Server Error, or other non-200 OK server responses that the software may encounter.

Optionally, it will save the retrieved source code (from any public URL on the internet) to your server for your later review.

The software has a place clearly marked where you can add your own code to scan the retrieved source code.

Don't make the retrieved source code public or publish it as your own, of course. This software is for ethical use.

The software's method of grabbing a page respects server resources. Only the source code of the web page is retrieved. Image, CSS, and other supporting files are ignored.

If later you load the web page source code into your browser, then the image, CSS, and content of other external files the page requires will be retrieved — assuming URLs to external files are valid.

Some web pages use absolute URLs for locations of external resources. Some use relative URLs. And some use both within the page.

Relative URLs to external files are likely to be broken because you are loading the web page from an unexpected location.

The Software

Below is the PHP software source code. Save it as GrabSourceCode.php or other *.php file name you prefer.

The software source code has 3 places that may be customized. Customization notes follow.

<?php
/*
Grab Source Code
Version 1.0
May 31, 2020
Will Bontrager Software LLC
https://www.willmaster.com/
*/

/* Customizations */
// Three places to customize.

// Place 1:
// The directory to store retrieved source code, 
//   location relative to document root. To omit 
//   storing retrieved code, leave the value blank.

$DirectoryToStoreWebPages = '/php/storedpages';

// Place 2:
// Between the lines containing the word 
//   URLSLIST
//   type the URLs of the source code to grab. 
//   Blank lines are acceptable.

$URLs = <<<URLSLIST
http://example.com
https://www.example.com/page.php
URLSLIST;

// Place 3:
// If you wish to process the source code 
//   as it is retrieved, the PHP code to scan 
//   or manipulate it can be put into this 
//   function ProcessWebPageSourceCode(). 
//   Admittedly, you may need to write other 
//   functions depending on how elaborate or 
//   extensive is the processing you want done.

function ProcessWebPageSourceCode($page)
{
// [YOUR CODE HERE]
}

/* End of customization */
/* *** *** *** *** *** */

mb_internal_encoding('UTF-8');
$content = '';
$ListOfURLs = array();
$DirectoryToStoreWebPages = empty($DirectoryToStoreWebPages) ? false : "{$_SERVER['DOCUMENT_ROOT']}$DirectoryToStoreWebPages";
foreach( preg_split('/[\r\n]/',trim($URLs)) as $url )
{
   if( ! preg_match('!//!',$url) ) { continue; }
   $url = trim($url);
   $filename = false;
   if( $DirectoryToStoreWebPages )
   {
      $match = array();
      preg_match('!^.+//([^/]+)!',$url,$match);
      $domain = isset($match[1]) ? $match[1] : false;
      if( ! $domain )
      {
         $ListOfURLs[] = "(malformed URL) $url";
         continue;
      }
      $filename = preg_replace('/\./','_',$domain);
      $filename .= '_' . time() . '.html';
   }
   $info = GetSourceCode($url,$content);
   $message = '';
   if( $info['http_code'] == 200 ) { ProcessWebPageSourceCode($content); }
   else
   {
      $filename = false;
      $message = "({$info['http_code']}) ";
   }
   $message .= $url;
   if( $info['redirect_count'] > 0 )  { $message .= "<br>(redirected to {$info['url']})"; }
   if( $info['errno'] > 0 )  { $message .= "<br>Error - {$info['errmsg']}"; }
   $ListOfURLs[] = $message;
   if( $DirectoryToStoreWebPages and $filename ) { file_put_contents("$DirectoryToStoreWebPages/$filename",$content); }
   $content = '';
}
echo '<p>' . implode("</p>\n<p>",$ListOfURLs) . '</p>';
echo '-DONE-';

function GetSourceCode($url,&$content)
{
   global $Version;
   $options = array(
      CURLOPT_RETURNTRANSFER => true,
      CURLOPT_HEADER         => false,
      CURLOPT_CONNECTTIMEOUT => 120,
      CURLOPT_TIMEOUT        => 120,
      CURLOPT_FOLLOWLOCATION => true,
      CURLOPT_MAXREDIRS      => 10,
      CURLOPT_USERAGENT      => $_SERVER['HTTP_USER_AGENT'],
      CURLOPT_VERBOSE        => false
   );
   $info = array();
   $ch = curl_init($url);
   curl_setopt_array($ch,$options);
   $content = curl_exec($ch);
   $info = curl_getinfo($ch);
   $info['errno'] = curl_errno($ch);
   $info['errmsg'] = curl_error($ch);
   curl_close($ch);
   return $info;
}
exit;
?>

Customization —

The 3 places that may be customized is where to put the retrieved source code, the URLs of the source code to retrieve, and custom handling code.

  1. Where to put the retrieved source code:

    In the PHP software source code, you'll see this line.

    $DirectoryToStoreWebPages = '/php/storedpages';
    

    If you wish to save retrieved source code, replace /php/storedpages with the directory location for the files.

    To omit saving the retrieved source code, remove /php/storedpages.

  2. The URLs of the source code to retrieve:

    In the PHP software source code, you'll see this block of text.

    $URLs = <<<URLSLIST
    http://example.com
    https://www.example.com/page.php
    URLSLIST;
    

    Replace the colored-blue URLs with your own list of URLs of source code to retrieve.

  3. Custom handling code:

    In the PHP software source code, you'll see this block of text.

    function ProcessWebPageSourceCode($page)
    {
    // [YOUR CODE HERE]
    }
    

    Currently, the function ProcessWebPageSourceCode() does nothing. If you want it do to a job for you with the source code at each URL, replace
    // [YOUR CODE HERE]
    with the code to do the job.

    No example code can be provided at this point because I don't know what job you want it to do.

    Something a person might want to do is scan the code for a specific string of characters and, perhaps, log the URL or send an email when a match is found.

    Another might be to count the number of images the page uses and whether any are pulled in from another domain. Or a person might want to gather the URLs the page links to.

    Whatever additional processing you wish to do with the retrieved source code, if any, replace // [YOUR CODE HERE] with your custom code.

When the Grab Source Code software has been customized, upload it to your server.

To run the software, put the software's URL into your browser's address bar. The software can also be run on a schedule set up with cron or other scheduler.

When the software runs, it will put the URLs it retrieves on the screen or, if an error occurs, either the non-200 status code or an error message. When done, -DONE- is put on the screen.

If the software is run with the example URLs that come with the above source code, it would print these three lines (as of the date this article was published).

http://example.com
(404) https://www.example.com/page.php
-DONE-

Grab Source Code works well for retrieving web page source code at your list of URLs and, optionally, storing it. And, you can include your own processing code.

(This article first appeared with an issue of the Possibilities newsletter.)

Will Bontrager

Was this article helpful to you?
(anonymous form)

Support This Website

Some of our support is from people like you who see the value of all that's offered for FREE at this website.

"Yes, let me contribute."

Amount (USD):

Tap to Choose
Contribution
Method

All information in WillMaster Library articles is presented AS-IS.

We only suggest and recommend what we believe is of value. As remuneration for the time and research involved to provide quality links, we generally use affiliate links when we can. Whenever we link to something not our own, you should assume they are affiliate links or that we benefit in some way.

How Can We Help You? balloons
How Can We Help You?
bullet Custom Programming
bullet Ready-Made Software
bullet Technical Support
bullet Possibilities Newsletter
bullet Website "How-To" Info
bullet Useful Information List

© 1998-2001 William and Mari Bontrager
© 2001-2011 Bontrager Connection, LLC
© 2011-2024 Will Bontrager Software LLC