Grab the Source Code
There are a number of reasons to programmatically retrieve web page source code. Here are a few.
-
Testing URLs when doing automated links checking.
-
Verifying that certain HTML or JavaScript code either is present or is absent from the web page at the URL. (It comes in handy when maintaining a list of pages that link back to your website.)
-
To update your own search engine's database.
-
To see the source code as provided by the server rather than as might be changed by the browser.
The software you'll find in this article retrieves the source code straight from the server for each of a list of URLs. It lists any 404 Not Found
, 500 Internal Server Error
, or other non-200 OK
server responses that the software may encounter.
Optionally, it will save the retrieved source code (from any public URL on the internet) to your server for your later review.
The software has a place clearly marked where you can add your own code to scan the retrieved source code.
Don't make the retrieved source code public or publish it as your own, of course. This software is for ethical use.
The Software
Below is the PHP software source code. Save it as GrabSourceCode.php
or other *.php
file name you prefer.
The software source code has 3 places that may be customized. Customization notes follow.
<?php /* Grab Source Code Version 1.0 May 31, 2020 Will Bontrager Software LLC https://www.willmaster.com/ */ /* Customizations */ // Three places to customize. // Place 1: // The directory to store retrieved source code, // location relative to document root. To omit // storing retrieved code, leave the value blank. $DirectoryToStoreWebPages = '/php/storedpages'; // Place 2: // Between the lines containing the word // URLSLIST // type the URLs of the source code to grab. // Blank lines are acceptable. $URLs = <<<URLSLIST http://example.com https://www.example.com/page.php URLSLIST; // Place 3: // If you wish to process the source code // as it is retrieved, the PHP code to scan // or manipulate it can be put into this // function ProcessWebPageSourceCode(). // Admittedly, you may need to write other // functions depending on how elaborate or // extensive is the processing you want done. function ProcessWebPageSourceCode($page) { // [YOUR CODE HERE] } /* End of customization */ /* *** *** *** *** *** */ mb_internal_encoding('UTF-8'); $content = ''; $ListOfURLs = array(); $DirectoryToStoreWebPages = empty($DirectoryToStoreWebPages) ? false : "{$_SERVER['DOCUMENT_ROOT']}$DirectoryToStoreWebPages"; foreach( preg_split('/[\r\n]/',trim($URLs)) as $url ) { if( ! preg_match('!//!',$url) ) { continue; } $url = trim($url); $filename = false; if( $DirectoryToStoreWebPages ) { $match = array(); preg_match('!^.+//([^/]+)!',$url,$match); $domain = isset($match[1]) ? $match[1] : false; if( ! $domain ) { $ListOfURLs[] = "(malformed URL) $url"; continue; } $filename = preg_replace('/\./','_',$domain); $filename .= '_' . time() . '.html'; } $info = GetSourceCode($url,$content); $message = ''; if( $info['http_code'] == 200 ) { ProcessWebPageSourceCode($content); } else { $filename = false; $message = "({$info['http_code']}) "; } $message .= $url; if( $info['redirect_count'] > 0 ) { $message .= "<br>(redirected to {$info['url']})"; } if( $info['errno'] > 0 ) { $message .= "<br>Error - {$info['errmsg']}"; } $ListOfURLs[] = $message; if( $DirectoryToStoreWebPages and $filename ) { file_put_contents("$DirectoryToStoreWebPages/$filename",$content); } $content = ''; } echo '<p>' . implode("</p>\n<p>",$ListOfURLs) . '</p>'; echo '-DONE-'; function GetSourceCode($url,&$content) { global $Version; $options = array( CURLOPT_RETURNTRANSFER => true, CURLOPT_HEADER => false, CURLOPT_CONNECTTIMEOUT => 120, CURLOPT_TIMEOUT => 120, CURLOPT_FOLLOWLOCATION => true, CURLOPT_MAXREDIRS => 10, CURLOPT_USERAGENT => $_SERVER['HTTP_USER_AGENT'], CURLOPT_VERBOSE => false ); $info = array(); $ch = curl_init($url); curl_setopt_array($ch,$options); $content = curl_exec($ch); $info = curl_getinfo($ch); $info['errno'] = curl_errno($ch); $info['errmsg'] = curl_error($ch); curl_close($ch); return $info; } exit; ?>
Customization —
The 3 places that may be customized is where to put the retrieved source code, the URLs of the source code to retrieve, and custom handling code.
-
Where to put the retrieved source code:
In the PHP software source code, you'll see this line.
$DirectoryToStoreWebPages = '/php/storedpages';
If you wish to save retrieved source code, replace
/php/storedpages
with the directory location for the files.To omit saving the retrieved source code, remove
/php/storedpages
. -
The URLs of the source code to retrieve:
In the PHP software source code, you'll see this block of text.
$URLs = <<<URLSLIST http://example.com https://www.example.com/page.php URLSLIST;
Replace the colored-blue URLs with your own list of URLs of source code to retrieve.
-
Custom handling code:
In the PHP software source code, you'll see this block of text.
function ProcessWebPageSourceCode($page) { // [YOUR CODE HERE] }
Currently, the function ProcessWebPageSourceCode() does nothing. If you want it do to a job for you with the source code at each URL, replace
// [YOUR CODE HERE]
with the code to do the job.No example code can be provided at this point because I don't know what job you want it to do.
Something a person might want to do is scan the code for a specific string of characters and, perhaps, log the URL or send an email when a match is found.
Another might be to count the number of images the page uses and whether any are pulled in from another domain. Or a person might want to gather the URLs the page links to.
Whatever additional processing you wish to do with the retrieved source code, if any, replace
// [YOUR CODE HERE]
with your custom code.
When the Grab Source Code software has been customized, upload it to your server.
To run the software, put the software's URL into your browser's address bar. The software can also be run on a schedule set up with cron or other scheduler.
When the software runs, it will put the URLs it retrieves on the screen or, if an error occurs, either the non-200 status code or an error message. When done, -DONE-
is put on the screen.
If the software is run with the example URLs that come with the above source code, it would print these three lines (as of the date this article was published).
http://example.com (404) https://www.example.com/page.php -DONE-
Grab Source Code works well for retrieving web page source code at your list of URLs and, optionally, storing it. And, you can include your own processing code.
(This article first appeared with an issue of the Possibilities newsletter.)
Will Bontrager