Web crawlers refer to a class of software that downloads pages, identifies the hyperlinks, and adds them to a database for future crawling. Crawlers are sometimes called web spiders, robots, worms, or wanderers and can be thought of as an automated text browser. A crawler’s downloaded pages are consumed by a scraper, which parses out certain pieces of information from those pages like hyperlinks to other pages.
A crawler can be written to be autonomous so that it populates its own list of fresh URLs to crawl, but is normally distributed across many machines and controlled centrally. Sample PHP crawler code is shown in Listing 18.1. These crawlers (which can be written in any language that is able to connect to the WWW) begin their work by having a list of URLs that need to be retrieved called the seeds. For a brand-new search engine, the initial seeds might be the URLs of web directories. Unlike an HTTP request from within a browser, the images, styles, and JavaScript files are not downloaded right away when a crawler downloads a page. The links to them, however, can be identified so that we can download those resources later.
class Crawler {
private $URLList;
private $nextIndex;
function construct(){
$this->nextIndex=0;
$this->URLList = array("http://SEEDWEBSITE/");
}
private function getNextURLToCrawl(){
return $this->URLList[$this->nextIndex++];
}
private function printSummary(){
echo count($this->URLList)." links. Index:".
$this->nextIndex."<br>";
foreach($this->URLList as $link){
echo $link."<br>";
}
}
// THIS CAN BE CALLED FROM LOOP OR CRON
public function doIteration(){
$url = $self->getNextURLToCrawl();
// Do note crawl if not allowed
if (robotsDisallow($url))
return;
echo "Crawling ".$url."<br>";
//this function finds the <a> links
scrapeHyperlinks($url);
$self->printSummary();
}
}In the early days of web crawlers there was no protocol about how often to request pages, or which pages to include, so some crawlers requested entire sites at once, putting stress on the servers. Moreover, some sites crawled content that the author did not really want or expect to link on a public directory. These issues created a bad reputation for crawlers. As search engines began to take off, more and more crawlers appeared, indexing more and more pages.
To address the issue of politeness Martijn Koster, the creator of ALIWEB, drafted a set of guidelines enshrined as the Robots Exclusion Standard still used today.2,3 These guidelines helped webmasters discourage certain pages from being crawled and indexed. The simple crawler in Listing 18.1 even adheres to it by calling the function robotsDisallow().
Crawlers are often requesting a page and then downloading its contents to be processed later. Scrapers are programs that identify certain pieces of information from the web to be stored in databases. Although crawlers and scrapers can be combined, they are separated in many distributed systems.
URL Scrapers identify URLs inside of a page by seeking out all the <a> tags and extracting the value of the href attribute. This can be done through string matching, seeking the <a> tag, or more robustly by parsing the HTML page into a DOM tree and using the built-in DOM search functionality of PHP as shown in Listing 18.2. Needless to say, a real scraper would store the data somewhere like a database rather than simply echo it out.
$DOM = new DOMDocument();
$DOM->loadHTML($HTMLDOCUMENT);
$aTags = $DOM->getElementsByTagName("a");
foreach($aTags as $link){
echo $link->getAttribute("href")." - ".$link->nodeValue."<br>";
}Email scrapers are not inherently unpleasant, but usually the intent of harvesting emails is to send a broadcast message, commonly known as spam. To harvest email accounts, a scraper seeks the words mailto: in the href attribute of a link. A slight modification to the loop from Listing 18.2 only prints the attribute if it is an email, and is shown in Listing 18.3.
Although early crawlers did not have the benefit of PHP DOM Document, they applied a similar approach to extract content.
foreach($aTags as $link){
$mailpos=strpos($link->getAttribute('href'),"mailto:");
if($mailpos !== false){
echo substr($link->getAttribute('href'),$mailpos+7)."<br>";
}
}Vulnerability scrapers scan a website for information about the underlying software. A site’s OS and server versions along with the list of JavaScript Plugins and CMS versions create a range of indexable data points that characterize a site (Centos 5, Apache 2.2, WordPress 5.1, etc.). This signature can then be searched against known vulnerabilities, allowing malicious attackers to automatically determine which attack to use on your site!
The final thing that a scraper may want to parse out is all of the text within a web page. These words will eventually be reverse indexed (covered below) so that the search engine knows they appear at this URL. Words are the most difficult content to parse, since the tags they appear in reflect how important they are to the page overall. Words in a large font are surely more important than small words at the bottom of a page. Also, words that appear next to one another should be somehow linked while words that are at opposite ends of a page or sentence are less related.