18.4 PageRank and Result Order

PageRank is an algorithm, published by Google’s founders in 1998.4 This early discussion of search engines and the thinking behind them is essential reading for anyone interested in search engines. The PageRank algorithm is the basis for search engine ranking, although in practice it has been modified and changed in the decades since its publication. According to the authors, PageRank is

a method for computing a ranking for every web page based on the graph of the web.

The graph of the web being referred to looks at the hyperlinks between web pages, and how that creates a web of pages with links. Links into a site are termed backlinks, and those backlinks are key to determining which pages are more important. Sites with thousands of backlinks (from other domains) are surely more important than sites with only a handful of backlinks into them.

Note

The remainder of this section describes the mathematics of the PageRank algorithm. While most web developers do not need to master this math, it is helpful for understanding why search engine optimization works.

The simplified definition of a site n’s PageRank is:

PR(n)=vBnPR(v)Nv

In this formula the PageRank of a page, that is, PR(n), is determined by collecting every page v that links to n (v  Bn), and summing their PageRanks PR(v) divided by the number of links out (Nv). In order to apply this algorithm, we begin by assigning each page the same rank: 1/(number of pages). With these initial ranks in place, we can iteratively calculate the updated PageRank using the formula above.

To illustrate this concept look at the four web pages listed in Figure 18.4. Intuitively A is the most important since all other pages link to it, but to formalize this notion, let’s calculate the actual PageRank. To begin, assign the default rank to all pages:

Figure 18.4 Webpages A, B, C, and D and their links

The figure shows the Webpages "A", B, C, and D and their links.
Figure 18.4 Full Alternative Text
PR(A)=PR(B)=PR(C)=PR(D)=14

Beginning with Page A, we calculate the updated PageRank.

PR(A)=vBAPR(v)Nv

Since all three other pages link to A, we must substitute all three components in our sum.

PR(A)=PR(B)NB+PR(C)NC+PR(D)ND

We know the page ranks of B, C, D and can count the links out of each NB, NC, and ND.

PR(A)=1/42+1/43+1/42=13

Since B has A and C backlinking to it:

PR(B)=PR(A)NA+PR(C)NC=>14+1/43=>13

C has only D backlinking to it so:

PR(C)=PR(D)ND=>1/42=>18

Finally, D has B and C backlinks so:

PR(D)=PR(B)NB+PR(C)NC=>1/42+1/43=>524

Figure 18.5 shows the four pages with PageRanks after two iterations. See if you can arrive at the same values for iteration 2. Interestingly, Page B has a higher calculated rank than A, defying our initial guess.

Figure 18.5 Illustration of two iterations of PageRank

The figure illustrates two iterations of PageRank.
Figure 18.5 Full Alternative Text

In practice the links can change between iterations as well if the page was re-crawled so the formula must be dynamically interpreted every time. Interestingly, the updated ranks always sum together to make one. This is not the case if one of the pages was a rank sink, that is, a page with no links as shown in Figure 18.6 where Page A has no links to other pages. There you can see with each iteration the total PageRank decreases. A more sophisticated PageRank algorithm introduces a scalar factor to prevent rank sinks.5

Figure 18.6 Iterations of PageRank with a rank sink (A)

The figure illustrates Iterations of Page Rank with a rank sink ("A").
Figure 18.6 Full Alternative Text