 |
One of the main reasons that Google is the most popular search engine on the Internet is it's page
ranking system. Everybody now wants their site to rank highly on Google and there are some great tools like
RankMeter for doing just that. The algorithm it uses has become so famous that it is now known simply as
"PageRank". PageRank has been so widely hailed
that it seems that any search system without it is deemed
to be immature, behind the times or just plain useless.
Brilliant as Google is, the funny thing about PageRank is that unless you are writing an Internet search
engine (come on, are you really going to be doing that?), it is probably the worst possible way to sort
search results. In fact you should never use the PageRank algorithm when returning results from a single site.
Note: This article first appeared at kuro5hin. The algorithm's used have been slightly modified from that article.

Before Google became a private company, its founders, Sergey Brin and Larry
Page, were both working on doctorates on the topic of Internet Search Engines at
Stanford University. Luckily for us, the details of Google's engine were
published so anyone could see how it worked. In The
Anatomy of a Large-Scale Hypertextual Web Search Engine, the authors
detailed the now famous PageRank algorithm. It is very simple, and can be stated
as follows:
PR(A) = (1-d) + d (PR(T1) /C(T1) + ... + PR(Tn) /C(Tn) )
Where PR(A) is the PageRank of Page A (the one we want to work out).
D is a dampening factor. Nominally this is set to 0.85. PR(T1) is the
PageRank of a site pointing to Page A. C(T1) is the number of links off that
page. PR(Tn) /C(Tn) means we do that for each page pointing to Page A.
You employ the page rank algorithm by firstly guessing a PageRank for
all the pages you have indexed and then recursively iterating until the PageRank
converges. This process is described in detail in PageRank Uncovered by Chris
Ridings and Mike Shishigin. PageRank Uncovered is a very thorough and clearly
written examination of PageRank, what it is, what it is not and how to exploit
it. It makes great bed time reading.
Although it's not obvious from the
algorithm, what PageRank in effect says is that pages "vote" for other pages on
the Internet. So if Page A links to Page B, it is saying B is an important page.
In addition, if lots of pages link to a page, then it has more votes and its
worth should be higher. These assumptions have been widely criticised, but, perhaps
because nobody has been able to come up with a better system that can be tested
on a live search engine, PageRank has evolved to become the de-facto standard of
rating search results.
Most search systems are not written to index the
Internet. They are written to merely index a particular web site (for instance,
the search box at the bottom of this page). If you searched for a term that was
present on the home page and some other pages as well, PageRank would always
rank the home page as the first result. This is not a good thing. Let's look at
a practical example to see why.
Say you had a web site that was meant to
provide information on models and you called it www.bikini.com. Let's say that you knew there
was a model on the site called Lola Corwin and that she had a personal page. If
you were looking for Lola with a search system that employed PageRank and Lola
happened to be on the Home page as featured model of the month, the home page
would come up first and Lola's page second. I actually tried this on Google at
the time of writing this article and indeed, the search term:
Corwin
site:www.bikini.com
brought the home page up first and her personal page
second. Clearly, you really wanted Lola's home page to be first, not the site's
home page. PageRank has failed you.
I am being a little unfair to Google
as PageRank is only one of the factors Google employs to rank pages. The others
include word position, font, capitalisation and search term appearance in title
tags. These 'others', however, are the only ones you should use when ranking
search results within a given site. PageRank is entirely meaningless because it
places such undue importance on the home page where detailed information is
almost never found. Unfortunately, a detailed description of how these other
factors should be calculated is not documented anywhere and it's difficult to
know how Google uses them.
Since Google became a private company, these
'other' factors and the PageRank algorithm itself have apparently undergone
modifications. But as Google is now a private company, they are no longer
documenting their technology. If you want to use them in your software, I guess
you will have to wait until a smart person somewhere creates a better system in
a publicly released Ph.D. thesis before they run off to become a billionaire.
These thoughts evolved as I was trying to write a search result ranking
system for a search engine I have developed called the Yider. It is a free
product that is designed for Windows servers. I had to have some kind of ranking
system, so I decided to use the word count and position as my page rank. It
works like this:
a) Assume a web page contains the following text
between the <body> tags:
"Ph.D.'s on search engines should be
banned because their final findings only become workable when a company is
established to produce a practical result from an incomplete research paper.
This denies everyone who funded them a chance to see the benefits of their
research. It's one of the reason I hate search engines in general. Haven't you
got anything better to do than search the Internet all day anyway? Why not
fiddle with real engines like the one in your car?"
This text consists
of 467 characters
b) Let's say we were searching for the phrase "search
engines". This phrase occurs at characters 12 and 304.
c) Rank phrase
matches with a score of 1 but penalise them linearly depending on their distance
from the beginning of the text:
Phrase rank = 1 x (467 - 12) / 467 + 1 x
(467 - 304) / 467
= 0.9743 + 0.3490 = 1.3233
d) We now need to take
account of partial phrase matches. I do this as follows. The two words in the
phrase can be found at the following locations:
search - 12, 304, 374
engines - 19, 311, 435
e) We will rate these as for the
phrase match, but we will reduce their importance by 0.5 divided by the number of words in the phrase "search engines"
i.e. we will reduce their importance by 0.5 / 2 = 0.25
Word rank = word rank "search" + word rank "engines"
= 0.25 x (467 - 12) / 467
+ 0.25 x (467 - 304) / 467
+ 0.25 x (467 - 374) / 467
+ 0.25 x (467 - 19) / 467
+ 0.25 x (467 - 311) / 467
+ 0.25 x (467 - 435) / 467
= 0.2436 + 0.0873 + 0.0498 + 0.2398 + 0.0835 + 0.0171 = 0.7211
f) The total page rank is = phrase rank + word rank = 1.3233 + 0.7211 = 2.044
Note that my page rank is not an absolute measure of a page's worth. It
is simply a measure of the relative relavence of this page to other pages in the
same site.
Before presenting search results to users, I rank every page
that contains a full or partial phrase match using this algorithm. This is a
common sense and simple approach that seems to provide good page ranking for me
and is certainly better than using Google's PageRank.
|
 |