This page is under construction and is in a bit of a mess. You might find some usefuls bits but I have not even ordered any of the topics yet
Google PageRank the Myth the Legend
PageRank the Misnomer
PageRank is now such a widely used term we sometimes assume that pages on google are the only pages that have an associated rank. This is one of the myths I have hinted at. Every search engine ranks their pages using some method or other. The rank a page achieves may not be obvious to the user or permanent but to attain sorted results some form of ranking must be used. It was Google that first decided to attach a PageRank to a page on a semi permanent basis and they coined the phrase. Yes their method for ranking was different than the others at the time but ranking pages was nothing new, only the method they used was different and it was this, as we all know that proved to be a very good idea.
Yahoo, Alta Vista, Gigablast and all the other search engines rank their pages its only the method that distinguishes them, not the name, so in my eyes naming the method PageRank was smart but its a bit of a misnomer.
What is Google PageRank
To cut a long story short and to avoid any mathematical equations that for the most part, you don't really need to understand PageRank the following is my definition of Google PageRank
Google PageRank is one method Google uses to partially determine a sort order for a set of Uniform Resource Locators(URL's) produced from an indeterminable search request.It sounds terribly fancy but it isn't. A more digestible definition is as follows.
Google PageRank is one method Googles uses to sort search results.
Something important to note about both definitions
The most imporatant thing to note about both definitions is my use of the words 'one method'. The reason for this is quite simple, Google does not rely on PageRank alone. Lots of people are under the impression that PageRank is the only method Google employs to rank pages, I can assure you this is absolute nonsense and Google would be a pretty crap search engine if this was true.
PageRank and Search Results
There is a distinct difference between Google PageRank and retireved search results. What I mean by this is that Google might not necessarily use PageRank to determine what results it retrieves from its indexes it might only use PageRank to sort them. This means that for a webpage to have a hope of getting into the search results it has to be found first, if it has not managed to get into the search results then no amount of PageRank will get it into the top ten. So before you start trying to increase your PageRank make sure you are getting into the search results for the keywords you want people to find you by.
Precision and Recall
Precision and Recall are the two things used to determine the effectiveness of an Information Retrieval System (IRS). Since search engines are IRS's then we can use Precision and Recall to measure their effectiveness. The following gives a better idea as to what Precision and Recall are.
Set of all documents relevant Docuements in the system for a query == Relevant_Documents
Set of all documents returned by the IRS for a query == Retrieved_Documents
Set of all Relavant documents retrieved by the IRS for a query == Relevant_Retrieved_Documents
This looks a bit confusing when you first read it but it isn't. Google contains several billion documents in its IRS. For each query there is a hypothetical number of Relevant_Documents for that query. The aim of an ideal IRS is to locate and return all of these Relevant_Documents and no others. Living in the real world however it is almost impossible to achieve this. Google's IRS will return a set of Retrieved_Documents which may or may not contain all of the Relevant_Documents. It might also contain Irrelevant_Documents that have little or no value. What we do know is that in the set of Retrieved_Documents there will be a set of Relevant_Retrieved_Documents. Using these three sets we can then work out the Precision and Recall of the system as follows.
Precision = Relevant_Retrieved_Documents / Retrieved_Documents
Recall = Relevant_Retrieved_Documents / Relevant_Documents
These two factors go hand in hand in understanding how search engines do their stuff. Neither one of the above is more important than the other although some people would argue one over the other but its a chicken and egg scenario because they are both inversely related to each other. Being inversely related to each other causes some problems ie if Retrieval goes up Precision goes down and vice versa and all search engines are trying to strike a balance between the two.
Reverse Index
The most common method in use today to get results from an IRS is to use a reverse index of all the words found in a document. This means that unless you are have the words the user is searching for in your page you will not appear in the search results of the search engine so we are back to picking the words that you would like to have your page found by.
Your first priority should be to determine what words would you like to have your site found on. It is keywords that people search on and although its easy to try and cover as many words as possible you are probably going to dilute the relevance of your pages.
Many people miss what I am getting at when I say one of the most important factors in determining a pages rank ( not its PageRank ) is how many times a particular word appears in the document. This is still the most widely used method in determing what documents will be returned from a much larger set. Its very easy to visualise that if we find eight occourances of the word 'law' in one document and 2 occourances of the word law in another that the first of the 2 must have more to do with the law than the first. This is a reasonable assumption in a perfect world under perfect circumstances. Unfortunately we do not live in a perfect world.
NNOT FINISHED
APPENDIX
Anatomy of a search engine
There is a great article that is freely available to anyone with an internet connection on what Sergey Brin and Lawrence Page did while studying at stanford. This document should be Required reading for anyone who is interested in search engine technology, particularly anything to do with Google since this is the document that describes Googles inception. I have read the above paper many times and their ideas and their simplicity never cease to amaze me.