Site Navigation
Vector Space Search Engine Building a Spider Indexing the Internet Vector Space Postgres Programming What is a robots.txt file Stop List
Some Other Sites
Harry Jackson Job Site The Banana Tree HR-XML Builder -->

Building A Vector Space Search Engine

Disclaimer

None of the code below comes with any guarantee or or warranty as to its suitablity for any purpose use it at your own risk. This document is also under construction as of the 31 Oct 2003. Please bear with me.

Why build a Search Engine

For me it's a challenge. I wanted to know if I could build a search engine and it has always been something that has interested me. Coupled with the fact that I had just written two web spiders then there was no reason why I should not do it. It also gives me some experience in a field that I am very interested in. I need a job, so if you know of or are someone looking to hire then look no further you just found someone, hint, hint ;-)

Why's it taking so long

Because building a search engine is not easy. Any one can throw together a rudimetary search engine. Have a look at the uklug website to see one that searches in excess of 10000 jobs. The uklug engine is constatnly getting new jobs and removing old ones but its a rudimentary engine at best.

Where to start

I started by reading and inwardly digesting as much as I could about the various methods employed by search engines. The method that seemd to be the most promising is Latent Semantic Indexing (LSI). I quicly ran into a slight problem when looking for a library in C++ that would allow me to carry out Singular Value Decomposition on large sparse matrices. I was unable to find one (I did find one, but it has not been written to be used on a humble x86). This meant that trying to impliment LSI was no longer possible so I decided to try the simpler, Vector Space model with no LSI thrown in. Here is a list of links that I have found fairly useful over the last few days.

  • http://instruct.uwo.ca/gplis/601/week3/tfidf.html"
  • http://www2002.org/CDROM/refereed/643/node5.html
  • http://www.eng.mu.edu/corlissg/168Search.03F/ch1.html
  • http://www9.org/w9cdrom/159/159.html
  • http://www9.org/w9cdrom/159/159.html
  • http://www-db.stanford.edu/~backrub/google.html

    Some Preliminaries

    This is not a tutorial but if you have some rudimentary knowledge of Perl and C++ then you should be very well equipped to use it as such. Even if you do not know how to program there are topics in here that you may find of interest. The following links will take you to the various stages that I went through in building the engine. This document at the moment is not in any real order, I am writing it as I encounter it and will fill in the blanks later.

    This first link will take you through my thought process as I designed and built a robot. It also explains some of the things that need to be considered in building a robot and why.

    Building a Spider / Robot

    This link will explain some simple methods used to reduce the size of the data to be searched.

    Indexing The Internet

    Here, I try to explain a relatively simple information retrieval method without using too much maths.

    Vector Space