Building A Vector Space Search Engine
Disclaimer
None of the code below comes with any guarantee or or warranty as to its suitablity for any purpose use it at your own risk. This document is also under construction as of the 12 Nov 2003.
What is a Vector Space
I recommend getting a pencil and some graph paper for this. I am afraid that I am not able to get any diagrams loaded at the moment but as soon as I have Mathcad working I will put them up. Please do not be offended if I am teaching you something you already know or am being too basic. It will get more complicated towards the end I promise.
If you cast your mind back to your time in maths class you may remember using graph paper. Graph paper is two dimensional ie. we have an X axis and a Y axis. If we move along the X axis by +3 and up the Y axis by +4 we can represent this as follows V[3,4], this is a vector representation of our movement in 2 dimensional space. There is more than one way to represent this movement though. If we join all the point up we then form a triangle that has an Adjacent "a = 3" an Opposite "o = 4" and an unknown Hypotenuse "h". We can work out the length of the hypotenuse using Pythagoras's theorm.
h = sqroot(sqr(3) + sqr(4)) = 5
We can also use some trigonometry to work out the angle between the Adjacent and the Hypotenuse as follows.
θ = atan(opp/adj) ≈ 43 degrees approximately
So we now know the length (magnitude) of "h" and its angle θ from the X axis, this is its direction. We can now represent the movement as follows <5,θ>, this is know as its polar form. We could now carry out the original movement of V[3,4] by turning to face θ and walking 5 units in that direction. If this sounds like an odd way to do things then do not worry it will start to become a little clearer as we go along. What we have been doing is moving around in a 2 dimensional space using Vectors as a sort of map. Going back to what I said earlier, we could also visualise them as pointers, ie. they provide both direction and distance so they lead us to a point on our graph. We have been working in 2 dimensions but this can easily be done in 3 dimensions as well but it is much harder to visualise. For instance, if we have V[3,4,6] this tells us to walk along the X axis +3 up the Y axis by +4 and along the Z axis by 6 (The Z axis can be thought of as depth). This is no different from what we did before except we have added an axis.
How can we use this vector space to represent our documents
Before describing how this is done we really need to discuss the fact that our 2 Dimensional space is not really a very good model. We are limited to only two axis. What is really a bit of a head spin is that we need to think of an N dimensional space where N can represent any positive integer. I know that this sounds a bit crazy but just because we cannot see something or imagine it does not mean it does not exist. If you are wondering what I mean by this then try drawing V[3,4,6,8] on a piece of paper. Up to the Z axis we are OK but where can the next axis go. We are limited to visualising 3 dimensions but we are not limited to only using three dimensions in mathematics. Now that we know we can have millions of dimensions we can then use each axis to represent a different word. Lets take the following sentence's as an example.
- Building a Vector Space Search Engine
- Parsing file and extracting the words for the index
- File storage for our search engine
We can use our stop list and remove some surplus words so we have.
- Building Vector Space Search Engine
- Parsing file extracting words index
- File storage search engine
We can now assign numbers to these words to uniquely indentify them and store the word occourance as an example [building,1] = [1,1]
[1,1], [2,1], [3,1], [4,2], [5,2], [6,1], [7,2], [8,1], [9,1], [10,1], [11,1]
We can see that [8,1] is the 8th word "extracting" and that it occours once in this document.
This document has produced an eleven colum vector as follows.
V[1,1,1,2,2,1,2,1,1,1,1]
This vector is a pointer to a place in an N dimensional space where N = 11. The word count is used to tell us how far along each axis we need to travel for that particualr word, we will refer to it as the "term weighting" for our simple model.
Million document problem
The fun really starts when we have several million documents and millions of terms to represent. Each term must have its own axis which means that the majority of every vector is going to be empty because most terms are not going to be found in every document. If our document set produced 100,000 unique terms then our vector is going to need 100,000 entries, this makes it very sparse and effectively full of zero;s. See below.
V[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,9,0,0,0,0,0,0...
As an example, in the vector above the term identified by the integer 27 has a word count of 9 in whatever document this vector belongs to.
If we imagine that there are 8 documents "m", 11 terms "n", then we have an "8x11" Matrix "M" where the doc_id represents a column and the word_id represents a row as follows
| 0 0 0 0 0 0 1 0 | --- This row is the word building
| 0 0 0 6 0 0 1 1 | --- This row is the word vector
| 0 2 0 0 5 0 1 0 | --- This row is the word space
| 0 0 0 1 0 0 2 1 | --- This row is the word search
| 4 0 0 0 5 0 2 0 | --- This row is the word engine
| 0 0 1 0 0 0 1 0 | --- .................... parsing
| 0 1 0 0 2 0 2 1 | --- .................... file
| 1 0 0 1 0 0 1 0 | --- .................... extracting
| 0 0 1 0 0 0 1 5 | --- .................... words
| 3 0 0 0 0 1 1 0 | --- .................... index
| 3 0 0 0 0 0 1 0 | --- .................... file
If you look carefully at the above matrix M you will be able to find our
previous Vector, V[1,1,1,2,2,1,2,1,1,1,1] is represented there. The above
structure is know as a Term Document Matrix (TDM not to be confused with
Time Divisional Mutiplexing ;-). For clarity I have also put in what word
each row represents. Our original vector would be in document 7 in this
example.
Storage Problem for our model
The first thing that we can see in the above model is that it is rather sparse, in fact, it is known as a sparse matrix. As a data structure this poses some problems. To find out how big our storage requirements are when storing the zeros we can do the following.
Storage = S = (Column Count)x(Row Count)x(Integer Type Size in Bytes)
Storage = S = (Number Of Documents)x(Number Of Terms)x( 4 )
For 3 billion documents of 1 million terms, remember google's has over 10 million terms (as far as I know).
Storage = S = (3Billion)x(1Million)x(4) = 12,000,000,000,000,000 bytes
This is 12 Petabytes of storage. This is a bit big, to say the least. The trick is not to store the zeros. If we do this and make the assumption that each page has an average of 1000 words then we can reduce the size to 12,000,000,000,000 Terabytes which is a significant saving.
Storing a Large Sparse Matrix
I was unable to find a library that allowed me to work with large sparse matrices on my machine so I wrote the following bit of code to do it myself. This was written in C++
typedef map<int,int> TVector;
typedef TVector::iterator TVectorMapIter;
class RowVector {
private:
TVector term_vector;
TVectorMapIter term_vector_iterator;
public:
RowVector();
~RowVector();
void insert_term(int key , int value );
int get_value(int key);
int get_size();
TVectorMapIter get_vector_begin();
TVectorMapIter get_vector_end();
TVector get_term_vector();
};
The way I represented a large sparse matrix is as follows.
// First I create a map where the key values are doc_id's and the
// values are the object above which represent all the keywords in
// each document.
typedef map<int, RowVector*> DocumentMap;
This map gets filled with all the document ID's and the associated RowVector object that belong to it. The RowVector object only stores the terms that have been found in that document. The RowVector object is simply another "STL map object" which I have provided a simple wrapper to. When I have built up the DocumentMap collection I can then access each Documents assosicated vector using the doc_id. From there I can access the value associated to each term. I do this by passing the word_id as a key value to the "get_value" method. If the word_id exists in the map then return the value, otherwise it will return 0, and thats the trick, the 0 has not been stored but it will appear to be so when we ask for a term that is not in that document. This effectively represents a sparse matrix. It is probably not the best way to do it but it seems to work for the small document set that I am testing with.