Site Navigation
Vector Space Search Engine Building a Spider Indexing the Internet Vector Space Postgres Programming What is a robots.txt file Stop List
Some Other Sites
Harry Jackson Job Site The Banana Tree HR-XML Builder -->

Road Map for an Open Source Search Engine

Some details of what I have done and what I am currently working on. I have several other projects that take up a substantial amount of my time and I am doing a Maths Degree so this project does not get as much work as I would like to put into it but its getting there. I am always looking for help so if you want to get involved let me know.

I have actually just started building the lexicon. This is just a simple parser written in Perl that store the data in a Postgresql database. I have been quite strict with the lexicon so I am not expecting it to become too huge. I don't have the processing power or the room to cope with something huge to cope with something massive. Unfortunately I need more SCSI because the IO involved is really slowing the parser down. Perl is finding words quicker than I can store them. I suppose I should look at Berkely DB or some other method

TaskStateSkills required
Write Polite Spiders Done Perl Postgres and HTTP Protocol
Collect 1 million Test Pages 600,000 Collected Postgres Perl Linux
Build Lexicon Current Work: 90,000 entries found C/C++
Build Reverse Index for 1 million pages Current Work: C/C++
Write C++ for handling the reverse indexes TODO C/C++ Linux
Research Ranking Algorithm TODO C/C++, Maths, Comp Sci
Build Front end to the search engine TODO Web design, HTML
Buy or scroungs an online test machine for the search engine TODO Sales
Get it hosted TODO Sales, Marketing, Money Money Money