Distributed Search Engine
I am suprised that no one is using Lucene to do somehting like this
Like everything on this website this document is a work in progress but I thought that I should just jot down some notes about how I envisage this search engine growing. Of course the first thing I need to do is get the search engine software written and get a prototype up but this seems to be gathering some momentum so I have started thinking about how it will move forward. The following notes are here only to provoke thought in the reader. The diagrams although pretty are not factually correct ie we may have 20 harveter machines to each manager etc. Also note that the job roles are not complete. Text in bold are areas where I envisage problems and are areas that need to be addressed.
Definitions for some key terms
- Searcher: Customer facing user Interface
- Indexer: Creates and manages reverse indexes
- Store: Large repoitory of pages and indexes
- Harvester: Runs spiders on a designmated subset of the internet
- Manager: Manages harvesters and allocates URL's to each harvester.
Stage 1 (Embryonic)

This is where the main objectives of the system will be to recruit willing volunteers to help harvest pages from the internet. You will note that the top machine is multitasking between a Manager, Indexer and Searcher and everything else. The only thing this machine is not doing is spidering.
Stage 2

Eventually there will be more harvester machines than a single server can cope with. This is where we introduce the concept of promotion. If there is a willing volunteers with the computing resources available then we promote them to take over some of the tasks previously carried out by a single server. This reduces the load and also enable us the recruit more harvesters. The main function of these new recruits is to manage the harvesters and check for poisioning of the search results by bad harvesters.
Stage 3

I imagine there will be many iterations of the process in stage 2 before getting this far. This is where there are more managers than can be conveniently handled by the top level machines and another level of abstraction is required to deal with the load. The introduction of a cell has also been introduced here and appears as a cloud in the diagram.
This is a long way to look ahead because who is to say if each cell should be in charge of its own indexes and the larger store machine only handle the merging of the collected indexes or if each cell should be sending full documents to the store machines and letting them handle the indexing.
Conclusion
I hope the little description above has provoked at least a few thoughts as to how a distributed search engine could grow. I doubt if it would all run half as smoothly as the above suggests and I imagine many more diagrams in the real thing moving from each stage to the next but if it has sparked a little interest in you about the concept of Distributed Search then I have achieved my aim.
Some Make or break Points
Support: Gathering support will be a major problem but it can be done, the grub project is testament to this.
Recruitment: Developers of the calibre required for a project of this size are in short supply and being an open source project with no wages at the end of the month is not going to do it any favours.
Financial: Its a lovely thought that this could all happen and hardware and bandwidth would be donated but unfortunately this is not utopia and if it was we would not be needing search engines. Gathering financial backing and support will be a major hurdle for this project.
Trust: How do we manage the harvesters and managers. Can we trust them? Of course there will always be criminals trying to poision the results so how do we minimise the damage?
Opposition: Would there be any opposition to this sort of project from competitors.
Buy in from Webmasters: A possibility in this model is to allow webmasters of large trusted sites to spider their own material. This allows the data to be kept as fresh as possible which for sites like the BBC and CNN is very important. This of course would come with its own set of management headaches.