After a very long time , i am creating an entry. Have started to attend webinars on Search Technologies and am planning to blog my learnings. Probably it will be kind of notes to the webinars.
To get an high level overview of search architecture, read the Anatomy of Large-Scale hypertext Web Search Engine by Brin and Page , which they published during their stint in Standford.
Search involves 3 major processes.
1. The Spider / Crawler that crawls all the web pages.
Care should be taken that the crawlers we write should adhere to standards like providing identification about from where the crawler is coming from, adhereing to the robots.txt file which indicates which part6s of the site the crawler can crawl and which crawlers can crawl etc and also should ensure that we dont bring down sites by running our crawlers.
2. The Indexing process
Inverted indexes are used to store the crawled data. The indexing process involves generating list of keywords from the content of the page , the proximity of the key words , location of the words etc are generated and stored in the index with uniquely generated document id.
3. The Lookup or Retrieval Process with Ranking.
The search servers provide us with the UI to enter the search terms we want to search on. Once the terms are submitted, the search servers look up the inverted index to obtain the documents that have search terms as keywords. Once it has the list it ranks the documents based on the frequency of the apprearence of the search terms, the location of the terms, the anchor text pointing to the document, the importance of sites that points to that particular document etc. The documents get displayed based on ranking they get through the ranking process. One famous algorithm is Google’s Page Rank algorithm.
Some interesting links to learn about search engines :
Happy Learning !!