Search Engine Application – This Project deals with the design and implementation of a content-based search engine. Content-based means that the system utilizes information available in the web documents in a holistic manner to determine what might be interesting to the user. We focus on textual content that is written in a natural language as opposed to, say, images included in the documents. We call the presented system a search engine, as it contains components to retrieve and index web documents, and it provides a mechanism to return a ranked subset of the documents according to the user’s requests.
The system should be able to process millions of documents in a reasonable time and respond to queries with a low average latency. The starting point is a Web Crawler (or spider) to retrieve all Web pages: it simply traverses the entire Web or a certain subset of it, to download the pages or files it encounters and save for other components to use. The actual traversal algorithm varies depends on the implementation; depth first, breadth first, or random traversal are all being used to meet different design goals. The parser takes all downloaded raw results, analyze and eventually try to make sense out of them. In the case of a text search engine, this is done by extracting keywords and checking the locations and/or frequencies of them. Hidden HTML tags, such as KEYWORDS and DESCRIPTION, are also considered. Usually a scoring system is involved to give a final point for each keyword on each page.
Simple or complicated, a search engine must have a way to determine which pages are more important than the others, and present them to users in a particular order. This is called the Ranking System. The most famous one is the Page Rank Algorithm published by Google founders [Brin 1998].
A reliable repository system is definitely critical for any application. Search engine also requires everything to be stored in the most efficient way to ensure maximum performance. The choice of database vendor and the schema design can make big difference on performance for metadata such as URL description, crawling date, keywords, etc. More challenging part is the huge volume of downloaded files to be saved before they are picked up by other modules.
Finally, a front-end interface for users: This is the face and presentation of the search engine. When a user submits a query, usually in the form of a list of textual terms, an internal scoring function is applied to each Web page in the repository [Pandey 2005], and the list of result is presented, usually in the order or relevance and importance.
Google has been known for its simple and straight forward interface, while some most recent competitors, such as Ask.com1, provide much richer user experience by adding features like preview or hierarchy displaying. This project work will focus on how we can make a search engine system that will that can gather information from all angles of the web, index and rank it while maintaining a simple and rich user interface for users to query information