Title: Enrichment in Performance of Focused Web Crawlers
Authors: Routhu, Ravikiran
Supervisor: Kumar, Ravinder
Keywords: Crawlers;Search Engines;WWW
Issue Date: 16-Sep-2010
Abstract: The World Wide Web (WWW) is an interlinked collection of billions of documents formatted using HTML. Since its inception in 1990, WWW has grown exponentially in size. As of today, it is estimated that it contains approximately 50 billion publicly accessible/index able web documents distributed all over the world on thousands of web servers. It is very difficult to search information from such a huge collection of web documents on WWW as the web pages/documents are not organized as books on shelves in a library, nor are web pages completely catalogued at one central location. It is not guaranteed that users will be able to retrieve information even after knowing where to look for information by knowing its URLs as web is constantly changing. The search engine is a tool that solves these problems by finding specific information on the WWW. Internet would have not become so popular if search engines would not have been developed and it would be almost impossible to locate anything on the web unless or until know a specific URL address. Most of these search engines save a copy of the web pages in their central repository and then make appropriate indexes of them for later search/retrieval of information. Due to the limited storage of databases/repositories, search engine can’t accommodate each and every page available on the WWW. So the databases of search engines are maintained with the help of some software, to store most relevant pages from the WWW. The software that traverses the web and downloads web pages is called “Crawler”. Web crawlers are also known as “spiders”, ”robots”, ”ants”, ”automatic indexers” etc. In this thesis, Crawler basics, the commonly used Web crawling techniques, the pseudo code of various basic crawling algorithms and their implementations in C language along with simplified flowcharts are discussed.
Description: M.E. (CSED)
Appears in Collections:Masters Theses@CSED

