Enrichment in Performance of Focused Web Crawlers

Routhu, Ravikiran

Enrichment in Performance of Focused Web Crawlers

Files

1261.pdf (1.6 MB)

Date

2010-09-16

Authors

Routhu, Ravikiran

Supervisors

Kumar, Ravinder

Abstract

The World Wide Web (WWW) is an interlinked collection of billions of documents formatted using HTML. Since its inception in 1990, WWW has grown exponentially in size. As of today, it is estimated that it contains approximately 50 billion publicly accessible/index able web documents distributed all over the world on thousands of web servers. It is very difficult to search information from such a huge collection of web documents on WWW as the web pages/documents are not organized as books on shelves in a library, nor are web pages completely catalogued at one central location. It is not guaranteed that users will be able to retrieve information even after knowing where to look for information by knowing its URLs as web is constantly changing. The search engine is a tool that solves these problems by finding specific information on the WWW. Internet would have not become so popular if search engines would not have been developed and it would be almost impossible to locate anything on the web unless or until know a specific URL address. Most of these search engines save a copy of the web pages in their central repository and then make appropriate indexes of them for later search/retrieval of information. Due to the limited storage of databases/repositories, search engine can’t accommodate each and every page available on the WWW. So the databases of search engines are maintained with the help of some software, to store most relevant pages from the WWW. The software that traverses the web and downloads web pages is called “Crawler”. Web crawlers are also known as “spiders”, ”robots”, ”ants”, ”automatic indexers” etc. In this thesis, Crawler basics, the commonly used Web crawling techniques, the pseudo code of various basic crawling algorithms and their implementations in C language along with simplified flowcharts are discussed.

Description

M.E. (CSED)

Keywords

Crawlers, Search Engines, WWW

URI

http://hdl.handle.net/10266/1261

Collections

Masters Theses@CSED

Full item page

Enrichment in Performance of Focused Web Crawlers

Files

Date

Authors

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By