Enrichment in Performance of Focused Web Crawlers
Loading...
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The World Wide Web (WWW) is an interlinked collection of billions of documents
formatted using HTML. Since its inception in 1990, WWW has grown exponentially in size.
As of today, it is estimated that it contains approximately 50 billion publicly accessible/index
able web documents distributed all over the world on thousands of web servers. It is very
difficult to search information from such a huge collection of web documents on WWW as
the web pages/documents are not organized as books on shelves in a library, nor are web
pages completely catalogued at one central location. It is not guaranteed that users will be
able to retrieve information even after knowing where to look for information by knowing its
URLs as web is constantly changing. The search engine is a tool that solves these problems
by finding specific information on the WWW.
Internet would have not become so popular if search engines would not have been developed
and it would be almost impossible to locate anything on the web unless or until know a
specific URL address. Most of these search engines save a copy of the web pages in their
central repository and then make appropriate indexes of them for later search/retrieval of
information. Due to the limited storage of databases/repositories, search engine can’t
accommodate each and every page available on the WWW. So the databases of search
engines are maintained with the help of some software, to store most relevant pages from the
WWW. The software that traverses the web and downloads web pages is called “Crawler”.
Web crawlers are also known as “spiders”, ”robots”, ”ants”, ”automatic indexers” etc.
In this thesis, Crawler basics, the commonly used Web crawling techniques, the pseudo code
of various basic crawling algorithms and their implementations in C language along with
simplified flowcharts are discussed.
Description
M.E. (CSED)
