Web Page Content Block Partitioning for Focussed Crawling

Loading...
Thumbnail Image

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The World Wide Web (WWW) is a collection of billions of documents formatted using HTML.Web Search engines are used to find the desired information on the World Wide Web. Whenever a user query is inputted, searching is performed through that database. The size of repository of search engine is not enough to accommodate every page available on the web. So it is desired that only the most relevant pages must be stored in the database. So, to store those most relevant pages from the World Wide Web, a better approach has to be followed. The software that traverses web for getting the relevant pages is called “Crawlers” or “Spiders”. A specialized crawler called focused crawler traverses the web and selects the relevant pages to a defined topic rather than to explore all the regions of the web page. The crawler does not collect all the web pages, but retrieves only the relevant pages out of all. So the major problem is how to retrieve the relevant and quality web pages. To address this problem, in this thesis, we hve designed an algorithm which partitions the web pages on the basis of headings into blocks and then calculates the relevancy of each partitioned block in web page. Then the page relevancy is calculated by sum of all block relevancy scores in one page. It also calculates the URL score and identifies whether the URL is relevant to a topic or not. As compared to previous methods of partitioning, our method on the basis of headings is more appropriate because in other methods, sub tables of a table are considered to be the other block. But it is not so. These must be the part of that block only in which the table resides. On the basis of headings, there is an appropriate division of pages into blocks because a complete block comprises of the heading, content, images, links, tables and sub tables of a particular block only.

Description

M.E. (Software Engineering)

Citation

Endorsement

Review

Supplemented By

Referenced By