Comparative Analysis of approaches for detecting near-duplicate URLs for search engine

Panwar, Shashank

Comparative Analysis of approaches for detecting near-duplicate URLs for search engine

Files

4165.pdf (1.09 MB)

Date

2016-08-26

Authors

Panwar, Shashank

Supervisors

Arora, Vinay

Abstract

The content on the web is increasing rapidly due to which the use of search engine is becoming vital for information retrieval. Search engine uses web crawler to traverse the web content available on the internet. Web crawler is an internet robot which visits web sites and fetches the content in order to create entries for search engine’s index. Due to huge amount of web pages on the web there is a problem in front of search engine regarding removal of the duplicate and near-duplicate URLs. The duplicate detection techniques are divided into two main categories viz. Conventional approaches and Modernistic approaches. Conventional approaches include fingerprinting approach, shingling approach, cluster based approach, URL based approach, and keyword based approach. Modernistic approach includes locality sensitive hashing. In Conventional approaches, only fingerprinting approach i.e. MD5 signature is practically implemented in web crawler and all other approaches are just a concept or can be used in near future. The comparative analysis of conventional approach and modernistic approach can be done on the basis of two parameters. First is time taken by the crawler to crawl the websites and second is number of duplicate document detected by the crawler using different algorithms. The process of removing the duplicate copies of the particular content is called Deduplication.

Keywords

Web crawling, Search engine, Apache Nutch

URI

http://hdl.handle.net/10266/4165

Collections

Masters Theses@CSED

Full item page

Comparative Analysis of approaches for detecting near-duplicate URLs for search engine

Files

Date

Authors

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By