Comparative Analysis of approaches for detecting near-duplicate URLs for search engine
Loading...
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The content on the web is increasing rapidly due to which the use of search engine is
becoming vital for information retrieval. Search engine uses web crawler to traverse
the web content available on the internet. Web crawler is an internet robot which
visits web sites and fetches the content in order to create entries for search engine’s
index. Due to huge amount of web pages on the web there is a problem in front of
search engine regarding removal of the duplicate and near-duplicate URLs. The
duplicate detection techniques are divided into two main categories viz. Conventional
approaches and Modernistic approaches. Conventional approaches include
fingerprinting approach, shingling approach, cluster based approach, URL based
approach, and keyword based approach. Modernistic approach includes locality
sensitive hashing. In Conventional approaches, only fingerprinting approach i.e. MD5
signature is practically implemented in web crawler and all other approaches are just a
concept or can be used in near future.
The comparative analysis of conventional approach and modernistic approach can be
done on the basis of two parameters. First is time taken by the crawler to crawl the
websites and second is number of duplicate document detected by the crawler using
different algorithms. The process of removing the duplicate copies of the particular
content is called Deduplication.
