Please use this identifier to cite or link to this item: http://hdl.handle.net/10266/4165
Title: Comparative Analysis of approaches for detecting near-duplicate URLs for search engine
Authors: Panwar, Shashank
Supervisor: Arora, Vinay
Keywords: Web crawling;Search engine;Apache Nutch
Issue Date: 26-Aug-2016
Abstract: The content on the web is increasing rapidly due to which the use of search engine is becoming vital for information retrieval. Search engine uses web crawler to traverse the web content available on the internet. Web crawler is an internet robot which visits web sites and fetches the content in order to create entries for search engine’s index. Due to huge amount of web pages on the web there is a problem in front of search engine regarding removal of the duplicate and near-duplicate URLs. The duplicate detection techniques are divided into two main categories viz. Conventional approaches and Modernistic approaches. Conventional approaches include fingerprinting approach, shingling approach, cluster based approach, URL based approach, and keyword based approach. Modernistic approach includes locality sensitive hashing. In Conventional approaches, only fingerprinting approach i.e. MD5 signature is practically implemented in web crawler and all other approaches are just a concept or can be used in near future. The comparative analysis of conventional approach and modernistic approach can be done on the basis of two parameters. First is time taken by the crawler to crawl the websites and second is number of duplicate document detected by the crawler using different algorithms. The process of removing the duplicate copies of the particular content is called Deduplication.
URI: http://hdl.handle.net/10266/4165
Appears in Collections:Masters Theses@CSED

Files in This Item:
File Description SizeFormat 
4165.pdf1.11 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.