Identify Similar Research Papers Using Locality Sensitive Hashing

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Identifying the research papers of a particular domain is a tedious and time consuming job for an academician and a researcher. Lot of e ort can be saved if all papers related to a particular domain can be combined in a single group. It will not be feasible to manually cluster the similar type of papers on the basis of topics, key words or abstract. This thesis presents an approach of clustering similar type of papers using Locality Sensitive Hashing (LSH) , a probabilistic data structure which adds similar type of documents in a single bucket by spiting the input text into shingles and using min-hashing, a variant of Jaccard similarity to generate signature matrix. Our work explores how similar research papers can be clustered by considering the title of the paper, keywords and abstract of the paper. Experimental analysis shows that using LSH majority of the papers of similar domain are categorized into one bucket in less time. In particular, we interpolate the sensitive hashing for the abstract with authors, keyword and journal of the paper. The basic methodology we adapt is to turning of a document into vector model is done by shingling and homogeneity among sets, is intended using Jaccard similarity. By penetrating shingles we build Characteristic matrix, which engender signatures for each document by a technique called ”minhashing” is used to diminish the size of the matrix.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By