Identify Similar Research Papers Using Locality Sensitive Hashing

dc.contributor.authorGupta, Divya
dc.contributor.supervisorBatra, Shalini
dc.date.accessioned2016-08-08T11:07:03Z
dc.date.available2016-08-08T11:07:03Z
dc.date.issued2016-08-08
dc.description.abstractIdentifying the research papers of a particular domain is a tedious and time consuming job for an academician and a researcher. Lot of e ort can be saved if all papers related to a particular domain can be combined in a single group. It will not be feasible to manually cluster the similar type of papers on the basis of topics, key words or abstract. This thesis presents an approach of clustering similar type of papers using Locality Sensitive Hashing (LSH) , a probabilistic data structure which adds similar type of documents in a single bucket by spiting the input text into shingles and using min-hashing, a variant of Jaccard similarity to generate signature matrix. Our work explores how similar research papers can be clustered by considering the title of the paper, keywords and abstract of the paper. Experimental analysis shows that using LSH majority of the papers of similar domain are categorized into one bucket in less time. In particular, we interpolate the sensitive hashing for the abstract with authors, keyword and journal of the paper. The basic methodology we adapt is to turning of a document into vector model is done by shingling and homogeneity among sets, is intended using Jaccard similarity. By penetrating shingles we build Characteristic matrix, which engender signatures for each document by a technique called ”minhashing” is used to diminish the size of the matrix.en_US
dc.identifier.urihttp://hdl.handle.net/10266/4043
dc.language.isoenen_US
dc.subjectHashingen_US
dc.subjectLOcalityen_US
dc.titleIdentify Similar Research Papers Using Locality Sensitive Hashingen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
4043.pdf
Size:
1.76 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.03 KB
Format:
Item-specific license agreed upon to submission
Description: