Please use this identifier to cite or link to this item:
http://hdl.handle.net/10266/4233
Title: | Textual Similarity Analysis Using Locality Sensitive Hashing |
Authors: | Bansal, Reetika |
Supervisor: | Batra, Shalini |
Keywords: | Locality Sensitive Hashing;Similarity Search;Shingles;minhashing |
Issue Date: | 2-Sep-2016 |
Abstract: | The rapid growth in online information requires the efficient management and usability of data. Searching similar text documents in large set of documents is one of the most challenging tasks in the era of Big data. This thesis focuses on finding similarity between documents using Approximate K-nearest neighbor (KNN) search. Shingling technique has been applied which converts text documents into sets and further the set identical (similarity) value is calculated using Jaccard similarity. Later, hash functions and shingles in each document are used to create characteristic matrix. Shingles generation has been divided in three categories: with and without stopwords, with and without blank spacing and with and without stammers. Since shingle set is quite large, the size of matrix is significant, a technique called „minhashing‟ is used to reduce the size of the matrix. Minhashing creates a signature matrix which is very less in size compared to characteristic matrix with approximately same outcome. Further, finding the similarity among all pairs of column of signature matrix is still a big problem because comparing a column takes O (n2) time. The time of comparing columns for similarity is reduced using another technique, termed as Locality Sensitive Hashing which divided the entire matrix into number of bands and buckets and similar elements lie in one bucket. The entire technique has been successfully implemented with variations in categories of shingles and size of shingles. |
Description: | Master of Engineering- Software Engineering |
URI: | http://hdl.handle.net/10266/4233 |
Appears in Collections: | Masters Theses@CSED |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.