Similarity Search Using Locality Sensitive Hasing and Bloom Filter
Loading...
Files
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Similarity search of text documents can be reduced to Approximate Nearest Neighbor
Search by converting text documents into sets by using Shingling. In the proposed
scheme conversion of a document into set is done by using k-shingle and similarity
between sets, is calculated using Jaccard similarity. Characteristic matrix, created by
searching shingles in each document is generated using the shingles and hash functions.
Since the complexity of the matrix is significantly high when shingle set is very large, a
technique called ‘minhashing’ is used to reduce the size of the matrix. Search time can be
drastically reduced if characteristics matrix are created by using Bloom Filter. Bloom
Filter is a probabilistic data structure that will reduce O(n) search time to constant time.
Minhashing creates a signature matrix which is very less in size as compared with
characteristics matrix but gives almost same result. Further, finding the similarity among
all pairs of column of signature matrix is still a big problem because comparing n
columns take O(n2) time. The time of comparing columns for similarity is reduced by
using another technique, Locality Sensitive Hashing. The proposed scheme has been
successfully implemented and comparative analysis of existing methodology and
proposed methodology has been provided.
Description
Master of Engineering-Thesis
