Similarity Search Using Locality Sensitive Hasing and Bloom Filter

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Similarity search of text documents can be reduced to Approximate Nearest Neighbor Search by converting text documents into sets by using Shingling. In the proposed scheme conversion of a document into set is done by using k-shingle and similarity between sets, is calculated using Jaccard similarity. Characteristic matrix, created by searching shingles in each document is generated using the shingles and hash functions. Since the complexity of the matrix is significantly high when shingle set is very large, a technique called ‘minhashing’ is used to reduce the size of the matrix. Search time can be drastically reduced if characteristics matrix are created by using Bloom Filter. Bloom Filter is a probabilistic data structure that will reduce O(n) search time to constant time. Minhashing creates a signature matrix which is very less in size as compared with characteristics matrix but gives almost same result. Further, finding the similarity among all pairs of column of signature matrix is still a big problem because comparing n columns take O(n2) time. The time of comparing columns for similarity is reduced by using another technique, Locality Sensitive Hashing. The proposed scheme has been successfully implemented and comparative analysis of existing methodology and proposed methodology has been provided.

Description

Master of Engineering-Thesis

Citation

Endorsement

Review

Supplemented By

Referenced By