Efficient Similarity Search Techniques for Textual and Non-Textual Datasets

Chauhan, Sachendra Singh

Efficient Similarity Search Techniques for Textual and Non-Textual Datasets

dc.contributor.author	Chauhan, Sachendra Singh
dc.contributor.supervisor	Batra, Shalini
dc.date.accessioned	2020-11-23T07:28:44Z
dc.date.available	2020-11-23T07:28:44Z
dc.date.issued	2020-11-23
dc.description.abstract	In today’s information overloaded world, data has become the epicentre of the entire research. Textual data in the form of log, news papers, web documents, etc. is a key source of data analytics. Apart from textual contents, images, videos, audios generated by various handy devices are shared and downloaded by millions of users across the globe, every second. Finding similar items in such large and unstructured datasets (text and image) is indeed a challenging task. The exact match rarely has meaning in these environments; proximity or distance among the items is a preferred choice to identify similar items. In this work three similarity search approaches have been proposed: one for text documents and two for image datasets. For the textual data, a parallel similarity search approach has been proposed which uses Bloom filters for the representation of the features of the document and comparison with user’s query. Query features are stored in an integer array. The proposed approach uses approximate similarity search; has been implemented on Graphics Processing Unit (GPU) with compute unified device architecture as the programming platform. Two approaches have been proposed for image dataset. Both approaches uses Content Based Image Retrieval (CBIR). First CBIR approach named as ‘Bi-layer Content Based Image Retrieval (BiCBIR) System’ consists of two modules: first module extracts the features of images in terms of color, texture and shape. Second module consists of two layers: initially all images are compared with query image for shape and texture feature space and indexes of M images similar to the query image are retrieved. Next, M images retrieved from previous layer are matched with query image for shape and color feature space and finally F images similar to the query image are returned as output. Second approach, Feature wise Incremental CBIR, named as FiCBIR, uses color, texture, and shape features. The retrieval process is accomplished in three layers, in the first layer complete dataset is searched but only one feature space is used. Top 10% of images most similar to query image are retained in the second layer. The second layer uses two features for similarity computation and only 50% of the most similar images are passed to the third layer. Finally, the third layer uses all three features to compute the similarity. It has been experimentally proved that FiCBIR reduces the search space at subsequent layers by using multiple features for a reduced dataset in the final layer. The proposed CBIR approaches are evaluated on publicly available image datasets and experimental results validate the effectiveness of the approaches. The performance of both the approaches outperform the available state-of-the-art image retrieval systems in terms of precision, recall and f-score.	en_US
dc.identifier.uri	http://hdl.handle.net/10266/6047
dc.language.iso	en	en_US
dc.subject	Similarity Search	en_US
dc.subject	Bloom Filter	en_US
dc.subject	Feature Extraction	en_US
dc.subject	Image similarity	en_US
dc.subject	Sub-space features	en_US
dc.title	Efficient Similarity Search Techniques for Textual and Non-Textual Datasets	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Efficient Similarity Search Techniques for Textual.pdf
Size:: 4.18 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.03 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Doctoral Theses@CSED