A Novel Technique for Efficient Storage and Retrieval of Massive Data Sets

Singh, Amritpal

A Novel Technique for Efficient Storage and Retrieval of Massive Data Sets

dc.contributor.author	Singh, Amritpal
dc.contributor.supervisor	Batra, Shalini
dc.date.accessioned	2018-10-12T08:01:27Z
dc.date.available	2018-10-12T08:01:27Z
dc.date.issued	2018-10-12
dc.description.abstract	In today’s world data is considered as one of the most valuable assets. With the coming up of plethora of web applications and technologies like sensors, IoT, cloud computing, etc., the in-stream data generation resources have increased exponentially. Data originating from heterogeneous sources and real world applications is severely susceptible to inconsistent, incomplete and noisy data. To support data applications in different domains, data processing must be efficient and automated as much as possible. Further, timely and accurate analysis of available data is an intrinsic requirement. Conventional databases and traditional data mining techniques are efficient for stored data analytic but for in-streamed data, where data is arriving continuously, it is not feasible to store the data into databases and then perform analysis since all such applications demand time bound query output. Moreover, traditional approaches demand that entire data should be stored in a formatted manner. Massive datasets require architectures and tools for data storage, handling, processing and mining of the bulk information in limited time and in single pass. One of the available alternative is use of Probabilistic Data Structures (PDS) in Big data analytics, which use some probability based approaches, approximation principals and hashing methods to reduce time and space trade off in storage, retrieval and search of data. This thesis proposes three techniques for streamed data analysis. First one, a variant of scalable Bloom Filter (BF), called AdapTable Bloom Filter (ATBF), performs peak hour analysis and decides the size of dynamic BF apriori using Kalman filter and Learning Array (LA). In second approach, a variant of stable BF, called FingerPrint Stable Bloom Filter(FPSBF), has been proposed for duplicate detection in streamed data. In the third approach, a semi-supervised technique for spam detection in Twitter has been proposed which employs ensemble based framework (Eb-SDF) comprising of four classifiers. The framework is based xv on usage of PDS like Quotient Filter (QF) to query the URL database, spam users, spam words databases and Locality Sensitive Hashing (LSH) for similarity search. Performance of the proposed approaches has been evaluated by comparative analysis of PDS with the similar data structures and through the standard evaluation parameters. ATBF has been compared with scalable BF for server utilization and hourly load analysis. FPSBF has been compared with stable BF and reservoir based sampling BF and accuracy is determined for detecting duplicates in streaming data. Results are compared on different BF parameters which include counter size, size of bloom filter, number of hash functions, false positive and false negative analysis, etc. Eb-SDF has been tested on twitter dataset and comparative analysis is performed on the basis of precision, recall and F1- score.	en_US
dc.identifier.uri	http://hdl.handle.net/10266/5418
dc.language.iso	en	en_US
dc.subject	Probabilistic data structures	en_US
dc.subject	Bloom Filter	en_US
dc.subject	Quotient Filter	en_US
dc.subject	Locality Sensitive Hashing	en_US
dc.subject	Ensemble based PDS	en_US
dc.title	A Novel Technique for Efficient Storage and Retrieval of Massive Data Sets	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: A Novel Technique for Efficient Storage and Retrieval of Massive data sets.pdf
Size:: 9.78 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.03 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Doctoral Theses@CSED