A Novel Technique for Efficient Storage and Retrieval of Massive Data Sets
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In today’s world data is considered as one of the most valuable assets. With the coming up
of plethora of web applications and technologies like sensors, IoT, cloud computing, etc.,
the in-stream data generation resources have increased exponentially. Data originating from
heterogeneous sources and real world applications is severely susceptible to inconsistent,
incomplete and noisy data. To support data applications in different domains, data processing
must be efficient and automated as much as possible. Further, timely and accurate analysis
of available data is an intrinsic requirement.
Conventional databases and traditional data mining techniques are efficient for stored
data analytic but for in-streamed data, where data is arriving continuously, it is not feasible
to store the data into databases and then perform analysis since all such applications demand
time bound query output. Moreover, traditional approaches demand that entire data should
be stored in a formatted manner. Massive datasets require architectures and tools for data
storage, handling, processing and mining of the bulk information in limited time and in
single pass. One of the available alternative is use of Probabilistic Data Structures (PDS) in
Big data analytics, which use some probability based approaches, approximation principals
and hashing methods to reduce time and space trade off in storage, retrieval and search of
data.
This thesis proposes three techniques for streamed data analysis. First one, a variant
of scalable Bloom Filter (BF), called AdapTable Bloom Filter (ATBF), performs peak hour
analysis and decides the size of dynamic BF apriori using Kalman filter and Learning Array
(LA). In second approach, a variant of stable BF, called FingerPrint Stable Bloom Filter(FPSBF),
has been proposed for duplicate detection in streamed data. In the third approach, a
semi-supervised technique for spam detection in Twitter has been proposed which employs
ensemble based framework (Eb-SDF) comprising of four classifiers. The framework is based
xv
on usage of PDS like Quotient Filter (QF) to query the URL database, spam users, spam
words databases and Locality Sensitive Hashing (LSH) for similarity search.
Performance of the proposed approaches has been evaluated by comparative analysis of
PDS with the similar data structures and through the standard evaluation parameters. ATBF
has been compared with scalable BF for server utilization and hourly load analysis. FPSBF
has been compared with stable BF and reservoir based sampling BF and accuracy is
determined for detecting duplicates in streaming data. Results are compared on different
BF parameters which include counter size, size of bloom filter, number of hash functions,
false positive and false negative analysis, etc. Eb-SDF has been tested on twitter dataset and
comparative analysis is performed on the basis of precision, recall and F1- score.
