A Hybrid Approach for Efficient Clustering of Big Data
Loading...
Files
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In today’s era data generated by scientific applications and corporate environment has
grown rapidly not only in size but also in variety. This data collected is of huge amount
and there is a difficulty in collecting and analyzing such big data. Data mining is the
technique in which useful information and hidden relationship among data is extracted,
but the traditional data mining approaches cannot be directly used for big data due to
their inherent complexity.
Data clustering is an important data mining technology that plays a crucial role in
numerous scientific applications. However, it is challenging to cluster big data as the size
of datasets has been growing rapidly to extra-large scale in the real world. Meanwhile,
MapReduce is a desirable parallel programming platform that is widely applied in data
processing fields. Hadoop provides the cloud environment and is the most commonly
used tool for analyzing big data. K-Means and DBSCAN are parallelized to analyze big
data on cloud environment. The limitation of parallelized K-Means is that it is sensitive
to noisy data, sensitive to initial condition and forms fixed shape while DBSCAN has an
issue of processing time and it is more complex than K-Means.
This thesis presents a theoretical overview of some of current clustering techniques used
for analyzing big data. Comprehensive analysis of these existing techniques has been
carried out and appropriate clustering algorithm is provided. A hybrid approach based on
parallel K-Means and parallel DBSCAN is proposed to overcome the drawbacks of both
these algorithms. This approach allows combining the benefits of both the clustering
techniques. Further, the proposed technique is evaluated on the MapReduce framework
of Hadoop Platform. The results show that the proposed approach is an improved version
of parallel K-Means clustering algorithm. This algorithm also performs better than
parallel DBSCAN while handling clusters of circularly distributed data points and
slightly overlapped clusters. The proposed hybrid approach is more efficient than
DBSCAN-MR as it takes less computation time. Also it generates more accurate clusters
than both K-Means MapReduce algorithm and DBSCAN MapReduce algorithm.
Description
ME, CSED
