A Hybrid Approach for Efficient Clustering  of Big Data

Arora, Saurabh

A Hybrid Approach for Efficient Clustering of Big Data

Files

2830.pdf (8.05 MB)

Date

2014-08-06T07:26:14Z

Authors

Arora, Saurabh

Supervisors

Chana, Inderveer

Abstract

In today’s era data generated by scientific applications and corporate environment has grown rapidly not only in size but also in variety. This data collected is of huge amount and there is a difficulty in collecting and analyzing such big data. Data mining is the technique in which useful information and hidden relationship among data is extracted, but the traditional data mining approaches cannot be directly used for big data due to their inherent complexity. Data clustering is an important data mining technology that plays a crucial role in numerous scientific applications. However, it is challenging to cluster big data as the size of datasets has been growing rapidly to extra-large scale in the real world. Meanwhile, MapReduce is a desirable parallel programming platform that is widely applied in data processing fields. Hadoop provides the cloud environment and is the most commonly used tool for analyzing big data. K-Means and DBSCAN are parallelized to analyze big data on cloud environment. The limitation of parallelized K-Means is that it is sensitive to noisy data, sensitive to initial condition and forms fixed shape while DBSCAN has an issue of processing time and it is more complex than K-Means. This thesis presents a theoretical overview of some of current clustering techniques used for analyzing big data. Comprehensive analysis of these existing techniques has been carried out and appropriate clustering algorithm is provided. A hybrid approach based on parallel K-Means and parallel DBSCAN is proposed to overcome the drawbacks of both these algorithms. This approach allows combining the benefits of both the clustering techniques. Further, the proposed technique is evaluated on the MapReduce framework of Hadoop Platform. The results show that the proposed approach is an improved version of parallel K-Means clustering algorithm. This algorithm also performs better than parallel DBSCAN while handling clusters of circularly distributed data points and slightly overlapped clusters. The proposed hybrid approach is more efficient than DBSCAN-MR as it takes less computation time. Also it generates more accurate clusters than both K-Means MapReduce algorithm and DBSCAN MapReduce algorithm.

Description

ME, CSED

Keywords

Cloud Computing, Big data, Clustering

URI

http://hdl.handle.net/10266/2830

Collections

Masters Theses@CSED

Full item page

A Hybrid Approach for Efficient Clustering of Big Data

Files

Date

Authors

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By