A Hybrid Approach for Efficient Clustering  of Big Data

Arora, Saurabh

A Hybrid Approach for Efficient Clustering of Big Data

dc.contributor.author	Arora, Saurabh
dc.contributor.supervisor	Chana, Inderveer
dc.date.accessioned	2014-08-06T07:26:14Z
dc.date.available	2014-08-06T07:26:14Z
dc.date.issued	2014-08-06T07:26:14Z
dc.description	ME, CSED	en
dc.description.abstract	In today’s era data generated by scientific applications and corporate environment has grown rapidly not only in size but also in variety. This data collected is of huge amount and there is a difficulty in collecting and analyzing such big data. Data mining is the technique in which useful information and hidden relationship among data is extracted, but the traditional data mining approaches cannot be directly used for big data due to their inherent complexity. Data clustering is an important data mining technology that plays a crucial role in numerous scientific applications. However, it is challenging to cluster big data as the size of datasets has been growing rapidly to extra-large scale in the real world. Meanwhile, MapReduce is a desirable parallel programming platform that is widely applied in data processing fields. Hadoop provides the cloud environment and is the most commonly used tool for analyzing big data. K-Means and DBSCAN are parallelized to analyze big data on cloud environment. The limitation of parallelized K-Means is that it is sensitive to noisy data, sensitive to initial condition and forms fixed shape while DBSCAN has an issue of processing time and it is more complex than K-Means. This thesis presents a theoretical overview of some of current clustering techniques used for analyzing big data. Comprehensive analysis of these existing techniques has been carried out and appropriate clustering algorithm is provided. A hybrid approach based on parallel K-Means and parallel DBSCAN is proposed to overcome the drawbacks of both these algorithms. This approach allows combining the benefits of both the clustering techniques. Further, the proposed technique is evaluated on the MapReduce framework of Hadoop Platform. The results show that the proposed approach is an improved version of parallel K-Means clustering algorithm. This algorithm also performs better than parallel DBSCAN while handling clusters of circularly distributed data points and slightly overlapped clusters. The proposed hybrid approach is more efficient than DBSCAN-MR as it takes less computation time. Also it generates more accurate clusters than both K-Means MapReduce algorithm and DBSCAN MapReduce algorithm.	en
dc.format.extent	8439539 bytes
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/10266/2830
dc.language.iso	en	en
dc.subject	Cloud Computing	en
dc.subject	Big data	en
dc.subject	Clustering	en
dc.title	A Hybrid Approach for Efficient Clustering of Big Data	en
dc.type	Thesis	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2830.pdf
Size:: 8.05 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.79 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Masters Theses@CSED