Hierarchical Clustering Algorithm for Big Data using Hadoop and Mapreduce
Loading...
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Mining of massive data sets is the need of the hour in present computer science
industry. The exponential growth in the number of users on internet and volume of
available data force research to think about efficient approach to store data and
analyze useful patterns out of it. Extracting useful information out of massive data
and process them in less span of time has become crucial part of Data mining. There
are many approach exist to cluster data objects based on similarity. CURE (Clustering
Using Representatives) is very useful hierarchical algorithm which has ability to
identify cluster of arbitrary shape and able to identify outliers. However traditional
CURE algorithm is based on processing in single machine hence can’t cluster large
amount of data in efficient way.
In this thesis, CURE algorithm is proposed along with Distributed Environment
using Hadoop. To process huge amount of data and to extract useful patterns out of
it, distributed environment is the efficient solution so clustering of data objects is
performed with the help of Mapreduce Programming model. One of the other
advantage of CURE algorithm is to detect outlier points and removed it from further
clustering process and improve quality of clusters. The major focus of this thesis has
been exploring new approach to cluster data objects using CURE clustering algorithm
with the help of Hadoop distributed environment and explore effect of different
parameters in outlier detection.
Description
Master of Engineering-Software Engineering
