Please use this identifier to cite or link to this item:
|Title:||Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce|
|Keywords:||DBSCAN, Big Data, Mapreduce|
|Abstract:||Over the past 20 years, there is a tremendous increase in the data generated from various diverse sources. With the advancement in newer technologies, accumulation of digital data is growing at very high pace. A series of challenges emerges due to this huge data storage and also, making the operations such as querying, retrieving and analyzing the data very difficult and tedious. Conventional database query methods and analytical technologies are becoming insufficient to deal with this huge amount of data. Cluster analysis has become the important data analysis method to unveil unknown patterns from the data. During the last few years, the mining of relational databases has become the popular research topic. Various clustering algorithms such as Partitioned, Hierarchical etc. have been proposed for this type of databases, but only a few methods are proposed for spatial databases. Spatial clustering has become an active topic for researchers in spatial data mining. This research explores the usage of one of the clustering method i.e. Density-Based Clustering, for spatial data mining. DBSCAN is the most popular algorithm in this sub-type of clustering method. The existing algorithm has quadratic time complexity which can be further reduced by using indexing structure to O(nlogn). But this algorithm is not able to handle very large databases and takes a significant amount of time. The existing algorithms are implemented in R-tool and comparison is done between these algorithms on different datasets. Modifications have been proposed in the existing algorithm and a new algorithm is proposed named as MR-DBSCAN-KD implemented with the usage of MapReduce programming model. A traditional DBSCAN is also implemented on Hadoop framework. Both the algorithms are tested on multinode Hadoop framework by connecting three nodes. An analysis has been done between these algorithms on the basis of execution time and number of clusters performed. It has been experimentally verified that this algorithm can work efficiently with large databases and reduces the execution time.|
|Appears in Collections:||Masters Theses@CSED|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.