Efficient Detection and Interpretation of Clusters in High Dimensional Databases
Loading...
Files
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Exponential growth of data resources has necessitated new techniques that can convert it into useful information. Clustering is one of the data mining techniques that investigates these data resources for hidden patterns. Many clustering algorithms are available in literature. This research work emphasizes on partitioning based methods and is an attempt towards developing clustering algorithms that can efficiently detect clusters for high dimensional databases. In partitioning based methods, k-means and single pass clustering are popular clustering algorithms but they have several limitations. To overcome the limitations of these algorithms, a Modified Single Pass Clustering (MSPC) algorithm has been proposed in this work. It revolves around the proposition of a threshold similarity value. This is not a user defined parameter; instead, it is a function of data objects left to be clustered. In our experiments, this threshold similarity value is taken as mean/median of the paired distance of all data objects left to be clustered. To assess the performance of MSPC algorithm, five experiments for k-means, SPC and MSPC algorithms have been carried out on artificial and real datasets.
Further, a deterministic algorithm, Adaptive Threshold based Clustering (ATC) has been proposed. It does not select the data objects randomly; rather, it is based on selecting the farthest data objects. It uses a parameter, neighborhood distance, to cluster the data objects. It is again an adaptive parameter and not specified by the user. Another parameter used in ATC algorithm is the minimum support value which prunes the insignificant clusters. Performance of the ATC algorithm is also assessed on ten artificial and eight real datasets. It has also been compared with existing k-means algorithm.
In this research work, new separation and compactness measures have also been proposed. Proposed compactness measures are based on the arithmetic/geometric average of maximum dispersion of data objects along each dimension. Proposed separation measure is an averaged paired distance between the data objects of clusters. Experimental work has been carried out on artificial and real datasets to justify these measures. The work presented in this thesis can be extended further by proposing variants in MSPC and ATC algorithms.
Description
PHD, CSED
