Big Data Clustering Based Recommendation System Model Through Correlations
Loading...
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Technological advancement has enabled us to store and process huge amounts of data in
relatively short spans of time. The nature of data is rapidly increasing its dimensionality to
become multi and high-dimensional. There is an immediate need to expand our focus to include
analysis of high-dimensional and large datasets. Data analysis is becoming a mammoth
task as a result of incremental increase in data volume and complexity in terms of heterogony
of data. It is because of this dynamic computing environment that the existing techniques either
need to be modified or discarded to handle new data in multiple high-dimensions. Data
clustering is a tool that is used in many disciplines, including data mining, so that meaningful
knowledge can be extracted from seemingly unstructured data. Correlation clustering
possibly represents the most intuitive form of clustering construction. It gives solutions that
can be approximated while automatically selecting the number of clusters. This approach
handles scenarios where the focus is on relationships between the objects instead of on actual
representations of the objects. The suitability of this method extends to the structured
objects for which feature vectors are not easy to obtain. Given the increasing scale of data
these days, correlation clustering has become a powerful addition to the fields of data mining
and agnostic learning. In this thesis, we start by proposing an algorithm that defines
an intuitive and accurate correlation coefficient metric, known as the General (rank based)
correlation coefficient (G). Further, a framework is proposed, based on this algorithm, and is
named as G Based Agglomerative Clustering (GBAC). Our approach has been found to be
effective for small, large and high-dimensional data that generate high quality clusters. This
framework combines the predictive power of correlation coefficients with the ability to find
patterns in data obtained from agglomerative hierarchical clustering.
To explore complex relationships in data, there is a need to integrate dimensionalityreduction
techniques with data-mining approaches and graph theory. We propose another apxvii
proach called Local Graph based Correlation Clustering (LGBACC). This approach merges
hierarchical clustering with PCA to uncover complex hierarchical relationships and uses
graph models to visualize the results. Visualization of data is an important output and is
knitted into the fabric of the framework. LGBACC is found to produce high quality clusters
across a wide spectrum of dimensionality. Finally, both of these algorithms have been
tested on real-life datasets using distributed and parallel computing (SparkR). We have used
four large datasets (varying between 260 GB and 1 TB) to prove the scalability of the proposed
approaches. They are found to be scalable and performing better than the existing
hierarchical clustering algorithms. These algorithms are then integrated into recommendation
algorithms and we define a recommendation system model through correlations. This
model has been validated using a real-time, large dataset, and the results prove that combining
correlated points with the predictive power of recommendation algorithms produces
better-quality recommendations.
