Please use this identifier to cite or link to this item:
http://hdl.handle.net/10266/4964
Title: | Big Data Clustering Based Recommendation System Model Through Correlations |
Authors: | Pandove, Divya |
Supervisor: | Rani, Rinkle Goel, Shivani |
Keywords: | Big Data, Correlation Clustering, Spark R, Recommendation System |
Issue Date: | 1-Nov-2017 |
Abstract: | Technological advancement has enabled us to store and process huge amounts of data in relatively short spans of time. The nature of data is rapidly increasing its dimensionality to become multi and high-dimensional. There is an immediate need to expand our focus to include analysis of high-dimensional and large datasets. Data analysis is becoming a mammoth task as a result of incremental increase in data volume and complexity in terms of heterogony of data. It is because of this dynamic computing environment that the existing techniques either need to be modified or discarded to handle new data in multiple high-dimensions. Data clustering is a tool that is used in many disciplines, including data mining, so that meaningful knowledge can be extracted from seemingly unstructured data. Correlation clustering possibly represents the most intuitive form of clustering construction. It gives solutions that can be approximated while automatically selecting the number of clusters. This approach handles scenarios where the focus is on relationships between the objects instead of on actual representations of the objects. The suitability of this method extends to the structured objects for which feature vectors are not easy to obtain. Given the increasing scale of data these days, correlation clustering has become a powerful addition to the fields of data mining and agnostic learning. In this thesis, we start by proposing an algorithm that defines an intuitive and accurate correlation coefficient metric, known as the General (rank based) correlation coefficient (G). Further, a framework is proposed, based on this algorithm, and is named as G Based Agglomerative Clustering (GBAC). Our approach has been found to be effective for small, large and high-dimensional data that generate high quality clusters. This framework combines the predictive power of correlation coefficients with the ability to find patterns in data obtained from agglomerative hierarchical clustering. To explore complex relationships in data, there is a need to integrate dimensionalityreduction techniques with data-mining approaches and graph theory. We propose another apxvii proach called Local Graph based Correlation Clustering (LGBACC). This approach merges hierarchical clustering with PCA to uncover complex hierarchical relationships and uses graph models to visualize the results. Visualization of data is an important output and is knitted into the fabric of the framework. LGBACC is found to produce high quality clusters across a wide spectrum of dimensionality. Finally, both of these algorithms have been tested on real-life datasets using distributed and parallel computing (SparkR). We have used four large datasets (varying between 260 GB and 1 TB) to prove the scalability of the proposed approaches. They are found to be scalable and performing better than the existing hierarchical clustering algorithms. These algorithms are then integrated into recommendation algorithms and we define a recommendation system model through correlations. This model has been validated using a real-time, large dataset, and the results prove that combining correlated points with the predictive power of recommendation algorithms produces better-quality recommendations. |
URI: | http://hdl.handle.net/10266/4964 |
Appears in Collections: | Doctoral Theses@CSED |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.