Please use this identifier to cite or link to this item:
Title: Efficient Text Clustering Techniques for Big Datasets
Authors: Mehta, Vivek
Supervisor: Bawa, Seema
Singh, Jasmeet
Keywords: Document clustering;High dimensionality;Unsupervised learning;Text datasets;Big datasets;Word embeddings;lexical chains
Issue Date: 2-Nov-2021
Abstract: Clustering is regarded as one of the most important tools for data analysis, especially when label information is not available. Basically, it segregates a collection of data points into such groups that each group contains as similar data points as possible. A Big dataset in general, is characterized by several complexities including high dimensionality. Specifically, in the case of textual datasets, high dimensionality poses a great challenge for clustering as well as other text mining tasks. In a textual dataset, the number of unique words across the whole corpus (set of documents) becomes the dimensionality of the dataset. Hence, the number of dimensions can reach anywhere from tens of thousands to a few millions, for a dataset containing some thousands of documents. In addition, the matrix representation of such datasets become very sparse (containing a large number of zeros). These major challenges make traditional clustering techniques such as partitioning-based, hierarchical, and density-based unsuitable for clustering on such high-dimensional and sparse data. In some cases, they even fail to perform clustering. Another important challenge in the case of textual datasets is to include the semantics (meaning) of text while forming clusters. In the literature, several semantic-based text clustering techniques are also defined which consider the semantics and to some extent attempts to reduce the high dimensionality problem. Still, there is a crucial requirement of text clustering techniques that can scale to the high dimensionality of large textual datasets. In this thesis, such text clustering techniques have been proposed that attempt to simultaneously solve the aforementioned challenges. The first proposed technique is named “Stamantic Clustering” which is based on lexical chains (groups of semantically related words) and WordNet (a lexical database for English). The other proposed technique is named “WEClustering” which is based on word embeddings (a numerical vector that represents a word). Both the techniques have been validated on sufficiently large text datasets having high dimensionality. Based on various performance metrics, a comparative analysis of both techniques has also been performed with some of the existing state-of-the-art text clustering techniques. The analysis shows that the proposed techniques are more efficient in clustering high-dimensional textual datasets. Additionally, the two proposed text clustering techniques are compared among themselves for factors such as accuracy, execution, and scalability.
Appears in Collections:Doctoral Theses@CSED

Files in This Item:
File Description SizeFormat 
MergedPDF_thesis.pdf6.04 MBAdobe PDFView/Open    Request a copy

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.