Efficient Text Clustering Techniques for Big Datasets

Mehta, Vivek

Efficient Text Clustering Techniques for Big Datasets

Files

MergedPDF_thesis.pdf (5.9 MB)

Date

2021-11-02

Authors

Mehta, Vivek

Supervisors

Bawa, Seema

Singh, Jasmeet

Abstract

Clustering is regarded as one of the most important tools for data analysis, especially when label information is not available. Basically, it segregates a collection of data points into such groups that each group contains as similar data points as possible. A Big dataset in general, is characterized by several complexities including high dimensionality. Specifically, in the case of textual datasets, high dimensionality poses a great challenge for clustering as well as other text mining tasks. In a textual dataset, the number of unique words across the whole corpus (set of documents) becomes the dimensionality of the dataset. Hence, the number of dimensions can reach anywhere from tens of thousands to a few millions, for a dataset containing some thousands of documents. In addition, the matrix representation of such datasets become very sparse (containing a large number of zeros). These major challenges make traditional clustering techniques such as partitioning-based, hierarchical, and density-based unsuitable for clustering on such high-dimensional and sparse data. In some cases, they even fail to perform clustering. Another important challenge in the case of textual datasets is to include the semantics (meaning) of text while forming clusters. In the literature, several semantic-based text clustering techniques are also defined which consider the semantics and to some extent attempts to reduce the high dimensionality problem. Still, there is a crucial requirement of text clustering techniques that can scale to the high dimensionality of large textual datasets. In this thesis, such text clustering techniques have been proposed that attempt to simultaneously solve the aforementioned challenges. The first proposed technique is named “Stamantic Clustering” which is based on lexical chains (groups of semantically related words) and WordNet (a lexical database for English). The other proposed technique is named “WEClustering” which is based on word embeddings (a numerical vector that represents a word). Both the techniques have been validated on sufficiently large text datasets having high dimensionality. Based on various performance metrics, a comparative analysis of both techniques has also been performed with some of the existing state-of-the-art text clustering techniques. The analysis shows that the proposed techniques are more efficient in clustering high-dimensional textual datasets. Additionally, the two proposed text clustering techniques are compared among themselves for factors such as accuracy, execution, and scalability.

Keywords

Document clustering, High dimensionality, Unsupervised learning, Text datasets, Big datasets, Word embeddings, lexical chains

URI

http://hdl.handle.net/10266/6186

Collections

Doctoral Theses@CSED

Full item page

Efficient Text Clustering Techniques for Big Datasets

Files

Date

Authors

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By