Efficient Text Clustering Techniques for Big Datasets
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Clustering is regarded as one of the most important tools for data analysis, especially
when label information is not available. Basically, it segregates a collection of data points
into such groups that each group contains as similar data points as possible. A Big dataset
in general, is characterized by several complexities including high dimensionality. Specifically, in the case of textual datasets, high dimensionality poses a great challenge for
clustering as well as other text mining tasks. In a textual dataset, the number of unique
words across the whole corpus (set of documents) becomes the dimensionality of the
dataset. Hence, the number of dimensions can reach anywhere from tens of thousands to
a few millions, for a dataset containing some thousands of documents. In addition, the
matrix representation of such datasets become very sparse (containing a large number of
zeros).
These major challenges make traditional clustering techniques such as partitioning-based,
hierarchical, and density-based unsuitable for clustering on such high-dimensional and
sparse data. In some cases, they even fail to perform clustering. Another important challenge in the case of textual datasets is to include the semantics (meaning) of text while
forming clusters. In the literature, several semantic-based text clustering techniques are
also defined which consider the semantics and to some extent attempts to reduce the high
dimensionality problem. Still, there is a crucial requirement of text clustering techniques
that can scale to the high dimensionality of large textual datasets.
In this thesis, such text clustering techniques have been proposed that attempt to simultaneously solve the aforementioned challenges. The first proposed technique is named
“Stamantic Clustering” which is based on lexical chains (groups of semantically related
words) and WordNet (a lexical database for English). The other proposed technique is
named “WEClustering” which is based on word embeddings (a numerical vector that
represents a word). Both the techniques have been validated on sufficiently large text
datasets having high dimensionality. Based on various performance metrics, a comparative analysis of both techniques has also been performed with some of the existing
state-of-the-art text clustering techniques. The analysis shows that the proposed techniques are more efficient in clustering high-dimensional textual datasets. Additionally,
the two proposed text clustering techniques are compared among themselves for factors
such as accuracy, execution, and scalability.
