Role of Feature Selection in Data Filtering: A Comparative Analysis
Loading...
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The quality of the data is one of the most important factors influencing the performance
of any classification or clustering algorithm. The attributes defining the feature space of a
given data set can often be inadequate, which make it difficult to discover interesting
knowledge or desired output. However, even when the original attributes are individually
inadequate, it is often possible to combine such attributes in order to construct new ones
with greater predictive power. Feature selection, as a preprocessing step to machine
learning, has been very effective in reducing dimensionality, removing irrelevant data,
and noise from data to improving result comprehensibility. This thesis addresses the task
of feature selection for clustering and classification.
The goal of this thesis is to find out the best feature subset from the given features in
order to improve the performance of classification and clustering techniques on complex,
real world data. To partition a given document collection into clusters of similar
documents a choice of good features along with good clustering algorithms is very
important in clustering. The feature selection is an important part in automatic text
categorization which can change the entire results of text clusters.
This thesis addresses the problem of feature selection for machine learning through
various methods. The central hypothesis is that good feature sets contain features that are
highly correlated with the class, yet uncorrelated with each other. A feature evaluation
formula, based on ideas from test theory, provides an operational definition of this
hypothesis. This thesis give a comparative study of variety of feature selection methods
for data mining, including Information Gain (IG) and χ2 statistic (CHI) etc using Weka,
an open source data mining tool.
