An Efficient Approach for Outlier Detection in Big Data
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Outlier detection is an important aspect of data mining which discovers the unusual events
that occurs in data. Big data has large volume of unseen knowledge and many perceptions
which have raised significant challenges in knowledge discovery. In certain kinds of data, the
association among the different attributes is of much more significance than the information
itself. Hence, in such datasets before detecting outliers these associations needs to be
extracted. The associations can be mined by analyzing correlation among various attributes.
However, it is very challenging to acquire ample benefits from the large amount of complex
data. To overcome these issues, various methods for analyzing correlation are studied. Also,
various existing approaches for outlier detection based on supervised and unsupervised
learning models are studied. In recent times, these approaches have become an indispensable
tool for detecting anomalous events in various domains.
With the advancement in sensor technologies, a lot of data is being generated by wireless
sensors in various application domains. In this study, the main concern is on data generated
from wireless body sensor networks. As caretaker may not be always available to monitor
physiological parameters so, different sensors are attached with the body of patient to
remotely monitor the health of the patient. Outlier detection in this domain detects the
anomalous activities based on the sensor measurements and differentiates the sensor fault
from true medical condition.
This thesis carried out research work in the field of outlier detection in wireless body area
sensor networks. The key objective of the research is to explore the profits of using
distributed map reduce framework for outlier detection. An approach is proposed to detect
outliers based on the assumption that data attributes are linearly related to each other.
xiv
Further, as it is seen that in real application scenarios none of the sensors exhibit a truly
linear relationship. Hence, to deal with non-linear aspect of data the proposed approach is
further enhanced so that it can be able to detect outliers in dataset where data attributes are
linearly or non-linearly correlated. The results of both the proposed approaches are proved to
be effective than other competent approaches in terms of processing time and accuracy of
outlier detection. The approaches are also tested for scalability by forming a multinode
Hadoop cluster of eight nodes.
Furthermore, an integrated framework for outlier detection is proposed that is based on data
compression, data clustering, and cluster refinement. The clustering algorithm in the
proposed framework works on the principle of clonal selection algorithm and uses the
objective function of fuzzy clustering. It is seen that the clusters formed by proposed
clustering algorithm have more optimal structures than state of art clustering algorithms. The
formed clusters are further refined using cluster refinement algorithm to increase accuracy of
outlier detection. The results of the proposed framework show that it outperforms the
competent algorithms in various aspects of processing time and detection rate.
It is suggested that the utilization of correlation between attributes detects discriminate and
significant events, which can help in accurate classification of events and also reduce false
alarms which can further aid in better utilization of resources
