A Novel Framework for Analysis of Big Data
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The world is already into the information age. The huge growth of digital data has
overwhelmed the traditional systems and approaches. Big data is touching almost all
aspects of our life and the data-driven discovery approach is an emerging paradigm for
computing. The ever-growing data provides a tidal wave of opportunities and challenges
in terms of data capture, storage, manipulation, management, analysis, knowledge
extraction, security, privacy, and visualization. Though the promise of big data seems to
be genuine, still a wide gap exists between its potential and realization.
In this era of digitization, a huge amount of data being generated has resulted in an
exponential growth of widespread cyber threats. Moreover, the ever-evolving threat
landscape and rapidly growing network environments are offering additional ways for the
attackers to break in. This scenario has overwhelmed the existing traditional solutions and
rendered them outdated to handle such attacks. To encounter the real-world cybersecurity
challenges, the security researchers are putting a lot of efforts on technologies stemming
from areas like big data, and artificial intelligence to extract powerful insights. Malware
is one of the most critical and challenging security threats in the Internet world. It is
growing exponentially in terms of volume, variety and velocity, and thus overwhelms the
traditional approaches employed for malware detection and classification. Moreover, with
the advent of Internet of Things, there is a huge growth in the volume of digital devices
and in such scenario, malicious binaries are bound to grow even faster making it a big
data problem.
The main aim of this research is to explore the various tools and techniques of big data
processing and analysis, and propose a framework for analyzing big data to generate the
actionable insights or intelligence. A case study of malware analysis and detection has
been used in the research. The initial part of the research focuses on understanding the
basic concept of big data, its evolution and popular open-source big data stream
processing frameworks. A bibliometric study of academic and industry publications
during the period 2000–2017 is conducted to understand the current state, evolving
xii
disciplines, tools and techniques, and research trends of big data. A comparison of the
most widely used open source big data stream processing frameworks is made, and the
major big data research challenges and directions are identified and deliberated to offer
rich observations and thinking. Further, the significance of big data analytics and machine
learning in cyber security is identified.
An enormous amount of malware samples is available online, but only a few researchers
have attempted to analyze these thoroughly for obtaining insights or threat intelligence
by extracting and analyzing behavioral trends using big data frameworks. This type of
trend analysis could be very useful to understand the context and the goals of security
breaches. In this research, we have proposed a scalable architecture built on the top of
Apache Spark to perform a statistical analysis to study malware behavioral trends during
the period 2010 to 2017. These trends can be further extrapolated by security experts to
generate cyber threat intelligence which can help organizations to improve their threat
protection systems and reduce the risks posed by malicious binaries.
In order to analyze and detect unknown malware on a large scale, security analysts need
to make use of machine learning algorithms along with big data technologies. These
technologies help them to deal with current threat landscape consisting of complex and
large flux of malicious binaries. This research proposes the design of a scalable
architecture using Apache Spark and its scalable machine learning library for detecting
zero-day malware. Three machine learning algorithms, namely, Naïve Bayes, support
vector machine and random forest are used and the experimental results show that random
forest gives the best accuracy.
Although, many machine learning models have been used in detection and classification
of malicious binaries in literature, however, the performance of ensemble learning
methods has not been investigated extensively on large malware data. We have designed
two methods based on ensemble learning and big data for improving the performance of
malware detection at large scale. The first method is based on the weighted voting strategy
of ensemble learning, and the second method is for selecting an optimal set of base
classifiers for stacking purpose. The proposed methods are implemented using Apache
Spark, and their performance is tested and evaluated. The experiments demonstrate that
the proposed approach improves the generalization performance in detecting new
malware as compared to traditional ensemble methods.
xiii
Increasing complexity and sophistication of malware has led to many state-of-art machine
learning based solutions. However, many of these solutions suffer from high false positive
rates and low scalability restricting their wider adoption and deployment. In recent years,
deep learning, a subfield of machine learning, has resurged and reported outstanding
performance in tackling many classification problems in a wide range of fields. A deep
learning model has been proposed for malware detection which uses Apache Spark for
efficient data preprocessing and Keras with TensorFlow as its computational engine for
implementation of deep learning model. The findings demonstrate that the four layer deep
learning model achieves the highest accuracy.
The present research provides an evidence-based knowledge pertaining to application of
big data tools in malware detection. It provides a comprehensive study of big data
evolution including batch and stream processing tools, bibliometric analysis, and research
challenges. It includes the study of big data security analytics which identifies malware
detection and classification as a big data problem. Finally, the research proposes a set of
scalable solutions for malware detection at large scale. These solutions are developed on
the top of Apache Spark and Keras along with TensorFlow, and use machine learning,
ensemble learning, and deep learning techniques to identify the malicious binaries.
