A Novel Framework for Analysis of Big Data

Gupta, Deepak

A Novel Framework for Analysis of Big Data

Files

Primary Deepak Gupta-Final-Thesis.pdf (4.49 MB)

Date

2020-10-29

Authors

Gupta, Deepak

Supervisors

Rani, Rinkle

Abstract

The world is already into the information age. The huge growth of digital data has overwhelmed the traditional systems and approaches. Big data is touching almost all aspects of our life and the data-driven discovery approach is an emerging paradigm for computing. The ever-growing data provides a tidal wave of opportunities and challenges in terms of data capture, storage, manipulation, management, analysis, knowledge extraction, security, privacy, and visualization. Though the promise of big data seems to be genuine, still a wide gap exists between its potential and realization. In this era of digitization, a huge amount of data being generated has resulted in an exponential growth of widespread cyber threats. Moreover, the ever-evolving threat landscape and rapidly growing network environments are offering additional ways for the attackers to break in. This scenario has overwhelmed the existing traditional solutions and rendered them outdated to handle such attacks. To encounter the real-world cybersecurity challenges, the security researchers are putting a lot of efforts on technologies stemming from areas like big data, and artificial intelligence to extract powerful insights. Malware is one of the most critical and challenging security threats in the Internet world. It is growing exponentially in terms of volume, variety and velocity, and thus overwhelms the traditional approaches employed for malware detection and classification. Moreover, with the advent of Internet of Things, there is a huge growth in the volume of digital devices and in such scenario, malicious binaries are bound to grow even faster making it a big data problem. The main aim of this research is to explore the various tools and techniques of big data processing and analysis, and propose a framework for analyzing big data to generate the actionable insights or intelligence. A case study of malware analysis and detection has been used in the research. The initial part of the research focuses on understanding the basic concept of big data, its evolution and popular open-source big data stream processing frameworks. A bibliometric study of academic and industry publications during the period 2000–2017 is conducted to understand the current state, evolving xii disciplines, tools and techniques, and research trends of big data. A comparison of the most widely used open source big data stream processing frameworks is made, and the major big data research challenges and directions are identified and deliberated to offer rich observations and thinking. Further, the significance of big data analytics and machine learning in cyber security is identified. An enormous amount of malware samples is available online, but only a few researchers have attempted to analyze these thoroughly for obtaining insights or threat intelligence by extracting and analyzing behavioral trends using big data frameworks. This type of trend analysis could be very useful to understand the context and the goals of security breaches. In this research, we have proposed a scalable architecture built on the top of Apache Spark to perform a statistical analysis to study malware behavioral trends during the period 2010 to 2017. These trends can be further extrapolated by security experts to generate cyber threat intelligence which can help organizations to improve their threat protection systems and reduce the risks posed by malicious binaries. In order to analyze and detect unknown malware on a large scale, security analysts need to make use of machine learning algorithms along with big data technologies. These technologies help them to deal with current threat landscape consisting of complex and large flux of malicious binaries. This research proposes the design of a scalable architecture using Apache Spark and its scalable machine learning library for detecting zero-day malware. Three machine learning algorithms, namely, Naïve Bayes, support vector machine and random forest are used and the experimental results show that random forest gives the best accuracy. Although, many machine learning models have been used in detection and classification of malicious binaries in literature, however, the performance of ensemble learning methods has not been investigated extensively on large malware data. We have designed two methods based on ensemble learning and big data for improving the performance of malware detection at large scale. The first method is based on the weighted voting strategy of ensemble learning, and the second method is for selecting an optimal set of base classifiers for stacking purpose. The proposed methods are implemented using Apache Spark, and their performance is tested and evaluated. The experiments demonstrate that the proposed approach improves the generalization performance in detecting new malware as compared to traditional ensemble methods. xiii Increasing complexity and sophistication of malware has led to many state-of-art machine learning based solutions. However, many of these solutions suffer from high false positive rates and low scalability restricting their wider adoption and deployment. In recent years, deep learning, a subfield of machine learning, has resurged and reported outstanding performance in tackling many classification problems in a wide range of fields. A deep learning model has been proposed for malware detection which uses Apache Spark for efficient data preprocessing and Keras with TensorFlow as its computational engine for implementation of deep learning model. The findings demonstrate that the four layer deep learning model achieves the highest accuracy. The present research provides an evidence-based knowledge pertaining to application of big data tools in malware detection. It provides a comprehensive study of big data evolution including batch and stream processing tools, bibliometric analysis, and research challenges. It includes the study of big data security analytics which identifies malware detection and classification as a big data problem. Finally, the research proposes a set of scalable solutions for malware detection at large scale. These solutions are developed on the top of Apache Spark and Keras along with TensorFlow, and use machine learning, ensemble learning, and deep learning techniques to identify the malicious binaries.

Keywords

Big Data, Malware Data Analysis, Spark, Keras, Hadoop

URI

http://hdl.handle.net/10266/6037

Collections

Doctoral Theses@CSED

Full item page

A Novel Framework for Analysis of Big Data

Files

Date

Authors

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By