Please use this identifier to cite or link to this item:
Title: Efficient Machine Learning Techniques for Big Data Analytics
Authors: Sharma, Gaurav
Supervisor: Bawa, Seema
Rana, Prashant Singh
Keywords: Big Data;Machine Learning;Ensemble Models;Hybrid Models;Heuristic Techniques;MCDM;Apache Spark
Issue Date: 21-Oct-2021
Abstract: World's data is increasing at a tremendous rate, and many domains are becoming data-rich. New technological trends like the internet of things, cloud computing, smart devices etc. are responsible for this unprecedented data growth in several domains. Every domain is interested in gaining valuable insights by implementing knowledge discovery methods on the generated data to improve overall outcome or for some scientific breakthrough. However, gaining valuable insight from this big data comes with several challenges due to its inherent properties like carrying heterogeneous formats like structured, semi-structured or unstructured, growth rate and huge volume. The traditional machine learning and predictive analytics techniques face some significant limitations in terms of efficiency and accuracy when it comes to big data. The limitations of traditional tools and techniques have opened up vast opportunities for researchers worldwide to develop efficient machine learning techniques for big data problems. There is no single machine learning algorithm that fits all scenarios, so there is a vast amount of research developing efficient machine learning techniques for different big data problems. Researchers are using different approaches like ensemble or a hybrid approach for developing a more accurate, efficient and reliable machine learning system for the problem in hand. Hybrid approaches usually involve integrating one machine learning technique with some other machine learning, heuristic, meta-heuristic or soft computing technique. On the other hand, ensemble machine learning techniques are build by combining various machine learning algorithms using grouping techniques like bagging, boosting and stacking. In this thesis, hybrid and ensemble machine learning techniques are developed for big data problems in bioinformatics, material science and particle physics domains. In the first case study, hybrid machine learning techniques are developed to predict different types of human T-cell lymphotropic virus (HTLV) from semi-structured data, compromising protein sequences of different HTLVs and non-HTLV viruses. Hybrid machine learning techniques are built by combining supervised and unsupervised machine learning algorithms with greedy search and heuristic techniques. The machine learning system developed in this case study aims to assist the current diagnostic system for detecting HTLV-1 virus and gaining better insights about the virus by exploring the protein sequences' physicochemical properties extracted in this work. In the second case study multi-criteria decision making (MCDM) based machine learning techniques are developed to predict the kinematic viscosity of three commercial grades of lubricants namely gear oil, hydraulic oil and transmission oil deployed in heavy earth-moving vehicles. The experimental data for each lubricant category was collected by adding two different types of nano-particles at varying temperature and particle volume fraction. Four different machine learning techniques were trained on each category of nano-lubricants' experimental data, and their predictive efficiency was evaluated based on different model evaluation parameters. In the final step for finding the best predictive model in each category, the ranking of machine learning techniques is done basis on the model evaluation parameters using MCDM technique called Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS). In the third case study, multilevel ensemble classifier is developed for dealing with the binary classification problem in the massive volume of data generated by particle colliders like Large Hadron Collider (LHC). In this work, four different supervised machine learning techniques are stacked to build ensemble classifier. Moreover, for dealing with the massive volume of data, the ensemble classifier is implemented using popular big data distributed platform Apache Spark on the AWS cloud. The multilevel ensemble classifier's efficiency is evaluated based on different model evaluation parameters, and comparative analysis of the results is done with the existing benchmark techniques. The results obtained in all three case studies have proved the efficiency of hybrid and machine learning techniques developed for the respective problem in hand.
Appears in Collections:Doctoral Theses@CSED

Files in This Item:
File Description SizeFormat 
Final Thesis with Signature.pdf5.66 MBAdobe PDFThumbnail

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.