Intelligent Framework for Omics Data Analysis using Machine Learning
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Omics data encompasses extensive genetic information as genomics, proteomics, transcriptomics, and metabolomics, generated through advanced sequencing and mass spectrometry technologies. In computational bioinformatics, machine learning techniques
are harnessed for analysis of omics data. Recent advancements in omics data analysis
presents a breakthrough in healthcare which enables researchers to predict the disease
before its onset. The combination of computational technologies and omics data in
healthcare has revolutionized the way large datasets are retrieved and analyzed. This
integration enables researchers to extract valuable insights and make significant advancements in prediction for the development of targeted therapies which ultimately
leads to improvements in human health. The substantial omics data generated necessitates the requirement of advanced computational methods for effective survival
prediction and disease prediction.
The aim of this research is to employ computational technologies such as machine
learning, and metaheuristic methods for effective disease prediction and survival prediction of patients using omics data. At the beginning, a comprehensive review has
been undertaken to explore computationally intelligent approaches for omics data analysis. It involved investigating, comparing, and categorizing diverse technologies and
tools utilized in disease prediction, survival prediction, biomarker discovery, and disease recurrence using omics data. Through this critical analysis, it became evident that
there is a significant demand for the development of effective framework specifically
designed for survival prediction and disease prediction using omics data. Additionally, it was noted that existing tools in the field often lack the necessary provisions
for users to make informed choices concerning data pre-processing, feature selection,
and prediction models for omics data. This limitation underscores the crucial need
for an accessible solution that empowers researchers with a wide range of options for
conducting omics data analysis. To address these gaps, the present research proposes
OmicsML framework for omics data analysis. Further, an application is developed
using proposed framework.
The OmicsML framework is proposed for omics data analysis which consists of four
xxiii
phases, i.e., data acquisition, data preparation, development of learning models, and
integration. Through data acquisition phase, omics data is collected from public repositories, i.e., The Cancer Genome Atlas (TCGA), Molecular Taxonomy of Breast Cancer International Consortium(METABRIC), and National Center for Biotechnology
Information-Gene Expression Omnibus (NCBI-GEO). The data preparation phase consists of pre-processing and feature selection techniques. The data pre-processing is performed by removal and imputation of null values, data normalization, and removal of
duplicate samples. Additionally, feature selection is done using Artificial Bee Colony
(ABC) and ANOVA-Firefly technique. In development of learning models phase, a
Bayesian optimized Stacked ensemble (BSense) model and Bayesian optimized Deep
Neural Network (BDNN) model is proposed for survival prediction and disease prediction, respectively. In integration phase, a web application is developed using the
previous three phases of proposed OmicsML framework for validation.
The BSense model is proposed for survival prediction using Multi-layer Perceptron,
Gradient Boosting Machine, and Random Forest models. The hyperparameters of used
models are tuned efficiently using parallel Bayesian optimization, leading to improved
performance in a shorter processing time. The survival prediction is designed using
data acquisition, data preparation, and learning model phase of proposed framework.
In data preparation, ABC technique is applied for feature selection. Further, BSense
model is used as learning model for survival prediction. The BSense model is validated
using various breast cancer datasets, i.e., TCGA, METABRIC, Metabolomics, and
RNA-seq. It has been observed from the results that for TCGA dataset, BSense model
gives Area Under Curve (AUC) value of 83.9%. For METABRIC dataset, BSense
model provides AUC value of 87.3%. For Metabolomics dataset and RNA-seq dataset,
BSense model provides AUC value of 91.1% and 80.1%, respectively. The accurate
survival prediction of breast cancer using omics data complements insightful decision
making along with clinical data. The ability of BSense model to accurately predict
breast cancer survival will help the clinicians in guiding more suitable cancer treatment. Additionally, the predicted short-term survivors could be prioritized and given
appropriate line of treatment well in time.
xxiv
The BDNN model is proposed for disease prediction using Deep Neural Network, i.e.
Multi-layer Perceptron model. The hyperparameters of used model are tuned using
Bayesian optimization. The disease prediction is designed using data acquisition, data
preparation, and learning model phase of proposed framework. In data preparation,
ANOVA-Firefly technique is applied for feature selection. Further, BDNN model is used
as learning model for disease prediction. The BDNN model is validated using various
diseases, i.e., Alzheimer’s, Breast Cancer, and COVID-19 datasets. For Alzheimer’s
dataset, i.e., GEO:GSE33000 and GEO:GSE44770, BDNN model gives an AUC value
of 94.9%. For breast cancer dataset, i.e., METABRIC, BDNN model showed an AUC
value of 98.7%. For COVID-19 dataset, i.e., GSE157103, BDNN model gives AUC
value of 98.9%. The enhanced and accurate performance of BDNN model for disease
prediction can help in recommending treatment to a patient diagnosed with disease.
This work makes a significant contribution by developing a omics data analysis application for the validation of proposed framework. The OmicsML application is developed
by integrating the data acquisition, data preparation, and learning model phase of proposed framework. The OmicsML application is deployed on cloud server and provides
the graphical user interface which offers users to autopick the data pre-preparation
techniques and learning models for omics data analysis.
