Ensemble Approach for Antigenic Epitopes Prediction using Physicochemical Properties
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Accurate and efficient prediction of antigenic epitopes are essential for the medical applications and
immunologic research. The prediction of antigenic epitopes are challenging as compared to other
bio-informatics issues. Because antigenic epitopes have many variabilities where an paratope which
is a part of antibody binds to a given epitope with high accuracy. Although, continuous efforts are
invested in this field for the improvement but the problem is still unsolved and attracts attention of the
researchers. To improve the results of antigenic epitopes prediction, an adaptive system needs to be
constructed by using machine learning techniques.
The pathogen or invader which is identifiable as a foreign substance by the adaptive immune
system that is known as antigen. Normally, antigens are the structural proteins which include portion
of bacterium cell membranes and spike proteins of viruses. Epitopes are the part of antigens which
bind to the helper T-cells, Cytotoxic T-lymphocytes, B-cells, antibodies and antigenic molecule based
upon the type of antigen. Therefore, to predict antigenic epitopes, analyze and predict diseases, to
group similar genetic elements, and to find relationships or associations in biological data, machine
learning techniques can be used to improve the results of such type of problems. There are many
studies exist to predict antigenic epitopes. But these studies have some limitations including use of
single model, fixed length of epitopes, lack of data preprocessing and fixed data partitioning approach
to train the models. Because of such issues, the trained model may or may not produce a reliable
and efficient prediction. Single model can be replaced with the ensemble model to predict antigenic
epitopes.
Ensemble learning is a process of combining more than one model to solve a given computational
intelligence problem. Generally, it is used to enhance the predictability as well as to improve the
robustness of a model. Identification of T-cell or B-cell epitopes in the targeted antigen is the
main goal in designing epitopes based vaccine, immune-diagnostic tests and antibody production.
Therefore, three ensemble models have been developed to predict IgG and IgA antibodies antigenic
epitopes, mycobacterium tuberculosis (M. tuberculosis) epitopes and B-cell epitopes.
A multilevel ensemble model has been proposed for the prediction of epitopes inducing IgG and
IgA antibodies. Epitope length is important while training the model and it is efficient to use variable
length of epitopes. In this ensemble approach, seven different machine learning models are combined to predict variable length of epitopes (4 to 50-mers).
To predict T-cell M. tuberculosis epitopes, an ensemble model has been developed. The existing
NetMHC 2.2, NetMHC 2.3, NetMHC 3.0 and NetMHC 4.0 etc estimate binding capacity of peptide.
This is still a challenge for those servers to predict whether a given peptide is M.tuberculosis epitope
or non-epitope. One of the servers, CTLpred works in this category but it is limited to peptide length
of 9-mers. Therefore, a direct method of predicting M. tuberculosis epitope or non-epitope has been
proposed which also overcomes the limitations of above servers. The proposedmethod is able to work
with variable length epitopes having size even greater than 9-mers. The proposed ensemble model is
designed by combining three models and is used to predict M. tuberculosis epitopes of variable length
(7 to 40 mers).
The third hybrid model has been designed by using stacked generalization ensemble technique for
prediction of linear B-cell epitopes. The goal of using stacked generalization ensemble approach is to
refine predictions of base classifiers and to get rid of the worse predictions. In this ensemble model,
six machine learning models are fused to predict variable length epitopes (6 to 49 mers).
The three proposed ensemble models contain different machine learning models. In the training
process, other models are also trained on these datasets. To meet the objective of ensembling,
i. e. combine weak models to improve their performance, thus we have selected the weak and
strong models. The models whose performance is poor considered as weak models. On the other
hand, strong models are ones which produce accurate predictions. We ensemble these models to
get improved and robust results. In the ensembling process, there are multiple trained weak and
strong models. There are various combinations to get proposed ensemble model. The best performer
combination is selected as the final ensemble model for the antigenic epitopes prediction.
A data division approach has been proposed in which data is provided to each model in such
a way that they can properly learn it. This approach enhances the predictability of the proposed
model. For feature selection, different approaches are used. All the proposed models are efficient
to predict variable length of epitopes. To check the consistency of proposed ensemble models’
prediction, repeated k-fold cross-validation has been performed. Each proposed ensemble model
has been evaluated via evaluation parameters like Gini, area under the curve, accuracy, sensitivity and
specificity. To check the improvement in the results, proposed models are compared with the existing
systems.
Description
PhD Thesis
