Ensemble Approach for Antigenic Epitopes Prediction using Physicochemical Properties

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Accurate and efficient prediction of antigenic epitopes are essential for the medical applications and immunologic research. The prediction of antigenic epitopes are challenging as compared to other bio-informatics issues. Because antigenic epitopes have many variabilities where an paratope which is a part of antibody binds to a given epitope with high accuracy. Although, continuous efforts are invested in this field for the improvement but the problem is still unsolved and attracts attention of the researchers. To improve the results of antigenic epitopes prediction, an adaptive system needs to be constructed by using machine learning techniques. The pathogen or invader which is identifiable as a foreign substance by the adaptive immune system that is known as antigen. Normally, antigens are the structural proteins which include portion of bacterium cell membranes and spike proteins of viruses. Epitopes are the part of antigens which bind to the helper T-cells, Cytotoxic T-lymphocytes, B-cells, antibodies and antigenic molecule based upon the type of antigen. Therefore, to predict antigenic epitopes, analyze and predict diseases, to group similar genetic elements, and to find relationships or associations in biological data, machine learning techniques can be used to improve the results of such type of problems. There are many studies exist to predict antigenic epitopes. But these studies have some limitations including use of single model, fixed length of epitopes, lack of data preprocessing and fixed data partitioning approach to train the models. Because of such issues, the trained model may or may not produce a reliable and efficient prediction. Single model can be replaced with the ensemble model to predict antigenic epitopes. Ensemble learning is a process of combining more than one model to solve a given computational intelligence problem. Generally, it is used to enhance the predictability as well as to improve the robustness of a model. Identification of T-cell or B-cell epitopes in the targeted antigen is the main goal in designing epitopes based vaccine, immune-diagnostic tests and antibody production. Therefore, three ensemble models have been developed to predict IgG and IgA antibodies antigenic epitopes, mycobacterium tuberculosis (M. tuberculosis) epitopes and B-cell epitopes. A multilevel ensemble model has been proposed for the prediction of epitopes inducing IgG and IgA antibodies. Epitope length is important while training the model and it is efficient to use variable length of epitopes. In this ensemble approach, seven different machine learning models are combined to predict variable length of epitopes (4 to 50-mers). To predict T-cell M. tuberculosis epitopes, an ensemble model has been developed. The existing NetMHC 2.2, NetMHC 2.3, NetMHC 3.0 and NetMHC 4.0 etc estimate binding capacity of peptide. This is still a challenge for those servers to predict whether a given peptide is M.tuberculosis epitope or non-epitope. One of the servers, CTLpred works in this category but it is limited to peptide length of 9-mers. Therefore, a direct method of predicting M. tuberculosis epitope or non-epitope has been proposed which also overcomes the limitations of above servers. The proposedmethod is able to work with variable length epitopes having size even greater than 9-mers. The proposed ensemble model is designed by combining three models and is used to predict M. tuberculosis epitopes of variable length (7 to 40 mers). The third hybrid model has been designed by using stacked generalization ensemble technique for prediction of linear B-cell epitopes. The goal of using stacked generalization ensemble approach is to refine predictions of base classifiers and to get rid of the worse predictions. In this ensemble model, six machine learning models are fused to predict variable length epitopes (6 to 49 mers). The three proposed ensemble models contain different machine learning models. In the training process, other models are also trained on these datasets. To meet the objective of ensembling, i. e. combine weak models to improve their performance, thus we have selected the weak and strong models. The models whose performance is poor considered as weak models. On the other hand, strong models are ones which produce accurate predictions. We ensemble these models to get improved and robust results. In the ensembling process, there are multiple trained weak and strong models. There are various combinations to get proposed ensemble model. The best performer combination is selected as the final ensemble model for the antigenic epitopes prediction. A data division approach has been proposed in which data is provided to each model in such a way that they can properly learn it. This approach enhances the predictability of the proposed model. For feature selection, different approaches are used. All the proposed models are efficient to predict variable length of epitopes. To check the consistency of proposed ensemble models’ prediction, repeated k-fold cross-validation has been performed. Each proposed ensemble model has been evaluated via evaluation parameters like Gini, area under the curve, accuracy, sensitivity and specificity. To check the improvement in the results, proposed models are compared with the existing systems.

Description

PhD Thesis

Citation

Endorsement

Review

Supplemented By

Referenced By