Multiple Decision Techniques for RMSD prediction of Protein Structure
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Protein sequences are converted into three dimensional tertiary designs to perform various biological processes. Physicochemical properties of amino acid remains and their ratio give rise to different associated forces which further lead the folding of a protein sequence into its distinct tertiary designs. A large amount of protein sequence data is storming as the outcome of different genomic and several other sequences projects. Due to inundation of such enormous amount of sequence data, there is the vital need to develop computational predictive approaches for prediction of protein structure from amino acid sequences. The work presented in this thesis mainly focuses on the multiple decision techniques qualitative study of protein structure using supervised learning with six physicochemical properties. The objective is to predict the qualitative measure i.e. Root Mean Square Deviation (RMSD) of a protein structure in the absence of its true native state. In this work, a performance study of classification machine learning models is carried out to classify the protein structure using Multiple Decision Techniques. The k-fold cross validation is used to measure the robustness of the proposed method. Prediction of RMSD of the protein structure is the critical factor in order to differentiate the native protein structure or native like protein structures from the predicted structures. In this work, Principle Component Analysis (PCA) has been implemented in order to obtain independent and uncorrelated components to decrease the dimensionality of the feature space. PCA is very useful as it extract the relevant information from the dataset, analyze structure of observations and to represent it as a new set of principle components. The seventeen classification methods have been used which belong to different families of machine learning that makes a rigorous and least biased ensemble. Further, based on the several performance parameters of the particular classifier, Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) has been introduced to build a single performance score to rank the classifiers and based on the ranking given by TOPSIS to predict the RMSD of protein structure of protein sequence the ensembled model has been developed. Despite the simplicity of the technique used, the results obtained by these ensembles are found to be better in comparison to those produced by other methods. The empirical study indicated that the combination of performance score of individual classification algorithms increased the performance. That’s why, TOPSIS based eight ensembles of classification algorithms have been generated to increase the performance. By intensive experimentation, it is found that ensemble of nine classification models outperformed. There are several measures to evaluate performance and it is the critical undertaking to choose an outperforming classifier (or set of classifiers). Further, this work also introduced a rough set based ensembled approach which make rough sets of independent, uncorrelated and outperforming models. It is evident from the results that proposed novel rough set based ensemble has a high accuracy, Sensitivity, specificity, Area under the receiver operating characteristic curve (AUC), Positive Predictive Value (PPV), Negative Predictive value (NPV) and Detection Rate. The proposed model has been compared with available models and validated on benchmark dataset CASP 10. The k-fold cross validation has been used to check the robustness of proposed model.
Description
Master of Engineering -CSE
