Please use this identifier to cite or link to this item: http://hdl.handle.net/10266/6430
Title: Machine Learning based approaches for the Gene-Based Diagnosis of Parkinson’s Disease
Authors: Arora, Priya
Supervisor: Mishra, Ashutosh
Malhi, Avleen
Keywords: Machine Learning;Parkinson Disease;Gene Identification;Physicochemical Properties;Neural Network
Issue Date: 2-Mar-2023
Abstract: Identifying disease-genes from human genome is a significant and essential issue in biomedical research. Despite several publications using machine learning methods to find new disease genes, it is still difficult due to the factors like pleiotropy of genes, the limited number of confirmed disease genes in the entire genome and the genetic heterogeneity of diseases. Recent approaches have applied the concept of ‘guilty by association’ to investigate the association between a disease phenotype and its causative genes, which means that candidate genes with similar characteristics as known disease genes are more likely to be associated with diseases. However, due to the fact that only a small number of genes in the human genome have been experimentally proven to be linked to disease, semi-supervised approaches like positive-unlabeled learning and label propagation are used to find candidate disease genes by training on unknown genes. This is usually the case when there are a small number of confirmed disease genes (labelled data) and a large number of unknown genome regions (unlabeled data). The performance of Disease gene identification models is limited by potential bias of single learning models and incompleteness or noisy biological data sources, therefore ensemble learning models are applied via protein sequences to obtain better predictive performance. In this work, various machine learning classifiers are analysed and feature extraction method is proposed to choose a more relevant feature set for analysis. An ideal multilevel voting model is proposed, which integrates various ML models based on their False Positive rates to retrieve a new voting classifier for better prediction analysis. The developed model helps to solve the trade-off issue between accuracy and efficiency. A deep learning based methods have also been designed using the Multi-Layer Perceptron (MLP) and Long Short Term Memory (LSTM) for PD genes identification. A comparative study with existing systems shows the effectiveness of the proposed approaches. Further, disease gene identification is a positive-unlabeled problem. A Positive unlabeled approach have recently been put forth to develop a classification model where known genes are treated as positive training set P and unknown genes are treated as unlabeled set U (instead of negative set N) because unknown genes contain unidentified disease genes. Twelve physicochemical properties of amino acids are applied to generate features with Geary Autocorrelation, Normalized moreau-broto autocorrelation and moran autocorrelation representation methods. The protein sequences based on previous knowledge are adopted to extract features. Consequently, t-SNE is applied to extract relevant features. On the positive unlabelled data a novel n-semble method was proposed which trained a neural network in a special way and integrated three classification methods based on their F-Score to ensemble the predictions for achieving more accurate predictive analysis. It is found that physicochemical properties of amino acids are highly beneficial in extracting features. Compared with the previous methods on unbalanced datasets, the F Score is improved with proposed n-semble method. The GA representation method characterizes a higher success rate than other representation methods. The experiments were conducted to identify novel disease genes from the entire unlabeled gene set using n-semble algorithm. As a case study, we selected Parkinson’s disease category and discovered that several of these identified genes are linked to Parkinson’s disease based on the literature survey.
URI: http://hdl.handle.net/10266/6430
Appears in Collections:Doctoral Theses@CSED

Files in This Item:
File Description SizeFormat 
Final Thesis.pdf4.11 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.