Design of Algorithms for Gene Predictions

Maji, Srabanti

Design of Algorithms for Gene Predictions

dc.contributor.author	Maji, Srabanti
dc.contributor.supervisor	Garg, Deepak
dc.date.accessioned	2013-05-03T11:22:10Z
dc.date.available	2013-05-03T11:22:10Z
dc.date.issued	2013-05-03T11:22:10Z
dc.description	Ph.D, CSED	en
dc.description.abstract	Identification of coding sequence from genomic DNA sequence is the major step in pursuit of gene identification. In the prediction of splice site, which is the separation between exons and introns, though the sequences adjacent to the splice sites have a high conservation, but still, the accuracy is lower than 90%. Therefore, here, both approaches – Conventional as well as Computational Intelligences (CI) have been pursued to predict the splice site in DNA sequence of the Eukaryotic organism and, both have been evaluated and compared in terms of their performance. In the conventional approach, i.e., “Hidden Markov Model (HMM) System”, the model architecture includes the probabilistic descriptions of the splicing, translational, and transcriptional signals. Splice sites predictor based on Unique Hidden Markov Model (HMM) is developed and trained using Modified Expectation Maximization (MEM) algorithm. A 12 fold cross validation technique is also applied to check the reproducibility of the results obtained and to further increase the prediction accuracy. The proposed system is able to achieve the accuracy of 98% of true donor site and 93% for true acceptor site in the standard DNA (nucleotide) sequence. The second proposed method, based on combination of conventional and computational intelligences, namely, “Markov Model 2 Feature – Support Vector Machine (MM2F-SVM)” consists of three stages – initial stage, in which a second order Markov Model (MM2) is used; intermediate, or the second stage in which principal feature analysis (PFA) is done; and the third or final stage, in which a support vector machine (SVM) with Gaussian kernel is used. The first stage is known as “feature extraction”; the second stage is called “feature selection” and, the final stage is known as “classification”. The model is proficient of indicating the reliability of each predicted splice site with high accuracy. The accuracy of this method, when tested on standardized sets of human genes, is shown to be significantly better than some of the existing methods as it correctly identified maximum 98.31% of the true donor sites and 97.88% of the false donor sites in the test dataset; 97.92% of the true acceptor sites and 96.34% of the false acceptor sites in the test data set. The applications of the program to identify splice site in newly sequenced genomic regions and to identify the alternative splice sites are also explained along with appropriate examples.	en
dc.format.extent	77382419 bytes
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/10266/2189
dc.language.iso	en	en
dc.subject	Bioinformatics	en
dc.subject	Gene Identification	en
dc.subject	Splice Site	en
dc.subject	Support Vector Machine	en
dc.title	Design of Algorithms for Gene Predictions	en
dc.type	Thesis	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2189.pdf
Size:: 2.99 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.78 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Doctoral Theses@CSED