Please use this identifier to cite or link to this item:
|Degraded Text Recognition of Gurmukhi Script
Lehal, G. S.
|OCR, Degraded Text, Touching Characters, Overlapping lines, Heavy Printed Characters, Character Segmentation
|Character recognition is one of the important subjects in the field of Document Analysis and Recognition (DAR). Character recognition can be performed on printed text or handwritten text. Printed text can be from good quality documents or degraded documents. There are several kinds of degradations in almost every script of the world. The list of normally found degradations in any printed script includes touching characters, broken characters, heavy printed characters (self touching), faxed documents, typewritten documents and backside text visible documents. The problem of touching characters commonly exists in all the degraded documents containing these kinds of degradations. Hence, it is the need of the time to cope with the problem of touching characters to make an Optical Character Recognition (OCR) for degraded text. Researchers involved in recognition of good quality printed text in different scripts around the world have reported drastic decrease in recognition accuracy due to presence of touching characters in the text. Research and experiments have shown that performance breakdown of commercial document recognition system under real application situations is caused mainly due to the difficulty in dealing with touching characters that are abundant in documents as a result of document degradations. Touching characters make it difficult to correctly segment character images for individual classification, and therefore, pose severe difficulty to conventional document recognition systems that are critically dependent on character segmentation. The problem of heavy printed characters also decreases recognition accuracy. A document containing touching characters generally contains heavily printed characters also. Objective of this work is to seek new approaches to degraded document recognition of printed Gurmukhi script containing touching characters and heavily printed characters. OCR algorithms can achieve good recognition rates (near 99%) on images with little degradation. However, recognition rates drop to 70% or even lower when image degradations are present. Typical pages of text have more than 2000 characters per page. Therefore, an error rate of 30% results in more than 600 mistakes. Before the mistakes can be corrected, they must be located, making the correction process even more tedious. Currently, there is no software available for OCR of degraded printed Gurmukhi script, in particular, and other degraded printed Indian language scripts, in general. There is a dire need for the OCRs of Indian language scripts as people working in Indian language scripts are denied the opportunity of converting scanned images of degraded machine-printed or handwritten text into a computer processable format. This work is the first attempt towards the development of an OCR for recognising degraded documents of printed Gurmukhi script. This work can lead towards the development of OCRs for other Indian language scripts such as Devanagari, Bangla etc. that are structurally similar to Gurmukhi script. This thesis is divided into seven chapters. First chapter introduces the process of OCR and various phases of OCR like pre-processing, segmentation, feature extraction, classification and post-processing. Problems in text recognition due to presence of degraded text in a script, in general, and in Gurmukhi script, in particular, have been discussed. The need of an OCR for recognising degraded printed documents containing touching and heavily printed characters of Indic scripts has also been discussed. In second chapter, a comprehensive and exhaustive review of the literature for various methods used for segmenting machine-printed scripts and degraded printed scripts have been discussed. Also, various methods used in literature for feature extraction and classification have been discussed. A detailed survey on Indian script recognition systems has also been carried out. We have also discussed the work done by various researchers for recognising degraded text. Third chapter starts with study of importance of degradation models used for recognising degraded data. The properties of Gurmukhi script and other Indian scripts have been discussed. Various kinds of degradations in degraded printed Gurmukhi script are also presented in this chapter. The problems associated with recognition of printed Gurmukhi script documents containing touching characters, broken characters, heavy printed characters, faxed data, typewritten data, and backside text visible characters have also been discussed. The reason of occurrence, comparison of each kind of degradation with corresponding degradations in Roman script and some possible solutions have been discussed for each kind of degradations. Chapter 4 consists of algorithms proposed for segmenting touching characters in degraded printed Gurmukhi script. In the first algorithm, a method has been proposed to segment horizontally overlapping lines and associating broken components of a line (small strips) with their respective lines. Various types of strips have been identified in good quality as well as degraded printed documents of Gurmukhi script with percentage of occurrence of each type of strip in document database. The modified version of the algorithm has been proposed for segmenting horizontally overlapping lines of multiple sized texts. The proposed algorithm had also been successfully tested on other Indian scripts like Devanagari, Bangla, Gujarati, Kannada, Tamil, Telugu and Malayalam for segmenting the horizontally overlapping lines and associating broken components of a line with their respective lines. Segmentation accuracy of 95% to 99.7% has been achieved with the use of these algorithms for various scripts. Further, different categories of touching characters in all the three zones (upper, middle and lower zone) of degraded printed Gurmukhi script has been identified on the basis of structural properties of Gurmukhi script. In another algorithm, a method for segmenting touching characters in upper zone has been proposed with an accuracy of 92%. The algorithm is based upon the structural properties like concavity and convexity of sub-symbols (connected components of a character) in upper zone. This algorithm successfully segments highly touching characters and also segments small characters such as bindī from other characters. One more algorithm has been developed for segmenting touching characters in middle zone. This algorithm is very effective for segmenting touching sub-symbols with 91% accuracy. The solution has also been proposed for segmenting touching sub-symbols in lower zone. These are new algorithms that have been proposed by us for segmenting degraded text in Gurmukhi script. It is also shown that such algorithms can also be adopted for segmentation of degraded text in Devanagari, Bangla and other Indic scripts. In Chapter five, structural and statistical features used for extracting the features of segmented characters of degraded printed Gurmukhi script have been discussed. Structural features like presence of sidebar, presence of half sidebar, presence of headline, number of junctions with headline, number of junctions with baseline, aspect ratio, left and right profile direction codes, top and bottom profile direction codes and transition features have been used. Another useful structural feature, named, Directional Distance Distribution (DDD) has been used, which is based upon the distance of nearest black/white pixel in eight directions for each white/black pixel in the input binary array. We have given the detection accuracy of each structural feature also. Additional assumptions have been proposed to improve the detection accuracy. Some of the statistical features including zoning, Zernike moments, Orthogonal Fourier Mellin (OFM) moments have been used for extracting the features. A detailed performance analysis on various options of each structural and statistical feature for all the three zones has been carried out. In Chapter six, various classifiers used for recognition of text have been discussed. We have developed a corpus for degraded printed Gurmukhi script OCR. A number of documents from various sources like newspapers, old and new books, magazines printed on low quality paper, computer printouts, faxed documents and typewritten documents were collected and scanned, which were used for training and testing purpose. Most commonly used classifiers such as k-Nearest Neighbor (k-NN), Support Vector Machines (SVM), and Artificial Neural Networks (ANN) have been used for recognition purpose. We have used MATLAB 7.2 for implementing k-NN and SVM classifiers. NeuNet Pro 2.3 has been used for implementing ANN. We have obtained an accuracy of 92.54% in recognition of degraded printed Gurmukhi script characters using SVM classifier. Finally, chapter seven presents the inferences drawn from the results of the various experiments conducted in this thesis. Also, some pointers to the future research on the topic under consideration in this thesis are discussed briefly in this chapter.
|Appears in Collections:
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.