Please use this identifier to cite or link to this item:
http://hdl.handle.net/10266/5140
Title: | Optical Character Recognition of Machine Printed Dogri Language Documents |
Authors: | Jindal, Khushneet |
Supervisor: | Sharma, Rajiv Kumar |
Keywords: | Image Processing;Character Recognition;Dogri;Segmentation |
Issue Date: | 3-Aug-2018 |
Abstract: | Optical character recognition (OCR) is a technology used for the digitization of printed historical documents, books, magazines, manuscripts etc., in order to preserve them from deterioration. In India, there are a number of language groups, the major ones being the Indo-Aryan languages, spoken by most of the Indians. One such language is Dogri which is written using Devanagari script and is one of the important Indian language used in the border areas of the North of India. The present research work is an exclusive attempt towards the design and development of an OCR for recognition of machine printed Dogri language documents. A new dataset of Dogri language characters has been prepared as standard dataset for Dogri language OCR was non-existent. The new dataset consists of around 87000 character images collected from old books, magazines, newspaper and synthetically generated data. A novel shape based algorithms have been proposed for the segmentation of lines, words, characters and modifiers for the printed Dogri language documents. The proposed algorithms mainly focused upon the structure of character by retaining the header line (Shirorekha) during segmentation which is importantly required to minimize the loss of structural information. It helps in minimizing the chances of under-segmentation or oversegmentation. A segmentation accuracy of around 99.46% at character level has been achieved using proposed algorithms. The results showed that the proposed algorithms not only successfully resolves identified shortcomings that occur due to structural loss, but are also time efficient than the other methods. Moreover, when the proposed algorithms were applied on the pre-detected words of Devanagari script based natural scene images, it has been found that there was an enhanced accuracy of 36.34% at v character level with 56% lesser processing time than the method in vogue. The proposed algorithms successfully segments almost all the cases where existing algorithms failed to segment, under or over segment the text image. For the recognition of characters initially shape oriented features have been extracted using Discrete Cosine Transformation (DCT), Gradient and Zernike Moments feature extraction techniques. The performance of these techniques is evaluated in terms of attributes and length of features. The effectiveness of shape based features has been analysed in recognition stage using various combinations classification techniques. Around 200 features have been extracted in zig-zag manner from each of the image of size 32x32. The characters recognition has been performed using various combinations of extracted features and Multilayer perceptron neural networks (MPNN), Support Vector Machines (SVM) & k-Nearest Neighbors (k-NN) classification techniques. For experimentation, the datasets was partitioned in the ratio of 75:25 i.e, 75% data has been used for training and remaining 25% for testing the classifier. The proposed character recognition system has achieved an impressive accuracy of 98.56% to 99.10% (depending upon the classifier used) the best reported till date. Further, the maximum character recognition accuracy of 99.10% was also achieved with the combination of Gradient features and Support vector machine. Then, a dictionary based post-processing technique has been applied for the correction of errors left by the classification stage. A corpus containing around fifty lac words of Dogri and Hindi language text has been formulated from online books, documents, magazines, newspaper etc. The output of post-processor has been manually vi checked at character level on five Dogri language documents and the results were matched with the actual printed document. Finally, chapter six presents the inferences drawn from the results of the various experiments carried out in this work. Also, some future research directions on the line of this work are discussed briefly. |
URI: | http://hdl.handle.net/10266/5140 |
Appears in Collections: | Doctoral Theses@CSED |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
PhD_Khushneet_Jindal_951211001.pdf | 6.67 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.