Optical Character Recognition of Machine Printed Dogri Language Documents

Jindal, Khushneet

Optical Character Recognition of Machine Printed Dogri Language Documents

Files

PhD_Khushneet_Jindal_951211001.pdf (6.51 MB)

Date

2018-08-03

Authors

Jindal, Khushneet

Supervisors

Sharma, Rajiv Kumar

Abstract

Optical character recognition (OCR) is a technology used for the digitization of printed historical documents, books, magazines, manuscripts etc., in order to preserve them from deterioration. In India, there are a number of language groups, the major ones being the Indo-Aryan languages, spoken by most of the Indians. One such language is Dogri which is written using Devanagari script and is one of the important Indian language used in the border areas of the North of India. The present research work is an exclusive attempt towards the design and development of an OCR for recognition of machine printed Dogri language documents. A new dataset of Dogri language characters has been prepared as standard dataset for Dogri language OCR was non-existent. The new dataset consists of around 87000 character images collected from old books, magazines, newspaper and synthetically generated data. A novel shape based algorithms have been proposed for the segmentation of lines, words, characters and modifiers for the printed Dogri language documents. The proposed algorithms mainly focused upon the structure of character by retaining the header line (Shirorekha) during segmentation which is importantly required to minimize the loss of structural information. It helps in minimizing the chances of under-segmentation or oversegmentation. A segmentation accuracy of around 99.46% at character level has been achieved using proposed algorithms. The results showed that the proposed algorithms not only successfully resolves identified shortcomings that occur due to structural loss, but are also time efficient than the other methods. Moreover, when the proposed algorithms were applied on the pre-detected words of Devanagari script based natural scene images, it has been found that there was an enhanced accuracy of 36.34% at v character level with 56% lesser processing time than the method in vogue. The proposed algorithms successfully segments almost all the cases where existing algorithms failed to segment, under or over segment the text image. For the recognition of characters initially shape oriented features have been extracted using Discrete Cosine Transformation (DCT), Gradient and Zernike Moments feature extraction techniques. The performance of these techniques is evaluated in terms of attributes and length of features. The effectiveness of shape based features has been analysed in recognition stage using various combinations classification techniques. Around 200 features have been extracted in zig-zag manner from each of the image of size 32x32. The characters recognition has been performed using various combinations of extracted features and Multilayer perceptron neural networks (MPNN), Support Vector Machines (SVM) & k-Nearest Neighbors (k-NN) classification techniques. For experimentation, the datasets was partitioned in the ratio of 75:25 i.e, 75% data has been used for training and remaining 25% for testing the classifier. The proposed character recognition system has achieved an impressive accuracy of 98.56% to 99.10% (depending upon the classifier used) the best reported till date. Further, the maximum character recognition accuracy of 99.10% was also achieved with the combination of Gradient features and Support vector machine. Then, a dictionary based post-processing technique has been applied for the correction of errors left by the classification stage. A corpus containing around fifty lac words of Dogri and Hindi language text has been formulated from online books, documents, magazines, newspaper etc. The output of post-processor has been manually vi checked at character level on five Dogri language documents and the results were matched with the actual printed document. Finally, chapter six presents the inferences drawn from the results of the various experiments carried out in this work. Also, some future research directions on the line of this work are discussed briefly.

Keywords

Image Processing, Character Recognition, Dogri, Segmentation

URI

http://hdl.handle.net/10266/5140

Collections

Doctoral Theses@CSED

Full item page

Optical Character Recognition of Machine Printed Dogri Language Documents

Files

Date

Authors

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By