Optical Character Recognition of Machine Printed Dogri Language Documents
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Optical character recognition (OCR) is a technology used for the digitization of
printed historical documents, books, magazines, manuscripts etc., in order to preserve
them from deterioration. In India, there are a number of language groups, the major
ones being the Indo-Aryan languages, spoken by most of the Indians. One such
language is Dogri which is written using Devanagari script and is one of the important
Indian language used in the border areas of the North of India. The present research
work is an exclusive attempt towards the design and development of an OCR for
recognition of machine printed Dogri language documents. A new dataset of Dogri
language characters has been prepared as standard dataset for Dogri language OCR was
non-existent. The new dataset consists of around 87000 character images collected from
old books, magazines, newspaper and synthetically generated data. A novel shape based
algorithms have been proposed for the segmentation of lines, words, characters and
modifiers for the printed Dogri language documents. The proposed algorithms mainly
focused upon the structure of character by retaining the header line (Shirorekha) during
segmentation which is importantly required to minimize the loss of structural
information. It helps in minimizing the chances of under-segmentation or oversegmentation.
A segmentation accuracy of around 99.46% at character level has been
achieved using proposed algorithms. The results showed that the proposed algorithms
not only successfully resolves identified shortcomings that occur due to structural loss,
but are also time efficient than the other methods. Moreover, when the proposed
algorithms were applied on the pre-detected words of Devanagari script based natural
scene images, it has been found that there was an enhanced accuracy of 36.34% at
v
character level with 56% lesser processing time than the method in vogue. The
proposed algorithms successfully segments almost all the cases where existing
algorithms failed to segment, under or over segment the text image.
For the recognition of characters initially shape oriented features have been
extracted using Discrete Cosine Transformation (DCT), Gradient and Zernike Moments
feature extraction techniques. The performance of these techniques is evaluated in
terms of attributes and length of features. The effectiveness of shape based features has
been analysed in recognition stage using various combinations classification
techniques. Around 200 features have been extracted in zig-zag manner from each of
the image of size 32x32.
The characters recognition has been performed using various combinations of
extracted features and Multilayer perceptron neural networks (MPNN), Support Vector
Machines (SVM) & k-Nearest Neighbors (k-NN) classification techniques. For
experimentation, the datasets was partitioned in the ratio of 75:25 i.e, 75% data has
been used for training and remaining 25% for testing the classifier. The proposed
character recognition system has achieved an impressive accuracy of 98.56% to 99.10%
(depending upon the classifier used) the best reported till date. Further, the maximum
character recognition accuracy of 99.10% was also achieved with the combination of
Gradient features and Support vector machine.
Then, a dictionary based post-processing technique has been applied for the
correction of errors left by the classification stage. A corpus containing around fifty lac
words of Dogri and Hindi language text has been formulated from online books,
documents, magazines, newspaper etc. The output of post-processor has been manually
vi
checked at character level on five Dogri language documents and the results were
matched with the actual printed document.
Finally, chapter six presents the inferences drawn from the results of the various
experiments carried out in this work. Also, some future research directions on the line
of this work are discussed briefly.
