Speech Recognition of Punjabi Numerals Using Convolutional Neural Networks (CNNs)
Loading...
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Speech is one of the most natural ways a human interacts and expresses. It is the most
convenient form of giving an input to a system. With advancements in technology almost
every object that surround humans is slowly progressing towards being automated. This
means that in near future almost everything will be controlled using voice or gestures.
Slowly and steadily the count of devices and objects that we come across daily in our lives
being speech recognizable is increasing like ATMs for visually impaired people and various
applications can be supported with speech recognizing system to provide employment
opportunities for the differently abled people.
But achieving good accuracy in speech recognition and making the speech recognition
system noise robust has always been one of the main concerns of this research area. The
model that has dominated the speech recognition field has been GMM-HMM, but with
the advancement in the big data field and the computing power, the deep net models
have leveraged these gains and used them to outperform GMM-HMM model .But still
there is a race of minimizing the error rate.
Achieving accuracy for speech recognition has been a huge obstacle in the domain of
Natural Language Processing. The model used predominantly for recognizing speech is
GMM-HMM. But with the boom of Deep learning, it has took primacy over the earlier
model. With the advancement in the parallel processing and usage of the GPU power,
Deep Learning has emanated throughout and has set forth results that has asserted the
fact of it outperforming the GMM-HMM.
In this research work we implemented deep learning algorithm - Convolutional Neural
network (CNN) with the purpose of achieving good accuracy using the data set. The
data is audio data (.wav files) capturing recital of counting from 0 to 100 in Punjabi
Language. Data has been targeted to achieve a good balance of male and female speakers.
The CNN model architecture comprises of four stack of convolutional layer , ReLU unit
and Max pooling unit and further the output from these stacks is passed on to the
two fully connected layer . The first fully connected layer has a drop out of 25%. The
results obtained from this work has shown better performance as compared to the existing
work.
Description
Master of Engineering -CSE
