Prosody Based Phonetic Engine and Speaker Classification for Punjabi Language
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Speech is the most natural means of communication between humans. It is one of
the first skills that we learn. Babies quickly learn how to react to the voice of their
mother and they even more quickly learn to produce noise when they are in need.
Speech has always been an important way of communication. Even before writing,
the spoken words were used to pass the knowledge.
Despite all our novel ways of communication, such as e-mail and chat, speech is
still considered to be the best means of communication. So, it is only logical that
machine interface designers in their quest for a natural man-machine interface have
turned to automatic speech recognition and speech production as one of the most
promising interfaces. The system which converts speech signal to text is termed
as Automatic Speech Recognition (ASR) system. Phonetic Engine (PE) is the first
stage of ASR and it converts speech signal to phonetic symbols. ASR system does
this process by capturing speech waveform, extracting the relevant features, capturing the message and reproducing it as text.
The main motivation behind this work is to develop a PE for Punjabi language
and explore the possibility of improving its performance by incorporating prosody.
Prosody refers to the collection of characteristics that lend naturalness to speech. PE
is a transformation tool which utilizes the acoustic phonetic details present in an input speech signal to decompose it into a symbolic form. PE develops a sequence of
symbols without considering any language constraints in the form of lexical, syntactic and higher level knowledge source. The choice of symbols should be such
that it can capture all the phonetic variations in the speech.
In this research work, a PE is designed and implemented for continuous speech of
an Indian language named as Punjabi. Punjabi is a highly prosodic language and
not much work has been done in this direction on this language. As a first step
towards the development of PE, 24.5 hours of data has been collected in three different modes, namely, read speech, lecture speech and conversational speech. The
10 hours of collected data is then manually transcribed using International Phonetic
iii
Alphabet (IPA) chart. The architecture of the PE includes three phases: data preparation, system training and system testing. Initially, 49 symbols were selected by
carefully analysing the symbol frequency in IPA transcription and data files have
been prepared to train the system accordingly. The prepared data files and speech
files have then been used for modeling and feature extraction processes. In the development of PE, Mel-Frequency Cepstral Coefficients (MFCCs) have been used
as a feature extraction technique and Hidden Markov Model (HMM) as a classifier.
The PE has been developed using HMM ToolKit (HTK).
The performance of PE has been evaluated using three different approaches: (i) By
increasing the amount of data from 3 hours to 5 hours, (ii) By decreasing the number of symbols from 49 to 29, and (iii) By increasing MFCC dimensions from 12 to
36. An accuracy of 72.3% has been achieved in this work when 5 hours data with
29 symbols and 12 MFCCs was employed.
The speech data collected in read speech mode has further been used to design and
implement a text-independent speaker classification, since, it is one of the popular
biometric identification techniques, which establishes the speaker’s identity by considering the speech of the person.
Many speaker classification techniques have been designed and implemented so far
to efficiently recognize the speaker. From the existing review, it has found that the
existing speaker classification techniques suffer from the over-fitting and the parameter tuning issues. An efficient tuning of machine learning techniques can improve
the classification accuracy of speaker classification. Therefore, to overcome the
over-fitting issue, initially, in this thesis, a novel Ensemble-based Quantum Neural
Network (EQNN) technique has been designed. It works on ensembling of novel
data splitting strategies. Quantum Neural Network (QNN) has been implemented in
MATLAB for the dataset of 7 speakers with 30 samples of read speech from each
speaker. QNN has been trained and tested with different data splitting strategies.
Along with this, results of previous strategy has been ensembled with the training
of next strategy. All the experiments have been repeated 30 times.
For comparison of results, we have implemented four base classifiers, namely, Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM) and Artificial
Neural Networks (ANNs) with same dataset. Extensive experiments have been carried out by considering EQNN and the base classifiers. The performance of all the
techniques has been evaluated using four performance metrics, namely, accuracy,
F-measure, specificity and sensitivity. It has been observed that EQNN outperforms
existing speaker classification techniques in terms of all the performance metrics.
However, EQNN based speaker classification technique suffers from the parameter
iv
tuning issue and still there is a chance of over-fitting. To overcome this issue, finally, a Crossover based Particle Swarm Optimization with Support Vector Machine
(CPSOSVM) has been designed and implemented in this work using MATLAB. In
CPSOSVM, Particle Swarm Optimization (PSO) has been used to tune the parameters of SVM. The crossover operator has been applied on PSO as it has an ability to
overcome the issue of getting stuck in local optima with the standard PSO. Thereafter, CPSOSVM and the competitive machine learning techniques have been used
to classify the speakers. Finally, the comparisons have been drawn with the competitive machine learning models and CPSOSVM by considering the same performance
metrics as we did for EQNN. It has been observed that CPSOSVM has performed
better in all the performance metrics when compared with EQNN and other base
classifiers.
