Prosody Based Phonetic Engine and Speaker Classification for Punjabi Language

Kaur, Rupinderdeep

Prosody Based Phonetic Engine and Speaker Classification for Punjabi Language

Files

Primary Updated Thesis - Rupinderdeep - 05-11-2021.pdf (4.68 MB)

Date

2021-11-06

Authors

Kaur, Rupinderdeep

Supervisors

Sharma, R. K.

Kumar, Parteek

Abstract

Speech is the most natural means of communication between humans. It is one of the first skills that we learn. Babies quickly learn how to react to the voice of their mother and they even more quickly learn to produce noise when they are in need. Speech has always been an important way of communication. Even before writing, the spoken words were used to pass the knowledge. Despite all our novel ways of communication, such as e-mail and chat, speech is still considered to be the best means of communication. So, it is only logical that machine interface designers in their quest for a natural man-machine interface have turned to automatic speech recognition and speech production as one of the most promising interfaces. The system which converts speech signal to text is termed as Automatic Speech Recognition (ASR) system. Phonetic Engine (PE) is the first stage of ASR and it converts speech signal to phonetic symbols. ASR system does this process by capturing speech waveform, extracting the relevant features, capturing the message and reproducing it as text. The main motivation behind this work is to develop a PE for Punjabi language and explore the possibility of improving its performance by incorporating prosody. Prosody refers to the collection of characteristics that lend naturalness to speech. PE is a transformation tool which utilizes the acoustic phonetic details present in an input speech signal to decompose it into a symbolic form. PE develops a sequence of symbols without considering any language constraints in the form of lexical, syntactic and higher level knowledge source. The choice of symbols should be such that it can capture all the phonetic variations in the speech. In this research work, a PE is designed and implemented for continuous speech of an Indian language named as Punjabi. Punjabi is a highly prosodic language and not much work has been done in this direction on this language. As a first step towards the development of PE, 24.5 hours of data has been collected in three different modes, namely, read speech, lecture speech and conversational speech. The 10 hours of collected data is then manually transcribed using International Phonetic iii Alphabet (IPA) chart. The architecture of the PE includes three phases: data preparation, system training and system testing. Initially, 49 symbols were selected by carefully analysing the symbol frequency in IPA transcription and data files have been prepared to train the system accordingly. The prepared data files and speech files have then been used for modeling and feature extraction processes. In the development of PE, Mel-Frequency Cepstral Coefficients (MFCCs) have been used as a feature extraction technique and Hidden Markov Model (HMM) as a classifier. The PE has been developed using HMM ToolKit (HTK). The performance of PE has been evaluated using three different approaches: (i) By increasing the amount of data from 3 hours to 5 hours, (ii) By decreasing the number of symbols from 49 to 29, and (iii) By increasing MFCC dimensions from 12 to 36. An accuracy of 72.3% has been achieved in this work when 5 hours data with 29 symbols and 12 MFCCs was employed. The speech data collected in read speech mode has further been used to design and implement a text-independent speaker classification, since, it is one of the popular biometric identification techniques, which establishes the speaker’s identity by considering the speech of the person. Many speaker classification techniques have been designed and implemented so far to efficiently recognize the speaker. From the existing review, it has found that the existing speaker classification techniques suffer from the over-fitting and the parameter tuning issues. An efficient tuning of machine learning techniques can improve the classification accuracy of speaker classification. Therefore, to overcome the over-fitting issue, initially, in this thesis, a novel Ensemble-based Quantum Neural Network (EQNN) technique has been designed. It works on ensembling of novel data splitting strategies. Quantum Neural Network (QNN) has been implemented in MATLAB for the dataset of 7 speakers with 30 samples of read speech from each speaker. QNN has been trained and tested with different data splitting strategies. Along with this, results of previous strategy has been ensembled with the training of next strategy. All the experiments have been repeated 30 times. For comparison of results, we have implemented four base classifiers, namely, Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM) and Artificial Neural Networks (ANNs) with same dataset. Extensive experiments have been carried out by considering EQNN and the base classifiers. The performance of all the techniques has been evaluated using four performance metrics, namely, accuracy, F-measure, specificity and sensitivity. It has been observed that EQNN outperforms existing speaker classification techniques in terms of all the performance metrics. However, EQNN based speaker classification technique suffers from the parameter iv tuning issue and still there is a chance of over-fitting. To overcome this issue, finally, a Crossover based Particle Swarm Optimization with Support Vector Machine (CPSOSVM) has been designed and implemented in this work using MATLAB. In CPSOSVM, Particle Swarm Optimization (PSO) has been used to tune the parameters of SVM. The crossover operator has been applied on PSO as it has an ability to overcome the issue of getting stuck in local optima with the standard PSO. Thereafter, CPSOSVM and the competitive machine learning techniques have been used to classify the speakers. Finally, the comparisons have been drawn with the competitive machine learning models and CPSOSVM by considering the same performance metrics as we did for EQNN. It has been observed that CPSOSVM has performed better in all the performance metrics when compared with EQNN and other base classifiers.

Keywords

Prosody, Phonetic Engine, Speaker Classification, EQNN, HTK, MFCC, HMM

URI

http://hdl.handle.net/10266/6189

Collections

Doctoral Theses@CSED

Full item page

Prosody Based Phonetic Engine and Speaker Classification for Punjabi Language

Files

Date

Authors

Supervisors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By