Please use this identifier to cite or link to this item:
Title: Development of Speaker Recognition Model for Forensic Application
Authors: Gaurav
Supervisor: Bhardwaj, Saurabh
Agarwal, Ravinder
Keywords: Speaker Recognition;Speech Processing;Speaker Identification;Speaker Diarization
Issue Date: 30-Apr-2024
Abstract: Voice is a natural communication tool humans use to convey meanings, ideas, opinions, etc. In particular, "voice" pertains to any sound generated through the vibration of vocal folds when air pressure is from the lungs. It encompasses various characteristics of the speaker, such as ethnicity, age, gender, and emotions. The utilisation of biometrics, particularly voice recognition, has gained popularity in the realm of security. Beyond facial recognition, distinct features like the retina, iris, and voice can be employed to distinguish individuals. Biometrics can be broadly classified as either physiological or behavioural. Physiological biometrics involve features like the face, finger-print, and iris, while behavioural biometrics encompass voice, keystroke, and signature. Among these, voice recognition is one of the most valuable technologies due to its user-friendly nature, widespread acceptance, and cost-effectiveness. Speaker recognition research has been ongoing for several decades, experiencing significant advancements in signal processing, algorithms, architecture, and hardware. Specifically, voice refers to any sound produced by vocal fold vibration when air from the lungs is under pressure. It carries various traits of the speaker, including ethnicity, age, gender, and emotions. The use of biometrics, including voice recognition, has gained popularity in the field of security. In addition to facial recognition, other unique features such as the retina, iris, and voice can also be used to distinguish individuals. Biometrics can be categorised as physiological and behavioural. Physiological biometrics include features like the face, fingerprint, and iris, while behavioral biometrics include voice, keystroke, and signature. Voice recognition is considered one of the most useful technologies. It is easy to use and implement, widely accepted by users, and cost-effective. Research in speaker recognition has been conducted for several decades and has significantly evolved with advancements in signal processing, algorithms, architecture, and hardware. Normally, speech samples received for forensic examination and comparison originate from uncontrolled environments. Consequently, models were developed for identification and verification in forensic scenarios. The existing methods do not provide sufficient accuracy and robustness of the speech signal. An efficient Speaker Identification framework based on Mask region-based convolutional neural network (Mask R-CNN) classifier parameter optimised using Hosted Cuckoo Optimization (HCO) is developed to overcome the issues. The objective of the method is "to increase the accuracy and to improve the robustness of the signal". The use of robust feature extraction significantly enhances the efficacy of forensic speaker verification. Although the voice signal is a continuous one-dimensional time series, most contemporary models mostly use recurrent neural network (RNN) or convolutional neural network (CNN) models. These models lack the ability to comprehensively depict human speech, rendering them susceptible to speech forgery. Therefore, it is necessary to establish a reliable technique to reproduce the human voice accurately and ensure the genuineness of the original speaker. The proposed method presents a Two-Tier Feature Extraction with a Metaheuristics-Based Automated Forensic Speaker Verification (TTFEM-AFSV) model, which aims to overcome the limitations of the previous models. The TTFEM-AFSV model focuses on verifying speakers in forensic applications by exploiting the average median filtering (AMF) technique to discard the noise in speech signals. Both models' performance validation was tested in a series of experiments. A comparative study revealed the significantly improved performance models over recent approaches. Speaker diarization is a method of splitting individual speakers in the audio stream so that all the speaker's speeches can be separated in the automatic speech recognition (ASR) transcript. Its unique audio features divide the speakers, and its speeches can be bucketed together. As mass gatherings and communication increase, the process of speaker diarization might add complexity to efforts to enhance the clarity of speech transcripts. In response to these concerns, an automated speaker diarization system has been devised by employing an arithmetic optimization algorithm alongside a deep belief network technique known as ASDS-AOADBN. To address these issues, an automated speaker diarization system using an arithmetic optimisation algorithm with a deep belief network (ASDS-AOADBN) technique is developed. The model's primary purpose lies in identifying and classifying speaker signals from input audio signals. The experimental result analysis stated the better performance of the ASDS- AOADBN technique over recent state-of-the-art DL models.
Appears in Collections:Doctoral Theses@EIED

Files in This Item:
File Description SizeFormat 
Thesis Gaurav_Speaker Recognition.pdf4.76 MBAdobe PDFView/Open    Request a copy

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.