Development of Speaker Recognition Model for Forensic Application

dc.contributor.authorGaurav
dc.contributor.supervisorBhardwaj, Saurabh
dc.contributor.supervisorAgarwal, Ravinder
dc.date.accessioned2024-04-30T05:07:32Z
dc.date.available2024-04-30T05:07:32Z
dc.date.issued2024-04-30
dc.description.abstractVoice is a natural communication tool humans use to convey meanings, ideas, opinions, etc. In particular, "voice" pertains to any sound generated through the vibration of vocal folds when air pressure is from the lungs. It encompasses various characteristics of the speaker, such as ethnicity, age, gender, and emotions. The utilisation of biometrics, particularly voice recognition, has gained popularity in the realm of security. Beyond facial recognition, distinct features like the retina, iris, and voice can be employed to distinguish individuals. Biometrics can be broadly classified as either physiological or behavioural. Physiological biometrics involve features like the face, finger-print, and iris, while behavioural biometrics encompass voice, keystroke, and signature. Among these, voice recognition is one of the most valuable technologies due to its user-friendly nature, widespread acceptance, and cost-effectiveness. Speaker recognition research has been ongoing for several decades, experiencing significant advancements in signal processing, algorithms, architecture, and hardware. Specifically, voice refers to any sound produced by vocal fold vibration when air from the lungs is under pressure. It carries various traits of the speaker, including ethnicity, age, gender, and emotions. The use of biometrics, including voice recognition, has gained popularity in the field of security. In addition to facial recognition, other unique features such as the retina, iris, and voice can also be used to distinguish individuals. Biometrics can be categorised as physiological and behavioural. Physiological biometrics include features like the face, fingerprint, and iris, while behavioral biometrics include voice, keystroke, and signature. Voice recognition is considered one of the most useful technologies. It is easy to use and implement, widely accepted by users, and cost-effective. Research in speaker recognition has been conducted for several decades and has significantly evolved with advancements in signal processing, algorithms, architecture, and hardware. Normally, speech samples received for forensic examination and comparison originate from uncontrolled environments. Consequently, models were developed for identification and verification in forensic scenarios. The existing methods do not provide sufficient accuracy and robustness of the speech signal. An efficient Speaker Identification framework based on Mask region-based convolutional neural network (Mask R-CNN) classifier parameter optimised using Hosted Cuckoo Optimization (HCO) is developed to overcome the issues. The objective of the method is "to increase the accuracy and to improve the robustness of the signal". The use of robust feature extraction significantly enhances the efficacy of forensic speaker verification. Although the voice signal is a continuous one-dimensional time series, most contemporary models mostly use recurrent neural network (RNN) or convolutional neural network (CNN) models. These models lack the ability to comprehensively depict human speech, rendering them susceptible to speech forgery. Therefore, it is necessary to establish a reliable technique to reproduce the human voice accurately and ensure the genuineness of the original speaker. The proposed method presents a Two-Tier Feature Extraction with a Metaheuristics-Based Automated Forensic Speaker Verification (TTFEM-AFSV) model, which aims to overcome the limitations of the previous models. The TTFEM-AFSV model focuses on verifying speakers in forensic applications by exploiting the average median filtering (AMF) technique to discard the noise in speech signals. Both models' performance validation was tested in a series of experiments. A comparative study revealed the significantly improved performance models over recent approaches. Speaker diarization is a method of splitting individual speakers in the audio stream so that all the speaker's speeches can be separated in the automatic speech recognition (ASR) transcript. Its unique audio features divide the speakers, and its speeches can be bucketed together. As mass gatherings and communication increase, the process of speaker diarization might add complexity to efforts to enhance the clarity of speech transcripts. In response to these concerns, an automated speaker diarization system has been devised by employing an arithmetic optimization algorithm alongside a deep belief network technique known as ASDS-AOADBN. To address these issues, an automated speaker diarization system using an arithmetic optimisation algorithm with a deep belief network (ASDS-AOADBN) technique is developed. The model's primary purpose lies in identifying and classifying speaker signals from input audio signals. The experimental result analysis stated the better performance of the ASDS- AOADBN technique over recent state-of-the-art DL models.en_US
dc.identifier.urihttp://hdl.handle.net/10266/6708
dc.language.isoenen_US
dc.subjectSpeaker Recognitionen_US
dc.subjectSpeech Processingen_US
dc.subjectSpeaker Identificationen_US
dc.subjectSpeaker Diarizationen_US
dc.titleDevelopment of Speaker Recognition Model for Forensic Applicationen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Thesis Gaurav_Speaker Recognition.pdf
Size:
4.65 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.03 KB
Format:
Item-specific license agreed upon to submission
Description: