Development of an Intelligent Model for Spoken Language Identification

Abstract

Speech is one of the most natural ways for communicating information. In speech signal the language is a medium to convey messages which consists of sounds, words and grammar. As we are moving fast towards globalized society, there is a need to deal with a variety of languages. Spoken Language identification (SLID) is the automatic process to recognize the identity of the language spoken in a speech sample. SLID is an enabling technology that plays an important role in many multilingual speech processing applications, such as spoken language translation, multilingual speech recognition, and spoken document retrieval. It is also a topic of great interest in the areas of intelligence and security for information distillation. Even with advancements in the field, SLID continues to face several challenges. These include similarities in phonetic structures between languages, variations in speaking styles, background disturbances, and differences in speech like pitch, accent, and pronunciation. The present work set out to design and validate improved models for spoken language identification (SLID) by combining bio-inspired optimization with deep learning. The study examined widely used feature-extraction and classification methods, identified acoustic features that best support language discrimination, developed an optimization-driven deep learning framework, and evaluated the proposed models on a benchmark multilingual dataset. The approach followed a steady progression from model design to empirical testing. The first part of the study introduced DBODL MSLIS framework that integrates Dung Beetle Optimization (DBO) with Long Short-Term Memory (LSTM) networks. Speech samples from the IIIT Spoken Language Dataset were processed to extract four key acoustic features: pitch, energy, zero-crossing rate (ZCR), and discrete wavelet transform (DWT) coefficients. The DBO algorithm was used to tune hyperparameters, improve ii convergence, and reduce the likelihood of the model settling in poor local minima. Experiments showed that DBODL-MSLIS provided strong classification performance and generalization, including under noisy conditions. Compared with the existing classifiers, the model recorded notable gains in accuracy, sensitivity, and F-score and maintained stable learning behaviour. In the second phase, a complementary framework named ASLID-GJODL (Automatic Spoken Language Identification using Golden Jackal Optimization with Deep Learning) was developed to further improve reliability and computational efficiency. Here, speech signals were converted into spectrograms, which were then processed by a Squeeze-andExcitation DenseNet (SE-DenseNet). This network adapts its feature emphasis to highlight language-specific cues while reducing the impact of background noise. Model parameters were optimized using the Golden Jackal Optimization (GJO) method, inspired by the cooperative hunting patterns of golden jackals, allowing an effective balance between exploration and convergence. Both proposed frameworks were tested on two benchmark datasets. Their performance was evaluated using confusion matrices, accuracy, precision, sensitivity, specificity, and F1-score. Results were compared with those of conventional machine-learning approaches. Across evaluations, the proposed models consistently demonstrated higher accuracy, better robustness, and improved computational performance relative to existing methods. This research introduces two hybrid architectures and provides insights into the ways in which metaheuristic optimization can improve the performance of deep learning models in the context of speech analysis. The study proved a significant advancement in the development of more efficient SLID systems through the integration of biologically inspired algorithms: Golden Jackal Optimization (GJO) for hyperparameter tuning and Dung Beetle Optimization (DBO) for feature selection. The results clearly indicates that the integration of computational intelligence with deep neural models can proficiently address the diversity and variability inherent in real-world spoken languages.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By