Multimodal Machine Learning for an Efficient Information Retrieval: Step into Next-Generation Computing
Loading...
Date
Authors
Supervisors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
What kind of a perception living creatures learn about the external environment including
their own body is perceived through sensory information or modalities such as visuals,
touch and hearing. Due to the rich characteristics of the environment, it is infrequent that
a single modality provides efficient complete knowledge about any phenomena of interest.
As when several senses are occupied in the processing of knowledge, we can have a better
understanding. The increase in the obtainability of modalities on the same space provides
new degrees of freedom for the fusion of modalities. Fusion of modalities is the process
of combining features from different sources to obtain complementary information from
each. This dissertation focuses on information fusion of multimodal data to provide
high accuracy, scalability and enhanced performance for various tasks. In this research
work we integrated the visual and linguistic modalities to have the improved decision
making machine learning models. For this we have proposed three different frameworks
for multimodal classification. The primary focus is to develop robust frameworks that
utilize deep learning architectures for enhancement of multimodal classification accuracy
and efficiency.
In the first proposed work we address the challenge of effectively fusing features to improve
food classification accuracy. The proposed model is evaluated on the UPMC Food 101
dataset and a newly created Bharatiya Food dataset. It involves feature extraction using
fine-tuned Inception-v4 for visual and RoBERTa for its related text, followed by earlystage
fusion to integrate these features effectively.
The second proposed work introduces Deep Attentive Multimodal Fusion Network (DAMFN)
which is an improvement to the previous model for multimodal food classification system.
In this model majorly two significant improvements have been done - one update is in the
feature extraction model of visual component and other is the increase in the size of the
newly developed dataset. The model employs a three-stage process: Functional Feature
Extraction, Early-Stage Fusion, and Feature Classification. Experimental results on the
UPMC Food 101 dataset and the newly developed food dataset demonstrate the superior
performance of DAMFN over state-of-the-art techniques, highlighting its ability to
leverage deep correlations between modalities for improved classification outcomes.
The third proposed approach introduces the Vision Language Fused Attention (ViLFAt)
classification network that addresses the challenge of effectively fusing the modalities for
improved meme detection accuracy. For intrinsic meme detection both the global and
salient features from the meme visual are combined with the textual features. The model further utilizes an attention mechanism to highlight and integrate the most relevant
features from the modalities. It has led to significant improvements in detecting intrinsic
multimodal meme content, as demonstrated by the performance results.
Keywords: Multimodal machine learning, deep learning, feature fusion, multimodality,
convolutional neural network, image and text integration, multimodal food classification,
multimodal meme detection.
