Framework for Efficient Spam Detection in Online Social Network
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Online Social Networks (OSNs) are perpetually evolving and used in plenteous
applications such as communication, news, entertainment, businesses, gaming,
marketing and advertisement, live-streaming, job search, dating, education, healthcare,
etc. Simultaneously, cybercriminals and botnets with groups of fake/bot accounts use
OSNs to disseminate spam, misleading facts, fake news, hate speech, and malicious
links to targeted users or masses to perform cyber-crimes, earn money, polarize
sentiments, and impact users’ online interaction time. Moreover, prevalent spam
degrades available information quality, network bandwidth, computing power, and
speed. Recently, AI-enabled Deepfakes have exacerbated these issues at large. Thus, to
detect and eliminate social spam and spammers from OSNs, it is necessary to review
recent research on these topics.
This doctoral thesis thoroughly reviews existing solutions for social spam and spammer
detection techniques. Initially, background related to social spam, the spamming
process, and social spam taxonomy is discussed. Later, the extensive review reveals
various essentials and critical challenges to detect and combat social spam. The thesis
uncovers important information about features used, dimensionality reduction
techniques used for feature selection/extraction, existing datasets, and various machine
learning and deep learning methodologies used for social spam and spammer detection,
along with their strengths and limitations. Also, the thesis explores information related
to recent AI-enabled Deepfake (text, image, and video) spam and its countermeasures.
The doctoral thesis aims to advance the field of spam detection in OSN by developing
Machine Learning (ML) and Deep Learning (DL) based approaches to address the most
pressing issues in social spam detection, thereby improving the performance of spam
detection. Most previous research relied on small datasets and witnessed class
imbalance issues, resulting in biased outcomes towards the majority class. This study
uses data-sampling techniques such as NearMiss and SmoteTomek to address the class
imbalance problem. Traditional word representation techniques are inefficient and
time-consuming when generating contextual word vectors. The relevant features are
extracted in this study using recent word-representation/embedding techniques before
feeding any ML/DL model.
In this study, a framework for social spam detection is proposed using ML and DL based approaches to improve the performance of social spam detection. For ML-based
approaches, two voting ensemble models are proposed. Initially, nine baseline ML
models are trained and tested on imbalanced and balanced datasets. Later, the models
are ensemble using hard and soft voting mechanisms. After the model evaluation on
test data, the best parameter values are extracted, and the best prediction model is
finalized by adjusting the parameters. In the Proposed Hard Voting Ensemble (PHVE),
the final class prediction is made using the majority voting of each classifier. However,
in the Proposed Soft Voting Ensemble (PSVE), the target label with the highest sum of
weighted probability is chosen as the final prediction. Finally, it is revealed that the
performance of the PHVE and PSVE models outperforms other ML baseline models
over the balanced datasets.
For DL-based approaches, two hybrid approaches are proposed using advanced pre-trained word embedding techniques, deep learning approaches, and self-attention
mechanism on a balanced combined dataset retrieved using the SmoteTomek data
sampling technique. The embeddings are generated using recent GloVe and FastText
pre-trained word embeddings and passed into the deep neural network comprised of
Conv1D and Bi-directional recurrent neural network layers with the self-attention
mechanism for improved contextual understanding and effective results. Experiments
and comparisons show that the proposed hybrid framework with deep learning-based
approaches outperformed other techniques/approaches.
Furthermore, meticulous discussions, existing challenges, and emerging issues such as
robustness of detection systems, scalability, real-time datasets, evade strategies used by
spammers, coordinated inauthentic behavior, and adversarial attacks on machine
learning-based spam detectors, etc., have been discussed with possible directions for
future research.
