Please use this identifier to cite or link to this item: http://hdl.handle.net/10266/6487
Title: Framework for Efficient Spam Detection in Online Social Network
Authors: Rao, Sanjeev
Supervisor: Verma, Anil Kumar
Bhatia, Tarunpreet
Keywords: Data sampling techniques;Deep Learning;Machine Learning;Online Social Network;Natural Language Processing;Social Spam
Issue Date: 26-Jun-2023
Abstract: Online Social Networks (OSNs) are perpetually evolving and used in plenteous applications such as communication, news, entertainment, businesses, gaming, marketing and advertisement, live-streaming, job search, dating, education, healthcare, etc. Simultaneously, cybercriminals and botnets with groups of fake/bot accounts use OSNs to disseminate spam, misleading facts, fake news, hate speech, and malicious links to targeted users or masses to perform cyber-crimes, earn money, polarize sentiments, and impact users’ online interaction time. Moreover, prevalent spam degrades available information quality, network bandwidth, computing power, and speed. Recently, AI-enabled Deepfakes have exacerbated these issues at large. Thus, to detect and eliminate social spam and spammers from OSNs, it is necessary to review recent research on these topics. This doctoral thesis thoroughly reviews existing solutions for social spam and spammer detection techniques. Initially, background related to social spam, the spamming process, and social spam taxonomy is discussed. Later, the extensive review reveals various essentials and critical challenges to detect and combat social spam. The thesis uncovers important information about features used, dimensionality reduction techniques used for feature selection/extraction, existing datasets, and various machine learning and deep learning methodologies used for social spam and spammer detection, along with their strengths and limitations. Also, the thesis explores information related to recent AI-enabled Deepfake (text, image, and video) spam and its countermeasures. The doctoral thesis aims to advance the field of spam detection in OSN by developing Machine Learning (ML) and Deep Learning (DL) based approaches to address the most pressing issues in social spam detection, thereby improving the performance of spam detection. Most previous research relied on small datasets and witnessed class imbalance issues, resulting in biased outcomes towards the majority class. This study uses data-sampling techniques such as NearMiss and SmoteTomek to address the class imbalance problem. Traditional word representation techniques are inefficient and time-consuming when generating contextual word vectors. The relevant features are extracted in this study using recent word-representation/embedding techniques before feeding any ML/DL model. In this study, a framework for social spam detection is proposed using ML and DL based approaches to improve the performance of social spam detection. For ML-based approaches, two voting ensemble models are proposed. Initially, nine baseline ML models are trained and tested on imbalanced and balanced datasets. Later, the models are ensemble using hard and soft voting mechanisms. After the model evaluation on test data, the best parameter values are extracted, and the best prediction model is finalized by adjusting the parameters. In the Proposed Hard Voting Ensemble (PHVE), the final class prediction is made using the majority voting of each classifier. However, in the Proposed Soft Voting Ensemble (PSVE), the target label with the highest sum of weighted probability is chosen as the final prediction. Finally, it is revealed that the performance of the PHVE and PSVE models outperforms other ML baseline models over the balanced datasets. For DL-based approaches, two hybrid approaches are proposed using advanced pre-trained word embedding techniques, deep learning approaches, and self-attention mechanism on a balanced combined dataset retrieved using the SmoteTomek data sampling technique. The embeddings are generated using recent GloVe and FastText pre-trained word embeddings and passed into the deep neural network comprised of Conv1D and Bi-directional recurrent neural network layers with the self-attention mechanism for improved contextual understanding and effective results. Experiments and comparisons show that the proposed hybrid framework with deep learning-based approaches outperformed other techniques/approaches. Furthermore, meticulous discussions, existing challenges, and emerging issues such as robustness of detection systems, scalability, real-time datasets, evade strategies used by spammers, coordinated inauthentic behavior, and adversarial attacks on machine learning-based spam detectors, etc., have been discussed with possible directions for future research.
URI: http://hdl.handle.net/10266/6487
Appears in Collections:Doctoral Theses@CSED

Files in This Item:
File Description SizeFormat 
951703012_Sanjeev Rao_PhD_Thesis.pdf3.86 MBAdobe PDFView/Open    Request a copy


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.