Please use this identifier to cite or link to this item:
http://hdl.handle.net/10266/5976
Title: | An Efficient Spammer Classification for Ranking of Web Pages |
Authors: | Makkar, Aaisha |
Supervisor: | Kumar, Neeraj |
Keywords: | Web page ranking |
Issue Date: | 6-Jul-2020 |
Abstract: | Inaccurate search engine result page (SERP) is one of the significant drawbacks of the search engine ranking algorithm. Web spam is one of its primary cause. Although there are many techniques which have detected web spam by analyzing the content features and link features of a web page. These spam detection techniques primarily focused on revising the rank score of a web page after being included in SERP. But, none of these techniques targets at preventing the web spam before assigning a rank by the ranking algorithm. For the successful SERPs, the web pages should be completely spammed free before en- tering into the ranking module. For this purpose, web spam pages should be demoted by the ranking algorithm itself to reduce their rank score. This mechanism should be implemented in such a way that the authoritative web sites get promoted. The complete analysis and study of Google ranking methodology, i.e., PageRank, is done. The various measures affecting the rank score computation in PageRank are investigated. As a result, the primary cause of injection of spam web pages on the web is due to the presence of dangling web pages. Dangling pages are the webpages which do not have hyperlinks. The ratio of dangling pages is increasing due to the documents such as pdf, technical reports from research communities. Spammers create artificial in-links to boost the rank of webpages, but the outgoing links are not focused. Thus, it results in dangling pages. Although a lot of work has been done for handling dangling pages and improving the ranking algorithm. But, none of these has handled dangling pages concerning user behavior analysis. User surfing activities can only predict the real picture of a webpage. Evaluating the importance of dangling page can significantly help in refining the SERPs. This task has been accomplished in this research work with two different approaches. The first approach detects the spam dangling web pages by considering the user behavior attributes, i.e., dwell time and click count. Web page importance score is computed by analyz- ing user surfing behavior attributes, dwell time, and click count. After calculating the webpage importance score, the ranks are revised by implementing it in Learning Automata (LA) envi- ronment. Learning automaton is the stochastic system which learns from the environment and responds either with a reward or a penalty. With every response from the environment, the probability of visiting the webpage is updated. Probability computation is done using Normal and Gamma distribution functions. In the proposal, we have considered only the dangling pages for experiments. Inactive webpages are punished and degraded from the system. We have val- ii idated the proposed approach with Microsoft Learning to Rank dataset. It has been found in the experiments performed that 3403 dangling pages out of 12211 dangling pages have been degraded using the proposed scheme. The objective of the proposed system is achieved by sav- ing web energy and computational cost. It takes 100 iterations to convergence which executed in 21.88 ms. However, the user behavior analysis helped in improving PageRank score of the webpages. The second approach presents an intelligent cognitive spammer framework, Cognitive spam- mer, which eliminates the spam pages during the web page rank score calculation by search engines. The framework updates Googles ranking algorithm, PageRank in such a way that it automatically prevents link spam. It considers the link structure of the web for rank score computation. The updated PageRank algorithm provided a better ranking of web pages. The proposed framework is validated with the WEBSPAM-UK2007 dataset. Before processing, the dataset is preprocessed with a new technique, called Split by Over-sampling and Train by Under-fitting to remove the trade off between imbalanced instances of the target class. After data cleaning, we applied machine learning techniques (Bagged model, Boosted linear model, etc.) with the web page features to make accurate predictions. The detection classifiers only consider the link features of the web page irrespective of the page content. Out of the fifteen classifiers, the best three are ensemble, which results in better performance with overall accu- racy improvement. Ten-fold cross-validation has also been applied with the resulted ensemble model, which results in getting the accuracy of 99.6% in the proposed scheme. |
URI: | http://hdl.handle.net/10266/5976 |
Appears in Collections: | Doctoral Theses@CSED |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Revised_thesis.pdf | 11.46 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.