Please use this identifier to cite or link to this item:
|Title:||Design and Implementation of an Efficient Framework for Web Page Classification|
|Keywords:||web page classification;categorization;feature set;binary classifier|
|Abstract:||With an evolution of Internet and related technologies, there is a great need of an effi cient web page categorization for getting the fast response with respect to searching and classi fication of various documents on the web. Due to large number of user's request, there may be a performance bottleneck during searching and classi fication of web documents with respect to various QoS parameters such as response time and congestion. Classifi cation helps in searching, sorting, retrieval, and querying of various documents. World Wide Web (WWW) contains huge repository of information in the form of web pages. But, size of Internet is growing day-by-day which results an e fficient classifi cation of diff erent web pages to achieve higher accuracy. The huge repository of information poses challenge to collect and process the relevant related information of a particular domain. Most of the solutions reported in the literature are not adequate to address above issues for getting a fast response time with respect to web page categorization. Also, most of the existing techniques reported in the literature are semi-automatic. Using these techniques, higher level of accuracy cannot be achieved. So, traditional text classi fication techniques are di fficult to apply on the rapidly growing web-based contents. Moreover, manual categorization of these billions of web pages to achieve high accuracy is a cumbersome and tough task. To address these issues, in this thesis, novel techniques for web page categorization are proposed. In these techniques, personality features are collected and assigned weights. Then, the proposed classi fiers are trained based on these special features. The proposed techniques are based on the identifi cation of specifi c and relevant features of the web pages. In the proposed scheme, first extraction and evaluation of features are done followed by fi ltering the feature set for categorization of domain web pages. A feature extraction tool(FET) based on the HTML document object model(DOM) of the web page is developed in the proposed scheme. In first technique, binary classi fication, feature extraction and weight assignment are based on the collection of domain-specifi c keyword list developed by considering various domain pages such as course, student, faculty etc. Moreover, the keyword list is reduced on the basis of ids of keywords in keyword list. Also, stemming of keywords and tag text is done to achieve a higher accuracy. An extensive feature set is generated to develop a robust classi fication technique. The proposed technique was evaluated using a machine learning method in combination with feature extraction and statistical analysis using support vector machine kernel as the classifi cation tool. In second technique, multiclass classi fication, on-page personality feature sets are extracted and weights are assigned based on feature frequency on web document for each domain. A combined feature set is proposed. Algorithms are designed and these are tested and validated with respect to various data sets collected from di fferent domain categories such as E-Newspaper, Education, Research, Online shopping, Resume. Results obtained depict that proposed classifi er successfully classified news domain pages, education, resume, online shopping, and research web pages from large database repository. Accuracy of the proposed classifi er is found to be satisfactory from a large data set of di fferent categories. Also, there is a 10-15 % overall performance gain using the proposed scheme in comparison to the other existing schemes. The results obtained con rm the eff ectiveness of the proposed scheme in terms of its accuracy in di fferent categories of web pages.|
|Appears in Collections:||Doctoral Theses@CSED|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.