Modified Algorithm for Data Cleaning of Log File Using File Extensions in Web Usage Mining

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Web pages typically contain a large amount of information that is not part of the main content of the pages, e.g. banner ads, navigation bars, copyright notices, etc. Such noise on web pages usually leads to poor results in Web Mining which mainly depends upon the web page content. This thesis focuses on the problem of web cleaning i.e. the preprocessing of web pages to automatically detect and eliminate noises for Web mining. Web usage mining is the subject field of Web Mining which deals with the discovery and analysis of usage patterns from web data specifically web logs in order to improve the web based applications. The Web usage mining process consists of three phases: Data Preprocessing, Pattern Discovery and Pattern Analysis. Preprocessing cleans up the data (server log file) in order to filter out from the data set the automatic requests generated by the web page, which were not specifically requested by the user. In addition to elimination of the irrelevant automatic requests, it is required to remove nonhuman access behaviour (e.g. spiders, crawlers, and automatic web bots) from the web log file. Another type of inaccuracy that may be present and must be removed from the log file is the entries related to the error requests. Pre-processing results also strongly influences the later phases of Web Usage Mining. This makes the pre-processing of server log files a significant step in Web Usage Mining. The data is preprocessed to improve the efficiency and ease of the mining process. Three algorithms have been discussed in this thesis. The aim of first algorithm is to separate the data fields from the server log entries. The second algorithm stores these separated fields in a relational database. The cleaning technique is discussed in the third algorithm. This algorithm extracts data from the relational database created by second algorithm and then filters the data by eliminating the extraneous and irrelevant entries. The output of this algorithm gives the clean log file consisting of data necessary from the Web usage mining perspective.

Description

Citation

Endorsement

Review

Supplemented By

Referenced By