A framework for association rule mining of distributed data

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The exponential rise in the collected data generated an essential need for new tech- niques that can convert this huge amount of data into useful knowledge. Consequently, Data Mining (DM) has become a powerful technology focusing on the most impor- tant information in the massive data. DM extracts the interesting data patterns from large databases using computational techniques/tools. The classical central data ware- house(DW) based DM approach is ine ective or infeasible because of heavy storage, com- putational and communication costs involved in managing data from the ever increasing and privacy-sensitive distributed resources. Distributed Data Mining (DDM) is emerged as an active sub-area of DM research. DDM is concerned with application of classical DM procedures in a distributed computing environment to e ectively utilize the available resources. Association Rules (ARs) are used to discover the associations among frequent itemsets in a database. Association Rule Mining (ARM) today is one of the most important aspects of DM task. In ARM all the strong association rules are generated from the frequent itemsets. Distributed Association Rule Mining (DARM) generates the globally strong association rules from the global frequent itemsets in a distributed environment for the global decision making. Agent mining also known as Agent enriched DM, is an emerging interdisciplinary area that integrates agent technology, DM, machine learning. Most of the existing agent based frameworks for DARM task are only prototype model and lacks the appropriate underlying Agent Execution Environment (AEE), scalability, privacy preserving techniques, global knowledge and implementation using a real datasets especially in bio-informatics domain. Bio-informatics or computational molecular biology aims at automated analysis and the management of high-throughput biological data as well as modeling and simulation of complex biological systems. Mining the ARs from the frequent itemsets requires a transactional dataset which can be a real transactional datset of any retail industry or can be a synthetic version generated by a tool. A software tool called Transactional Dataset Generator (TDSG) has been designed and implemented in Java language for generating a synthetic dataset. Traditional central DW based approach for ARM is practically investigated with the help of a client-server based framework. The overall response time for the ARM task performed using this approach is also formulated. The outcome of this approach suggested ii the use of agent technology for DARM task for the issue of scalability and global knowledge extraction. An AEE is designed and implemented that acts as a distributed server application for managing a multi-agent system (MAS) for DARM task. It provides the appropriate functionality to Mobile Agents (MAs) to execute, communicate, migrate to other platform, manage itinerary and use system resources. A scalable MAS called Agent enriched Mining of Globally Strong Association Rules (AeMGAR) that act as framework for DARM task is designed and implemented using two computing models. In a serial computing model MAs visit n distributed sites serially and performs their designated tasks. Global knowledge and performance of this system is compared with the traditional central DW based ARM approach. Serial itinerary used for MAs increases the overall cost of DARM task so a parallel computing model is designed. Clones of MAs in parallel computing model visit n distributed sites in parallel and it is found that overall response time for the DARM task involving n distributed sites is very less in case of parallel computing model of AeMGSAR. The comparative analysis on various parameters reveals that the proposed AeMGSAR framework has improved features and exhibit superior performance than the existing agent based DARM frameworks. As mining biological data is an emerging area at the intersection between DDM and bio-informatics, we have also taken the case of DARM in bio-informatics and designed another version of this framework called Agent enriched Quantitative Association Rules Mining for Amino Acids in distributed Protein Data Banks (AeQARM-AAPDB) for min- ing the quantitative ARs for amino acids in proteins. Experimental tests on real data have con rmed its e ectiveness. A comparative analysis on various parameters shows that the proposed system outperforms existing model. This thesis may be considered as an approach that advocates the integration of MAS and DM especially in bio-informatics. A scalable agent based framework for ARM of distributed data has been designed and implemented and further enhanced as a case study in bio-informatics.

Description

Doctor of Philosophy-Computer Science-Thesis

Citation

Endorsement

Review

Supplemented By

Referenced By