A framework for association rule mining of distributed data
Loading...
Files
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The exponential rise in the collected data generated an essential need for new tech-
niques that can convert this huge amount of data into useful knowledge. Consequently,
Data Mining (DM) has become a powerful technology focusing on the most impor-
tant information in the massive data. DM extracts the interesting data patterns from
large databases using computational techniques/tools. The classical central data ware-
house(DW) based DM approach is ine ective or infeasible because of heavy storage, com-
putational and communication costs involved in managing data from the ever increasing
and privacy-sensitive distributed resources. Distributed Data Mining (DDM) is emerged
as an active sub-area of DM research. DDM is concerned with application of classical
DM procedures in a distributed computing environment to e ectively utilize the available
resources.
Association Rules (ARs) are used to discover the associations among frequent itemsets
in a database. Association Rule Mining (ARM) today is one of the most important aspects
of DM task. In ARM all the strong association rules are generated from the frequent
itemsets. Distributed Association Rule Mining (DARM) generates the globally strong
association rules from the global frequent itemsets in a distributed environment for the
global decision making.
Agent mining also known as Agent enriched DM, is an emerging interdisciplinary area
that integrates agent technology, DM, machine learning. Most of the existing agent based
frameworks for DARM task are only prototype model and lacks the appropriate underlying
Agent Execution Environment (AEE), scalability, privacy preserving techniques, global
knowledge and implementation using a real datasets especially in bio-informatics domain.
Bio-informatics or computational molecular biology aims at automated analysis and the
management of high-throughput biological data as well as modeling and simulation of
complex biological systems.
Mining the ARs from the frequent itemsets requires a transactional dataset which
can be a real transactional datset of any retail industry or can be a synthetic version
generated by a tool. A software tool called Transactional Dataset Generator (TDSG) has
been designed and implemented in Java language for generating a synthetic dataset.
Traditional central DW based approach for ARM is practically investigated with the
help of a client-server based framework. The overall response time for the ARM task
performed using this approach is also formulated. The outcome of this approach suggested
ii
the use of agent technology for DARM task for the issue of scalability and global knowledge
extraction.
An AEE is designed and implemented that acts as a distributed server application
for managing a multi-agent system (MAS) for DARM task. It provides the appropriate
functionality to Mobile Agents (MAs) to execute, communicate, migrate to other platform,
manage itinerary and use system resources.
A scalable MAS called Agent enriched Mining of Globally Strong Association Rules
(AeMGAR) that act as framework for DARM task is designed and implemented using
two computing models. In a serial computing model MAs visit n distributed sites serially
and performs their designated tasks. Global knowledge and performance of this system is
compared with the traditional central DW based ARM approach. Serial itinerary used for
MAs increases the overall cost of DARM task so a parallel computing model is designed.
Clones of MAs in parallel computing model visit n distributed sites in parallel and it
is found that overall response time for the DARM task involving n distributed sites is
very less in case of parallel computing model of AeMGSAR. The comparative analysis on
various parameters reveals that the proposed AeMGSAR framework has improved features
and exhibit superior performance than the existing agent based DARM frameworks.
As mining biological data is an emerging area at the intersection between DDM and
bio-informatics, we have also taken the case of DARM in bio-informatics and designed
another version of this framework called Agent enriched Quantitative Association Rules
Mining for Amino Acids in distributed Protein Data Banks (AeQARM-AAPDB) for min-
ing the quantitative ARs for amino acids in proteins. Experimental tests on real data have
con rmed its e ectiveness. A comparative analysis on various parameters shows that the
proposed system outperforms existing model.
This thesis may be considered as an approach that advocates the integration of MAS
and DM especially in bio-informatics. A scalable agent based framework for ARM of
distributed data has been designed and implemented and further enhanced as a case
study in bio-informatics.
Description
Doctor of Philosophy-Computer Science-Thesis
