Fault Tolerance in Computational Grids
Loading...
Files
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The Grid is rapidly emerging as the means for coordinated resource sharing and
problem solving in multi-institutional virtual organizations while providing
dependable, consistent, pervasive access to global resources. The emergence of
computational Grids and the potential for seamless aggregation and interactions
between distributed services and resources, has led to the start of new era of
computing. Tremendously large number and the heterogeneous nature of grid
computing resource make the resource management a significantly challenging job.
Resource management scenarios often include resource discovery, resource
monitoring, resource inventories, resource provisioning, fault isolation, variety of
autonomic capabilities and service level management activities. Out of these scenarios,
fault tolerance is one of the main research areas. The probability of fault occurrence
increases, as the number of resources involved in grid increases. Till today there is no
system that can be fully fault tolerant.
In this research our main focus is on the development of fault tolerance system for
computational grids. For this we had setup a computational grid based on the Alchemi
middleware. Alchemi is a .NET-based grid computing framework that provides the
runtime machinery and programming environment required to construct
computational grid. After setting up grid environment, we have studied existing fault
tolerance in Alchemi in detail, and have ascertained the frequent causes of failures in
it.
To deal with some of the identified deficiencies we have proposed backup manager
concept. Backup manager uses the heartbeating and replication based fault tolerant
technique to monitor the central manager. In case of failure of the central manager,
backup manager will take its control and avoids the grid to fail.
