Fault Tolerance in Computational Grids

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The Grid is rapidly emerging as the means for coordinated resource sharing and problem solving in multi-institutional virtual organizations while providing dependable, consistent, pervasive access to global resources. The emergence of computational Grids and the potential for seamless aggregation and interactions between distributed services and resources, has led to the start of new era of computing. Tremendously large number and the heterogeneous nature of grid computing resource make the resource management a significantly challenging job. Resource management scenarios often include resource discovery, resource monitoring, resource inventories, resource provisioning, fault isolation, variety of autonomic capabilities and service level management activities. Out of these scenarios, fault tolerance is one of the main research areas. The probability of fault occurrence increases, as the number of resources involved in grid increases. Till today there is no system that can be fully fault tolerant. In this research our main focus is on the development of fault tolerance system for computational grids. For this we had setup a computational grid based on the Alchemi middleware. Alchemi is a .NET-based grid computing framework that provides the runtime machinery and programming environment required to construct computational grid. After setting up grid environment, we have studied existing fault tolerance in Alchemi in detail, and have ascertained the frequent causes of failures in it. To deal with some of the identified deficiencies we have proposed backup manager concept. Backup manager uses the heartbeating and replication based fault tolerant technique to monitor the central manager. In case of failure of the central manager, backup manager will take its control and avoids the grid to fail.

Description

Citation

Endorsement

Review

Supplemented By

Referenced By