logo

Intelligent Interactive Distributed Systems


Project: Fault Tolerance
Home  People  Publications  

AgentScape Fault Tolerance Management Service

Overview

AgentScape is meant to be deployed in large-scale environments, where both the number of hosts and the number of agents involved are huge. This highly increases the capacity of the supported application in terms of quantity and diversity of available resources. One drawback, however, is that the more hardware and software entities get involved in a common computation, the more likely it becomes that one of these entities is going to fail at some point. Therefore it appears reasonable to integrate fault tolerance mechanisms in the architecture of AgentScape.

The fault tolerance solutions considered for AgentScape are inspired from those implemented in the DARX framework. DARX (Dynamic Agent Replication eXtension) is based on the idea that fault tolerance mechanisms are costly, even more so in a large-scale environment; this may render the computation so time- and resource-consuming that merely running it would be pointless. To work around this problem, it is assumed that at any point of the computation, only a small part of the application should be considered as critical. For instance, from the complete set of agents which form an agent application, a small subset may be identified as the agents whose failure would jeopardise the continuity and correct termination of the application. Given this fundamental assumption, DARX provides several services aimed at allowing fault-tolerance and its automation with respect to agent computing. These services comprise :
  •  A failure detection service which maintains dynamic lists of all the running hosts involved as well as of the valid replicas which participate to the supported application, and notifies the latter of suspected failure occurences.
  • A naming and localisation service which generates a unique identifier for every replica in the system, and returns the addresses for all the replicas of a same group in response to an agent localisation request.
  • A system observation service which monitors the behaviour of the underlying distributed system: it collects low-level data by means of OS-compliant probes and diffuses processed trace information so as to make it available for the decision processes which take place in DARX.
  • An analysis service which builds a global representation of the supported agent application in terms of fault tolerance requirements.
  • A replication service which brings all the necessary mechanisms for replicating agents, maintaining the consistency between replicas of a same agent and adapting the replication scheme for every agent according to the data gathered through system monitoring and application analysis.
  • An adaptation service which offers wrapper-making solutions for Java-based agents, thus rendering the DARX middleware usable by various multi-agent systems and even making it possible to introduce interoperability amongst different systems.

DARX has been implemented: it currently provides support for adaptive fault tolerance based on replication. Agents can be replicated/unreplicated on-the-fly, the strategies applied in order to maintain consistency amongst replicas can be combined into hybrid replication schemes, and the dynamic modification of these schemes is fully automated with respect to both the application semantics and the monitored system characteristics.

Challenges

Evidently, reusing DARX as is alongside AgentScape is far from an efficient solution. For a start several services are bound to overlap, or even to conflict: both softwares have their own naming services as well as their own observation services. The other main issue is that of performance concerns: DARX is not an agent platform, instead it is meant as a generic middleware for supporting agent platforms. Both these considerations seem to call for a more complex approach: that of integrating DARX mechanisms within AgentScape.

Besides the obvious adaptation work that the above mentioned integration will induce, the effort put into incorporating DARX within AgentScape raises two main research issues:
  1. DARX can be seen as a first step towards autonomic computing: fault tolerance schemes are adaptively applied to agents at runtime, and the adaptation itself is fully automated. Yet a good deal of participation from the user is still required: (i) the agent application developer alone really knows which agent is to be considered critical and at what point of the computation it will become so, (ii) the mapping between how critical an agent is and which type of fault tolerance scheme is applied to it remains to be researched extensively. As an agent platform, AgentScape may be able to provide the missing paradigms for further automation of various services: agent criticality evaluation, system monitoring, failure detection, fault tolerance, ... Additionally, AgentScape comprises a resource management service based on asynchronous leasing. For instance services such as fault tolerance could be seen as particular resources (shared components) leased on pre-determined conditions (secure usage authorisations, agent priority level, ...).
  2. One way of making the integration effective could be to remodel DARX services as components which would then be plugged into the AgentScape middleware. This can lead to analysing the constraints which are specific to components explicitly designed for large-scale environments. It follows that core methodologies for designing such components could then be derived from the obtained experience.


login
For comments on this web-site, please contact: Michel Oey