AgentScape is meant to be deployed in
large-scale environments, where both the number of hosts and the number
of agents involved are huge. This highly increases the capacity of the
supported application in terms of quantity and diversity of available
resources. One drawback, however, is that the more hardware and software
entities get involved in a common computation, the more likely it
becomes that one of these entities is going to fail at some point.
Therefore it appears reasonable to integrate fault tolerance mechanisms
in the architecture of AgentScape.
The fault tolerance solutions considered for AgentScape are inspired
from those implemented in the
DARX
framework. DARX (Dynamic Agent Replication eXtension) is based on
the idea that fault tolerance mechanisms are costly, even more so in a
large-scale environment; this may render the computation so time- and
resource-consuming that merely running it would be pointless. To work
around this problem, it is assumed that at any point of the computation,
only a small part of the application should be considered as critical.
For instance, from the complete set of agents which form an agent
application, a small subset may be identified as the agents whose
failure would jeopardise the continuity and correct termination of the
application. Given this fundamental assumption, DARX provides several
services aimed at allowing fault-tolerance and its automation with
respect to agent computing. These services comprise :
A failure
detection service which maintains dynamic lists of all the running
hosts involved as well as of the valid replicas which participate to
the supported application, and notifies the latter of suspected failure
occurences.
- A naming and localisation
service which generates a unique identifier for every replica in
the system, and returns the addresses for all the replicas of a same
group in response to an agent localisation request.
- A system observation service
which monitors the behaviour of the underlying distributed system: it
collects low-level data by means of OS-compliant probes and diffuses
processed trace information so as to make it available for the decision
processes which take place in DARX.
- An analysis service which
builds a global representation of the supported agent application in
terms of fault tolerance requirements.
- A replication service
which brings all the necessary mechanisms for replicating agents,
maintaining the consistency between replicas of a same agent and
adapting the replication scheme for every agent according to the data
gathered through system monitoring and application analysis.
- An adaptation service
which offers wrapper-making solutions for Java-based agents, thus
rendering the DARX middleware usable by various multi-agent systems and
even making it possible to introduce interoperability amongst different
systems.
DARX has been implemented: it currently provides support for adaptive
fault tolerance based on replication. Agents can be
replicated/unreplicated on-the-fly, the strategies applied in order to
maintain consistency amongst replicas can be combined into hybrid
replication schemes, and the dynamic modification of these schemes is
fully automated with respect to both the application semantics and the
monitored system characteristics.
Challenges
Evidently, reusing DARX as is alongside AgentScape is far from an
efficient solution. For a start several services are bound to overlap,
or even to conflict: both softwares have their own naming services as
well as their own observation services. The other main issue is that of
performance concerns: DARX is not an agent platform, instead it is meant
as a generic middleware for supporting agent platforms. Both these
considerations seem to call for a more complex approach: that of
integrating DARX mechanisms within AgentScape.
Besides the obvious adaptation work that the above mentioned
integration will induce, the effort put into incorporating DARX within
AgentScape raises two main research issues:
- DARX can be seen as a first step towards autonomic computing:
fault tolerance schemes are adaptively applied to agents at runtime, and
the adaptation itself is fully automated. Yet a good deal of
participation from the user is still required: (i) the agent application
developer alone really knows which agent is to be considered critical
and at what point of the computation it will become so, (ii) the
mapping between how critical an agent is and which type of fault
tolerance scheme is applied to it remains to be researched extensively.
As an agent platform, AgentScape may be able to provide the missing
paradigms for further automation of various services: agent criticality
evaluation, system monitoring, failure detection, fault tolerance, ...
Additionally, AgentScape comprises a resource management service based
on asynchronous leasing. For instance services such as fault tolerance
could be seen as particular resources (shared components) leased on
pre-determined conditions (secure usage authorisations, agent priority
level, ...).
- One way of making the integration effective could be to remodel
DARX services as components which would then be plugged into the
AgentScape middleware. This can lead to analysing the constraints which
are specific to components explicitly designed for large-scale
environments. It follows that core methodologies for designing such
components could then be derived from the obtained experience.