In the world of mission critical computing, the availability of information
and computing resources is extremely important. The availability of a system
is affected by how long it is unavailable after a failure in any of its components.
Different degrees of availability are provided by different types of systems:
Fault-tolerant systems (continuous availability).
These systems use redundant components and specialized logic to ensure continuous
operation and to provide complete data integrity. On these systems the degree
of availability is extremely high. Some of these systems can also tolerate
outages due to hardware or software upgrades (continuous availability). This
solution is very expensive and requires specialized hardware and software.
Highly available systems. These systems survive single points
of failure by using redundant off-the-shelf components and specialized software.
They provide a lower degree of availability than the fault-tolerant systems,
but at much lower cost. Typically these systems provide high availability
only for client/server applications, and base their redundancy on cluster
architectures with shared resources.
The Silicon Graphics® Linux FailSafe product provides a general
facility for providing highly available services. Linux FailSafe provides
highly available services for a cluster that contains multiple nodes (N-node configuration). Using Linux FailSafe, you can configure
a highly available system in any of the following topologies:
Basic two-node configuration
Ring configuration
Star configuration, in which multiple applications running
on multiple nodes are backed up by one node
Symmetric pool configuration
These configurations provide redundancy of processors and I/O controllers.
Redundancy of storage can either be obtained through the use of multi-hosted
RAID disk devices and mirrored disks, or with redundant disk systems which
are kept in synchronization.
If one of the nodes in the cluster or one of the nodes' components fails,
a different node in the cluster restarts the highly available services of
the failed node. To clients, the services on the replacement node are indistinguishable
from the original services before failure occurred. It appears as if the original
node has crashed and rebooted quickly. The clients notice only a brief interruption
in the highly available service.
In a Linux FailSafe highly available system, nodes can serve as backup
for other nodes. Unlike the backup resources in a fault-tolerant system, which
serve purely as redundant hardware for backup in case of failure, the resources
of each node in a highly available system can be used during normal operation
to run other applications that are not necessarily highly available services.
All highly available services are owned and accessed by one node at a time.
Highly available services are monitored by the Linux FailSafe software.
During normal operation, if a failure is detected on any of these components,
a failover process is initiated. Using Linux FailSafe,
you can define a failover policy to establish which node will take over the
services under what conditions. This process consists of resetting the failed
node (to ensure data consistency), doing any recovery required by the failed
over services, and quickly restarting the services on the node that will take
them over.
Linux FailSafe supports selective failover in
which individual highly available applications can be failed over to a backup
node independent of the other highly available applications on that node.
Linux FailSafe highly available services fall into two groups: highly
available resources and highly available applications. Highly available resources
include network interfaces, logical volumes, and filesystems such as ext2f
or reiserfs that have been configured for Linux FailSafe. Silicon Graphics
has also developed Linux FailSafe NFS. Highly available applications can include
applications such as NFS, Apache, etc.
Linux FailSafe provides
a framework for making additional applications into highly available services.
If you want to add highly available applications on a Linux FailSafe cluster,
you must write scripts to handle application monitoring functions. Information
on developing these scripts is described in the Linux FailSafe
Programmer's Guide. If you need assistance in this regard, contact
SGI Global Services, which offers custom Linux FailSafe agent development
and HA integration services.