|
|
Linux » Books » Administrative »
Linux FailSafe Administrator's Guide
(document number: 007-4322-002 / published: 2001-02-28)
table of contents | additional info | download find in page
When a failure is detected on one node (the
node has crashed, hung, or been shut down, or a highly available service is
no longer operating), a different node performs a failover of the highly available
services that are being provided on the node with the failure (called the failed node). Failover allows all of the highly available services,
including those provided by the failed node, to remain available within the
cluster. A failure in a highly available service can be detected by Linux FailSafe
processes running on another node. Depending on which node detects the failure,
the sequence of actions following the failure is different. If the failure is detected by the Linux FailSafe software running on
the same node, the failed node performs these operations: Stops the highly available resource group running on the node Moves the highly available resource group to a different node,
according to the defined failover policy for the resource group Sends a message to the node that will take over the services
to start providing all resource group services previously provided by the
failed node
When it receives the message, the node that is taking over the resource
group performs these operations: If the failure is detected by Linux FailSafe software running on a different
node, the node detecting the failure performs these operations: Using the serial connection between the nodes, reboots the
failed node to prevent corruption of data Transfers ownership of the resource group from the failed
node to the other nodes in the cluster, based on the resource group failover
policy. Starts offering the resource group services that were running
on the failed node
When a failed node comes back up, whether the node automatically starts
to provide highly available services again depends on the failover policy
you define. For information on defining failover policies, see Section 5.5.12. Normally, a node that experiences a failure automatically reboots and
resumes providing highly available services. This scenario works well for
transient errors (as well as for planned outages for equipment and software
upgrades). However, if there are persistent errors, automatic reboot can cause
recovery and an immediate failover again. To prevent this, the Linux FailSafe
software checks how long the rebooted node has been up since the last time
it was started. If the interval is less than five minutes (by default), the
Linux FailSafe software automatically disables Linux FailSafe from booting
on the failed node and does not start up the Linux FailSafe software on this
node. It also writes error messages to /var/log/failsafe
and to the appropriate log file.
Linux FailSafe Administrator's Guide
(document number: 007-4322-002 / published: 2001-02-28)
table of contents | additional info | download
home/search |
what's new |
help
|
|
|