In looking over the actions of a FailSafe system on failure to determine
what has gone wrong and how processes have transferred, it is important to
consider the concept of node membership. When failover occurs, the runtime
failover domain can include only those nodes that are in the cluster membership.
Nodes can enter into the cluster membership
only when they are not disabled and they are in a known state. This ensures
that data integrity is maintained because only nodes within the cluster membership
can access the shared storage. If nodes outside the membership and not controlled
by FailSafe were able to access the shared storage, two nodes might try to
access the same data at the same time, a situation that would result in data
corruption. For this reason, disabled nodes do not participate in the membership
computation. Note that no attempt is made to reset nodes that are configured
disabled before confirming the cluster membership.
Node membership in a cluster is based on a quorum majority.
For a cluster to be enabled, more than 50% of the nodes in the cluster must
be in a known state, able to talk to each other, using heartbeat control networks.
This quorum determines which nodes are part of the cluster membership that
is formed.
If
there are an even number of nodes in the cluster, it is possible that there
will be no majority quorum; there could be two sets of nodes, each consisting
of 50% of the total number of node, unable to communicate with the other set
of nodes. In this case, FailSafe uses the node that has been configured as
the tie-breaker node when you configured your FailSafe parameters. If no tie-breaker
node was configured, FailSafe uses the enabled node with the lowest node id
number.
For information on setting tie-breaker nodes, see Section 5.4.4.
The nodes in a quorum attempt to reset the nodes
that are not in the quorum. Nodes that can be reset are declared DOWN in the membership, nodes that could not be reset are declared UNKNOWN. Nodes in the quorum are UP.
If a new majority quorum is computed, a new membership is
declared whether any node could be reset or not.
If at least one node in the current quorum has a current membership,
the nodes will proceed to declare a new membership if they can reset at least
one node.
If all nodes in the new tied quorum are coming up for the
first time, they will try to reset and proceed with a new membership only
if the quorum includes the tie-breaker node.
If a tied subset of nodes in the cluster had no previous membership,
then the subset of nodes in the cluster with the tie-breaker node attempts
to reset nodes in the other subset of nodes in the cluster. If at least one
node reset succeeds, a new membership is confirmed.
If a tied subset of nodes in the cluster had previous membership,
the nodes in one subset of nodes in the cluster attempt to reset nodes in
the other subset of nodes in the cluster. If at least one node reset succeeds,
a new membership is confirmed. The subset of nodes in the cluster with the
tie-breaker node resets immediately, the other subset of nodes in the cluster
attempts to reset after some time.
Resets are done through system controllers connected to tty
ports through serial lines. Periodic serial line monitoring never stops. If
the estimated serial line monitoring failure interval and the estimated heartbeat
loss interval overlap, we suspect a power failure at the node being reset.