|
|
Linux » Books » Administrative »
Linux FailSafe Administrator's Guide
(document number: 007-4322-002 / published: 2001-02-28)
table of contents | additional info | download find in page
The following sections describe various recovery procedures
you can perform when different failsafe components fail. Procedures for the
following situations are provided: Follow this procedure if status of the cluster is UNKNOWN in all nodes
in the cluster. Check to see if there are control networks that have failed
(see Section 9.6.5). At least 50% of the nodes in the cluster must be able to communicate
with each other to have an active cluster (Quorum requirement). If there are
not sufficient nodes in the cluster that can communicate with each other using
control networks, stop HA services on some of the nodes so that the quorum
requirement is satisfied. If there are no hardware configuration problems, detach all
resource groups that are online in the cluster (if any), stop HA services
in the cluster, and restart HA services in the cluster.
The following cluster_mgr command detaches the resource
group web-rg in cluster web-cluster: cmgr> admin detach resource_group web-rg in cluster web-cluster |
To stop HA services in the cluster web-cluster
and ignore errors (force option), use the following cluster_mgr command: cmgr> stop ha_services for cluster web-cluster force |
To start HA services in the cluster web-cluster,
use the following cluster_mgr command: cmgr> start ha_services for cluster web-cluster |
Follow this procedure if the status of a node is UNKNOWN in an active
cluster: Check to see if the control networks in the node are working
(see Section 9.6.5). Check to see if the serial reset cables to reset the node
are working (see Section 9.6.6). If there are no hardware configuration problems, stop HA services
in the node and restart HA services. To stop HA services in the node web-node3 in the
cluster web-cluster, ignoring errors (force option), use the following cluster_mgr command cmgr> stop ha_services in node web-node3 for cluster web-cluster
force |
To start HA services in the node web-node3 in the
cluster web-cluster, use the following cluster_mgr command: cmgr> start ha_services in node web-node3 for cluster web-cluster |
To do simple maintenance on an application that is part of the resource
group, use the following procedure. This procedure stops monitoring the resources
in the resource group when maintenance mode is on. You need to turn maintenance
mode off when application maintenance is done. | Caution | If there is node failure on the node where resource group maintenance
is being performed, the resource group is moved to another node in the failover
policy domain. |
To put a resource group web-rg in maintenance
mode, use the following cluster_mgr command: cmgr> admin maintenance_on resource_group web-rg in cluster web-cluster |
The resource group state changes to ONLINE_MAINTENANCE. Do whatever application maintenance is required. (Rotating application
logs is an example of simple application maintenance). To remove a resource group web-rg from
maintenance mode, use the following cluster_mgr command: cmgr> admin maintenance_off resource_group web-rg in cluster
web-cluster |
The resource group state changes back to ONLINE.
You perform the following procedure when a resource group is in an ONLINE state and has an SRMD EXECUTABLE ERROR. Look at the SRM logs (default location: /var/log/failsafe/srmd_node name) to determine the cause of failure
and the resource that has failed. Fix the cause of failure. This might require changes to resource
configuration or changes to resource type stop/start/failover action timeouts. After fixing the problem, move the resource group offline
with the force option and then move the resource group
online. The following cluster_mgr command moves the resource
group web-rg in the cluster web-cluster
offline and ignores any errors: cmgr> admin offline resource_group web-rg in cluster web-cluster
force |
The following cluster_mgr command moves the resource
group web-rg in the cluster web-cluster
online: cmgr> admin online resource_group web-rg in cluster web-cluster |
The resource group web-rg should be in an ONLINE state with no error.
You use the following procedure when a resource group is not online
but is in an error state. Most of these errors occur as a result of the exclusivity
process. This process, run when a resource group is brought online, determines
if any resources are already allocated somewhere in the failure domain of
a resource group. Note that exclusivity scripts return that a resource is
allocated on a node if the script fails in any way. In other words, unless
the script can determine that a resource is not present, it returns a value
indicating that the resource is allocated. Some possible error states include: SPLIT RESOURCE GROUP (EXCLUSIVITY), NODE NOT AVAILABLE (EXCLUSIVITY), NO AVAILABLE NODES in failure domain. See Section 7.4.3,
for explanations of resource group error codes. Look at the failsafe and SRM logs (default
directory: /var/log/failsafe, files: failsafe_nodename, srmd_nodename) to determine the cause of the failure and the resource
that failed. For example, say the task of moving a resource group online results
in a resource group with error state SPLIT RESOURCE GROUP (EXCLUSIVITY). This means that parts of a resource group are allocated on at
least two different nodes. One of the failsafe logs will have the description
of which nodes are believed to have the resource group partially allocated. At this point, look at the srmd logs on each of
these machines to see what resources are believed to be allocated. In some
cases, a misconfigured resource will show up as a resource which is allocated.
This is especially true for Netscape_web resources. Fix the cause of the failure. This might require changes to
resource configuration or changes to resource type start/stop/exclusivity
timeouts. After fixing the problem, move the resource group offline
with the force option and then move the resource group
online.
There are a few double failures that can occur in the cluster which
will cause resource groups to remain in a non-highly-available state. At times
a resource group might get stuck in an offline state. A resource group might
also stay in an error state on a node even when a new node joins the cluster
and the resource group can migrate to that node to clear the error. When these circumstances arise, the correct action should be as follows: Try to move the resource group online if it is offline. If the resource group is stuck on a node, detach the resource
group, then bring it online again. This should clear many errors. If detaching the resource group does not work, force the resource
group offline, then bring it back online. If commands appear to be hanging or not working properly,
detach all resource groups, then shut down the cluster and bring all resource
groups back online.
See Section 7.5.2, for information on detaching
resource groups and forcing resource groups offline. You use this procedure when a resource that is not part of a resource
group is in an ONLINE state with error. This can happen
when the addition or removal of resources from a resource group fails. Look at the SRM logs (default location: /var/log/failsafe/srmd_nodename) to determine the cause of
failure and the resource that has failed. Fix the cause of failure. This might require changes to resource
configuration or changes to resource type stop/start/failover action timeouts. After fixing the problem, move the resource offline with the force option of the Cluster Manager CLI admin offline
command: cmgr> admin offline_force resource web-srvr of resource_type
Netscape_Web in cluster web-cluster |
Executing this command removes the error state of resource web-srvr of type Netscape_Web, making it available
to be added to a resource group. You can also use the Cluster Manager GUI to clear the error state for
the resource. To do this, you select the “Recover a Resource”
task from the “Resources and Resource Types” category of the FailSafe
Manager.
Control network failures are reported in cmsd logs.
The default location of cmsd log is /var/log/failsafe/cmsd_nodename. Follow this procedure when
the control network fails: Use the ping command to check whether the
control network IP address is configured in the node. Check node configuration to see whether the control network
IP addresses are correctly specified. The following cluster_mgr command displays node configuration
for web-node3: cmgr> show node web-node3 |
If IP names are specified for control networks instead of
IP addresses in XX.XX.XX.XX notation, check to see whether IP names can be
resolved using DNS. It is recommended that IP addresses are used instead of
IP names. Check whether the heartbeat interval and node timeouts are
correctly set for the cluster. These HA parameters can seen using cluster_mgr show ha_parameters command.
Serial cables are used for resetting a node when there is a node failure.
Serial cable failures are reported in crsd logs. The
default location for the crsd log is /var/log/failsafe/crsd_nodename. Check the node configuration to see whether serial cable connection
is correctly configured. The following cluster_mgr command displays node configuration
for web-node3 cmgr> show node web-node3 |
Use the cluster_mgr admin ping command to verify
the serial cables. cmgr> admin ping node web-node3 |
The above command reports serial cables problems in node web-node3. When the entire configuration database (CDB) must be reinitialized,
execute the following command: # /usr/cluster/bin/cdbreinit /var/cluster/cdb/cdb.db |
This command will restart all cluster processes. The contents of the
configuration database will be automatically synchronized with other nodes
if other nodes in the pool are available. Otherwise, the CDB will need to be restored from backup at this point.
For instructions on backing up and restoring the CDB, see Section 7.8. If the FailSafe Cluster Manager GUI is displaying information that is
inconsistent with the FailSafe cluster_mgr command, restart
cad process on the node to which Cluster Manager GUI is connected to by executing
the following command: The cluster administration daemon is restarted automatically by the cmond process.
Linux FailSafe Administrator's Guide
(document number: 007-4322-002 / published: 2001-02-28)
table of contents | additional info | download
home/search |
what's new |
help
|
|
|