|
|
Linux » Books » Administrative »
Linux FailSafe Administrator's Guide
(document number: 007-4322-002 / published: 2001-02-28)
table of contents | additional info | download find in page
While the Linux FailSafe system is running, you can monitor the
status of the Linux FailSafe components to determine the state of the component.
Linux FailSafe allows you to view the system status in the following ways: You can keep continuous watch on the state of a cluster using
the FailSafe Cluster View of the Cluster Manager GUI. You can query the status of an individual resource group,
node, or cluster using either the Cluster Manager GUI or the Cluster Manager
CLI. You can use the haStatus script provided
with the Cluster Manager CLI to see the status of all clusters, nodes, resources,
and resource groups in the configuration.
The following sections describe the procedures for performing each of
these tasks. The easiest way to keep a continuous watch on the state of a cluster
is to use the FailSafe Cluster View of the Cluster Manager GUI. In the FailSafe Cluster View window, problems system components are
experiencing appear as blinking red icons. Components in transitional states
also appear as blinking icons. If there is a problem in a resource group or
node, the FailSafe Cluster View icon for the cluster turns red and blinks,
as well as the resource group or node icon. The full color legend for component states in the FailSafe Cluster View
is as follows: - grey
healthy but not online or active - green
healthy and active or online - blinking green
transitioning to green - blinking red
problems with component - black and white outline
resource type - grey with yellow wrench
maintenance mode, may or may not be currently monitored by Linux FailSafe
If you minimize the FailSafe Cluster View window, the minimized-icon
shows the current state of the cluster. When the cluster has Linux FailSafe
HA services active and there is no error, the icon shows a green cluster.
When the cluster goes into error state, the icon shows a red cluster. When
the cluster has Linux FailSafe HA services inactive, the icon shows a grey
cluster. You can use the CLI to query the
status of a resource or to ping the system controller at a node, as described
in the following subsections. To query a resource status, use the following
CLI command: cmgr> show status of resource A of resource_type B [in cluster C] |
If you have specified a default cluster, you do not need to specify
a cluster when you use this command and it will show the status of the indicated
resource in the default cluster. To perform
a ping operation on a system controller by providing the device name, use
the following CLI command: cmgr> admin ping dev_name A of dev_type B with sysctrl_type C |
To query the status of a resource group, you
provide the name of the resource group and the cluster which includes the
resource group. Resource group status includes the following components: These components are described in the following subsections. If a node that contains a resource group online has a status of UNKNOWN, the status of the resource group will not be available
or ONLINE-READY. A resource group state can be one of the
following: - ONLINE
Linux FailSafe is running on the local nodes. The resource group is
allocated on a node in the cluster and is being monitored by Linux FailSafe.
It is fully allocated if there is no error; otherwise, some resources may
not be allocated or some resources may be in error state. - ONLINE-PENDING
Linux FailSafe is running on the local nodes and the resource group
is in the process of being allocated. This is a transient state. - OFFLINE
The resource group is not running or the resource group has been detached,
regardless of whether Linux FailSafe is running. When Linux FailSafe starts
up, it will not allocate this resource group. - OFFLINE-PENDING
Linux FailSafe is running on the local nodes and the resource group
is in the process of being released (becoming offline). This is a transient
state. - ONLINE-READY
Linux FailSafe is not running on the local node. When Linux FailSafe
starts up, it will attempt to bring this resource group online. No Linux FailSafe
process is running on the current node is this state is returned. - ONLINE-MAINTENANCE
The resource group is allocated in a node in the cluster but it is not
being monitored by Linux FailSafe. If a node failure occurs while a resource
group in ONLINE-MAINTENANCE state resides on that node,
the resource group will be moved to another node and monitoring will resume.
An administrator may move a resource group to an ONLINE-MAINTENANCE state for upgrade or testing purposes, or if there is any reason
that Linux FailSafe should not act on that resource for a period of time. - INTERNAL ERROR
An internal Linux FailSafe error has occurred and Linux FailSafe does
not know the state of the resource group. Error recovery is required. - DISCOVERY (EXCLUSIVITY)
The resource group is in the process of going online if Linux FailSafe
can correctly determine whether any resource in the resource group is already
allocated on all nodes in the resource group's application failure domain.
This is a transient state. - INITIALIZING
Linux FailSafe on the local node has yet to get any information about
this resource group. This is a transient state.
When a resource group is ONLINE, its error status is continually being
monitored. A resource group error status can be one of the following: - NO ERROR
Resource group has no error. - INTERNAL ERROR - NOT RECOVERABLE
Notify Silicon Graphics if this condition arises. - NODE UNKNOWN
Node that had the resource group online is in unknown state. This occurs
when the node is not part of the cluster. The last known state of the resource
group is ONLINE, but the system cannot talk to the node. - SRMD EXECUTABLE ERROR
The start or stop action has failed for a resource in the resource group. - SPLIT RESOURCE GROUP (EXCLUSIVITY)
Linux FailSafe has determined that part of the resource group was running
on at least two different nodes in the cluster. - NODE NOT AVAILABLE (EXCLUSIVITY)
Linux FailSafe has determined that one of the nodes in the resource
group's application failure domain was not in the membership. Linux FailSafe
cannot bring the resource group online until that node is removed from the
application failure domain or HA services are started on that node. - MONITOR ACTIVITY UNKNOWN
In the process of turning maintenance mode on or off, an error occurred.
Linux FailSafe can no longer determine if monitoring is enabled or disabled.
Retry the operation. If the error continues, report the error to Silicon Graphics. - NO AVAILABLE NODES
A monitoring error has occurred on the last valid node in the cluster's
membership.
The resource owner is the logical node name of
the node that currently owns the resource. You can use the FailSafe ClusterView to monitor the status of the resources
in a Linux FailSafe configuration. You can launch the FailSafe Cluster View
directly, or you can bring it up at any time by clicking on “FailSafe
Cluster View” at the bottom of the “FailSafe Manager” display. From the View menu, select “Resources in Groups” to see
the resources organized by the groups they belong to, or select “Groups
owned by Nodes” to see where the online groups are running. This view
lets you observe failovers as they occur. To query a resource group status, use the following CLI command: cmgr> show status of resource_group A [in cluster B] |
If you have specified a default cluster, you do not need to specify
a cluster when you use this command and it will show the status of the indicated
resource group in the default cluster. To query the status
of a node, you provide the logical node name of the node. The node status
can be one of the following: - UP
This node is part of cluster membership. - DOWN
This node is not part of cluster membership (no heartbeats) and this
node has been reset. This is a transient state. - UNKNOWN
This node is not part of cluster membership (no heartbeats) and this
node has not been reset (reset attempt has failed). - INACTIVE
HA services have not been started on this node.
When you start HA services, node states transition from INACTIVE to UP. It may happen that a node state may transition
from INACTIVE to UNKNOWN to UP. You can use the FailSafe Cluster View to monitor the status of the clusters
in a Linux FailSafe configuration. You can launch the FailSafe Cluster View
directly, or you can bring it up at any time by clicking on “FailSafe
Cluster View” at the bottom of the “FailSafe Manager” display. From the View menu, select “Groups owned by Nodes” to monitor
the health of the default cluster, its resource groups, and the group's resources. To query node status, use the following CLI command: cmgr> show status of node A |
When Linux FailSafe is running, you can determine whether the system
controller on a node is responding with the following Cluster Manger CLI command: This command uses the Linux FailSafe daemons to test whether the system
controller is responding. You can verify reset connectivity on a node in a cluster even when the
Linux FailSafe daemons are not running by using the standalone
option of the admin ping command of the CLI: cmgr> admin ping standalone node A |
This command does not go through the Linux FailSafe daemons, but calls
the ping command directly to test whether the system controller
on the indicated node is responding. To query the status of a cluster, you
provide the name of the cluster. The cluster status can be one of the following: You can use the Cluster View of the Cluster Manager GUI to monitor the
status of the clusters in a Linux FailSafe system. To query node and cluster status, use the following CLI command: cmgr> show status of cluster A |
The haStatus script provides status and configuration information about
clusters, nodes, resources, and resource groups in the configuration. This
script is installed in the /var/cluster/cmgr-scripts
directory. You can modify this script to suit your needs. See the haStatus (1M) man page for further information about this script. The following examples show the output of the different options of the haStatus script. # haStatus -help
Usage: haStatus [-a|-i] [-c clustername]
where,
-a prints detailed cluster configuration information and cluster
status.
-i prints detailed cluster configuration information only.
-c can be used to specify a cluster for which status is to be printed.
“clustername” is the name of the cluster for which status is to be
printed.
# haStatus
Tue Nov 30 14:12:09 PST 1999
Cluster test-cluster:
Cluster state is ACTIVE.
Node hans2:
State of machine is UP.
Node hans1:
State of machine is UP.
Resource_group nfs-group1:
State: Online
Error: No error
Owner: hans1
Failover Policy: fp_h1_h2_ord_auto_auto
Resources:
/hafs1 (type: NFS)
/hafs1/nfs/statmon (type: statd)
150.166.41.95 (type: IP_address)
/hafs1 (type: filesystem)
havol1 (type: volume)
# haStatus -i
Tue Nov 30 14:13:52 PST 1999
Cluster test-cluster:
Node hans2:
Logical Machine Name: hans2
Hostname: hans2.engr.sgi.com
Is FailSafe: true
Is Cellular: false
Nodeid: 32418
Reset type: powerCycle
System Controller: msc
System Controller status: enabled
System Controller owner: hans1
System Controller owner device: /dev/ttyd2
System Controller owner type: tty
ControlNet Ipaddr: 192.26.50.15
ControlNet HB: true
ControlNet Control: true
ControlNet Priority: 1
ControlNet Ipaddr: 150.166.41.61
ControlNet HB: true
ControlNet Control: false
ControlNet Priority: 2
Node hans1:
Logical Machine Name: hans1
Hostname: hans1.engr.sgi.com
Is FailSafe: true
Is Cellular: false
Nodeid: 32645
Reset type: powerCycle
System Controller: msc
System Controller status: enabled
System Controller owner: hans2
System Controller owner device: /dev/ttyd2
System Controller owner type: tty
ControlNet Ipaddr: 192.26.50.14
ControlNet HB: true
ControlNet Control: true
ControlNet Priority: 1
ControlNet Ipaddr: 150.166.41.60
ControlNet HB: true
ControlNet Control: false
ControlNet Priority: 2
Resource_group nfs-group1:
Failover Policy: fp_h1_h2_ord_auto_auto
Version: 1
Script: ordered
Attributes: Auto_Failback Auto_Recovery
Initial AFD: hans1 hans2
Resources:
/hafs1 (type: NFS)
/hafs1/nfs/statmon (type: statd)
150.166.41.95 (type: IP_address)
/hafs1 (type: filesystem)
havol1 (type: volume)
Resource /hafs1 (type NFS):
export-info: rw,wsync
filesystem: /hafs1
Resource dependencies
statd /hafs1/nfs/statmon
filesystem /hafs1
Resource /hafs1/nfs/statmon (type statd):
InterfaceAddress: 150.166.41.95
Resource dependencies
IP_address 150.166.41.95
filesystem /hafs1
Resource 150.166.41.95 (type IP_address):
NetworkMask: 0xffffff00
interfaces: ef1
BroadcastAddress: 150.166.41.255
No resource dependencies
Resource /hafs1 (type filesystem):
volume-name: havol1
mount-options: rw,noauto
monitoring-level: 2
Resource dependencies
volume havol1
Resource havol1 (type volume):
devname-group: sys
devname-owner: root
devname-mode: 666
No resource dependencies
Failover_policy fp_h1_h2_ord_auto_auto:
Version: 1
Script: ordered
Attributes: Auto_Failback Auto_Recovery
Initial AFD: hans1 hans2
# haStatus -a
Tue Nov 30 14:45:30 PST 1999
Cluster test-cluster:
Cluster state is ACTIVE.
Node hans2:
State of machine is UP.
Logical Machine Name: hans2
Hostname: hans2.engr.sgi.com
Is FailSafe: true
Is Cellular: false
Nodeid: 32418
Reset type: powerCycle
System Controller: msc
System Controller status: enabled
System Controller owner: hans1
System Controller owner device: /dev/ttyd2
System Controller owner type: tty
ControlNet Ipaddr: 192.26.50.15
ControlNet HB: true
ControlNet Control: true
ControlNet Priority: 1
ControlNet Ipaddr: 150.166.41.61
ControlNet HB: true
ControlNet Control: false
ControlNet Priority: 2
Node hans1:
State of machine is UP.
Logical Machine Name: hans1
Hostname: hans1.engr.sgi.com
Is FailSafe: true
Is Cellular: false
Nodeid: 32645
Reset type: powerCycle
System Controller: msc
System Controller status: enabled
System Controller owner: hans2
System Controller owner device: /dev/ttyd2
System Controller owner type: tty
ControlNet Ipaddr: 192.26.50.14
ControlNet HB: true
ControlNet Control: true
ControlNet Priority: 1
ControlNet Ipaddr: 150.166.41.60
ControlNet HB: true
ControlNet Control: false
ControlNet Priority: 2
Resource_group nfs-group1:
State: Online
Error: No error
Owner: hans1
Failover Policy: fp_h1_h2_ord_auto_auto
Version: 1
Script: ordered
Attributes: Auto_Failback Auto_Recovery
Initial AFD: hans1 hans2
Resources:
/hafs1 (type: NFS)
/hafs1/nfs/statmon (type: statd)
150.166.41.95 (type: IP_address)
/hafs1 (type: filesystem)
havol1 (type: volume)
Resource /hafs1 (type NFS):
State: Online
Error: None
Owner: hans1
Flags: Resource is monitored locally
export-info: rw,wsync
filesystem: /hafs1
Resource dependencies
statd /hafs1/nfs/statmon
filesystem /hafs1
Resource /hafs1/nfs/statmon (type statd):
State: Online
Error: None
Owner: hans1
Flags: Resource is monitored locally
InterfaceAddress: 150.166.41.95
Resource dependencies
IP_address 150.166.41.95
filesystem /hafs1
Resource 150.166.41.95 (type IP_address):
State: Online
Error: None
Owner: hans1
Flags: Resource is monitored locally
NetworkMask: 0xffffff00
interfaces: ef1
BroadcastAddress: 150.166.41.255
No resource dependencies
Resource /hafs1 (type filesystem):
State: Online
Error: None
Owner: hans1
Flags: Resource is monitored locally
volume-name: havol1
mount-options: rw,noauto
monitoring-level: 2
Resource dependencies
volume havol1
Resource havol1 (type volume):
State: Online
Error: None
Owner: hans1
Flags: Resource is monitored locally
devname-group: sys
devname-owner: root
devname-mode: 666
No resource dependencies
# haStatus -c test-cluster
Tue Nov 30 14:42:04 PST 1999
Cluster test-cluster:
Cluster state is ACTIVE.
Node hans2:
State of machine is UP.
Node hans1:
State of machine is UP.
Resource_group nfs-group1:
State: Online
Error: No error
Owner: hans1
Failover Policy: fp_h1_h2_ord_auto_auto
Resources:
/hafs1 (type: NFS)
/hafs1/nfs/statmon (type: statd)
150.166.41.95 (type: IP_address)
/hafs1 (type: filesystem)
havol1 (type: volume) |
Linux FailSafe Administrator's Guide
(document number: 007-4322-002 / published: 2001-02-28)
table of contents | additional info | download
home/search |
what's new |
help
|
|
|