|
|
Linux » Books » Administrative »
Linux FailSafe Administrator's Guide
(document number: 007-4322-002 / published: 2001-02-28)
table of contents | additional info | download find in page
This section describes the software layers, communication paths, and
cluster configuration database. A Linux FailSafe system has the following software layers: Plug-ins, which create highly available services. If the
application plug-in you want is not available, you can hire the Silicon Graphics
Global Services group to develop the required software, or you can use the Linux FailSafe Programmer's Guide to write the software yourself. Linux FailSafe base, which includes the ability to define
resource groups and failover policies High-availability cluster infrastructure that lets you define
clusters, resources, and resource types (this consists of the cluster_services installation package) Cluster software infrastructure, which lets you do the following: Perform node logging Administer the cluster Define nodes
The cluster software infrastructure consists of the cluster_admin and cluster_control subsystems).
Figure 1-3 shows a graphic representation of these
layers. Table 1-2 describes the layers for Linux FailSafe,
which are located in the /usr/lib/failsafe/bin directory. Table 1-2. Contents of /usr/lib/failsafe/bin Layer | Subsystem | Process | Description |
|---|
Linux FailSafe Base | failsafe2 | ha_fsd | Linux FailSafe daemon. Provides basic
component of the Linux FailSafe software. | High-availability cluster infrastructure | cluster_ha | ha_cmsd | Cluster membership daemon. Provides
the list of nodes, called node membership, available
to the cluster. | | | ha_gcd | Group membership daemon. Provides group
membership and reliable communication services in the presence of failures
to Linux FailSafe processes. | | | ha_srmd | System resource manager daemon. Manages
resources, resource groups, and resource types. Executes action scripts for
resources. | | | ha_ifd | Interface agent daemon. Monitors the
local node's network interfaces. | Cluster software infrastructure | cluster_admin | cad | Cluster administration daemon. Provides
administration services. | | cluster_control | crsd | Node control daemon. Monitors the serial
connection to other nodes. Has the ability to reset other nodes. | | | cmond | Daemon that manages all other daemons.
This process starts other processes in all nodes in the cluster and restarts
them on failures. | | | cdbd | Manages the configuration database
and keeps each copy in sync on all nodes in the pool |
The following figures show communication paths in Linux FailSafe. Note
that they do not represent cmond. Figure 1-5 shows the communication path for a node
that is in the pool but not in a cluster. Action scripts are executed under the following conditions: exclusive: the resource group is made online
by the user or HA processes are started start: the resource group is made online
by the user, HA processes are started, or there is a resource group failover stop: the resource group is made offline,
HA process are stopped, the resource group fails over, or the node is shut
down monitor: the resource group is online restart: the monitor
script fails
The order of execution is as follows: Linux FailSafe is started, usually at node boot or manually,
and reads the resource group information from the cluster configuration database. Linux FailSafe asks the system resource manager (SRM) to run exclusive scripts for all resource groups that are in the Online ready state. SRM returns one of the following states for each resource
group: running partially running not running
If a resource group has a state of not running
in a node where HA services have been started, the following occurs: Linux FailSafe runs the failover policy script associated
with the resource group. The failover policy scripts take the list of nodes
that are capable of running the resource group (the failover domain) as a parameter. The failover policy script returns an ordered list of nodes
in descending order of priority (the run-time failover domain)
where the resource group can be placed. Linux FailSafe sends a request to SRM to move the resource
group to the first node in the run-time failover domain. SRM executes the start action script for
all resources in the resource group: If the start script fails, the resource
group is marked online on that node with an srmd
executable error error. If the start script is successful, SRM
automatically starts monitoring those resources. After the specified start
monitoring time passes, SRM executes the monitor action
script for the resource in the resource group.
If the state of the resource group is running
or partially running on only one node in the cluster,
Linux FailSafe runs the associated failover policy script: If the highest priority node is the same node where the resource
group is partially running or running, the resource group is made online on
the same node. In the partially running case, Linux FailSafe
asks SRM to execute start scripts for resources in the
resource group that are not running. If the highest priority node is a another node in the cluster,
Linux FailSafe asks SRM to execute stop action scripts
for resources in the resource group. Linux FailSafe makes the resource group
online in the highest priority node in the cluster.
If the state of the resource group is running
or partially running in multiple nodes in the cluster,
the resource group is marked with an error exclusivity error.
These resource groups will require operator intervention to become online
in the cluster.
Figure 1-6 shows the message paths for action
scripts and failover policy scripts. The cluster configuration database is a key component of Linux FailSafe
software. It contains all information about the following: Resources Resource types Resource groups Failover policies Nodes Clusters
The cluster configuration database daemon (cdbd)
maintains identical databases on each node in the cluster. The following are the contents of the failsafe directories under the /usr/lib and /var hierarchies: /var/run/failsafe/comm/ Directory that contains files that communicate between various daemons. /usr/lib/failsafe/common_scripts/ Directory that contains the script library (the common functions that
may be used in action scripts). /var/log/failsafe/ Directory that contains the logs of all scripts and daemons executed
by Linux FailSafe. The outputs and errors from the commands within the scripts
are logged in the script_nodename
file. /usr/lib/failsafe/policies/ Directory that contains the failover scripts used for resource groups. /usr/lib/failsafe/resource_types/template Directory that contains the template action scripts. /usr/lib/failsafe/resource_types/rt_name Directory that contains the action scripts for the rt_name resource type. For example, /usr/lib/failsafe/resource_types/filesystem . resource_types/rt_name/exclusive Script that verifies that a resource of this resource type is not already
running. For example, resource_types/filesystem/exclusive. resource_types/rt_name/monitor Script that monitors a resource of this type. resource_types/rt_name/restart Script that restarts a resource of this resource type on the same node
after a monitoring failure. resource_types/rt_name/start Script that starts a resource of this resource type. resource_types/rt_name/stop Script that stops a resource of this resource type.
Table 1-3 shows the administrative commands available
for use in scripts. Table 1-3. Administrative Commands for Use in Scripts Command | Purpose |
|---|
ha_cilog | Logs messages to the script_ nodename log
files. | ha_execute_lock | Executes a command with a file lock.
This allows command execution to be serialized | ha_exec2 | Executes a command and retries the
command on failure or timeout. | ha_filelock | Locks a file. | ha_fileunlock | Unlocks a file. | ha_ifdadmin | Communicates with the ha_ifd network interface agent daemon. | ha_http_ping2 | Checks if a web server is running. | ha_macconfig2 | Displays or modifies MAC addresses
of a network interface. |
Linux FailSafe Administrator's Guide
(document number: 007-4322-002 / published: 2001-02-28)
table of contents | additional info | download
home/search |
what's new |
help
|
|
|