- action scripts
The set of scripts that determine how a resource is started, monitored,
and stopped. There must be a set of action scripts specified for each
resource type. The possible set of action scripts is: probe, exclusive, start, stop, monitor, and restart.
- cluster
A collection of one or more cluster nodes coupled to each other by networks or other similar interconnections.
A cluster is identified by a simple name; this name must be unique within
the pool. A particular node may be a member of
only one cluster.
- cluster administrator
The person responsible for managing and maintaining a Linux FailSafe
cluster.
- cluster configuration database
Contains configuration information about all resources, resource
types, resource groups, failover policies, nodes, and clusters.
- cluster node
A single Linux image. Usually, a cluster node is an individual computer.
The term node is also used in this guide for brevity.
- control messages
Messages that cluster software sends between the cluster nodes to
request operations on or distribute information about cluster nodes and
resource groups. Linux FailSafe sends control messages for the purpose
of ensuring nodes and groups remain highly available. Control messages
and heartbeat messages are sent through a node's network interfaces that
have been attached to a control network. A node can be attached to multiple
control networks.
A node's control networks should not be set to accept control messages
if the node is not a dedicated Linux FailSafe node. Otherwise, end users
who run non-Linux FailSafe jobs on the machine can have their jobs killed
unexpectedly when Linux FailSafe resets the node.
- control network
The network that connects nodes through their network interfaces
(typically Ethernet) such that Linux FailSafe can maintain a cluster's
high availability by sending heartbeat messages and control messages through
the network to the attached nodes. Linux FailSafe uses the highest priority
network interface on the control network; it uses a network interface
with lower priority when all higher-priority network interfaces on the
control network fail.
A node must have at least one control network interface for heartbeat
messages and one for control messages (both heartbeat and control messages
can be configured to use the same interface). A node can have no more
than eight control network interfaces.
- dependency list
See resource dependency list or resource type dependency list.
- failover
The process of allocating a resource group
to another node to another, according to a failover policy. A failover may be triggered by the failure
of a resource, a change in the node membership (such as when a node fails
or starts), or a manual request by the administrator.
- failover attribute
A string that affects the allocation of a resource group in a cluster.
The administrator must specify system-defined attributes (such as AutoFailback or ControlledFailback),
and can optionally supply site-specific attributes.
- failover domain
The ordered list of nodes
on which a particular resource group can be allocated.
The nodes listed in the failover domain must be within the same cluster;
however, the failover domain does not have to include every node in the
cluster.The administrator defines the initial failover domain when creating a failover policy. This list is transformed
into the running failover domain by the failover script; the runtime
failover domain is what is actually used to select the failover node.
Linux FailSafe stores the runtime failover domain and uses it as input
to the next failover script invocation. The initial and runtime failover
domains may be identical, depending upon the contents of the failover
script. In general, Linux FailSafe allocates a given resource group to
the first node listed in the runtime failover domain that is also in the
node membership; the point at which this allocation takes place is affected
by the failover attributes.
- failover policy
The method used by Linux FailSafe to determine the destination node
of a failover. A failover policy consists of a failover domain, failover attributes,
and a failover script. A failover policy name must
be unique within the pool.
- failover script
A failover policy component that generates a runtime
failover domain and returns it to the Linux FailSafe process.
The Linux FailSafe process applies the failover attributes and then selects
the first node in the returned failover domain that is also in the current
node membership.
- heartbeat messages
Messages that cluster software sends between the nodes that indicate
a node is up and running. Heartbeat messages and control messages are sent through a node's network interfaces that have been
attached to a control network. A node can be attached to multiple control
networks.
- heartbeat interval
Interval between heartbeat messages. The node timeout value must
be at least 10 times the heartbeat interval for proper Linux FailSafe
operation (otherwise false failovers may be triggered). The higher the
number of heartbeats (smaller heartbeat interval), the greater the potential
for slowing down the network. Conversely, the fewer the number of heartbeats
(larger heartbeat interval), the greater the potential for reducing availability
of resources.
- initial failover domain
The ordered list of nodes, defined by the administrator when a failover
policy is first created, that is used the first time a cluster is booted.The
ordered list specified by the initial failover domain is transformed into
a runtime failover domain by the failover
script; the runtime failover domain is used along with failover
attributes to determine the node on which a resource group should reside.
With each failure, the failover script takes the current runtime failover
domain and potentially modifies it; the initial failover domain is never
used again. Depending on the runtime conditions and contents of the failover
script, the initial and runtime failover domains may be identical. See
also runtime failover domain.
- key/value attribute
A set of information that must be defined for a particular resource
type. For example, for the resource type filesystem,
one key/value pair might be mount_point=/fs1 where mount_point is the key and fs1 is the value
specific to the particular resource being defined. Depending on the value,
you specify either a string or integer
data type. In the previous example, you would specify string as the data type for the value fs1.
- log configuration
A log configuration has two parts: a log level
and a log file, both associated with a log group. The cluster administrator can customize the location
and amount of log output, and can specify a log configuration for all
nodes or for only one node. For example, the crsd log
group can be configured to log detailed level-10 messages to the /var/log/failsafe/crsd-foo log only on the node foo, and to write only minimal level-1 messages to the crsd log on all other nodes.
- log file
A file containing Linux FailSafe notifications for a particular log group. A log file is part of the log configuration for a log group. By default, log files reside in the /var/log/failsafe directory, but the cluster administrator
can customize this. Note: Linux FailSafe logs both normal operations and
critical errors to /var/log/messages,
as well as to individual logs for specific log groups.
- log group
A set of one or more Linux FailSafe processes that use the same
log configuration. A log group usually corresponds to one Linux FailSafe
daemon, such as gcd.
- log level
A number controlling the number of log messages that Linux FailSafe
will write into an associated log group's log file. A log level is part
of the log configuration for a log group.
- node
See cluster node
- node ID
A 16-bit positive integer that uniquely defines a cluster node.
During node definition, Linux FailSafe will assign a node ID if one has
not been assigned by the cluster administrator. Once assigned, the node
ID cannot be modified.
- node membership
The list of nodes in a cluster on which Linux FailSafe can allocate
resource groups.
- node timeout
If no heartbeat is received from a node in this period of time,
the node is considered to be dead. The node timeout value must be at least
10 times the heartbeat interval for proper Linux FailSafe operation (otherwise
false failovers may be triggered).
- notification command
The command used to notify the cluster administrator of changes
or failures in the cluster, nodes, and resource groups. The command must
exist on every node in the cluster.
- offline resource group
A resource group that is not highly available in the cluster. To
put a resource group in offline state, Linux FailSafe stops the group
(if needed) and stops monitoring the group. An offline resource group
can be running on a node, yet not under Linux FailSafe control. If the
cluster administrator specifies the detach only option
while taking the group offline, then Linux FailSafe will not stop the
group but will stop monitoring the group.
- online resource group
A resource group that is highly available in the cluster. When Linux
FailSafe detects a failure that degrades the resource group availability,
it moves the resource group to another node in the cluster. To put a resource
group in online state, Linux FailSafe starts the group (if needed) and
begins monitoring the group. If the cluster administrator specifies the attach only option while bringing the group online, then Linux
FailSafe will not start the group but will begin monitoring the group.
- owner host
A system that can control a Linux FailSafe node remotely, such as
power-cycling the node). Serial cables must physically connect the two
systems through the node's system controller port. At run time, the owner
host must be defined as a node in the Linux FailSafe pool.
- owner TTY name
The device file name of the terminal port (TTY) on the owner host to which the system controller serial cable is
connected. The other end of the cable connects to the Linux FailSafe node
with the system controller port, so the node can be controlled remotely
by the owner host.
- pool
The entire set of nodes
involved with a group of clusters. The group of clusters are usually close
together and should always serve a common purpose. A replicated database
is stored on each node in the pool.
- port password
The password for the system controller port, usually set once in
firmware or by setting jumper wires. (This is not the same as the node's
root password.)
- powerfail mode
When powerfail mode is turned on, Linux FailSafe
tracks the response from a node's system controller as it makes reset
requests to a cluster node. When these requests fail to reset the node
successfully, Linux FailSafe uses heuristics to try to estimate whether
the machine has been powered down. If the heuristic algorithm returns
with success, Linux FailSafe assumes the remote machine has been reset
successfully. When powerfail mode is turned off, the
heuristics are not used and Linux FailSafe may not be able to detect node
power failures.
- process membership
A list of process instances in a cluster that form a process group.
There can be one or more processes per node.
- resource
A single physical or logical entity that provides a service to clients
or other resources. For example, a resource can be a single disk volume,
a particular network address, or an application such as a web server.
A resource is generally available for use over time on two or more nodes in a cluster,
although it can be allocated to only one node at any given time. Resources
are identified by a resource name and a resource type. Dependent resources must be part of the same resource group and are identified in a resource
dependency list.
- resource dependency
The condition in which a resource requires the existence of other
resources.
- resource group
A collection of resources.
A resource group is identified by a simple name; this name must be unique
within a cluster. Resource groups cannot overlap; that is, two resource
groups cannot contain the same resource. All interdependent resources
must be part of the same resource group. If any individual resource in
a resource group becomes unavailable for its intended use, then the entire
resource group is considered unavailable. Therefore, a resource group
is the unit of failover for Linux FailSafe.
- resource keys
Variables that define a resource of a given resource type. The action
scripts use this information to start, stop, and monitor a resource of
this resource type.
- resource name
The simple name that identifies a specific instance of a resource type. A resource name must be unique within a cluster.
- resource type
A particular class of resource. All of the
resources in a particular resource type can be handled in the same way
for the purposes of failover. Every resource is
an instance of exactly one resource type. A resource type is identified
by a simple name; this name must be unique within a cluster. A resource
type can be defined for a specific node or for an entire cluster. A resource
type that is defined for a node overrides a cluster-wide resource type
definition with the same name; this allows an individual node to override
global settings from a cluster-wide resource type definition.
- resource type dependency
A set of resource types upon which a resource type depends. For
example, the filesystem resource type depends upon
the volume resource type, and the Netscape_web resource type depends upon the filesystem and IP_address resource types.
- runtime failover domain
The ordered set of nodes on which the resource group can execute
upon failures, as modified by the failover script.
The runtime failover domain is used along with failover attributes to
determine the node on which a resource group should reside.See also initial failover domain.
- start/stop order
Each resource type has a start/stop order, which is a non-negative
integer. In a resource group, the start/stop orders of the resource types
determine the order in which the resources will be started when Linux
FailSafe brings the group online and will be stopped when Linux FailSafe
takes the group offline. The group's resources are started in increasing
order, and stopped in decreasing order; resources of the same type are
started and stopped in indeterminate order. For example, if resource type volume has order 10 and resource type filesystem
has order 20, then when Linux FailSafe brings a resource group online,
all volume resources in the group will be started before all filesystem
resources in the group.
- system controller port
A port sitting on a node that provides a way to power-cycle the
node remotely. Enabling or disabling a system controller port in the cluster
configuration database (CDB) tells Linux FailSafe whether it can perform
operations on the system controller port. (When the port is enabled, serial
cables must attach the port to another node, the owner host.) System controller
port information is optional for a node in the pool, but is required if
the node will be added to a cluster; otherwise resources running on that
node never will be highly available.
- tie-breaker node
A node identified as a tie-breaker for Linux FailSafe to use in
the process of computing node membership for the cluster, when exactly
half the nodes in the cluster are up and can communicate with each other.
If a tie-breaker node is not specified, Linux FailSafe will use the node
with the lowest node ID in the cluster as the tie-breaker node.
- type-specific attribute
Required information used to define a resource of a particular resource
type. For example, for a resource of type filesystem,
you must enter attributes for the resource's volume name (where the filesystem
is located) and specify options for how to mount the filesystem (for example,
as readable and writable).