Chapter 5. Performance Co-Pilot for FailSafe

This chapter tells you how to install and use Performance Co-Pilot (PCP) for FailSafe to monitor the availability of a FailSafe cluster.

PCP provides the following:

The visualization of statistics provides valuable information about the availability of nodes and resources monitored by FailSafe. For example, it can highlight a reduction in monitoring response times that may indicate problems in availability of services provided by the cluster.

Because PCP for FailSafe is an extension to the PCP framework, you can use other PCP tools to analyze or present FailSafe monitoring statistics, and record PCP for FailSafe metrics as archives for deferred analysis. You can also use PCP to gather statistics about CPU and memory utilization, network and disk activity, and other performance metrics for each node in the cluster.

Installing Performance Co-Pilot Software

You can deploy PCP for FailSafe as a collector agent or as a monitor client:

  • Collector agents are installed on collector hosts , which are the nodes in the FailSafe cluster itself from which you want to gather statistics. Typically, each node in a FailSafe cluster is designated as a collector host.

  • A monitor client is installed on the monitor host, which is typically a workstation that has a display and is running an X Window System server and a window manager.

Installing the Collector Host

To install PCP for FailSafe on the designated collector hosts, the following software components must already be installed:

  • The RPM package failsafe-1.0.1-1 or a later version of it

  • The RPM package pcp-2.1.6-1 or a later version of it

  • The RPM package pcp-pro-2.1.6-7 or a later version of it

After this software is installed, you must install the following subsystems of PCP for FailSafe on each collector host. To install the required RPM packages on a collector host, do the following:

  1. Locate the binary RPM package pcp-fsafe-2.1.1-1.i386.rpm on the FailSafe CD.

  2. Log in as root.

  3. Issue the rpm(1) command to install PCP for FailSafe:

    # rpm -i <srcpath>/pcp-fsafe-2.1.1-1.i386.rpm

    where <srcpath> is the path (on the local file system, CD-ROM, or URL) to the PCP for FailSafe binary RPM package.

    If pcp-2.1.6-1 and/or pcp-pro-2.1.6-7 is not installed, you will get an error from rpm(1) saying that pre-requisite packages are not installed. You will need to install them before installing pcp-fsafe-2.1.1-1.

    Install them by locating the pcp and pcp-pro RPM packages, and installing them the same way, and in the following order:

    # rpm -i <srcpath>/pcp-2.1.6-1.i386.rpm
    # rpm -i <srcpath>/pcp-pro-2.1.6-7,i386.rpm

  4. Change to the /var/pcp/pmdas/fsafe directory:

    # cd /var/pcp/pmdas/fsafe

  5. Run the Install utility, which installs the FailSafe performance metrics into the PCP performance metrics namespace:

    # ./Install

  6. Choose the appropriate configuration for installation of the fsafe Performance Metrics Domain Agent (PMDA):

    For Linux FailSafe clusters, since the RPM contains both collector and monitor software, select both:

    collector 

    Collects performance statistics on this system

    monitor 

    Allows this system to monitor local and/or remote systems

    both 

    Allows collector and monitor configuration for this system

    Please enter c(ollector) or m(onitor) or b(oth) {b} b

Removing Performance Metrics from a Collector Host

If you wish to remove PCP for FailSafe from a collector host, you will need to remove the PCP for FailSafe metrics from the performance metrics namespace of that host. You can do this before removing the pcp_fsafe subsystem by performing the following commands:

  1. Change to the /var/pcp/pmdas/fsafe directory:

    # cd /var/pcp/pmdas/fsafe

  2. Run the Remove utility:

    # ./Remove

Installing the Monitor Host

To install PCP for FailSafe on a designated monitor host, the following software components must already installed on them:

  • The pcp_eoe.sw subsystem of IRIX 6.5.6 or later, including the subsystem pcp_eoe.sw.monitor

  • PCP 2.1 or later, including the subsystem pcp.sw.monitor

The monitor license (PCPMON) must also be installed on the monitor host.

After this software is installed, install the following subsystems of PCP for FailSafe on each collector host. Table 5-1 lists the subsystems required for a collector host, and their approximate sizes:

Table 5-1. PCP for FailSafe Monitor Subsystems

Subsystem

Size in Kbytes

pcp_fsafe.man.pages

40

pcp_fsafe.man.relnotes

32

pcp_fsafe.sw.monitor

516


The instructions for installing on a monitor host is the same as that for a collector host, except that you do not need to install the FailSafe performance metrics into the PCP performance metrics namespace. Please refer to the section “Installing the Collector Host”, disregarding steps 4 to 6.

Using the Visualization Tools

To view statistics about the FailSafe cluster, use the hbvis(1) and rmvis(1) commands.

The hbvis(1) command constructs a display showing the distribution of heartbeat response times for every node in the cluster. Figure 5-1 shows an example display.

Figure 5-1. Heartbeat Response Statistics

Heartbeat Response Statistics

Key features of the display include the frequency of heartbeat responses that arrive at particular intervals within the timeout period, and the frequency of heartbeat responses that have been missed (determined not to have arrived). The bar representing the frequency of missed heartbeat responses changes color to indicate the urgency of problems with availability of a node.

The rmvis(1) command constructs a display of the resource monitoring response times for resources monitored on every node of the cluster. Figure 5-2 shows an example display.

Figure 5-2. Resource Monitoring Statistics

Resource Monitoring Statistics

The display is similar in concept to that of hbvis(1), showing the frequency of resource monitoring responses that arrive within the timeout period, and the frequency of responses that have timed out. The bar representing the frequency of resource responses that have timed out also changes color to indicate the urgency of problems with the availability of particular resources.

If a node has failed or a resource has failed over, its statistics will disappear from the display.

To run a visualization tool on the monitor host, use the -h option to specify an available collector host in the cluster (host):

% hbvis -h host

or

% rmvis -h host

The collector host specified can be any collector host that is a member of the cluster for which you wish to view statistics.

There are various options available to alter the display provided by hbvis(1) and rmvis(1):

-H hostfile

Provides a file that lists the nodes that are to appear in the visualization. This is useful in limiting the number of nodes in the display, because it takes more time to construct the display for clusters with more nodes.

-t interval

Assigns the sampling time of the visualization. There may be circumstances where extending the period of the sampling time may provide better application responsiveness, particularly for clusters with many nodes. Because FailSafe maintains the statistics, hbvis(1) and rmvis(1) will always show the latest statistics available for the sampling time selected. For details about the interval option, see the pmview(1) and PCPIntro(1) man pages.

-r

Selects the FailSafe metrics that present a sampling of statistics taken from the time of the last statistical reset. This enables hbvis(1) and rmvis(1) to improve the sensitivity of the visualization when abrupt changes appear in the FailSafe monitoring statistics.

Without the -r option, the statistics presented are from a sampling of FailSafe metrics collected from the time ha_cmsd(1m) and/or ha_srmd(1m) was last restarted.

-R

Starts a new statistical sampling.

-v

(hbvis(1) only) Provides a visualisation of heartbeat statistics for each node in the cluster, from the point of view of the selected collector host only. (The collector host is selected using the -h option). There is a graphical representation of heartbeat statistics for each node in the cluster as observed by the selected collector host.

-w

(hbvis(1) only) Provides a visualisation of the aggregate of heartbeat statistics for all nodes in the cluster, from the point of view of the selected collector host only. (The collector host is selected using the -h option). There is a only one graphical representation of heartbeat statistics for the entire cluster as observed by the selected collector host.

For a complete description of options, see the hbvis(1) and rmvis(1) man pages.

hbvis(1) and rmvis(1) use the command pmview(1) to display the 3-D visualization of FailSafe performance metrics. For a description of the various menu commands and controls in the visualization window, consult the man pages for pmview(1).

PCP for FailSafe Performance Metrics

PCP tools such as pmlogger(1), pmchart(1), and pminfo(1) can use the metrics exported by PCP for FailSafe.

Appendix A, “Metrics Exported by PCP for FailSafe”, provides a description of PCP for FailSafe metrics. You can also display a description of metrics by using the following command:

% pminfo -tT -h host

(If you are logged in to a collector host, you can leave out the -h option).

Troubleshooting

A grey display (that is, no colored rectangle bars appear on the node's grey baseplane) when using hbvis(1) or rmvis(1) may indicate one of the following:

  • The node is down.

    If you wish to see only the nodes that are up, create a file containing a list of nodes that are to be displayed and pass it as an option to hbvis(1)/rmvis(1) using the -H option (or the environment variable PCP_FSAFE_NODES) so that a new picture of the cluster can be generated. Please refer to the hbvis(1)/rmvis(1) man pages for more details on the -H option.

  • The collector daemons have been killed on that node.

    To solve this problem, restart pmdafsafe(1) in one of the following ways:

    • If pmcd(1) is still running, send pmcd(1) the SIGHUP signal by entering the following::

      # killall -HUP pmcd

    • If pmcd(1) is not running, restart PCP by entering the following:

      # /etc/init.d/pcp start

  • The timeout and sampling settings are too short.

    To change the sampling time, use the time controls available in the pmview(1) window. By default, this is 2 seconds; you may need to lengthen the sampling period if you are getting an unsatisfactory display.

    Alternatively, there may be timeout issues between pmdafsafe(1) and pmcd(1), or between pmcd(1) and pmview(1). Refer to the man pages for pmcd(1) and PCPIntro(1) for information on how to change the timeout settings for the various PCP tools.

  • The resource has failed over (for rmvis(1)).

    In this case, restart rmvis(1) so that a new picture of the cluster can be generated.