SGI Techpubs Library

Linux  »  Books  »  Administrative  »  
SGI InfiniteStorage High Availability Using Linux-HA Heartbeat
(document number: 007-5451-005 / published: 2009-10-22)    table of contents  |  additional info  |  download
find in page

Chapter 10. Troubleshooting

This section discusses the following:

For details about troubleshooting Heartbeat, see the documentation referred to in “Sources for Detailed Heartbeat Documentation” in Chapter 1.

General Heartbeat Troubleshooting

If you notice problems with Heartbeat, do the following:

  1. Watch the output from crm_mon.

  2. If things do not seem to be responding correctly or if crm_mon lists an error, execute the following:

    # crm_verify -LV 


    Note: You can run crm_verify with a larger number of -V command-line arguments for more detail. If you run crm_verify before STONITH is enabled, you will see errors. Errors similar to the following may be ignored at this time and will go away after STONITH is configured (line breaks used here for readability):
    crm_verify[182641]: 2008/07/11_16:26:54 ERROR: unpack_operation: 
    Specifying on_fail=fence and
    stonith-enabled=false makes no sense



  3. If there are any problems listed in the crm_verify output, they will contain the failed action and the host on which the action failed. To find the specific problem, try to match those events to messages in /var/log/messages or to other information you have about the cluster state.

Recovering from an Incomplete Failover

After an incomplete failover, in which one or more of the resources are not started and the cluster can no longer provide high availability, you must do the following to restore resource functionality and high availability:

  1. Disable Heartbeat management from the resource group:

    # crm_resource -r  resourcegroupname  -p is_managed -v false

  2. Determine which resources have failcounts:

    # crm_failcount -G -U nodename  -r resourcename

    Repeat for each resource on each node.

  3. Troubleshoot the failed resource operations. Examine the /var/log/messages system log and application logs around the time of the operation failures in order to deduce why they failed and deal with those causes.

  4. Ensure that all of the individual resources are working properly according to the information in Chapter 7, “Configuring and Testing SGI Products for High Availability”.

  5. Remove the failcounts found in step 2:

    # crm_failcount -D -U nodename  -r failed_resourcename

    Repeat this for each failed resource on each node.

  6. Remove error messages:

    # crm_resource -C -H nodename -r failed_resourcename

    Repeat this for each failed resource on each node.

  7. Reenable Heartbeat management:

    # crm_resource -r resourcegroupname -d is_managed

Error Messages in /var/log/messages

If you see errors in the /var/log/messages file, you should run the crm_verify command and look for corresponding messages.

dmaudit Error Detection

Files in the DMF-managed user filesystems that are out of synchronization with DMF database entries because of a system crash may cause errors to be found by dmaudit.

For example, when a user file is removed, the DMF daemon will immediately soft-delete the database entries corresponding to that file indicating at what time the file was removed. Should the machine crash shortly thereafter, the removal of the file within the filesystem might not have yet been updated on disk, and so the file will reappear when the filesystem is mounted on the alternate node. If dmaudit is then run, it will report that the filesystem and database are inconsistent because it found an existing file pointing at a soft-deleted database entry. You can minimize these errors by using the dirsync mount option if potentially lower filesystem performance is not a concern.

DMF-managed user filesystems may be mounted with either the sync or dirsync mount option, depending on the your desire for filesystem completeness upon system failure. Alternatively, the user applications accessing these filesystems are responsible for verifying file synchronization.


Caution: The sync and dirsync mount options have serious performance implications that may outweigh filesystem synchronization benefits.


DMF Logs are Incomplete

DMF logs may not be complete due to system failures. If you want to ensure that the DMF logs are absolutely complete upon system failure, despite any performance issues, you may optionally choose to mount the DMF spool filesystem with the sync mount option.

The DMF home, journals, and disk MSP filesystems do not require the sync mount option because DMF synchronizes data on those filesystems. The move and temporary filesystems do not require the sync mount option because DMF does not recover anything from them.


Caution: The sync mount option has serious performance implications that may outweigh filesystem synchronization benefits.


Using SGI Knowledgebase

If you encounter problems and have an SGI support contract, you can log on to Supportfolio and access the Knowledgebase tool to help find answers.

To log in to Supportfolio Online, see:

https://support.sgi.com/login

Then click on Search the SGI Knowledgebase and select the type of search you want to perform.

If you need further assistance, contact SGI Support.

Reporting Problems to SGI

If you need to report problems to SGI Support, do the following:

  • Run the following command as root on every node in the cluster in order to gather system configuration information:

    # /usr/sbin/system_info_gather -A -o nodename.out

  • Gather the information reported about Heartbeat in the logfiles; see “Reviewing Log Files” in Chapter 9.

  • Run the following command once on the node where the resource group currently resides to collect information for today and the specified number of additional days (extra-days must be a numerical value greater than or equal to 0):

    # dmcollect extra-days 

    See the dmcollect(8) man page for additional information.

When you contact SGI Support, you will be provided with information on how and where to upload the collected information files for SGI analysis.

SGI InfiniteStorage High Availability Using Linux-HA Heartbeat
(document number: 007-5451-005 / published: 2009-10-22)    table of contents  |  additional info  |  download

    Front Matter
    New Features in this Guide
    About This Guide
    Chapter 1. Introduction
    Chapter 2. Best Practices for High Availability
    Chapter 3. Preliminary Requirements for High Availability
    Chapter 4. Outline of the Configuration Procedure for SGI High-Availability Products
    Chapter 5. Configuring and Testing the Standard Services
    Chapter 6. Establishing a Heartbeat Cluster
    Chapter 7. Configuring and Testing SGI Products for High Availability
    Chapter 8. Configuring and Testing STONITH Reset Services
    Chapter 9. HA Administrative Tasks and Considerations
    Chapter 10. Troubleshooting
    Appendix A. Complete XML Examples
    Appendix B. Differences Among Heartbeat, FailSafe, and SGI Cluster Manager
    Glossary
    Index


home/search | what's new | help