Linux » Books » Developer »
Message Passing Toolkit (MPT) User's Guide
(document number: 007-3773-012 / published: 2009-10-22)
table of contents | additional info | download find in page
Chapter 7. Run-time Tuning
This chapter discusses ways in which the user can tune the run-time
environment to improve the performance of an MPI message passing application
on SGI computers. None of these ways involve application code changes.
This chapter covers the following topics:
Reducing Run-time Variability
One of the most common problems with
optimizing message passing codes on large shared memory computers is achieving
reproducible timings from run to run. To reduce run-time variability,
you can take the following precautions: Do not oversubscribe the system. In other words, do not
request more CPUs than are available and do not request more memory than
is available. Oversubscribing causes the system to wait unnecessarily
for resources to become available and leads to variations in the results
and less than optimal performance.
Avoid interference from other system activity. The Linux
kernel uses more memory on node 0 than on other nodes (node 0 is called
the kernel node in the following discussion). If your application uses
almost all of the available memory per processor, the memory for processes
assigned to the kernel node can unintentionally spill over to nonlocal
memory. By keeping user applications off the kernel node, you can avoid
this effect.
Additionally, by restricting system daemons to run on the kernel
node, you can also deliver an additional percentage of each application
CPU to the user.
Avoid interference with other applications. You can use
cpusets or cpumemsets to address this problem also. You can use cpusets
to effectively partition a large, distributed memory host in a fashion
that minimizes interactions between jobs running concurrently on the system.
See the Linux Resource Administration
Guide for information about cpusets and cpumemsets.
On a quiet, dedicated system, you can use dplace
or the MPI_DSM_CPULIST shell variable to
improve run-time performance repeatability. These approaches are not as
suitable for shared, nondedicated systems.
Use a batch scheduler; for example, LSF from Platform
Computing or PBSpro from Veridan. These batch schedulers use cpusets to
avoid oversubscribing the system and possible interference between applications.
Tuning MPI Buffer Resources
By default, the SGI MPI implementation buffers
messages whose lengths exceed 64 bytes. Longer messages are buffered in
a shared memory region to allow for exchange of data between MPI processes.
In the SGI MPI implementation, these buffers are divided into two basic
pools. For messages exchanged between MPI processes within the
same host or between partitioned systems when using the XPMEM driver,
buffers from the ”per process” pool (called the “per
proc” pool) are used. Each MPI process is allocated a fixed portion
of this pool when the application is launched. Each of these portions
is logically partitioned into 16-KB buffers.
For MPI jobs running across multiple hosts, a second pool
of shared memory is available. Messages exchanged between MPI processes
on different hosts use this pool of shared memory, called the “per
host” pool. The structure of this pool is somewhat more complex
than the “per proc” pool.
For an MPI job running on a single host, messages that exceed 64
bytes are handled as follows. For messages with a length of 16 KB or less,
the sender MPI process buffers the entire message. It then delivers a
message header (also called a control message) to a mailbox, which is
polled by the MPI receiver when an MPI call is made. Upon finding a matching
receive request for the sender's control message, the receiver copies
the data out of the shared memory buffer into the application buffer indicated
in the receive request. The receiver then sends a message header back
to the sender process, indicating that the shared memory buffer is available
for reuse. Messages whose length exceeds 16 KB are broken down into 16-KB
chunks, allowing the sender and receiver to overlap the copying of data
to and from shared memory in a pipeline fashion.
Because there is a finite number of these shared memory buffers,
this can be a constraint on the overall application performance for certain
communication patterns. You can use the MPI_BUFS_PER_PROC
shell variable to adjust the number of buffers available for the “per
proc” pool. Similarly, you can use the MPI_BUFS_PER_HOST
shell variable to adjust the “per host” pool. You
can use the MPI statistics counters to determine if retries for these
shared memory buffers are occurring.
For information on the use of these counters, see “MPI Internal Statistics” in Chapter 6.
In general, you can avoid excessive numbers of retries for buffers by
increasing the number of buffers for the “per proc” pool or “per
host” pool. However, you should keep in mind that increasing the
number of buffers does consume more memory. Also, increasing the number
of “per proc” buffers does potentially increase the probability
for cache pollution (that is, the excessive filling of the cache with
message buffers). Cache pollution can result in degraded performance during
the compute phase of a message passing application.
There are additional buffering considerations to take into account
when running an MPI job across multiple hosts. For further discussion
of multihost runs, see “Tuning for Running Applications Across Multiple Hosts”.
For further discussion on programming implications concerning message
buffering, see “Buffering” in Chapter 4.
Avoiding Message Buffering - Enabling Single Copy
For
message transfers between MPI processes within the same host or transfers
between partitions, it is possible under certain conditions to avoid the
need to buffer messages. Because many MPI applications are written assuming
infinite buffering, the use of this unbuffered approach is not enabled
by default for MPI_Send. This section describes how
to activate this mechanism by default for MPI_Send.
For MPI_Isend, MPI_Sendrecv,
MPI_Alltoall, MPI_Bcast, MPI_Allreduce
, and MPI_Reduce, this optimization is enabled
by default for large message sizes. To disable this default single copy
feature used for the collectives, use the MPI_DEFAULT_SINGLE_COPY_OFF
environment variable.
Using the XPMEM Driver for Single Copy Optimization
MPI takes advantage
of the XPMEM driver to support single copy message transfers between two
processes within the same host or across partitions.
Enabling single copy transfers may result in better performance,
since this technique improves MPI's bandwidth. However, single copy transfers
may introduce additional synchronization points, which can reduce application
performance in some cases.
The threshold for message lengths beyond which MPI attempts to use
this single copy method is specified by the MPI_BUFFER_MAX
shell variable. Its value should be set to the message length in bytes
beyond which the single copy method should be tried. In general, a value
of 2000 or higher is beneficial for many applications.
During job startup, MPI uses the XPMEM driver (via the xpmem kernel
module) to map memory from one MPI process to another. The mapped areas
include the static (BSS) region, the private heap, the stack region, and
optionally the symmetric heap region of each process.
Memory mapping allows each process to directly access memory from
the address space of another process. This technique allows MPI to support
single copy transfers for contiguous data types from any of these mapped
regions. For these transfers, whether between processes residing on the
same host or across partitions, the data is copied using a bcopy
process. A bcopy process is also used to
transfer data between two different executable files on the same host
or two different executable files across partitions. For data residing
outside of a mapped region (a /dev/zero region, for
example), MPI uses a buffering technique to transfer the data.
Memory mapping is enabled by default. To disable it, set the
MPI_MEMMAP_OFF environment variable. Memory mapping must be
enabled to allow single-copy transfers, MPI-2 one-sided communication,
support for the SHMEM model, and certain collective optimizations.
Memory Placement and Policies
The MPI library takes advantage of NUMA
placement functions that are available. Usually, the default placement
is adequate. Under certain circumstances, however, you might want to modify
this default behavior. The easiest way to do this is by setting one or
more MPI placement shell variables. Several of the most commonly used
of these variables are discribed in the following sections. For a complete
listing of memory placement related shell variables, see the MPI(1) man page.
The MPI_DSM_CPULIST
shell variable allows you to manually select processors to use for an
MPI application. At times, specifying a list of processors on which to
run a job can be the best means to insure highly reproducible timings,
particularly when running on a dedicated system.
This setting is treated as a comma and/or hyphen delineated ordered
list that specifies a mapping of MPI processes to CPUs. If running across
multiple hosts, the per host components of the CPU list are delineated
by colons. Within hyphen delineated lists CPU striding may be specified
by placing "/#" after the list where "#" is the stride distance.
 | Note: This feature should not be used with MPI applications that use either
of the MPI-2 spawn related functions.
|
Examples of settings are as follows: | Value | | CPU Assignment
| | 8,16,32 | | Place three MPI processes on CPUs 8, 16, and 32.
| | 32,16,8 | | Place the MPI process rank zero on CPU 32, one on 16,
and two on CPU 8.
| | 8-15/2 | | Place the MPI processes 0 through 3 strided on CPUs 8,
10, 12, and 14
| | 8-15,32-39 | | Place the MPI processes 0 through 7 on CPUs 8 to 15. Place
the MPI processes 8 through 15 on CPUs 32 to 39.
| | 39-32,8-15 | | Place the MPI processes 0 through 7 on CPUs 39 to 32.
Place the MPI processes 8 through 15 on CPUs 8 to 15.
| | 8-15:16-23 | | Place the MPI processes 0 through 7 on the first host
on CPUs 8 through 15. Place MPI processes 8 through 15 on CPUs 16 to 23
on the second host.
|
Note that the process rank is the MPI_COMM_WORLD
rank. The interpretation of the CPU values specified in the
MPI_DSM_CPULIST depends on whether the MPI job is being run
within a cpuset. If the job is run outside of a cpuset, the CPUs specify
cpunum values beginning with 0 and up to the number of CPUs
in the system minus one. When running within a cpuset, the default behavior
is to interpret the CPU values as relative processor numbers within the
cpuset.
The number of processors specified should equal the number of MPI
processes that will be used to run the application. The number of colon
delineated parts of the list must equal the number of hosts used for the
MPI job. If an error occurs in processing the CPU list, the default placement
policy is used.
Use the MPI_DSM_DISTRIBUTE
shell variable to ensure that each MPI process will get a physical
CPU and memory on the node to which it was assigned.If this environment
variable is used without specifying an MPI_DSM_CPULIST
variable, it will cause MPI to assign MPI ranks starting at logical CPU
0 and incrementing until all ranks have been placed. Therefore, it is
recommended that this variable be used only if running within a cpumemset
or on a dedicated system.
The MPI_DSM_PPM shell
variable allows you to specify the number of MPI processes to be placed
on a node. Memory bandwidth intensive applications can benefit from placing
fewer MPI processes on each node of a distributed memory host. On SGI
Altix 3000 systems, setting MPI_DSM_PPM to 1 places
one MPI process on each node.
Setting the MPI_DSM_VERBOSE
shell variable directs MPI to display a synopsis of the NUMA
and host placement options being used at run time.
Using dplace for Memory Placement
The dplace tool offers
another means of specifying the placement of MPI processes within a distributed
memory host. The dplace tool and MPI interoperate to
allow MPI to better manage placement of certain shared memory data structures
when dplace is used to place the MPI job.
For instructions on how to use dplace with MPI,
see the dplace(1) man page and the
Linux Application Tuning Guide.
Tuning MPI/OpenMP Hybrid Codes
A hybrid MPI/OpenMP application is one in which each MPI process
itself is a parallel threaded program. These programs often exploit the
OpenMP paralllelism at the loop level while also implementing a higher
level parallel algorithm using MPI.
Many parallel applications perform better if the MPI processes and
the threads within them are pinned to particular processors for the duration
of their execution. For ccNUMA systems, this ensures that all local, non-shared
memory is allocated on the same memory node as the processor referencing
it. For all systems, it can ensure that some or all of the OpenMP threads
stay on processors that share a bus or perhaps a processor cache, which
can speed up thread synchronization.
MPT provides the omplace(1) command to help with
the placement of OpenMP threads within an MPI program. The
omplace command causes the threads in a hybrid MPI/OpenMP job
to be placed on unique CPUs within the containing cpuset. For example,
the threads in a 2-process MPI program with 2 threads per process
would be placed as follows: rank 0 thread 0 on CPU 0
rank 0 thread 1 on CPU 1
rank 1 thread 0 on CPU 2
rank 1 thread 1 on CPU 3 |
The CPU placement is performed by dynamically generating a
dplace(1) placement file and invoking dplace.
For detailed syntax and a number of examples, see the
omplace(1) man page. For more information on dplace
, see the dplace(1) man page. For information
on using cpusets, see the Linux Resource Administration Guide
. For more information on using dplace,
see the Linux Application Tuning Guide.
Example 7-1. How to Run a Hybrid MPI/OpenMP Application
Here is an example of how to run a hybrid MPI/OpenMP application
with eight MPI processes that are two-way threaded on two hosts: mpirun host1,host2 -np 4 omplace -nt 2 ./a.out |
When using the PBS batch scheduler to schedule the a hybrid MPI/OpenMP
job as shown in Example 7-1, use the following resource
allocation specification:
And use the following mpiexec command with the
above example: mpiexec -n 8 omplace -nt 2 ./a.out |
For more information about running MPT programs with PBS, see“Running MPI Jobs with a Work Load Manager” in Chapter 3 .
Tuning for Running Applications Across Multiple Hosts
When you are running an MPI
application across a cluster of hosts, there are additional run-time environment
settings and configurations that you can consider when trying to improve
application performance.
Systems can use the XPMEM interconnect to cluster hosts as partitioned
systems, or use the Voltaire InfiniBand interconnect or TCP/IP as the
multihost interconnect.
When launched as a distributed application, MPI probes for these
interconnects at job startup. For details of launching a distributed application,
see “Launching a Distributed Application” in Chapter 3. When a high performance interconnect
is detected, MPI attempts to use this interconnect if it is available
on every host being used by the MPI job. If the interconnect is not available
for use on every host, the library attempts to use the next slower interconnect
until this connectivity requirement is met. Table 7-1
specifies the order in which MPI probes for available interconnects.
Table 7-1. Inquiry Order for Available Interconnects
Interconnect
| Default Order of Selection
| Environment Variable to Require
Use
|
|---|
XPMEM
| 1
| MPI_USE_XPMEM
| InfiniBand
| 2
| MPI_USE_IB
| TCP/IP
| 3
| MPI_USE_TCP
|
The third column of Table 7-1 also indicates
the environment variable you can set to pick a particular interconnect
other than the default.
In general, to insure the best performance of the application, you
should allow MPI to pick the fastest available interconnect.
In addition to the choice of interconnect, you should know that
multihost jobs may use different buffers from those used by jobs run on
a single host. In the SGI implementation of MPI, the XPMEM interconnect
uses the “per proc” buffers while the InfiniBand and TCP interconnects
use the “per host” buffers. The default setting for the number
of buffers per proc or per host might be too low for many applications.
You can determine whether this setting is too low by using the MPI statistics
described earlier in this section.
When using the TCP/IP interconnect, unless specified otherwise,
MPI uses the default IP adapter for each host. To use a nondefault adapter,
enter the adapter-specific host name on the mpirun
command line.
When using the InfiniBand interconnect, MPT applications may not
execute a fork() or system() call.
The InfiniBand driver produces undefined results when an MPT process using
InfiniBand forks.
Requires the MPI library to use the InfiniBand driver as the interconnect
when running across multiple hosts or running with multiple binaries.
MPT requires the ibhost software stack from Voltaire
when the InfiniBand interconnect is used. If InfiniBand is used, the
MPI_COREDUMP environment variable is forced to INHIBIT, to comply
with the InfiniBand driver restriction that no fork()s may occur after
InfiniBand resources have been allocated. Default: Not set
When this is set to 1 and the MPI library uses the InfiniBand driver
as the inter-host interconnect, MPT will send its InfiniBand traffic over
the first fabric that it detects. If this is set to 2, the library will
try to make use of multiple available separate InfiniBand fabrics and
split its traffic across them. If the separate InfiniBand fabrics do not
have unique subnet IDs, then the rail-config utility
is required. It must be run by the system administrator to enable the
library to correctly use the separate fabrics. Default: 1 on all SGI Altix
systems.
MPI_IB_SINGLE_COPY_BUFFER_MAX
When MPI transfers data over InfiniBand, if the size of the cumulative
data is greater than this value then MPI will attempt to send the data
directly between the processes's buffers and not through intermediate
buffers inside the MPI library. Default: 32767
For more information on these environment variables, see the “ENVIRONMENT
VARIABLES” section of the mpi(1)
man page.
Tuning for Running Applications over the InfiniBand Interconnect
When running an
MPI application across a cluster of hosts using the InfiniBand interconnect,
there are additional run-time environmental settings that you can consider
to improve application performance, as follows:
Controls the number of other ranks that a rank can receive from
over InfiniBand using a short message fast path. This is 8 by default
and can be any value between 0 and 32.
For zero-copy sends over the InfiniBand interconnect, MPT keeps
a cache of application data buffers registered for these transfers. This
environmental variable controls the size of the cache. It is 8 by default
and can be any value between 0 and 32. If the application rarely reuses
data buffers, it may make sense to set this value to 0 to avoid cache
trashing.
MPI_CONNECTIONS_THRESHOLD
For very large MPI jobs, the time and resource cost to create a
connection between every pair of ranks at job start time may be prodigious.
When the number of ranks is at least this value, the MPI library will
create InfiniBand connections lazily on a demand basis. The default is
2048 ranks.
When the MPI library is in lazy connection mode (see “MPI_CONNECTIONS_THRESHOLD”),
it normally sends connection control messages directly between the involved
ranks. This may create excessive load on the administration links between
the involved hosts. When the number of ranks is at least the value of
this variable, the MPI library will forward control messages through
mpirun to avoid this problem. The default is 8192 ranks.
When the MPI library uses the InfiniBand fabric, it allocates some
amount of memory for each message header that it uses for InfiniBand.
If the size of data to be sent is not greater than this amount minus
64 bytes for the actual header, the data is inlined with the header. If
the size is greater than this value, then the message is sent through
remote direct memory access (RDMA) operations. The default is 16384 bytes.
When an InfiniBand card sends a packet, it waits some amount of
time for an ACK packet to be returned by the receiving
InfiniBand card. If it does not receive one, it sends the
packet again. This variable controls that wait period. The time spent
is equal to 4 * 2 ^ MPI_IB_TIMEOUT microseconds. By
default, the variable is set to 18.
When the MPI library uses InfiniBand and this variable is set, and
an InfiniBand transmission error occurs, MPT will try to restart the connection
to the other rank. It will handle a number of errors of this type between
any pair of ranks equal to the value of this variable. By default, the
variable is set to 4.
MPI software from SGI can
internally use the XPMEM kernel module to provide direct access to data
on remote partitions and to provide single copy operations to local data.
Any pages used by these operations are prevented from paging by the XPMEM
kernel module. If an administrator needs to temporarily suspend a MPI
application to allow other applications to run, they can unpin these pages
so they can be swapped out and made available for other applications.
Each process of a MPI application which is using the XPMEM kernel
module will have a /proc/xpmem/pid
file associated with it. The number of pages owned by this process which
are prevented from paging by XPMEM can be displayed by concatenating the
/proc/xpmem/pid file, for example: # cat /proc/xpmem/5562
pages pinned by XPMEM: 17 |
To unpin the pages for use
by other processes, the administrator must first suspend all the processes
in the application. The pages can then be unpinned by echoing any value
into the /proc/xpmem/pid
file, for example: # echo 1 > /proc/xpmem/5562 |
The echo command will not return until that process's pages are
unpinned.
When the MPI application is resumed, the XPMEM
kernel module will prevent these pages from paging as they are referenced
by the application.
Message Passing Toolkit (MPT) User's Guide
(document number: 007-3773-012 / published: 2009-10-22)
table of contents | additional info | download
Front Matter
New Features in This Manual
About This Manual
Chapter 1. Introduction
Chapter 2. Administrating MPT
Chapter 3. Getting Started
Chapter 4. Programming with SGI MPI
Chapter 5. Debugging MPI Applications
Chapter 6. Profiling MPI Applications
Chapter 7. Run-time Tuning
Chapter 8. MPI Performance Profiling
Chapter 9. Troubleshooting and Frequently Asked Questions
Index
home/search |
what's new |
help
|