Linux » Books » Developer »
Message Passing Toolkit (MPT) User's Guide
(document number: 007-3773-019 / published: 2011-11-15)
table of contents | additional info | download find in page
Chapter 8. Run-time Tuning
This chapter discusses ways in which the user can tune the run-time
environment to improve the performance of an MPI message passing application
on SGI computers. None of these ways involve application code changes.
This chapter covers the following topics:
Reducing Run-time Variability
One of the most common problems with
optimizing message passing codes on large shared memory computers is achieving
reproducible timings from run to run. To reduce run-time variability,
you can take the following precautions: Do not oversubscribe the system. In other words, do not
request more CPUs than are available and do not request more memory than
is available. Oversubscribing causes the system to wait unnecessarily
for resources to become available and leads to variations in the results
and less than optimal performance.
Avoid interference from other system activity. The Linux
kernel uses more memory on node 0 than on other nodes (node 0 is called
the kernel node in the following discussion). If your application uses
almost all of the available memory per processor, the memory for processes
assigned to the kernel node can unintentionally spill over to nonlocal
memory. By keeping user applications off the kernel node, you can avoid
this effect.
Additionally, by restricting system daemons to run on the kernel
node, you can also deliver an additional percentage of each application
CPU to the user.
Avoid interference with other applications. You can use
cpusets to address this problem also. You can use cpusets to effectively
partition a large, distributed memory host in a fashion that minimizes
interactions between jobs running concurrently on the system. See the Linux Resource Administration Guide
for information about cpusets.
On a quiet, dedicated system, you can use dplace
or the MPI_DSM_CPULIST shell variable to
improve run-time performance repeatability. These approaches are not as
suitable for shared, nondedicated systems.
Use a batch scheduler; for example, Platform LSF from
Platform Computing Corporation or PBS Professional from Altair Engineering,
Inc. These batch schedulers use cpusets to avoid oversubscribing the system
and possible interference between applications.
Tuning MPI Buffer Resources
By default, the SGI MPI implementation buffers
messages whose lengths exceed 64 bytes. Longer messages are buffered in
a shared memory region to allow for exchange of data between MPI processes.
In the SGI MPI implementation, these buffers are divided into two basic
pools. For messages exchanged between MPI processes within the
same host or between partitioned systems when using the XPMEM driver or
when there are more than MPI_BUFS_THRESHOLD hosts,
buffers from the ”per process” pool (called the “per
proc” pool) are used. Each MPI process is allocated a fixed portion
of this pool when the application is launched. Each of these portions
is logically partitioned into 16-KB buffers.
For MPI jobs running across multiple hosts, a second pool
of shared memory is available. Messages exchanged between MPI processes
on different hosts use this pool of shared memory, called the “per
host” pool. The structure of this pool is somewhat more complex
than the “per proc” pool.
For an MPI job running on a single host, messages that exceed 64
bytes are handled as follows. For messages with a length of 128k or less,
the sender MPI process buffers the entire message. It then delivers a
message header (also called a control message) to a mailbox, which is
polled by the MPI receiver when an MPI call is made. Upon finding a matching
receive request for the sender's control message, the receiver copies
the data out of the shared memory buffer into the application buffer indicated
in the receive request. The receiver then sends a message header back
to the sender process, indicating that the shared memory buffer is available
for reuse. Messages whose length exceeds 128k are broken down into 128k
chunks, allowing the sender and receiver to overlap the copying of data
to and from shared memory in a pipeline fashion.
Because there is a finite number of these shared memory buffers,
this can be a constraint on the overall application performance for certain
communication patterns. You can use the MPI_BUFS_PER_PROC
shell variable to adjust the number of buffers available for the “per
proc” pool. Similarly, you can use the MPI_BUFS_PER_HOST
shell variable to adjust the “per host” pool. You
can use the MPI statistics counters to determine if retries for these
shared memory buffers are occurring.
For information on the use of these counters, see “MPI Internal Statistics” in Chapter 9.
In general, you can avoid excessive numbers of retries for buffers by
increasing the number of buffers for the “per proc” pool or “per
host” pool. However, you should keep in mind that increasing the
number of buffers does consume more memory. Also, increasing the number
of “per proc” buffers does potentially increase the probability
for cache pollution (that is, the excessive filling of the cache with
message buffers). Cache pollution can result in degraded performance during
the compute phase of a message passing application.
There are additional buffering considerations to take into account
when running an MPI job across multiple hosts. For further discussion
of multihost runs, see “Tuning for Running Applications Across Multiple Hosts”.
For further discussion on programming implications concerning message
buffering, see “Buffering” in Chapter 4.
Avoiding Message Buffering - Enabling Single Copy
For
message transfers between MPI processes within the same host or transfers
between partitions, it is possible under certain conditions to avoid the
need to buffer messages. Because many MPI applications are written assuming
infinite buffering, the use of this unbuffered approach is not enabled
by default for MPI_Send. This section describes how
to activate this mechanism by default for MPI_Send.
For MPI_Isend, MPI_Sendrecv,
MPI_Alltoall, MPI_Bcast, MPI_Allreduce
, and MPI_Reduce, this optimization is enabled
by default for large message sizes. To disable this default single copy
feature used for the collectives, use the MPI_DEFAULT_SINGLE_COPY_OFF
environment variable.
Using the XPMEM Driver for Single Copy Optimization
MPI takes advantage
of the XPMEM driver to support single copy message transfers between two
processes within the same host or across partitions.
Enabling single copy transfers may result in better performance,
since this technique improves MPI's bandwidth. However, single copy transfers
may introduce additional synchronization points, which can reduce application
performance in some cases.
The threshold for message lengths beyond which MPI attempts to use
this single copy method is specified by the MPI_BUFFER_MAX
shell variable. Its value should be set to the message length in bytes
beyond which the single copy method should be tried. In general, a value
of 2000 or higher is beneficial for many applications.
During job startup, MPI uses the XPMEM driver (via the xpmem kernel
module) to map memory from one MPI process to another. The mapped areas
include the static (BSS) region, the private heap, the stack region, and
optionally the symmetric heap region of each process.
Memory mapping allows each process to directly access memory from
the address space of another process. This technique allows MPI to support
single copy transfers for contiguous data types from any of these mapped
regions. For these transfers, whether between processes residing on the
same host or across partitions, the data is copied using a bcopy
process. A bcopy process is also used to
transfer data between two different executable files on the same host
or two different executable files across partitions. For data residing
outside of a mapped region (a /dev/zero region, for
example), MPI uses a buffering technique to transfer the data.
Memory mapping is enabled by default. To disable it, set the
MPI_MEMMAP_OFF environment variable. Memory mapping must be
enabled to allow single-copy transfers, MPI-2 one-sided communication,
support for the SHMEM model, and certain collective optimizations.
Memory Placement and Policies
The MPI library takes advantage of NUMA
placement functions that are available. Usually, the default placement
is adequate. Under certain circumstances, however, you might want to modify
this default behavior. The easiest way to do this is by setting one or
more MPI placement shell variables. Several of the most commonly used
of these variables are discribed in the following sections. For a complete
listing of memory placement related shell variables, see the MPI(1) man page.
The MPI_DSM_CPULIST
shell variable allows you to manually select processors to use for an
MPI application. At times, specifying a list of processors on which to
run a job can be the best means to insure highly reproducible timings,
particularly when running on a dedicated system.
This setting is treated as a comma and/or hyphen delineated ordered
list that specifies a mapping of MPI processes to CPUs. If running across
multiple hosts, the per host components of the CPU list are delineated
by colons. Within hyphen delineated lists CPU striding may be specified
by placing "/#" after the list where "#" is the stride distance.
 | Note: This feature should not be used with MPI applications that use either
of the MPI-2 spawn related functions.
|
Examples of settings are as follows: | Value | | CPU Assignment
| | 8,16,32 | | Place three MPI processes on CPUs 8, 16, and 32.
| | 32,16,8 | | Place the MPI process rank zero on CPU 32, one on 16,
and two on CPU 8.
| | 8-15/2 | | Place the MPI processes 0 through 3 strided on CPUs 8,
10, 12, and 14
| | 8-15,32-39 | | Place the MPI processes 0 through 7 on CPUs 8 to 15. Place
the MPI processes 8 through 15 on CPUs 32 to 39.
| | 39-32,8-15 | | Place the MPI processes 0 through 7 on CPUs 39 to 32.
Place the MPI processes 8 through 15 on CPUs 8 to 15.
| | 8-15:16-23 | | Place the MPI processes 0 through 7 on the first host
on CPUs 8 through 15. Place MPI processes 8 through 15 on CPUs 16 to 23
on the second host.
|
Note that the process rank is the MPI_COMM_WORLD
rank. The interpretation of the CPU values specified in the
MPI_DSM_CPULIST depends on whether the MPI job is being run
within a cpuset. If the job is run outside of a cpuset, the CPUs specify
cpunum values beginning with 0 and up to the number of CPUs
in the system minus one. When running within a cpuset, the default behavior
is to interpret the CPU values as relative processor numbers within the
cpuset.
The number of processors specified should equal the number of MPI
processes that will be used to run the application. The number of colon
delineated parts of the list must equal the number of hosts used for the
MPI job. If an error occurs in processing the CPU list, the default placement
policy is used.
Use the MPI_DSM_DISTRIBUTE
shell variable to ensure that each MPI process will get a physical
CPU and memory on the node to which it was assigned. If this environment
variable is used without specifying an MPI_DSM_CPULIST
variable, it will cause MPI to assign MPI ranks starting at logical CPU
0 and incrementing until all ranks have been placed. Therefore, it is
recommended that this variable be used only if running within a cpuset
on a dedicated system.
Setting the MPI_DSM_VERBOSE
shell variable directs MPI to display a synopsis of the NUMA
and host placement options being used at run time.
Using dplace for Memory Placement
The dplace tool offers
another means of specifying the placement of MPI processes within a distributed
memory host. The dplace tool and MPI interoperate to
allow MPI to better manage placement of certain shared memory data structures
when dplace is used to place the MPI job.
For instructions on how to use dplace with MPI,
see the dplace(1) man page and the
Linux Application Tuning Guide.
Tuning MPI/OpenMP Hybrid Codes
A hybrid MPI/OpenMP application is one in which each MPI process
itself is a parallel threaded program. These programs often exploit the
OpenMP paralllelism at the loop level while also implementing a higher
level parallel algorithm using MPI.
Many parallel applications perform better if the MPI processes and
the threads within them are pinned to particular processors for the duration
of their execution. For ccNUMA systems, this ensures that all local, non-shared
memory is allocated on the same memory node as the processor referencing
it. For all systems, it can ensure that some or all of the OpenMP threads
stay on processors that share a bus or perhaps a processor cache, which
can speed up thread synchronization.
MPT provides the omplace(1) command to help with
the placement of OpenMP threads within an MPI program. The
omplace command causes the threads in a hybrid MPI/OpenMP job
to be placed on unique CPUs within the containing cpuset. For example,
the threads in a 2-process MPI program with 2 threads per process would
be placed as follows: rank 0 thread 0 on CPU 0
rank 0 thread 1 on CPU 1
rank 1 thread 0 on CPU 2
rank 1 thread 1 on CPU 3 |
The CPU placement is performed by dynamically generating a
dplace(1) placement file and invoking dplace.
For detailed syntax and a number of examples, see the
omplace(1) man page. For more information on dplace
, see the dplace(1) man page. For information
on using cpusets, see the Linux Resource Administration Guide
. For more information on using dplace,
see the Linux Application Tuning Guide.
Example 8-1. How to Run a Hybrid MPI/OpenMP Application
Here is an example of how to run a hybrid MPI/OpenMP application
with eight MPI processes that are two-way threaded on two hosts: mpirun host1,host2 -np 4 omplace -nt 2 ./a.out |
When using the PBS batch scheduler to schedule the a hybrid MPI/OpenMP
job as shown in Example 8-1, use the following resource
allocation specification:
And use the following mpiexec command with the
above example: mpiexec -n 8 omplace -nt 2 ./a.out |
For more information about running MPT programs with PBS, see“Running MPI Jobs with a Work Load Manager” in Chapter 3 .
Tuning for Running Applications Across Multiple Hosts
When you are running an MPI
application across a cluster of hosts, there are additional run-time environment
settings and configurations that you can consider when trying to improve
application performance.
Systems can use the XPMEM interconnect to cluster hosts as partitioned
systems, or use the Voltaire InfiniBand interconnect or TCP/IP as the
multihost interconnect.
When launched as a distributed application, MPI probes for these
interconnects at job startup. For details of launching a distributed application,
see “Launching a Distributed Application” in Chapter 3. When a high performance interconnect
is detected, MPI attempts to use this interconnect if it is available
on every host being used by the MPI job. If the interconnect is not available
for use on every host, the library attempts to use the next slower interconnect
until this connectivity requirement is met. Table 8-1
specifies the order in which MPI probes for available interconnects.
Table 8-1. Inquiry Order for Available Interconnects
Interconnect
| Default Order of Selection
| Environment Variable to Require
Use
|
|---|
XPMEM
| 1
| MPI_USE_XPMEM
| InfiniBand
| 2
| MPI_USE_IB
| TCP/IP
| 3
| MPI_USE_TCP
|
The third column of Table 8-1 also indicates
the environment variable you can set to pick a particular interconnect
other than the default.
In general, to insure the best performance of the application, you
should allow MPI to pick the fastest available interconnect.
In addition to the choice of interconnect, you should know that
multihost jobs may use different buffers from those used by jobs run on
a single host. In the SGI implementation of MPI, the XPMEM interconnect
uses the “per proc” buffers while the InfiniBand and TCP interconnects
use the “per host” buffers. The default setting for the number
of buffers per proc or per host might be too low for many applications.
You can determine whether this setting is too low by using the MPI statistics
described earlier in this section.
When using the TCP/IP interconnect, unless specified otherwise,
MPI uses the default IP adapter for each host. To use a nondefault adapter,
enter the adapter-specific host name on the mpirun
command line.
When using the InfiniBand interconnect, MPT applications may not
execute a fork() or system() call.
The InfiniBand driver produces undefined results when an MPT process using
InfiniBand forks.
Requires the MPI library to use the InfiniBand driver as the interconnect
when running across multiple hosts or running with multiple binaries.
MPT requires the ibhost software stack from Voltaire
when the InfiniBand interconnect is used. If InfiniBand is used, the
MPI_COREDUMP environment variable is forced to INHIBIT, to comply
with the InfiniBand driver restriction that no fork()s may occur after
InfiniBand resources have been allocated. Default: Not set
When this is set to 1 and the MPI library uses the InfiniBand driver
as the inter-host interconnect, MPT will send its InfiniBand traffic over
the first fabric that it detects. If this is set to 2, the library will
try to make use of multiple available separate InfiniBand fabrics and
split its traffic across them. If the separate InfiniBand fabrics do not
have unique subnet IDs, then the rail-config utility
is required. It must be run by the system administrator to enable the
library to correctly use the separate fabrics. Default: 1 on all SGI Altix
systems.
MPI_IB_SINGLE_COPY_BUFFER_MAX
When MPI transfers data over InfiniBand, if the size of the cumulative
data is greater than this value then MPI will attempt to send the data
directly between the processes's buffers and not through intermediate
buffers inside the MPI library. Default: 32767
For more information on these environment variables, see the “ENVIRONMENT
VARIABLES” section of the mpi(1)
man page.
Tuning for Running Applications over the InfiniBand Interconnect
When running an
MPI application across a cluster of hosts using the InfiniBand interconnect,
there are additional run-time environmental settings that you can consider
to improve application performance, as follows:
Controls the number of other ranks that a rank can receive from
over InfiniBand using a short message fast path. This is 8 by default
and can be any value between 0 and 32.
For zero-copy sends over the InfiniBand interconnect, MPT keeps
a cache of application data buffers registered for these transfers. This
environmental variable controls the size of the cache. It is 8 by default
and can be any value between 0 and 32. If the application rarely reuses
data buffers, it may make sense to set this value to 0 to avoid cache
trashing.
MPI_CONNECTIONS_THRESHOLD
For very large MPI jobs, the time and resource cost to create a
connection between every pair of ranks at job start time may be prodigious.
When the number of ranks is at least this value, the MPI library will
create InfiniBand connections lazily on a demand basis. The default is
2048 ranks.
When the MPI library uses the InfiniBand fabric, it allocates some
amount of memory for each message header that it uses for InfiniBand.
If the size of data to be sent is not greater than this amount minus
64 bytes for the actual header, the data is inlined with the header. If
the size is greater than this value, then the message is sent through
remote direct memory access (RDMA) operations. The default is 16384 bytes.
When an InfiniBand card sends a packet, it waits some amount of
time for an ACK packet to be returned by the receiving
InfiniBand card. If it does not receive one, it sends the packet again.
This variable controls that wait period. The time spent is equal to 4
* 2 ^ MPI_IB_TIMEOUT microseconds. By default, the
variable is set to 18.
When the MPI library uses InfiniBand and this variable is set, and
an InfiniBand transmission error occurs, MPT will try to restart the connection
to the other rank. It will handle a number of errors of this type between
any pair of ranks equal to the value of this variable. By default, the
variable is set to 4.
MPI on Altix UV 100 and Altix UV 1000 Systems
The SGI Altix UV 100 and Altix
UV 1000 series systems are scalable nonuniform memory access (NUMA) systems
that support a single Linux image of thousands of processors distributed
over many sockets and SGI Altix UV Hub application-specific integrated
circuits (ASICs). The UV Hub is the heart of the SGI Altix UV 1000 or
Altix UV 100 system compute blade. Each "processor" is a hyperthread on
a particular core within a particular socket. Each Altix UV Hub normally
connects to two sockets. All communication between the sockets and the
UV Hub uses Intel QuickPath Interconnect (QPI) channels. The Altix UV
Hub has four NUMAlink 5 ports that connect with the NUMAlink 5 interconnect
fabric. The UV Hub acts as a crossbar between the processors, local SDRAM
memory, and the network interface. The Hub ASIC enables any processor
in the single-system image (SSI) to access the memory of all processors
in the SSI. For more information on the SGI Altix UV hub, Altix UV compute
blades, QPI, and NUMAlink 5, see the SGI Altix UV 1000 System
User's Guide or the SGI Altix UV 100 System User's
Guide, respectively.
When MPI communicates between processes, two transfer methods are
possible on an Altix UV system:
MPI chooses the method depending on internal heuristics, the type
of MPI communication that is involved, and some user-tunable variables.
When using the GRU to transfer data and messages, the MPI library uses
the GRU resources it allocates via the GRU resource allocator, which divides
up the available GRU resources. It fairly allocates buffer space and control
blocks between the logical processors being used by the MPI job.
Running MPI jobs optimally
on Altix UV systems is not very difficult. It is best to pin MPI processes
to CPUs and isolate multiple MPI jobs onto different sets of sockets and
Hubs, and this is usually achieved by configuring a batch scheduler to
create a cpuset for every MPI job. MPI pins its processes to the sequential
list of logical processors within the containing cpuset by default, but
you can control and alter the pinning pattern using MPI_DSM_CPULIST
(see “MPI_DSM_CPULIST”),
omplace(1), and dplace(1).
The MPI library chooses buffer
sizes and communication algorithms in an attempt to deliver the best performance
automatically to a wide variety of MPI applications. However, applications
have different performance profiles and bottlenecks, and so user tuning
may be of help in improving performance. Here are some application performance
types and ways that MPI performance may be improved for them: Odd HyperThreads are idle.
Most high performance computing MPI programs run best using only
one HyperThread per core. When an Altix UV system has multiple HyperThreads
per core, logical CPUs are numbered such that odd HyperThreads are the
high half of the logical CPU numbers. Therefore, the task of scheduling
only on the even HyperThreads may be accomplished by scheduling MPI jobs
as if only half the full number exist, leaving the high logical CPUs idle.You
can use the cpumap(1) command to determine if cores
have multiple HyperThreads on your Altix UV system. The output tells the
number of physical and logical processors and if Hyperthreading
is ON or OFF
and how shared processors are paired (towards the bottom of the command's
output).
If an MPI job uses only half of the available logical CPUs, set
GRU_RESOURCE_FACTOR to 2 so that the MPI processes can utilize
all the available GRU resources on a Hub rather than reserving some of
them for the idle HyperThreads. For more information about GRU resource
tuning, see the gru_resource(3) man page.
MPI large message bandwidth is important.
Some programs transfer large messages via the MPI_Send
function. To switch on the use of unbuffered, single copy transport
in these cases you can set MPI_BUFFER_MAX to 0. See
the MPI(1) man page for more details.
MPI small or near messages are very frequent.
For small fabric hop counts, shared memory message delivery is faster
than GRU messages. To deliver all messages within an Altix UV host
via shared memory, set MPI_SHARED_NEIGHBORHOOD to "
host". See the MPI(1) man page for more
details.
Other ccNUMA Performance Issues
MPI application processes normally perform best
if their local memory is allocated on the socket assigned to execute it.
This cannot happen if memory on that socket is exhausted by the application
or by other system consumption, for example, file buffer cache. Use the
nodeinfo(1) command to view memory consumption on the nodes
assigned to your job and use bcfree(1) to clear out
excessive file buffer cache. PBS Professional batch scheduler installations
can be configured to issue bcfreecommands in the job
prologue. For more information, see PBS Professional documentation and
the bcfree(1) man page.
MPI software from SGI can
internally use the XPMEM kernel module to provide direct access to data
on remote partitions and to provide single copy operations to local data.
Any pages used by these operations are prevented from paging by the XPMEM
kernel module. If an administrator needs to temporarily suspend a MPI
application to allow other applications to run, they can unpin these pages
so they can be swapped out and made available for other applications.
Each process of a MPI application which is using the XPMEM kernel
module will have a /proc/xpmem/pid
file associated with it. The number of pages owned by this process which
are prevented from paging by XPMEM can be displayed by concatenating the
/proc/xpmem/pid file, for example: # cat /proc/xpmem/5562
pages pinned by XPMEM: 17 |
To unpin the pages for use
by other processes, the administrator must first suspend all the processes
in the application. The pages can then be unpinned by echoing any value
into the /proc/xpmem/pid
file, for example: # echo 1 > /proc/xpmem/5562 |
The echo command will not return until that process's pages are
unpinned.
When the MPI application is resumed, the XPMEM
kernel module will prevent these pages from paging as they are referenced
by the application.
Message Passing Toolkit (MPT) User's Guide
(document number: 007-3773-019 / published: 2011-11-15)
table of contents | additional info | download
Front Matter
New Features in This Manual
About This Manual
Chapter 1. Introduction
Chapter 2. Administrating MPT
Chapter 3. Getting Started
Chapter 4. Programming with SGI MPI
Chapter 5. Debugging MPI Applications
Chapter 6. PerfBoost
Chapter 7. Checkpoint/Restart
Chapter 8. Run-time Tuning
Chapter 9. MPI Performance Profiling
Chapter 10. Troubleshooting and Frequently Asked Questions
Index
home/search |
what's new |
help
|