Linux » Books » Developer »
Message Passing Toolkit (MPT) User's Guide
(document number: 007-3773-012 / published: 2009-10-22)
table of contents | additional info | download find in page
Chapter 4. Programming with SGI MPI
Portability is one of the main advantages MPI has over vendor-specific
message passing software. Nonetheless, the MPI Standard offers sufficient
flexibility for general variations in vendor implementations. In addition,
there are often vendor specific programming recommendations for optimal
use of the MPI library. This chapter addresses topics that are of interest
to those developing or porting MPI applications to SGI systems. It covers
the following topics:
Job Termination and Error Handling
This section describes the behavior of the SGI MPI implementation
upon normal job termination. Error handling and characteristics of abnormal
job termination are also described.
In the SGI MPI implementation, a call to MPI_Abort
causes the termination of the entire MPI job, regardless of the communicator
argument used. The error code value is returned as the exit status of
the mpirun command. A stack traceback is displayed
that shows where the program called MPI_Abort.
Section 7.2 of the MPI Standard describes MPI error handling. Although
almost all MPI functions return an error status, an error handler is invoked
before returning from the function. If the function has an associated
communicator, the error handler associated with that communicator is invoked.
Otherwise, the error handler associated with MPI_COMM_WORLD
is invoked.
The SGI MPI implementation provides the following predefined error
handlers: MPI_ERRORS_ARE_FATAL. The handler,
when called, causes the program to abort on all executing processes. This
has the same effect as if MPI_Abort were called by
the process that invoked the handler.
MPI_ERRORS_RETURN. The handler has
no effect.
By default, the MPI_ERRORS_ARE_FATAL error handler
is associated with MPI_COMM_WORLD and any communicators
derived from it. Hence, to handle the error statuses returned from MPI
calls, it is necessary to associate either the MPI_ERRORS_RETURN
handler or another user defined handler with MPI_COMM_WORLD
near the beginning of the application.
MPI_Finalize and Connect Processes
In the SGI implementation of MPI, all pending communications involving
an MPI process must be complete before the process calls MPI_Finalize
. If there are any pending send or
recv requests that are unmatched or not completed, the application
will hang in MPI_Finalize. For more details, see section
7.5 of the MPI Standard.
If the application uses the MPI-2 spawn functionality described
in Chapter 5 of the MPI-2 Standard, there are additional considerations.
In the SGI implementation, all MPI processes are connected. Section 5.5.4
of the MPI-2 Standard defines what is meant by connected processes. When
the MPI-2 spawn functionality is used, MPI_Finalize
is collective over all connected processes. Thus all MPI processes, both
launched on the command line, or subsequently spawned, synchronize in
MPI_Finalize.
In the SGI implementation, MPI processes are UNIX processes. As
such, the general rule regarding handling of signals applies as it would
to ordinary UNIX processes.
In addition, the SIGURG and SIGUSR1
signals can be propagated from the mpirun
process to the other processes in the MPI job, whether they belong to
the same process group on a single host, or are running across multiple
hosts in a cluster. To make use of this feature, the MPI program must
have a signal handler that catches SIGURG or
SIGUSR1. When the SIGURG or SIGUSR1
signals are sent to the mpirun process ID,
the mpirun process catches the signal and propagates
it to all MPI processes.
Most MPI implementations use buffering for overall performance reasons
and some programs depend on it. However, you should not assume that there
is any message buffering between processes because the MPI Standard does
not mandate a buffering strategy. Table 4-1 illustrates
a simple sequence of MPI operations that cannot work unless messages
are buffered. If sent messages were not buffered, each process would hang
in the initial call, waiting for an MPI_Recv call
to take the message.
Because most MPI implementations do buffer messages to some degree,
a program like this does not usually hang. The MPI_Send
calls return after putting the messages into buffer space, and the
MPI_Recv calls get the messages. Nevertheless, program logic
like this is not valid by the MPI Standard. Programs that require this
sequence of MPI calls should employ one of the buffer MPI send calls,
MPI_Bsend or MPI_Ibsend. Table 4-1. Outline of Improper Dependence on Buffering
Process 1
| Process 2
|
|---|
MPI_Send(2,....)
| MPI_Send(1,....)
| MPI_Recv(2,....)
| MPI_Recv(1,....)
|
By default, the SGI implementation of MPI uses buffering under most
circumstances. Short messages (64 or fewer bytes) are always buffered.
Longer messages are also buffered, although under certain circumstances
buffering can be avoided. For performance reasons, it is sometimes desirable
to avoid buffering. For further information on unbuffered message delivery,
see “Programming Optimizations”.
Multithreaded Programming
SGI MPI supports hybrid programming models, in which MPI is used
to handle one level of parallelism in an application, while POSIX threads
or OpenMP processes are used to handle another level. When mixing OpenMP
with MPI, for performance reasons it is better to consider invoking MPI
functions only outside parallel regions, or only from within master regions.
When used in this manner, it is not necessary to initialize MPI for thread
safety. You can use MPI_Init to initialize MPI. However,
to safely invoke MPI functions from any OpenMP process or when using Posix
threads, MPI must be initialized with MPI_Init_thread.
When using MPI_Thread_init() with the threading
level MPI_THREAD_MULTIPLE, link your program with
-lmpi_mt instead of -lmpi. See the
mpi(1) man page for more information about compiling and linking
MPI programs.
Interoperability with the SHMEM programming model
You can mix SHMEM and MPI message passing in the same program. The
application must be linked with both the SHMEM and MPI libraries. Start
with an MPI program that calls MPI_Init and
MPI_Finalize.
When you add SHMEM calls, the PE numbers are equal to the MPI rank
numbers in MPI_COMM_WORLD. Do not call start_pes()
in a mixed MPI and SHMEM program.
When running the application across a cluster, some MPI processes
may not be able to communicate with certain other MPI processes when using
SHMEM functions. You can use the shmem_pe_accessible
and shmem_addr_accessible functions to determine whether
a SHMEM call can be used to access data residing in another MPI process.
Because the SHMEM model functions only with respect to MPI_COMM_WORLD
, these functions cannot be used to exchange data between MPI
processes that are connected via MPI intercommunicators returned from
MPI-2 spawn related functions.
SHMEM get and put functions are thread safe. SHMEM collective and
synchronization functions are not thread safe unless different threads
use different pSync and pWork arrays.
For more information about the SHMEM programming model, see the
intro_shmem man page.
Miscellaneous Features of SGI MPI
This section describes other characteristics of the SGI MPI implementation
that might be of interest to application developers.
In this implementation, stdin is enabled for
only those MPI processes with rank 0 in the first MPI_COMM_WORLD
(which does not need to be located on the same host as
mpirun). stdout and stderr
results are enabled for all MPI processes in the job, whether launched
via mpirun, or via one of the MPI-2 spawn functions.
The MPI_Get_processor_name function returns the
Internet host name of the computer on which the MPI process invoking this
subroutine is running.
Programming Optimizations
This section describes ways in which the MPI application developer
can best make use of optimized features of SGI's MPI implementation. Following
recommendations in this section might require modifications to your MPI
application.
Using MPI Point-to-Point Communication Routines
MPI provides for a number of different routines for point-to-point
communication. The most efficient ones in terms of latency and bandwidth
are the blocking and nonblocking send/receive
functions (MPI_Send, MPI_Isend
, MPI_Recv, and MPI_Irecv).
Unless required for application semantics, the synchronous send
calls (MPI_Ssend and MPI_Issend)
should be avoided. The buffered send calls (MPI_Bsend
and MPI_Ibsend) should also usually be avoided as these
double the amount of memory copying on the sender side. The ready send
routines (MPI_Rsend and MPI_Irsend)
are treated as standard MPI_Send and MPI_Isend
in this implementation. Persistent requests do not offer any
performance advantage over standard requests in this implementation.
Using MPI Collective Communication Routines
The MPI collective calls are frequently layered on top of the point-to-point
primitive calls. For small process counts, this can be reasonably effective.
However, for higher process counts (32 processes or more) or for clusters,
this approach can become less efficient. For this reason, a number of
the MPI library collective operations have been optimized to use more
complex algorithms.
Some collectives have been optimized for use with clusters. In these
cases, steps are taken to reduce the number of messages using the relatively
slower interconnect between hosts.
Two of the collective operations have been optimized for use with
shared memory. The barrier operation has also been optimized to use hardware
fetch operations (fetchops). The MPI_Alltoall
routines also use special techniques to avoid message buffering
when using shared memory. For more details, see “Avoiding Message Buffering -- Single Copy Methods”.
Table 4-2, lists the MPI collective routines optimized
in this implementation. Table 4-2. Optimized MPI Collectives
Routine
| Optimized for Clusters
| Optimized for Shared Memory
|
|---|
MPI_Alltoall
| Yes
| Yes
| MPI_Barrier
| Yes
| Yes
| MPI_Allreduce
| Yes
| No
| MPI_Bcast
| Yes
| No
|
 | Note: These collectives are optimized across partitions by using
the XPMEM driver which is explained in Chapter 7, “Run-time Tuning”.
These collectives (except MPI_Barrier) will try to
use single-copy by default for large transfers unless MPI_DEFAULT_SINGLE_COPY_OFF
is specified.
|
Using MPI_Pack/MPI_Unpack
While MPI_Pack and MPI_Unpack
are useful for porting PVM codes to MPI, they essentially double the amount
of data to be copied by both the sender and receiver. It is generally
best to avoid the use of these functions by either restructuring your
data or using derived data types. Note, however, that use of derived data
types may lead to decreased performance in certain cases.
Avoiding Derived Data Types
In general, you should avoid derived data types when possible. In
the SGI implementation, use of derived data types does not generally lead
to performance gains. Use of derived data types might disable certain
types of optimizations (for example, unbuffered or single copy data transfer).
The use of wild cards (MPI_ANY_SOURCE,
MPI_ANY_TAG) involves searching multiple queues for messages.
While this is not significant for small process counts, for large process
counts the cost increases quickly.
Avoiding Message Buffering -- Single Copy Methods
One of the most significant optimizations for bandwidth sensitive
applications in the MPI library is single copy optimization, avoiding
the use of shared memory buffers. However, as discussed in “Buffering”,
some incorrectly coded applications might hang because of buffering assumptions.
For this reason, this optimization is not enabled by default for
MPI_send, but can be turned on by the user at run time by using
the MPI_BUFFER_MAX environment variable. The following
steps can be taken by the application developer to increase the opportunities
for use of this unbuffered pathway: The MPI data type on the send side must be a contiguous
type.
The sender and receiver MPI processes must reside on the
same host or, in the case of a partitioned system, the processes may reside
on any of the partitions.
The sender data must be globally accessible by the receiver.
The SGI MPI implementation allows data allocated from the static region
(common blocks), the private heap, and the stack region to be globally
accessible. In addition, memory allocated via the MPI_Alloc_mem
function or the SHMEM symmetric heap accessed via the
shpalloc or shmalloc functions is globally
accessible.
Certain run-time environment variables must be set to enable the
unbuffered, single copy method. For more details on how to set the run-time
environment, see “Avoiding Message Buffering - Enabling Single Copy” in Chapter 7.
 | Note: With the Intel 7.1 compiler, ALLOCATABLE arrays are not eligible
for single copy, since they do not reside in a globally accessible memory
region. This restriction does not apply when using the Intel 8.0/8.1 compilers.
|
Managing Memory Placement
SGI systems have a ccNUMA memory architecture. For single process
and small multiprocess applications, this architecture behaves similarly
to flat memory architectures. For more highly parallel applications, memory
placement becomes important. MPI takes placement into consideration when
laying out shared memory data structures, and the individual MPI processes'
address spaces. In general, it is not recommended that the application
programmer try to manage memory placement explicitly. There are a number
of means to control the placement of the application at run time, however.
For more information, see Chapter 7, “Run-time Tuning”.
Using Global Shared Memory
The MPT software includes the Global Shared Memory (GSM) Feature.
This feature allows users to allocate globally accessible shared memory
from within an MPI or SHMEM program. The GSM feature can be used to provide
shared memory access across partitioned Altix systems and additional memory
placement options within a single host configuration.
User-callable functions are provided to allocate a global shared
memory segment, free that segment, and provide information about the segment.
Once allocated, the application can use this new global shared memory
segment via standard loads and stores, just as if it were a System V shared
memory segment. For more information, see the GSM_Intro
or GSM_Alloc man pages.
Additional Programming Model Considerations
A number of additional programming options might be worth consideration
when developing MPI applications for SGI systems. For example, the SHMEM
programming model can provide a means to improve the performance of latency-sensitive
sections of an application. Usually, this requires replacing MPI
send/recv calls with shmem_put/
shmem_get and shmem_barrier calls. The SHMEM
programming model can deliver significantly lower latencies for short
messages than traditional MPI calls. As an alternative to shmem_get
/shmem_put calls, you might consider the
MPI-2 MPI_Put/ MPI_Get functions.
These provide almost the same performance as the SHMEM calls, while providing
a greater degree of portability.
Alternately, you might consider exploiting the shared memory architecture
of SGI systems by handling one or more levels of parallelism with OpenMP,
with the coarser grained levels of parallelism being handled by MPI. Also,
there are special ccNUMA placement considerations to be aware of when
running hybrid MPI/OpenMP applications. For further information, see Chapter 7, “Run-time Tuning”.
Message Passing Toolkit (MPT) User's Guide
(document number: 007-3773-012 / published: 2009-10-22)
table of contents | additional info | download
Front Matter
New Features in This Manual
About This Manual
Chapter 1. Introduction
Chapter 2. Administrating MPT
Chapter 3. Getting Started
Chapter 4. Programming with SGI MPI
Chapter 5. Debugging MPI Applications
Chapter 6. Profiling MPI Applications
Chapter 7. Run-time Tuning
Chapter 8. MPI Performance Profiling
Chapter 9. Troubleshooting and Frequently Asked Questions
Index
home/search |
what's new |
help
|