|
|
Linux » Books » Developer »
Message Passing Toolkit (MPT) User's Guide
(document number: 007-3773-012 / published: 2009-10-22)
table of contents | additional info | download find in page
Chapter 9. Troubleshooting and Frequently Asked Questions
This chapter
provides answers to some common problems users encounter when starting
to use SGI MPI, as well as answers to other frequently asked questions.
It covers the following topics:
What are some things I can try to figure out why mpirun
is failing?
Here are some things to investigate: Look in /var/log/messages for any
suspicious errors or warnings. For example, if your application tries
to pull in a library that it cannot find, a message should appear here.
Only the root user can view this file.
Be sure that you did not misspell the name of your application.
To find dynamic link errors, try to run your program without
mpirun. You will get the “mpirun must be used
to launch all MPI applications" message, along with any dynamic
link errors that might not be displayed when the program is started with
mpirun.
As a last resort, setting the environment variable LD_DEBUG
to all will display a set of messages for
each symbol that rld resolves. This produces a lot
of output, but should help you find the cause of the link arror.
Be sure that you are setting your remote directory properly.
By default, mpirun attempts to place your processes
on all machines into the directory that has the same name as
$PWD. This should be the common case, but sometimes different
functionality is required. For more information, see the section on
$MPI_DIR and/or the -dir option in the
mpirun man page.
If you are using a relative pathname for your application,
be sure that it appears in $PATH. In particular,
mpirun will not look in '.' for your application unless '.'
appears in $PATH.
Run /usr/sbin/ascheck to verify that
your array is configured correctly.
Use the mpirun -verbose
option to verify that you are running the version of MPI that you think
you are running.
Be very careful when setting MPI environment variables
from within your .cshrc or .login
files, because these will override any settings that you might later set
from within your shell (due to the fact that MPI creates the equivalent
of a fresh login session for every job). The safe way to set things up
is to test for the existence of $MPI_ENVIRONMENT in
your scripts and set the other MPI environment variables only if it is
undefined.
If you are running under a Kerberos environment, you may
experience unpredictable results because currently, mpirun
is unable to pass tokens. For example, in some cases, if you use
telnet to connect to a host and then try to run mpirun
on that host, it fails. But if you instead use rsh
to connect to the host, mpirun succeeds.
(This might be because telnet is kerberized but
rsh is not.) At any rate, if you are running under such conditions,
you will definitely want to talk to the local administrators about the
proper way to launch MPI jobs.
Look in /tmp/.arraysvcs on all machines
you are using. In some cases, you might find an errlog
file that may be helpful.
You can increase the verbosity of the Array Services daemon
(arrayd) using the -v option to
generate more debugging information. For more information, see the
arrayd(8) man page.
Check error messages in /var/run/arraysvcs.
My code runs correctly until it reaches MPI_Finalize()
and then it hangs.
This is almost
always caused by send or recv requests
that are either unmatched or not completed. An unmatched request is any
blocking send for which a corresponding recv
is never posted. An incomplete request is any nonblocking
send or recv request that was never freed
by a call to MPI_Test(), MPI_Wait(),
or MPI_Request_free().
Common examples are applications that call MPI_Isend()
and then use internal means to determine when it is safe to
reuse the send buffer. These applications never call MPI_Wait()
. You can fix such codes easily by inserting a call to
MPI_Request_free() immediately after all such isend
operations, or by adding a call to MPI_Wait()
at a later place in the code, prior to the point at which the send buffer
must be reused.
I keep getting error messages about MPI_REQUEST_MAX
being too small.
There are two types of cases in which the MPI library
reports an error concerning MPI_REQUEST_MAX. The error
reported by the MPI library distinguishes these.
MPI has run out of unexpected request entries;
the current allocation level is: XXXXXX |
The program is sending so many unexpected large messages (greater
than 64 bytes) to a process that internal limits in the MPI library have
been exceeded. The options here are to increase the number of allowable
requests via the MPI_REQUEST_MAX shell variable, or
to modify the application.
MPI has run out of request entries;
the current allocation level is: MPI_REQUEST_MAX = XXXXX |
You might have an application problem. You almost certainly are
calling MPI_Isend() or MPI_Irecv()
and not completing or freeing your request objects. You need to use
MPI_Request_free(), as described in the previous section.
I am not seeing stdout and/or stderr
output from my MPI application.
All stdout
and stderr is line-buffered, which means that
mpirun does not print any partial lines of output. This sometimes
causes problems for codes that prompt the user for input parameters but
do not end their prompts with a newline character. The only solution for
this is to append a newline character to each prompt.
You can set the MPI_UNBUFFERED_STDIO environment
variable to disable line-buffering. For more information, see the MPI(1) and mpirun(1)
man pages.
How can I get the MPT software to install on my machine?
MPT
RPMs are included in ProPack releases. In addition, you can obtain MPT
RPMs from the SGI Support website at under "Downloads".
Where can I find more information about the SHMEM programming model?
See the intro_shmem(3) man page.
The ps(1) command says my memory
use (SIZE) is higher than expected.
At
MPI job start-up, MPI calls the SHMEM library to cross-map all user static
memory on all MPI processes to provide optimization opportunities. The
result is large virtual memory usage. The ps(1)
command's SIZE statistic is telling you the amount
of virtual address space being used, not the amount of memory being consumed.
Even if all of the pages that you could reference were faulted in, most
of the virtual address regions point to multiply-mapped (shared) data
regions, and even in that case, actual per-process memory usage would
be far lower than that indicated by SIZE.
What does MPI: could not run executable mean?
This
message means that something happened while mpirun
was trying to launch your application, which caused it to fail before
all of the MPI processes were able to handshake with it.
The mpirun command directs arrayd
to launch a master process on each host and listens on a socket for those
masters to connect back to it. Since the masters are children of
arrayd, arrayd traps SIGCHLD
and passes that signal back to mpirun whenever one
of the masters terminates. If mpirun receives a signal
before it has established connections with every host in the job, it knows
that something has gone wrong.
How do I combine MPI with insert favorite tool here
?
In
general, the rule to follow is to run mpirun on your
tool and then the tool on your application. Do not try to run the tool
on mpirun. Also, because of the way that
mpirun sets up stdio, seeing the output from
your tool might require a bit of effort. The most ideal case is when the
tool directly supports an option to redirect its output to a file. In
general, this is the recommended way to mix tools with mpirun
. Of course, not all tools (for example, dplace) support such
an option. However, it is usually possible to make it work by wrapping
a shell script around the tool and having the script do the redirection,
as in the following example: > cat myscript
#!/bin/sh
setenv MPI_DSM_OFF
dplace -verbose a.out 2> outfile
> mpirun -np 4 myscript
hello world from process 0
hello world from process 1
hello world from process 2
hello world from process 3
> cat outfile
there are now 1 threads
Setting up policies and initial thread.
Migration is off.
Data placement policy is PlacementDefault.
Creating data PM.
Data pagesize is 16k.
Setting data PM.
Creating stack PM.
Stack pagesize is 16k.
Stack placement policy is PlacementDefault.
Setting stack PM.
there are now 2 threads
there are now 3 threads
there are now 4 threads
there are now 5 threads |
Why do I see “stack traceback” information when my
MPI job aborts?
More
information can be found in the MPI(1)
man page in descriptions of the MPI_COREDUMP and
MPI_COREDUMP_DEBUGGER environment variables.
Message Passing Toolkit (MPT) User's Guide
(document number: 007-3773-012 / published: 2009-10-22)
table of contents | additional info | download
Front Matter
New Features in This Manual
About This Manual
Chapter 1. Introduction
Chapter 2. Administrating MPT
Chapter 3. Getting Started
Chapter 4. Programming with SGI MPI
Chapter 5. Debugging MPI Applications
Chapter 6. Profiling MPI Applications
Chapter 7. Run-time Tuning
Chapter 8. MPI Performance Profiling
Chapter 9. Troubleshooting and Frequently Asked Questions
Index
home/search |
what's new |
help
|
|
|