Linux » Man Pages
find in page
dplace - a tool for controlling placement of processes onto cpus
dplace [-e] [-c cpu_numbers] [-s skip_count] [-n process_name] \
[-x skip_mask] [-r [l|L|b|B|A|t]] [-o log_file] [-v 1|2] \
dplace [-p placement_file] [-o log_file] command [command-args]
dplace [-q] [-qq] [-qqq]
The given program is executed after scheduling and memory placement
policies are set up according to command line arguments,
By default, memory is allocated to a process on the node that the pro-
cess is executing on. If a process moves from node to node during its
lifetime, a higher percentage of memory references will be to remote
nodes. Remote accesses typically have higher access times. Process
performance may suffer.
Dplace is used to bind a related set of processes to specific cpus or
nodes to prevent process migrations. In some cases, this will improve
performance since a higher percentage of memory accesses will to the
Processes always execute within a cpuset. The cpuset specifies the cpus
that are available for a process to execute on. By default, processes
usually execute in a cpuset that contains all the cpus in the system.
The cpu numbers specified on the command line or in the placement file
are always cpuset-relative.
Dplace invokes a kernel module to create a job placement container con-
sisting of all (or a subset of) the cpus of the cpuset. In the current
version, version 2, a LD_PRELOAD library (libdplace.so) is used to
intercept calls to fork(), exec(), and pthread_create() to do placement
of tasks being created. Note that tasks created internal to glibc are
not intercepted by the preload library. These tasks will not be placed.
If no placement file is being used, then the dplace process is placed
in this container and (by default) is bound to the first cpu of the
cpuset associated with the container. Then dplace "execs" the <com-
mand>. The command is executing within this placement container and
continues to be bound to the first cpu of the container. As the command
forks child processes, they inherit the container and are bound to the
next available cpu of the container.
If a placement file is being used, then the dplace process is not
placed at the time the job placement container is created. Placement
occurs as processes are forked and exec'd. The placement file may con-
tain a directive for placing this first task. See dplace(5) for addi-
dplace maintains a global count of the number of active processes that
have been placed (by dplace) on each cpu.
dplace supports 2 placement modes: load balanced and exact placement.
If load balanced placement (default) is selected, dplace will bind a
process to the cpu that has the lowest number of processes that were
placed by dplace AND is also in the users cpu list. For example, if
the current cpuset consists of physical cpus 2, 5, 8, and 9, and the
mpirun -np 2 dplace -s1 app1
mpirun -np 2 dplace -s1 app2
app1 will run on cpus 2 and 5. But app2 will run on cpus 8 and 9
because they have a lesser load (the -s1 above causes dplace to not
place an MPI helper process that is mostly inactive). This assumes
that no other processes that were placed by dplace are still running on
If exact placement is selected <-e>, processes are bound to cpus in the
exact order that the cpus are specified in the cpu list. Cpu numbers
may appear multiple times in the list. A cpu value of "x" indicates
that binding should not be done for that process. If the end of the
list is reached, binding starts over at the beginning of the list.
-c Cpu numbers. Specified as a list of cpus, optionally strided cpu
ranges, or a striding pattern. Example: "-c 1", "-c 2-4", "-c
1,4-8,3", "-c 2-8:3", "-c CS", "-c BT". The specification "-c
2-4" is equivalent to "-c 2,3,4" and "-c 2-8:3" is equivalent to
2,5,8. Ranges may also be specified in reverse order: "-c 12-8"
is equivalent to 12,11,10,9,8. Cpu numbers are NOT physical cpu
numbers. They are logical cpu number that are relative to the
cpus that are in the set of allowed cpus as specified by the
current cpuset. A cpu value of "x" (or "*"), in the argument
list for -c option, indicates that binding should not be done
for that process."x" should be used only if the -e option is
also used. Cpu numbers start at 0. For striding patterns any
subset of the characters (B)lade, (S)ocket, (C)ore, (T)hread may
be used and their ordering specifies the nesting of the itera-
tion. For example "SC" means to iterate all the cores in a
socket before moving to the next CPU socket, while "CB" means to
pin to the first core of each blade, then the second core of
every blade, etc. For best results, use the -e option when using
stride patterns. If the -c option is not specified, all cpus of
the current cpuset are available. The command itself (which is
exec'd by dplace) is the first process to be placed by the -c
-e Exact placement. As processes are created, they are bound to
cpus in the exact order that the cpus are specified in the cpu
list. Cpu numbers may appear multiple times in the list. A cpu
value of "x" indicates that binding should not be done for that
process. If the end of the list is reached, binding starts over
at the beginning of the list.
-o Write a trace file to <log_file> that decribes the placement
actions that were made for each fork, exec, etc. Each line con-
tains a timestamp, process id:thread number, cpu that task was
executing on, taskname | placement action. Works with version 2
-s Skip the first <skip_count> processes before starting to place
processes onto cpus. This option is useful if the first
<skip_count> processes are "shepherd" processes that are used
only for launching the application. If <skip_count> is not spec-
ified, a default value of 0 is used.
-n Only processes named <process_name> are placed. Other processes
are ignored and are not explicitly bound to cpus. Note: pro-
cess_name is the basename of the executable.
-r Specifies that text and/or data should be replicated on the node
or nodes where the application is running. In some cases, repli-
cation will improve performance by reducing the need to make
offnode memory references. The replication option applies to all
programs placed by the dplace command. See dplace(5) for addi-
tional information on text replication. The replication options
are a string of one or more of the following characters:
l replicate library text
L replicate library RW data
b replicate binary (a.out) text
B replicate binary (a.out) RW data
A replicate all library and DSO text & RW data
t thread round-robin option
-x Provides the ability to skip placement of processes. <skip_mask>
is a bitmask. If bit N of <skip_mask> is set, then the N+1th
process that is forked is not placed. For example, setting the
mask to 6 will cause the 2nd and 3rd processes from being
placed. The first process (the process named by the <command>)
will be assigned to the first cpu. The second and third pro-
cesses are not placed. The fourth process is assigned to the
second cpu, etc.. This option is useful for certain classes of
threaded apps that spawn a few helper processes that typically
do not use much cpu time. (Hint: Intel OpenMP applications cur-
rently should be placed using -x 2. This could change in future
versions of OpenMP).
-v Provides the ability to run in version 1 or version 2 compati-
bility mode if the kernel support is available. If not speci-
fied, version 2 compatibility is selected. See COMPATIBILITY
section for more details. Note: version 1 requires kernel sup-
port for PAGG.
-p Specifies a placement file that contains additional directives
that are used to control process placement. See dplace(5) for
additional details and for a description of the placement file
-q If specified once, lists the global count of the number of
active processes that have been placed (by dplace) on each cpu
in the current cpuset. Note that cpu numbers are logical cpu
numbers within the cpuset, NOT physical cpu numbers. If speci-
fied twice, lists the current dplace jobs that are running. If
specified 3 times, lists the current dplace jobs and the tasks
that are in each job.
The following examples assume the command is executed from a shell run-
ning in a cpuset consisting of physical cpus 8-15.
To execute a process on a specific set of logical cpus:
dplace -c 2 date # date runs on physical cpu 10.
dplace make linux # gcc and related processes run on
# physical cpus 8-15.
dplace -c 0-4,6 make linux # make (gcc and related
# processes) run on physical
# cpus 8-12 or 14.
The following example assumes the application is NOT run in a cpuset;
in other words, the cpuset is the entire system.
dplace -e -c 6,x,x,2 app # app will run on cpu 6. The first two
# threads created by app (either by fork
# or pthread_create) will not be bound.
# The 3th thread is bound to cpu 2. If a
# 4th thread is created, it is bound to
# cpu 6.
dplace -e -c 31,x,30-0 # The app will run on cpu 31. The first
# app-created thread (either by fork or
# pthread_create) will not be bound.
# The second app-created thread is bound
# to cpu 30, the third app-created
# thread is bound to cpu 29, etc.
Most SGI Message Passing Toolkit (MPT) MPI jobs are launched by mpirun
and use N+1 threads. The first thread is mainly inactive and usually
does not need to be placed.
To launch an MPI application use the following syntax:
mpirun -np <process_count> dplace [dplace_args] app [args]
mpirun -np 8 dplace -s1 lu.8
Intel OpenMP jobs use an extra thread that is unknown to the user and
need not be placed. In addition, OpenMP jobs have Intel-driven cpu
placement functionality that must be disabled by setting KMP_AFFIN-
ITY=disabled when running OpenMP jobs with dplace.
To launch an OpenMP application while pinning the user threads to
unique logical cpus, use the following syntax:
env KMP_AFFINITY=disabled OMP_NUM_THREADS=<thread count> \
dplace -e -x 2 [dplace_args] app [args]
env KMP_AFFINITY=disabled OMP_NUM_THREADS=4 \
dplace -e -x 2 -c 0-3 lu-omp.4
If you have SGI's MPT MPI package installed, you can conveniently use
omplace for a simpler syntax that hides some of the details associated
with launching OpenMP apps that were described in the previous example.
To launch an OpenMP application while pinning the user threads to
unique logical cpus starting at 0, use the following syntax:
omplace -nt <thread count> [omplace_args] app [args]
omplace -nt 4 lu-omp.4
Version 1 of numatools required kernel support for PAGG process place-
ment groups. This support is no longer available in all kernel vari-
Version 2 of numatools uses a preload library to intercept calls to
fork(), exec() (all variants), pthread_create() and pthread_exit(). The
intercept code performs placement as part of the library call. In most
cases, version 1 and version 2 are compatible. In some cases, however,
a user will notice differences:
preload libraries do not work with statically linked binaries
preload libraries do not intercept fork() or exec() calls that
come from glibc itself. Specifically, the system() call is not
intercepted and no placement of tasks that result from a system()
call will be done. In most cases, this is not an issue although
you may need to adjust the <skip_count> if you use this option
to skip tasks created by system().
In some cases, version 2 of numatools will give better performance than
version 1. Assuming first-touch placement policy, in version 1 all
thread-private data and a few stack pages will be located on the parent
node, not the node that the task is placed on. In version 2, this mem-
ory is usually allocated local to the task's node.
Dplace sets an environment variable to indicate if version 1 or version
2 placement is being done. This variable can be tested by applications
__DPLACE_ = 1
# version 1 placement (requires PAGG kernel support)
__DPLACE_ = 2
# version 2 placement.
The <skip_mask> is a kludge. A better solution is needed.
The "-n <process_name>" option is only marginally useful. The <pro-
cess_name> is checked at fork time but not at "exec" time.
Tasks created internal to glibc are not intercepted by the preload
libray. These tasks are not placed and will run on any cpu in the
cpuset. For example, tasks created by the system() call will not be
placed. THIS IS AN INCOMPATIBILITY with version 1.x of numatools.
Unless running in version 1 compatibility mode, dplace does not work
with statically linked binaries.
Because LD_PRELOAD is ignored for SUID programs, dplace will not do
correct placement of child processes of SUID programs.
Dplace depends on a loadable kernel module named "numatools". If this
module is not loaded, dplace will fail and print a message to remind
the user to load the numatools module.
cpuset(1), dplace(5), dlook(1), omplace(1)
2.0 26 June 2012 dplace(1)
Output converted with
© 2009 - 2015 Silicon Graphics International Corp. All Rights Reserved.