IRIX 6.5 » Books » Developer »
SpeedShop User's Guide
(document number: 007-3311-011 / published: 2003-08-23)
table of contents | additional info | download find in page | jump to first hit | clear highlight
Chapter 1. Introduction to Performance Analysis
This chapter provides a brief introduction to performance
analysis techniques for SGI systems and describes how to use them with SpeedShop
to solve performance problems. It includes the following sections:
Sources of Performance Problems
To tune a program's performance, you must first determine where machine
resources are being used. At any point in a process, there is one limiting
resource controlling the speed of execution. Processes can be slowed down
by:
CPU speed and availability: a CPU-bound
process spends its time executing in the CPU and is limited by CPU speed and
availability. To improve the performance of CPU-bound processes, you may need
to streamline your code. This can entail modifying algorithms, reordering
code to avoid interlocks, removing nonessential steps, blocking to keep data
in cache and registers, or using alternative algorithms.
I/O processing: an I/O-bound process
has to wait for input/output (I/O) to complete. I/O may be limited by disk
access speeds or memory caching. To improve the performance of I/O-bound processes,
you can try one of the following techniques:
Memory size and availability: a program that continuously
needs to swap out pages of memory is called memory-bound.
Page thrashing is often due to accessing virtual memory on a haphazard rather
than strategic basis; cache misses result. Insufficient memory bandwidth could
also be the problem.
To fix a memory-bound process, you can try to improve the memory reference
patterns or, if possible, decrease the memory used by the program.
Bugs: you may find that a bug is causing the performance problem.
For example, you may find that you are reading in the same file twice in different
parts of the program, that floating-point exceptions are slowing down your
program, that old code has not been completely removed, or that you are leaking
memory (making malloc calls without the corresponding calls
to free).
Performance phases: because programs exhibit different behavior
during different phases of operation, you need to identify the limiting resource
during each phase. A program can be I/O-bound while it reads in data, CPU-bound
while it performs computation, and I/O-bound again in its final stage while
it writes out data. Once you've identified the limiting resource in a phase,
you can perform an in-depth analysis to find the problem. And after you have
solved that problem, you can check for other problems within the phase. Performance
analysis is an iterative process.
Cache thrashing: If an application does not access CPU caches
efficiently, the application will run slower whle the CPU and operating system
reload cache entries.
Fixing Performance Problems
The SpeedShop tools described in this manual can help you to identify
specific performance problems described later; these techniques are only a
part of performance tuning. You can also tune graphics, I/O, the kernel, system
parameters, memory, and real-time system calls. For a complete guide to all
performance tools and the documentation about those tools, see the Guide to SGI Compilers and Compiling Tools.
Although it may be possible to obtain short-term speed increases by
relying on unsupported or undocumented quirks of the compiler, it is a bad
idea to do so. Any such “features” may break in future compiler
releases. The best way to produce efficient code that will remain efficient
is to follow good programming practices. In particular, choose good algorithms
and leave the details to the compiler.
The SpeedShop tools allow you to run experiments and generate reports
that track down the sources of performance problems. SpeedShop consists of
a set of commands that can be run in a shell, an application programming interface
(API) to provide some control over data collection, and a number of libraries
to support the commands.
This section provides an overview of the tools by first discussing the
main commands, then providing more detail on additional commands, experiment
types, libraries, the SpeedShop API, and supported programs and languages.
SpeedShop
provides the following commands to help you analyze your programs:
ssusage: Collects information about your
program's use of machine resources. Output from ssusage
can be used to determine where most resources are being spent.
ssrun: Allows you to run experiments on
a program to collect performance data. It establishes the environment to capture
performance data for an executable, creates a process from the executable
(or from an instrumented version of the executable) and runs it. Input to
ssrun consists of an experiment type, control flags, the name of
the target, and the arguments to be used in executing the target.
prof: Analyzes the performance data you
have recorded using ssrun and provides formatted reports.
prof detects the type of experiment you have run and generates a
report specific to the experiment type. You can also use the cvperf
command to display the data through the WorkShop graphic user interface.
sscompare: Analyzes the performance data
in one or more experiment files generated by SpeedShop and produces comparison
reports.
SpeedShop provides the following additional commands:
squeeze: Allocates a region of virtual
memory and locks the virtual memory down into real memory, making it unavailable
to other processes.
thrash: Allows you to allocate a block
of memory, then access the allocated memory to explore system paging behavior.
The following are the most popular experiments using the ssrun
command. (For the complete list of experiments, see the ssrun(1) man page.)
pcsamp experiments provide information
on a program's CPU usage using statistical program counter sampling.
Data is measured by periodically sampling the program counter of the
target executable when it is executing in the CPU. The program counter shows
the address of the currently executing instruction in the program. The data
that is obtained from the samples is translated to a time that can be displayed
at the function, source line, and machine instruction levels. The actual
CPU time is calculated by multiplying the number of times a specific
address is found in the PC by the amount of time between samples. (For a definition
of CPU time, wall-clock time, and process virtual time, see the glossary.)
hwc experiments display information from
a variety of hardware counters using statistical sampling.
Hardware counter experiments are available on R10000, R12000, R14000,
and R16000 systems that have built-in hardware counters. Data is measured
by counting each time the specified hardware counter exceeds its maximum value,
or overflows. You can specify the hardware counter and the overflow interval
you want to use. For more information on the hardware counter experiments,
see “Hardware Counter Experiments (*_hwc, *_hwctime)” in Chapter 4.
usertime experiments display a program's
CPU time by statistical call-stack profiling.
Data is measured by periodically sampling the call stack. The program's
call stack data is used to attribute exclusive user
time to the function at the bottom of each call stack (that is, the function
being executed at the time of the sample), and to attribute inclusive
user time to all the functions above the one currently being
executed. Exclusive time is the execution time of a given function but not
any functions that function calls, while inclusive time is the execution time
both of a given function and of any functions called by that function.
The totaltime experiment returns wall-clock
time in a manner identical to that of the usertime experiment.
It uses statistical callstack profiling, based on wall-clock time, with a
time sample interval of 30 milliseconds.
bbcounts experiments display an estimated
time based on linear basic blocks counting.
Data is measured by counting the number of executions of each
basic block and calculating an estimated time for each function.
This involves instrumenting the program to divide the code into basic blocks,
which are consecutive sequences of instructions with a single entry point,
a single exit point, and no branches into or out of the sequence. Instrumentation
also records a count of all dynamic (function-pointer) calls.
Because an exact count of every instruction in your program is recorded,
you can also use the bbcounts experiment to determine the
efficiency of your algorithm and identify any code that is not executed.
fpe experiments trace floating-point exceptions.
A floating-point exception trace collects each floating-point exception,
including the exception type and the call stack, at the time of the exception. prof(1) generates a report showing inclusive and exclusive
floating-point exception counts.
Versions of the SpeedShop libraries libss.so and
libssrt.so are available to support applications built using shared
libraries (called dynamic shared objects, or DSOs)
only and the old 32-bit, new 32-bit, or 64-bit application binary interfaces
(ABIs).
The following list describes the different SpeedShop libraries.
libss.so: A shared library (DSO) that supports
libssrt.so. The libss.so data normally appears
in experiment results generated with prof.
libssrt.so: A shared library (DSO) that
is linked in to the program you specify when you run an experiment. All the
performance data collection with the SpeedShop system is done within the target
processes by exercising various pieces of functionality using libssrt
. Data from libssrt.so does not normally appear
in performance data reports generated with prof.
libfpe_ss.so: Supplements the standard
libfpe.so for the purposes of collecting floating-point exception
data. See the fpe_ss(3) man page for more information.
libmalloc_ss.so: Inserts versions of
malloc routines from libc.so.1 that allow tracing
all calls to malloc, free,
realloc, memalign, and valloc.
See the malloc_ss(3) man page for more information.
libpixrt.so: A
shared library (DSO) used by programs that have been instrumented for basic
block counting.
The SpeedShop application programming interface (API) allows you to
use the ssrt_caliper_point function to set caliper points
in your source code. See “Using Calipers” in Chapter 6, for information
on using caliper points. For information on other API functions, see the ssapi(3) man page.
Supported Programming Models and Languages
The SpeedShop tools support programs with the following characteristics:
Shared libraries (DSOs).
Unstripped executables.
Executables that call fork(2), sproc(2), system(3F),
or exec(2).
Executables using supported techniques for opening, closing,
and delay-loading DSOs.
C, C++, Fortran (Fortran 77 and Fortran 90), or Ada (1.4.2
and older versions) source code.
Power Fortran and Power C source code. prof
understands the syntax and semantics of the multiprocessing run time and displays
the data accordingly.
pthreads, supported with data on a per-program
basis.
Message Passing Interface (MPI) or other message-passing paradigms.
Currently supported by providing data on the behavior of each process. The
behavior of the MPI library itself is monitored just like any other user-level
code. See the MPI Programmer's Manual
for details about the MPI library.
The OpenMP collection of compiler directives, library routines,
and environment variables that can be used to specify shared memory parallelism.
Using SpeedShop Tools for Performance Analysis
Performance tuning typically consists of:
Examining machine resource usage
Breaking down the process into phases
Identifying the resource bottleneck within each
phase
Correcting the cause of the bottleneck
Generally, you run the first experiment to break your program down into
phases and run subsequent experiments to examine each phase individually.
After you have solved a problem in a phase, you should re-examine machine
resource usage to see if there is further opportunity for performance improvement.
The general steps for a performance analysis cycle are as follows:
Build the application.
Run experiments on the application to collect performance
data.
Examine the performance data.
Generate an improved version
of the program.
Compare performance of improved version of the
program against the previous version. To do this, use the sscompare
command to compare the new version to the previous version to verify
that improvements are being made.
Repeat steps 1 through 5 as needed.
To accomplish this using SpeedShop tools, do the following:
Use the ssusage command to capture information
on your program's use of machine resources.
Use the ssrun command to capture different
types of performance data over either your entire program or parts of the
program. ssrun can be used in conjunction with dbx(1) or cvd(1), the
WorkShop debugger.
Use the prof command to analyze the data
and generate reports.
Using ssusage to Evaluate Machine Resource Use
To determine overall resource usage by your program, run the program
with ssusage. The results of this command allow you to
identify high-user CPU time, high-system CPU time, high I/O time, and a high
degree of paging. The ssusage(1) command has
the following format:
ssusage executable_name executable_args |
From the ssusage output, you can decide which experiments
to run to collect data for further study. For more information on
ssusage, see Chapter 5, “Collecting Data on Machine Resource Usage”, or see the ssusage(1) man page.
Gathering and Analyzing Performance Data
This section describes the steps involved in a performance analysis
cycle when using the line-based interface to the SpeedShop tools: the
ssrun and prof commands.
To perform a performance analysis, follow these general steps:
Build the executable.
You can usually build the executable as you would normally. See “Building Your Executable” in Chapter 6, for information on how to build the executable.
Specify caliper points if you want to analyze data
for only a portion of your program.
To collect performance data, issue the
ssrun command with the following parameters:
% ssrun ssrun_options -exp_type executable_name executable_args |
The following options are available with the ssrun
command:
ssrun_options: zero or more valid
options. For a complete list of options, see the
ssrun(1) man page.
exp_type: experiment name.
executable_name: executable name.
executable_args: arguments to the
executable.
Use the information in the following list to determine which experiments
to run. Each performance problem is followed by one or more experiment types:
High-user CPU time: usertime
, pcsamp (four variants), _hwc/
_hwctime (hardware counter experiments), or bbcounts.
High-system CPU time: if floating-point
exceptions are suspected, run an fpe trace.
High I/O time: bbcounts,
then examine counts of I/O routines.
High paging rates: bbcounts
, then prof -cordfb and cord
to rearrange procedures.
For each process of the executable, the experiment data is stored in
a file with a name in the following form: executable_name.exp_type.id |
The experiment ID consists of one or two letters designating the process
type, followed by the process ID number. An example of a name is:
See the following table for letter codes and descriptions.
Table 1-1. Letter Codes in Process Experiment ID Numbers
Letter Codes
| Description
|
|---|
m
| Master process created by
ssrun
| p
| Process created by a call to
sproc()
| f
| Process created by a call to
fork()
| s
| Process created by a call to
system()
| e
| Process created by a call to
exec()
| fe
| Process created by a call to
fork() and exec()
|
For more information on the ssrun command, see Chapter 6, “Setting Up and Running Experiments: ssrun”, or see the ssrun(1)
man page.
To generate a report from the experiment, issue
prof with the following parameters:
options: one or more valid options.
For a complete list of options, see the prof(1)
man page or “prof Options” in Chapter 7.
data_file: the name of the file
in which the experiment data was recorded.
The
sscompare command can be used to analyze the performance data in
experiment files that were generated by SpeedShop tools such as
ssrun, and produce a comparison report. When comparing application
performance, make sure to make a copy of the original binary code and a copy
of the original experiment file. Then you can compare the original experiment
results with the newer (hopefully improved) results.
The following are some useful comparisons: application performance before and after optimization
multiple ranks in an MPI application
multiple threads in an OpenMP applications
different experiments for the same application
The comparison report produced by sscompare contains
a legend and a table of performance data. Each input file and the type of
performance data it contains is listed in the legend with a numeric column
key. The table contains multiple columns of data; the type of data is dependent
on the options used to generate the report.
sscompare can be used with the following SpeedShop
experiment types:
See the sscompare(1) man page or “Comparing Experiment Results” in Chapter 7, for more details.
Collecting Data for Part of a Program
If you have a performance problem in only one part of your program,
consider collecting performance data for just that part. You can do this by
setting caliper points around the problem area when running an experiment,
then using the prof -calipers option to generate a report
for the problem area or using the calipers time line in the
cvperf(1) window of WorkShop to view the area through a graphic user
interface.
You can record caliper points using one of the following methods:
Direct calls to the SpeedShop API.
The caliper signal environment.
A debugger such as the ProDev WorkShop debugger.
Periodic caliper points with pollpoint caliper points.
For more information on using calipers, see “Using Calipers” in Chapter 6.
SpeedShop User's Guide
(document number: 007-3311-011 / published: 2003-08-23)
table of contents | additional info | download
Front Matter
New Features in this Guide
About This Guide
Chapter 1. Introduction to Performance Analysis
Chapter 2. Tutorial for C Users
Chapter 3. Tutorial for Fortran Users
Chapter 4. Experiment Types
Chapter 5. Collecting Data on Machine Resource Usage
Chapter 6. Setting Up and Running Experiments: ssrun
Chapter 7. Analyzing Experiment Results
Chapter 8. Miscellaneous Commands
Glossary
Index
home/search |
what's new |
help
|