IRIX 6.5 » Books » Developer »
MIPSpro N32/64 Compiling and Performance Tuning Guide
(document number: 007-2360-010 / published: 2003-08-15)
table of contents | additional info | download find in page
Chapter 4. Optimizing Program Performance
This chapter describes the compiler optimization facilities and their
benefits, and explains the major optimizing techniques. This chapter includes
the following topics:
 | Note: Please see the Release Notes and man pages
for your compiler for a complete list of options that you can use to optimize
and tune your program.
|
See your compiler manual for information about the optional parallelizer,
apo, and the OpenMP directives and routines for a method of using
a portable method to code parallelization into your program. You can find
additional information about optimization in the
MIPSpro 64-Bit Porting and Transition Guide. For information
about writing code for 64-bit programs, see Chapter 5, “Coding for 64-Bit Programs”.
For information about porting code to -n32 and
-64, see Chapter 6, “Porting Code to N32 and 64-Bit SGI Systems”.
The primary benefits of optimization are faster running programs and
often smaller object code size. However, the optimizer can also speed up development
time. For example, you can reduce coding time by leaving it up to the optimizer
to relate programming details to execution time efficiency. You can focus
on the more crucial global structure of your program.
Optimize your program only when it is fully developed and debugged.
To debug a program, you can use the
-g option. Note that you can also use -DEBUG:
options to debug run-time code and generate compile, link, and
run-time warning messages.
Debug a program before optimizing it, because the optimizer may move
operations around so that the object code does not correspond in an obvious
way to the source code. These changed sequences of code can create confusion
when using a debugger. For information on the debugger, see
dbx User's Guide and
ProDev WorkShop: Debugger User's Guide.
The compiler optimization options -O0 through
-O3 determine the optimization level to be applied to your program.
The -O0 option applies no optimizations, and the
-O3 option performs the most aggressive optimizations. See your
compiler manual and man page for more information.
Performance Tuning with Interprocedural Analysis (IPA)
Interprocedural Analysis (IPA) performs program optimizations that can
only be done in the presence of the whole program. Some of the optimizations
it performs also allow downstream phases to perform better code transformations.
 | Note: If you are using the automatic parallelizer (-apo),
run it after IPA. If you apply parallelization to subroutines in separate
modules, and then apply inlining to those modules using -IPA,
you inline parallelized code into a main routine that is not compiled to initialize
parallel execution. Therefore, you must use the parallelizer when compiling
the main module as well as any submodules.
|
Currently IPA optimizes code by performing the following functions:
Procedure inlining
Interprocedural constant propagation
Dead function elimination
Identification of global constants
Dead variable elimination
PIC optimization
Automatic selection of candidates for the gp-relative area
(autognum)
Dead call elimination
Automatic internal padding of COMMON arrays
in Fortran
Interprocedural alias analysis
Figure 4-1, shows the interprocedural analysis
and interprocedural optimization phase of the compilation process.
Typically, you invoke IPA with the -IPA: option group
to f77, f90, cc,
CC, and ld. Its inlining decisions are also controlled
by the -INLINE: option group. Up-to-date information on
IPA and its options is in the ipa(5) man page.
This section covers some IPA options including:
IPA performs across and within file inlining. A default inlining heuristic
determines which calls to inline.
Your code may benefit from inlining for the following reasons:
Inlining exposes a larger context to the scalar and loop-nest
optimizers, thereby allowing more optimizations to occur.
Inlining eliminates overhead resulting from the call (for
example, register save and restore, the call and return instructions, and
so forth). Instances occur, however, when inlining may hurt run-time performance
due to increased demand for registers, or compile-time performance due to
code expansion. Hence extensive inlining is not always useful. You must select
callsites for inlining based on certain criteria such as frequency of execution
and size of the called procedure. Often it is not possible to get an accurate
determination of frequency information based on compile-time analysis. As
a result, inlining decisions may benefit from generating feedback and providing
the feedback file to IPA. The inlining heuristic will perform better since
it is able to take advantage of the available frequency information in its
inlining decision.
Inlining Options for Routines
You may wish to select certain procedures to be inlined or not to be
inlined by using the inlining options.
 | Note: You can use the inline keyword and pragmas in C++ or C to specifically
identify routines to call sites to inline. The inliner's heuristics decides
whether to inline any cases not covered by the -INLINE
options.
|
In all cases, once a call is selected for inlining, a number of tests
are applied to verify its suitability. These tests may prevent its inlining
regardless of user specification: for instance, if the callee is a C
varargs routine, or parameter types do not match.
The -INLINE:none
and -INLINE:all Options
Changes the default inlining heuristic.
The -INLINE:all option. Attempts to inline all routines
that are not excluded by a never option or a routine pragma
suppressing inlining, either for the routine or for a specific callsite.
The -INLINE:none option. Does not attempt to inline
any routines that are not specified by a must option or
a pragma requesting inlining, either for the routine or for a specific callsite.
If you specify both all and none,
none is ignored with a warning.
The -INLINE:must
and -INLINE:never Options
The -INLINE:must=routine_name<,routine_name>*
option. Attempts to inline the specified routines at call sites
not marked by inlining pragmas, but does not inline if varargs
or similar complications prevent it. It observes site inlining pragmas.
Equivalently, you can mark a routine definition with a pragma requesting
inlining.
The -INLINE:never=routine_name<,routine_name>*
option. Does not inline the specified routines at call sites
not marked by inlining pragmas; it observes site inlining pragmas.
 | Note: For C++, you must provide mangled routine names.
|
The -INLINE:file=<
filename> Option
This option invokes the standalone inliner, which provides cross-file
inlining. The option -INLINE:file=<
filename> searches for routines provided via the
-INLINE:must list option in the file specified by the
-INLINE:file option. The file provided in this option must be generated
using the -IPA -c options. The file
generated contains information used to perform the cross-file inlining.
For example, suppose two files exist: foo.f and
bar.f.
The file, foo.f, looks like this:
program main
...
call bar()
end |
The file, bar.f, looks like this:
To inline bar into main, using
the standalone inliner, compile with -IPA and
-c options:
This produces the file, bar.o. To inline
bar into foo.f, enter:
% f77 -n32 foo.f -INLINE:must=bar:file=bar.o |
Power of two arrays can lead to degenerate behavior on cache-based machines.
The IPA phases try, when possible, to pad the leading dimension of arrays
to avoid cache conflicts. Several restrictions exist that limit IPA padding
of common arrays. If the restrictions are not met, the arrays are not padded.
The current restrictions are as follows:
The shape of the common block to which the global
array belongs must be consistent across procedures. That is, the declaration
of the common block must be the same in every subroutine that declares it.
In the following example, IPA can not pad any of the arrays in the common
block because the shape is not consistent.
program main
common /a/ x(1024,1024), y(1024, 1024), z(1024,1024)
....
....
end
subroutine foo
common /a/ xx(100,100), yy(1000,1000), zz(1000,1000)
....
....
end |
The common block variables must not initialize
data associated with them. In this example, IPA can not pad any of the arrays
in common block /a/:
block data inidata
common /a/ x(1024,1024), y(1024,1024), z(1024,1024), b(2)
DATA b /0.0, 0.0/
end
program main
common /a/ x(1024,1024), y(1024,1024), z(1024,1024), b(2)
....
....
end |
The array to be padded may be passed as a parameter
to a routine only if it declared as a one dimensional array, since passing
multi-dimensional arrays that may be padded can cause the array to be reshaped
in the callee.
Restricted types of equivalences to arrays that
may be padded are allowed. Equivalences that do not intersect with any column
of the array are allowed. This implies an equivalencing that will not cause
the equivalenced array to access invalid locations. In the following example,
the arrays in common /a/ will not be padded since
z is equivalenced to x(2,1), and hence
z(1024) is equivalenced to x(1,2).
program main
real z(1024)
common /a/ x(1024,1024), y(1024,1024) equivalence (z, x(2,1))
....
....
end |
The common block symbol must have an
INTERNAL or HIDDEN attribute, which implies that
the symbol may not be referenced within a DSO that has been linked with this
program.
The common block symbol can not be referenced by
regular object files that have been linked with the program.
Alias and Address Taken Analysis
The optimizations that are performed later in the compiler are often
constrained by the possibility that two variable references may be
aliased, meaning they may be aliased to the same address. This
possibility is increased by calls to procedures that are not visible to the
optimizer, and by taking the addresses of variables and saving them for possible
use later (for example, in pointers).
Furthermore, the compiler must normally assume that a global (external)
datum may have its address taken in another file, or may be referenced or
modified by any procedure call. The IPA alias and address-taken analyses are
designed to identify the actual global variable addressing and reference behavior
so that such worst-case assumptions are not necessary.
This option performs IPA alias analysis. That is, it determines which
global variables and formal parameters are referenced or modified by each
call, and which global variables are passed to reference formal parameters.
This analysis is used for other IPA analyses, including constant propagation
and address-taken analysis. This option is ON by default.
The -IPA:addressing=ON Option
This option performs IPA address-taken analysis. That is, it determines
which global variables and formal parameters have their addresses taken in
ways that may produce aliases. This analysis is used for other IPA analyses,
especially constant propagation. Its effectiveness is very limited without -IPA:alias=ON
. This option is ON by default.
Controlling Loop Nest Optimizations (LNO)
Numerical programs often spend most of their time executing loops. The
loop nest optimizer (LNO) performs high-level loop optimizations that can
greatly improve program performance by better exploiting caches and instruction-level
parallelism.
This section covers the following topics:
For complete information on options, see the lno(5)
man page.
LNO is run by default when you use the -O3
option for all Fortran, C, and C++ programs. LNO is an integrated
part of the compiler back end and is not a preprocessor. Therefore, the same
optimizations (with the same control options) apply to Fortran, C, and C++
programs. Note that this does not imply that LNO will optimize numeric C++
programs as well as Fortran programs. C and C++ programs often include features
that make them inherently harder to optimize than Fortran programs.
After LNO performs high-level transformations, it may be desirable to
view the transformed code in the original source language. Two translators
that are integrated into the back end translate the compiler internal code
representation back into the original source language after the LNO transformation
(and IPA inlining). You can invoke either one of these translators by using
the Fortran option -FLIST:=on or the cc
option -CLIST:=on. For example, f77 -O3 -FLIST:=on
x.f creates an a.out as well as a Fortran file
x.w2f.f. The .w2f.f file is a readable file,
and it is usually a compilable SGI Fortran representation of the original
program after the LNO phase (see Figure 4-2).
LNO is not a preprocessor, which means that recompiling the .w2f.f
file directly may result in an executable that is different from
the original compilation of the .f file.
Use the -CLIST:=on option to cc
to translate compiler internal code to C. No translator exists to translate
compiler internal code to C++. When the original source language is C++, the
generated C code may not be compilable.
This section describes some important optimizations performed by LNO.
For a complete listing, see your compiler's man page. Optimizations include:
The order of loops in a nest can affect the number of cache misses,
the number of instructions in the inner loop, and the ability to schedule
an inner loop. Consider the following loop nest example.
do i
do j
do k
a(j,k) = a(j,k) + b(i,k) |
As written, the loop suffers from several possible performance problems.
First, each iteration of the k loop requires two loads
and one store. Second, if the loop bounds are sufficiently large, every memory
reference will result in a cache miss.
Interchanging the loops improves performance.
do k
do j
do i
a(j,k) = a(j,k) + b(i,k) |
Since a(j,k) is loop invariant, only one load is
needed in every iteration. Also, b(i,k) is “stride-1,”
successive loads of b(i,k) come from successive memory
locations. Since each cache miss brings in a contiguous cache line of data,
typically 4-16 elements, stride-1 references incur a cache miss every 4-16
iterations. In contrast, the references in the original loop are not in stride-1
order. Each iteration of the inner loop causes two cache misses; one for
a(j,k) and one for b(i,k).
In a real loop, different factors may affect different loop ordering.
For example, choosing i for the inner loop may improve
cache behavior while choosing j may eliminate a recurrence.
LNO uses a performance model that considers these factors. It then orders
the loops to minimize the overall execution time estimate.
Blocking and Outer Loop Unrolling
Cache blocking and outer loop unrolling are two closely related optimizations
used to improve cache reuse, register reuse, and minimize recurrences. Consider
matrix multiplication in the following example.
do i=1,10000
do j=1,10000
do k=1,10000
c(i,j) = c(i,j) + a(i,k)*b(k,j) |
Given the original loop ordering, each iteration of the inner loop requires
two loads. The compiler uses loop unrolling, that is, register blocking, to
minimize the number of loads.
do i=1,10000
do j=1,10000,2
do k=1,10000
c(i,j) = c(i,j) + a(i,k)*b(k,j)
c(i,j+1) = c(i,j+1) + a(i,k)*b(k,j+1) |
Storing the value of a(i,k) in a register avoids
the second load of a(i,k). Now the inner loop only requires
three loads for two iterations. Unrolling the j loop even
further, or unrolling the i loop as well, further decrease
the amount of loads required. How much is the ideal amount to unroll? Unrolling
more decreases the amount of loads but not the amount of floating point operations.
At some point, the execution time of each iteration is limited by the floating
point operations. There is no point in unrolling further. LNO uses its performance
model to choose a set of unrolling factors that minimizes the overall execution
time estimate.
Given the original matrix multiply loop, each iteration of the
i loop reuses the entire b matrix. However, with
sufficiently large loop limits, the matrix b will not remain
in the cache across iterations of the i loop. Thus in each
iteration, you have to bring the entire matrix into the cache. You can “cache
block” the loop to improve cache behavior.
do tilej=1,10000,Bj
do tilek=1,10000,Bk
do i=1,10000
do j=tilej,MIN(tilej+Bj-1,10000)
do k=tilek,MIN(tilek+Bk-1,10000)
c(i,j) = c(i,j) + a(i,k)*b(k,j) |
By appropriately choosing Bi and Bk,
b remains in the cache across iterations of i,
and the total number of cache misses is greatly reduced.
LNO automatically caches tile loops with block sizes appropriate for
the target machine. When compiling for an SGI R8000, LNO uses a single level
of blocking. When compiling for an SGI systems (such as R4000, R5000, R10000,
or R12000) that contain multi-level caches, LNO uses multiple levels of blocking
where appropriate.
LNO attempts to fuse multiple loop nests to improve cache behavior,
to lower the number of memory references, and to enable other optimizations.
Consider the following example.
do i=1,n
do j=1,n
a(i,j) = b(i,j) + b(i,j-1) + b(i,j+1)
do i=1,n
do j=1,n
b(i,j) = a(i,j) + a(i,j-1) + a(i,j+1) |
In each loop, you need to do one store and one load in every iteration
(the remaining loads are eliminated by the software pipeliner). If
n is sufficiently large, in each loop you need to bring the
entire a and b matrices into the cache.
LNO fuses the two nests and creates the following single nest:
do i=1,n
a(i,1) = b(i,0) + b(i,1) + b(i,2)
do j=2,n
a(i,j) = b(i,j) + b(i,j-1) + b(i,j+1)
b(i,j-1) = a(i,j-2) + a(i,j-1) + a(i,j)
end do
b(i,n) = a(i,n-1) + a(i,n) + a(i,n+1)
end do |
Fusing the loops eliminates half of the cache misses and half of the
memory references. Fusion can also enable other optimizations. Consider the
following example:
do i
do j1
S1
end do
do j2
S2
end do
end do |
By fusing the two inner loops, other transformations are enabled such
as loop interchange and cache blocking.
do j
do i
S1
S2
end do
end do |
As an enabling transformation, LNO always tries to use loop fusion (or
fission, discussed in the following subsection) to create larger perfectly
nested loops. In other cases, LNO decides whether or not to fuse two loops
by using a heuristic based on loop sizes and the number of variables common
to both loops.
To fuse aggressively, use -LNO:fusion=2.
Loop Fission/Distribution
The opposite of fusing loops is distributing loops into multiple pieces,
or loop fission. As with fusion, fission is useful as an enabling transformation.
Consider this example again:
do i
do j1
S1
end do
do j2
S2
end do
end do |
Using loop fission, as shown in the following example, also enables
loop interchange and blocking.
do i1
do j1
S1
end do
end do
do i2
do j2
S2
end do
end do |
Loop fission is also useful to reduce register pressure in large inner
loops. LNO uses a model to estimate whether or not an inner loop is suffering
from register pressure. If it decides that register pressure is a problem,
fission is attempted. LNO uses a heuristic to decide on how to divide the
statements among the resultant loops.
Loop fission can potentially lead to the introduction of temporary arrays.
Consider the following loop.
do i=1,n
s = ..
.. = s
end do |
If you want to split the loop so that the two statements are in different
loops, you need to scalar expand s.
do i=1,n
tmp_s(i) = ..
end do
do i=1,n
.. = tmp_s(i)
end do |
Space for tmp_s is allocated on the stack to minimize
allocation time. If n is very large, scalar expansion
can lead to increased memory usage, so the compiler blocks scalar expanded
loops. Consider the following example:
do se_tile=1,n,b
do i=se_tile,MIN(se_tile+b-1,n)
tmp_s(i) = ..
end do
do i=se_tile,MIN(se_tile+b-1,n)
.. = tmp_s(i)
end do
end do |
Related to loop fission is vectorization of intrinsics. The SGI math
libraries support vector versions of many intrinsic functions that are faster
than the regular versions. That is, it is faster, per element, to compute
n cosines than to compute a single cosine. LNO attempts to split
vectorizable intrinsics into their own loops. If successful, each such loop
is collapsed into a single call to the corresponding vector intrinsic.
The MIPS IV instruction set supports a data prefetch instruction that
initiates a fetch of the specified data item into the cache. By prefetching
a likely cache miss sufficiently ahead of the actual reference, you can increase
the tolerance for cache misses. In programs limited by memory latency, prefetching
can change the bottleneck from hardware latency time to the hardware bandwidth.
By default, prefetching is enabled at
-O3 for the R10000.
LNO runs a pass that estimates which references will be cache misses
and inserts prefetches for those misses. Based on the miss latency, the code
generator later schedules the prefetches ahead of their corresponding references.
By default, for misses in the primary cache, the code generator moves
loads early in the schedule ahead of their use, exploiting the out-of-order
execution feature of the R10000 to hide the latency of the primary miss. For
misses in the secondary cache, explicit prefetch instructions are generated.
Prefetching is limited to array references in well behaved loops. As
loop bounds are frequently unknown at compile time, it is usually not possible
to know for certain whether a reference will miss. The algorithm therefore
uses heuristics to guess.
Prefetching can improve performance in compute-intensive operations
where data is too large to fit in the cache. Conversely, prefetching will
not help performance in a memory-bound loop where data fits in the cache.
Gather-Scatter Optimization
Software pipelining attempts to improve performance by executing statements
from multiple iterations of a loop in parallel. This is difficult when loops
contain conditional statements. Consider the following example:
do i = 1,n
if (t(i) .gt. 0.0) then
a(i) = 2.0*b(i-1)
end do
end do |
Ignoring the if statement, software pipelining may
move up the load of b(i-1), effectively executing it in
parallel with earlier iterations of the multiply. Given the conditional, this
is not strictly possible. The code generator will often if
convert such loops, essentially executing the body of the if
on every iteration. The if conversion does not work well
when the if is frequently not taken. An alternative is
to gather-scatter the loop, so the loop is divided as follows:
inc = 0 do i = 1,n
tmp(inc) = i
if (t(i) .gt. 0.0) then
inc = inc + 1
end do
end do
do i = 1,inc
a(tmp(i)) = 2.0*b((tmp(i)-1)
end do |
The code generator will if convert the first loop;
however, no need exists to if convert the second one. The
second loop can be effectively software pipelined without having to execute
unnecessary multiplies.
The lno(5) man page describes the compiler options
for LNO. Specifically, topics include:
Controlling LNO Optimization Levels (described below)
Controlling Fission and Fusion (described below)
Controlling Gather-Scatter
Controlling Cache Parameters
Controlling Permutation Transformations and Cache Optimization
Controlling Prefetch
All of the LNO optimizations are on by default when you use the
-O3 compiler option. To turn off LNO at -O3,
use -LNO:opt=0. If you want direct control, you can specify
options and pragmas to turn on and off optimizations that you require.
The options -r5000,
-r8000, -r10000, and -r12000
set a series of default cache characteristics. To override a default setting,
use one or more of the options below.
To define a cache entirely, you must specify all options immediately
following the -LNO:cache_size option. For example, if the
processor is an R4000 (r4k), which has no secondary cache,
then specifying -LNO:cache_size2=4m is not valid unless
you supply the options necessary to specify the other characteristics of the
cache. (Setting -LNO:cache_size2=0 is adequate to turn
off the second level cache; you do not have to specify other second-level
parameters.) Options are available for third and fourth level caches. Currently
none of the SGI machines have such caches. However, you can also use those
options to block other levels of the memory hierarchy.
For example, on a machine with 128 Mbytes of main memory, you can block
for it by using parameters, for example, -LNO:cs3=128M:ls3=....
In this case, assoc3 is ignored and does not have to be
specified. Instead, you must specify is_mem3.., since virtual
memory is fully associative.
Pragmas and Directives for LNO
Fortran directives and C and C++ pragmas enable, disable,
or modify a feature of the compiler. This section uses the term pragma
when describing either a pragma or a directive.
Pragmas within a procedure apply only to that particular procedure,
and revert to the default values upon the end of the procedure. Pragmas that
occur outside of a procedure alter the default value, and therefore apply
to the rest of the file from that point on, until overridden by a subsequent
pragma.
By default, pragmas within a file override the command-line options.
Use the -LNO:ignore_pragmas option to allow command-line
options to override the pragmas in the file.
This section covers:
See the lno(5) man page, or your compiler manual,
for more information on these pragmas and directives.
The following pragmas/directives
control fission and fusion.
| C*$* AGGRESSIVE INNER LOOP FISSION , #pragma aggressive inner loop fission | | | Fission inner loops into as many loops as possible. It can
only be followed by a inner loop and has no effect if that loop is not inner
any more after loop interchange.
| | C*$* FISSION, #pragma fission, C*$* FISSIONABLE, #pragma fissionable | | | Fission the enclosing n level of
loops after this pragma. The default is 1. Performs validity test unless a
FISSIONABLE pragma is also specified. Does not reorder statements.
| | C*$* FUSE, #pragma fuse, C*$* FUSABLE , #pragma fusable | | | Fuse the following loops, which must be immediately adjacent.
The default is 2,level. Fusion is attempted on each pair
of adjacent loops and the level, by default, is the determined by the maximal
perfectly nested loop levels of the fused loops although partial fusion is
allowed. Iterations may be peeled as needed during fusion and the limit of
this peeling is 5 or the number specified by the -LNO:fusion_peeling_limit
option. No fusion is done for non-adjacent outer loops. When the
FUSABLE pragma is present, no validity test is done and the fusion
is done up to the maximal common levels.
| | C*$* NO FISSION, #pragma no fission | | | The loop following this pragma should not be fissioned. Its
innermost loops, however, are allowed to be fissioned.
| | C*$* NO FUSION , #pragma no fusion | | | The loop following this pragma should not be fused with other
loops.
|
Blocking and Permutation Transformations
The following pragmas/directives control blocking and permutation transformations.
| C*$* INTERCHANGE, #pragma interchange | | | Loops to be interchanged (in any order) must directly follow
this pragma and be perfectly nested one inside the other. If they are not
perfectly nested, the compiler may choose to perform loop distribution to
make them so, or may choose to ignore the annotation, or even apply imperfect
interchange. Attempts to reorder loops so that I is outermost, then J, then
K. The compiler may choose to ignore this pragma.
| | C*$* NO INTERCHANGE, #pragma no interchange | | | Prevent the compiler from involving the loop directly following
this pragma in an interchange, or any loop nested within this loop.
| | C*$* BLOCKING SIZE [(n1,n2)], #pragma blocking size (n1,n2) | | | The loop specified, if it is involved in a blocking for the
primary (secondary) cache, will have a block size of n1 {n2}. The compiler
tries to include this loop within such a block. If a 0 blocking size is specified,
then the loop is not stripped, but the entire loop is inside the block.
|
For example:
subroutine amat(x,y,z,n,m,mm)
real*8 x(100,100), y(100,100), z(100,100)
do k = 1, n
C*$* BLOCKING SIZE 20
do j = 1, m
C*$* BLOCKING SIZE 20
do i = 1, mm
z(i,k) = z(i,k) + x(i,j)*y(j,k)
enddo
enddo
enddo
end |
In this example, the compiler makes 20x20 blocks when blocking. However,
the compiler can block the loop nest such that loop k is
not included in the tile. If it did not, add the following pragma just before
the k loop.
This pragma suggests that the compiler generates a nest like:
subroutine amat(x,y,z,n,m,mm)
real*8 x(100,100), y(100,100), z(100,100)
do jj = 1, m, 20
do ii = 1, mm, 20
do k = 1, n
do j = jj, MIN(m, jj+19)
do i = ii, MIN(mm, ii+19)
z(i,k) = z(i,k) + x(i,j)*y(j,k)
enddo
enddo
enddo
enddo
enddo
end |
Finally, you can apply a INTERCHANGE pragma to the
same nest as a BLOCKING SIZE pragma. The BLOCKING
SIZE applies to the loop it directly precedes only, and moves with
that loop when an interchange is applied.
| C*$* NO BLOCKING, #pragma no blocking | | | Prevent the compiler from involving this loop in cache blocking.
| | C*$* UNROLL, #pragma unroll | | | This pragma suggests to the compiler that n-1
copies of the loop body be added to the loop. If the loop that this pragma
directly precedes is an inner loop, then it indicates standard unrolling.
If the loop that this pragma directly precedes is not innermost, then outer
loop unrolling (unroll and jam) is performed. The value of n
must be at least 1. If it is exactly 1, then no unrolling is
performed.
|
For example, the following code:
C*$* UNROLL (2)
DO i = 1, 10
DO j = 1, 10
a(i,j) = a(i,j) + b(i,j)
END DO
END DO |
becomes:
DO i = 1, 10, 2
DO j = 1, 10
a(i,j) = a(i,j) + b(i,j)
a(i+1,j) = a(i+1,j) + b(i+1,j)
END DO
END DO |
and not:
DO i = 1, 10, 2
DO j = 1, 10
a(i,j) = a(i,j) + b(i,j)
END DO
DO j = 1, 10
a(i+1,j) = a(i+1,j) + b(i+1,j)
END DO
END DO |
The UNROLL pragma again is attached to the given
loop, so that if an INTERCHANGE is performed, the corresponding
loop is still unrolled. That is, the example above is equivalent to:
C*$* INTERCHANGE i,j
DO j = 1, 10
C*$* UNROLL 2
DO i = 1, 10
a(i,j) = a(i,j) + b(i,j)
END DO
END DO |
| C*$* BLOCKABLE(I,J,K), #pragma blockable (i,j,k) | | | The loops I, J and K must be adjacent and nested within each
other, although not necessarily perfectly nested. This pragma informs the
compiler that these loops may validly be involved in a blocking with each
other, even if the compiler considers such a transformation invalid. The loops
are also interchangeable and unrollable. This pragma does not tell the compiler
which of these transformations to apply.
|
The following pragmas/directives control prefetch operations.
| C*$* PREFETCH, #pragma prefetch | | | Specify prefetching for each level of the cache. Scope: entire
function containing the pragma.
0 prefetching off (default for all processors except
R10000 and R12000)
1 prefetching on, but conservative (default at
-03 when prefetch is on)
2 prefetching on, and aggressive
| | C*$* PREFETCH_MANUAL, #pragma prefetch_manual | | | Specify whether manual prefetches (through pragmas) should
be respected or ignored. Scope: entire function containing the pragma.
0 ignore manual prefetches (default for all processors
except R10000 and R12000)
1 respect manual prefetches (default at
-03 for R10000 and beyond)
| | C*$* PREFETCH_REF, #pragma prefetch_ref | | | The object specified is the reference itself; for example,
an array element A(i, j). The object can also be a scalar,
for example, x, an integer. Must be specified.
A specified stride prefetches every stride iterations of this loop.
Optional; the default is 1.
The specified level specifies the level in memory hierarchy to prefetch.
Optional; the default is 2.
lev=1: prefetch from L2 to L1 cache.
lev=2: prefetch from memory to L1 cache.
kind specifies read/write. Optional; the default
is write.
size specifies the size (in Kbytes) of this object
referenced in this loop. Must be a constant. Optional.
|
The effect of this pragma is:
Generate a prefetch and connect to the specified reference
(if possible).
Search for array references that match the supplied reference
in the current loop-nest. If such a reference is found, then that reference
is connected to this prefetch node with the specified latency. If no such
reference is found, then this prefetch node stays free-floating and is scheduled “loosely.”
Ignore all references to this array in this loop-nest by the
automatic prefetcher (if enabled).
If the size is specified, then the auto-prefetcher (if enabled)
uses that number in its volume analysis for this array.
No scope, just generate a prefetch.
| C*$* PREFETCH_REF_DISABLE, #pragma prefetch_ref_disable | | | This explicitly disables prefetching all references to the
specified array in the current loop nest. The auto-prefetcher runs (if enabled),
ignoring the array.
|
The following pragmas and/or directives control fill and/or alignment
of symbols. This section uses the term pragma when describing either a pragma or a directive.
The align_symbol pragma aligns the start of the named
symbol at the specified alignment. The fill_symbol pragma
pads the named symbol.
| C*$* FILL_SYMBOL, #pragma fill_symbol, C*$* ALIGN_SYMBOL, #pragma align_symbol | | | The fill_symbol/align_symbol
pragmas take a symbol, that is, a variable that is a Fortran COMMON
, a C/C++ global, or an automatic variable (but not a formal and
not an element of a structured type like a struct or an array).
The second argument in the pragma may be one of the keywords: L1cacheline (machine specific first-level
cache line size, typically 32 bytes)
L2cacheline (machine specific second-level
cache line size, typically 128 bytes)
page (machine specific page size,
typically 16 Kbytes)
a user-specified power-of-two value
The align_symbol pragma aligns the start of the named
symbol at the specified alignment, that is, the symbol “s”
will start at the specified alignment boundary.
The fill_symbol pragma pads the named symbol with
additional storage so that the symbol is assured not to overlap with any other
data item within the storage of the specified size. The additional padding
required is heuristically divided between each end of the specified variable.
For instance, a fill_symbol pragma for the
L1cacheline guarantees that the specified symbol does not suffer
from false-sharing for the L1 cache line.
For global variables, these pragmas must be specified at the variable
definition, and are optional at the declarations of the variable.
For COMMON block variables, these pragmas are required
at each declaration of the COMMON block. Since the pragmas
modify the allocated storage and its alignment for the named symbol, inconsistent
pragmas can lead to undefined results.
The align_symbol pragma is ineffective for local
variables of fixed-size symbols, such as simple scalars or arrays of known
size. The pragma continues to be effective for stack-allocated arrays of dynamically
determined size.
A variable cannot have both fill_symbol and
align_symbol pragmas applied to it.
Examples:
int x; /* x is a global or a common block variable */
#pragma align_symbol (x, 32)
/* x will start at a 32-byte boundary */
#pragma align_symbol (x, 2)
/* Error: cannot request alignment lower than the natural
* alignment of the symbol.
*/
double y; /* y is a global, common, or local */
#pragma fill_symbol (y, L2cacheline)
/* allocate extra storage both before and after “y” so
* that “y” is within an L2cacheline (128 bytes) all by
* itself. Can be useful to avoid false-sharing between
* multipleprocessors for cacheline containing “y”.
*/ |
|
The following pragmas/directives control dependence analysis.
| CDIR$ IVDEP, #pragma ivdep | | | Liberalize dependence analysis. This applies only to inner
loops. Given two memory references, where at least one is loop variant, ignore
any loop-carried dependences between the two references.
For example: CDIR$ IVDEP
do i = 1,n
b(k) = b(k) + a(i)
enddo |
ivdep does not break the dependence since
b(k) is not loop variant. CDIR$ IVDEP
do i=1,n
a(i) = a(i-1) + 3.0
enddo |
ivdep does break the dependence, but the compiler
warns the user that it is breaking an obvious dependence. CDIR$ IVDEP
do i=1,n
a(b(i)) = a(b(i)) + 3.0
enddo |
ivdep does break the dependence. CDIR$ IVDEP
do i = 1,n
a(i) = b(i)
c(i) = a(i) + 3.0
enddo |
ivdep does not break the dependence on
a(i) since it is within an iteration.
If -OPT:cray_ivdep=TRUE, use Cray semantics. Break
all lexically backward dependences. For example: CDIR$ IVDEP
do i=1,n
a(i) = a(i-1) + 3.0
enddo |
ivdep does break the dependence but the compiler
warns the user that it is breaking an obvious dependence. CDIR$ IVDEP
do i=1,n
a(i) = a(i+1) + 3.0
enddo |
ivdep does not break the dependence since the dependence
is from the load to the store, and the load comes lexically before the store.
To break all dependencies, specify -OPT:liberal_ivdep=TRUE
.
-OPT:cray_ivdep and -OPT:liberal_ivdep
are OFF (FALSE) by default.
|
Controlling Floating-Point Optimization
Floating-point numbers (the Fortran REAL*
n, DOUBLE PRECISION, and COMPLEX*
n, and the C float,
double, and long double) are inexact representations
of ideal real numbers. The operations performed on them are also necessarily
inexact. However, the MIPS processors conform to the IEEE 754 floating-point
standard, producing results as precise as possible given the constraints of
the IEEE 754 representations, and the MIPSpro compilers generally preserve
this conformance. Note, however, that 128-bit floating point (that is, the
Fortran REAL*16 and the C long double)
is not precisely IEEE-compliant. In addition, the source language standards
imply rules about how expressions are evaluated.
Most code that has not been written with careful attention to floating-point
behavior does not require precise conformance to either the source language
expression evaluation standards or to IEEE 754 arithmetic standards. Therefore,
the MIPSpro compilers provide a number of options that trade off source language
expression evaluation rules and IEEE 754 conformance against better performance
of generated code. These options allow transformations of calculations specified
by the source code that may not produce precisely the same floating point
result, although they involve a mathematically equivalent calculation.
Two of these options are the preferred controls:
-OPT:roundoff=n
deals with the extent to which language expression evaluation rules are observed,
generally affecting the transformation of expressions involving multiple operations.
-OPT:IEEE_arithmetic=n
deals with the extent to which the generated code conforms to IEEE 754 standards
for discrete IEEE-specified operations (for example, a divide or a square
root).
The
-OPT:roundoff option provides control over floating point accuracy
and overflow/underflow exception behavior relative to the source language
rules.
The
roundoff option specifies the extent to that optimizations are allowed
to affect floating point results, in terms of both accuracy and overflow/underflow
behavior. The roundoff value, n, has a value in
the range 0-3. Roundoff values are described in the following list:
| roundoff=0 | | | Do no transformations that could affect floating-point results.
This is the default for optimization levels -O0 to
-O2.
| | roundoff=1 | | | Allow transformations with limited effects on floating point
results. For roundoff, limited means that only the last bit or two of the
mantissa is affected. For overflow (or underfow), it means that intermediate
results of the transformed calculation may overflow within a factor of two
of where the original expression may have overflowed (or underflowed). Note
that effects may be less limited when compounded by multiple transformations.
| | roundoff=2 | | | Allow transformations with more extensive effects on floating
point results. Allow associative rearrangement, even across loop iterations,
and distribution of multiplication over addition or subtraction. Disallow
only transformations known to cause cumulative roundoff errors, or overflow
or underflow, for operands in a large range of valid floating-point values.
Reassociation can have a substantial effect on the performance of software
pipelined loops by breaking recurrences. This is therefore the default for
optimization level -O3.
| | roundoff=3 | | | Allow any mathematically valid transformation of floating
point expressions. This allows floating point induction variables in loops,
even when they are known to cause cumulative roundoff errors, and fast algorithms
for complex absolute value and divide, which overflow (underflow) for operands
beyond the square root of the representable extremes.
|
The -OPT:IEEE_arithmetic option controls conformance
to IEEE 754 arithmetic standards for discrete operators.
The -OPT:IEEE_arithmetic
option specifies the extent to which optimizations must preserve
IEEE floating-point arithmetic. The value n must
be in the range of 1 through 3. Values are described in the following list:
| -OPT:IEEE_arithmetic=1 | | | No degradation: do no transformations that degrade floating-point
accuracy from IEEE requirements. The generated code may use instructions such
as madd, which provides greater accuracy than required
by IEEE 754. This is the default.
| | -OPT:IEEE_arithmetic=2 | | | Minor degradation: allow transformations with limited effects
on floating point results, as long as exact results remain exact. This option
allows use of the MIPS4 recip and rsqrt
operations.
| | -OPT:IEEE_arithmetic=3 | | | Conformance not required: allow any mathematically valid transformations.
For instance, this allows implementation of x/y as
x*recip(y), or sqrt(x) as x*rsqrt(x)
.
|
As an example, consider optimizing the following Fortran code fragment:
INTEGER i, n
REAL sum, divisor, a(n)
sum = 0.0
DO i = 1,n
sum = sum + a(i)/divisor
END DO |
At roundoff=0 and IEEE_arithmetic=1,
the generated code must do the n loop iterations
in order, with a divide and an add in each.
Using IEEE_arithmetic=3, the divide can be treated
like a(i)*(1.0/divisor). For example, on the MIPS R8000
and R10000, the reciprocal can be done with a recip instruction.
But more importantly, the reciprocal can be calculated once before the loop
is entered, reducing the loop body to a much faster multiply and add per iteration,
which can be a single madd instruction on the R8000 and
R10000.
Using roundoff=2, the loop may be reordered. For
example, the original loop takes at least 4 cycles per iteration on the R8000
(the latency of the add or madd instruction).
Reordering allows the calculation of several partial sums in parallel, adding
them together after loop exit. With software pipelining, a throughput of nearly
2 iterations per cycle is possible on the R8000, a factor of 8 improvement.
Consider another example:
INTEGER i,n
COMPLEX c(n)
REAL r
DO i = 1,n
r = 0.1 * i
c(i) = CABS ( CMPLX(r,r) )
END DO |
Mathematically, r can be calculated by initializing
it to 0.0 before entering the loop and adding 0.1 on each iteration. But doing
so causes significant cumulative errors because the representation of 0.1
is not exact. The complex absolute value is mathematically equal to
SQRT(r*r + r*r). However, calculating it this way causes an overflow
if 2*r*r is greater than the maximum REAL value, even though
a representable result can be calculated for a much wider range of values
of r (at greater cost). Both of these transformations
are forbidden for roundoff=2, but enabled for
roundoff=3.
Other Options to Control Floating Point Behavior
Other options exist that allow finer control of floating point behavior
than is provided by -OPT:roundoff. The options may be used
to obtain finer control, but they may disappear or change in future compiler
releases.
| -OPT:div_split | | | Enable/disable the calculation of x/y as x*(1.0/y)
, normally enabled by IEEE_arithmetic=3. Simplifies
expressions by determining if (A/B) should be turned into
(1/B)*A. This can be useful if B is a loop-invariant,
as it replaces the divide with a multiply. For example, X = A/B
becomes X = A*(1/B). See -OPT:recip.
| | -OPT:fast_complex | | | Enable/disable the fast algorithms for complex absolute value
and division, normally enabled by roundoff=3.
| | -OPT:fast_exp | | | Enable/disable the translation of exponentiation by integers
or halves to sequences of multiplies and square roots. This can change roundoff, and can make these functions
produce minor discontinuities at the exponents where it applies. Normally
enabled by roundoff>0 for Fortran, or for C if the function
exp() is labeled intrinsic in <math.h> (the
default in -xansi and -cckr modes).
| | -OPT:fast_io | | | Enable/disable inlining of printf(), fprintf(),
sprintf(), scanf(), fscanf(),
sscanf(), and printw() for more specialized lower-level
subroutines. This option applies only if the candidates for inlining are marked
as intrinsic (-D__INLINE_INTRINSICS) in the respective
header files (<stdio.h> and <curses.h>
); otherwise they are not inlined. Programs that use functions
such as printf() or scanf() heavily
generally have improved I/O performance when this switch is used. Since this
option may cause substantial code expansion, it is OFF
by default.
| | -OPT:fast_sqrt | | | Enable/disable the calculation of square root as x*rsqrt(x) for -mips4,
normally enabled by IEEE_arithmetic=3. This option is ignored
for the R10000.
| | -OPT:fold_reassociate | | | Enable/disable transformations that reassociate or distribute
floating point expressions. This option is on at
-O3, or if roundoff >= 2. For example,
X + 1. X can be turned into X - X +
1.0, which will then simplify to 1. This can
cause problems is X is large compared to 1,
so that X+1 is X due to roundoff.
| | -OPT:IEEE_comparisons | | | Force comparisons to yield results conforming to the IEEE
754 standard for NaN and Inf operands,
normally disabled. Setting this option disables certain optimizations like
assuming that a comparison x==x is always TRUE
(since it is FALSE if x is
a NaN). It also disables optimizations that reverse the
sense of a comparison, for example, turning “x < y”
into “! (x >= y)”, since both “x<y”
and “x>=y” may be FALSE
if one of the operands is a NaN.
| | -OPT:recip | | | Allow use of the MIPS IV reciprocal instruction for 1.0/y, normally enabled by
-O3 or IEEE_arithmetic>=2. See -OPT:div_split
. For example, X = 1./Y generates the
recip instruction instead of a divide instruction. This may change
the results slightly.
| | -OPT:rsqrt | | | Allow use of the MIPS IV reciprocal square root instruction
for 1.0/sqrt(y),
normally enabled by
-O3 or IEEE_arithmetic>=2. For example,
X = 1./SQRT(Y) generates the rsqrt instruction
instead of a divide and a square root. This may change the results slightly.
This option is ignored for the R10000.
| | -TARG:madd | | | The MIPS IV architecture supports fused multiply-add instructions,
which add the product of two operands to a third, with a single roundoff step
at the end. Because the product is not separately rounded, this can produce
slightly different (but more accurate) results than a separate multiply and
add pair of instructions. This is normally enabled for
-mips4.
|
Debugging Floating-Point Problems
The preceding options can change the results of floating point calculations,
causing less accuracy (especially -OPT:IEEE_arithmetic),
different results due to expression rearrangement (-OPT:roundoff
), or NaN/Inf results in new cases. Note that in some such cases,
the new results may not be worse (that is, less accurate) than the old, they
just may be different. For instance, doing a sum reduction by adding the terms
in a different order is likely to produce a different result. Typically, that
result is not less accurate, unless the original order was carefully chosen
to minimize roundoff.
If you encounter such effects when using these options (including
-O3, which enables -OPT:roundoff=2 by default),
first attempt to identify the cause by forcing the safe levels of the options:
-OPT:IEEE_arithmetic=1:roundoff=0. When you do this, do not have
the following options explicitly enabled:
| -OPT:div_split | | -OPT:fast_complex | | -OPT:fast_exp | | -OPT:fast_sqrt | | -OPT:fold_reassociate | | -OPT:recip | | -OPT:rsqrt |
If using the safe levels works, you can either use the safe levels or,
if you are dealing with performance-critical code, you can use the more specific
options (for example, div_split, fast_complex
, and so forth) to selectively disable optimizations. Then you can
identify the source code that is sensitive and eliminate the problem. Or,
you can avoid the problematic optimizations.
Controlling Other Optimizations with the -OPT Option
The following -OPT options allow control over a variety
of optimizations. These include:
Using the -OPT:Olimit Option
| -OPT:Olimit | | | This option controls the size of procedures to be optimized.
Procedures above the cutoff limit are not optimized. A value of 0 means “infinite
Olimit,” and causes all procedures to be optimized. If you
compile at -O2 or above, and a routine is so large that
the compile speed may be slow, then the compiler prints a message telling
you the Olimit value needed to optimize your program.
|
Using the -OPT:alias Option
| -OPT:alias=name | | | The compilers must typically be very conservative in optimization
of memory references involving pointers (especially in C), since aliases (that
is, different ways of accessing the same memory) may be very hard to detect.
This option may be used to specify that the program being compiled avoids
aliasing in various ways. The
-OPT:alias options are listed below.
| | -OPT:alias=any | | | The compiler assumes that any pair of memory references may
be aliased unless it can prove otherwise. This is the default.
| | -OPT:alias=typed | | | The compiler assumes that any pair of memory references that
reference distinct types in fact reference distinct data. For example, consider
the code: void dbl ( int *i, float *f ) {
*i = *i + *i;
*f = *f + *f;
} |
The compiler assumes that i and f
point to different memory, and produces an overlapped schedule for the two
calculations.
| | -OPT:alias=unnamed | | | The compiler assumes that pointers never point to named objects.
For example, consider the code: float g;
void dbl ( float *f ) {
g = g + g;
*f = *f + *f;
} |
The compiler assumes that f cannot point to
g, and produces an overlapped schedule for the two calculations.
This option also implies the alias=typed assumption.
Note that this is the default assumption for the pointers implicit in Fortran
dummy arguments according to the ANSI standard.
| | -OPT:alias=restrict and -OPT:alias=disjoint
| | | The compiler assumes a very restrictive (restrict
) model of aliasing: memory operations dereferencing different named
pointers in the program are assumed not to alias with each other, nor with
any named scalar in the program. For example, if p and
q are pointers, *p does not alias with
*q; *p does not alias with p;
and *p does not alias with any named scalar variable.
Use -OPT:alias=no_restrict when distinct pointer
variables may point to overlapping storage.
Use -OPT:alias=disjoint
for memory operations dereferencing different named pointers in
the program that are assumed not to alias with each other, or with any named
scalar in the program. For example, if p and
q are pointers, *p does not alias with
*q; *p does not alias with **p;
and *p does not alias with **q.
Use -OPT:alias=no_disjoint when distinct pointer
expressions may point to overlapping storage.
Although these options are very dangerous to use, they may produce significantly
better code when used for specific, well-controlled cases where they are known
to be valid.
|
Simplifying Code with the -OPT Option
The following -OPT options perform algebraic simplifications
of expressions, such as turning x + 0 into x.
| -OPT:fold_unsafe_relops | | | Controls folding of relational operators in the presence of
possible integer overflow. On by default. For example, X + Y <
0 may turn into X < Y. If X + Y
overflows, it is possible to get different answers.
| | -OPT:fold_unsigned_relops | | | Determines if simplifications are performed of unsigned relational
operations that may result in wrong answers in the event of integer overflow.
Off by default. The example is the same as above, only for unsigned integers.
|
Controlling Execution Frequency
The #pragma mips_frequency_hint
provides information about execution frequency for certain regions in the
code. The format of the pragma is:
#pragma mips_frequency_hint {NEVER|
INIT} [function_name]
You can provide the following indications:
NEVER or INIT.
NEVER: This region of code is never or rarely executed.
The compiler may move this region of the code away from the normal path. This
movement may either be to the end of the procedure or at some point to an
entirely separate section.
INIT: This region of code is executed only during
initialization or startup of the program. The compiler may try to put all
regions under INIT together to provide better locality
during the startup of a program.
You can specify this pragma in two ways:
 | Note: The pragma must appear after the function declaration.
|
This section describes the part of the compiler that generates code.
The code generator processes an input program unit (PU) in intermediate
representation form to produce an output object file (.o) or assembly
file (.s).
Program units are partitioned into basic blocks. A new basic block is
started at each branch target. Basic blocks are also ended by CALL
statements or branches. Large basic blocks are arbitrarily ended
after a certain number of operations, because some algorithms in the code
generator work on one basic block at a time (“local” algorithms)
and have a complexity that is nonlinear in the number of operations in the
basic block.
This section covers the following topics:
Code Generator and Optimization Levels
At optimization levels -O0 and -O1,
the code generator only uses local algorithms that operate individually on
each basic block. At -O0, no code generator optimization
is done. References to global objects are spilled and restored from memory
at basic block boundaries. At -O1, the code generator performs
standard local optimizations on each basic block (for example, copy propagation,
dead code elimination) as well as some elimination of useless memory operations.
At optimization levels -O2 and -O3,
the code generator includes global register allocation and a large number
of special optimizations for innermost loops, including software pipelining
at -O3.
An Example of Local Optimization for Fortran
Consider the Fortran statement, a(i) = b(i). At -O0,
the value of i is kept in memory and is loaded before each
use. This statement uses two loads of i. The code generator
local optimizer replaces the second load of i with a copy
of the first load, and then it uses copy-propagation and dead code removal
to eliminate the copy. Comparing .s files for the
-O0 and -O1 versions shows:
The .s file for -O0:
lw $3,20($sp) # load address of i
lw $3,0($3) # load i
addiu $3,$3,-1 # i - 1
sll $3,$3,3 # 8 * (i-1)
lw $4,12($sp) # load base address for b
addu $3,$3,$4 # address for b(i)
ldc1 $f0,0($3) # load b
lw $1,20($sp) # load address of i
lw $1,0($1) # load i
addiu $1,$1,-1 # i - 1
sll $1,$1,3 # 8 * (i-1)
lw $2,4($sp) # load base address for a
addu $1,$1,$2 # address for a(i)
sdc1 $f0,0($1) # store a |
The .s file for -O1:
lw $1,0($6) # load i
lw $4,12($sp) # load base address for b
addiu $3,$1,-1 # i - 1
sll $3,$3,3 # 8 * (i-1)
lw $2,4($sp) # load base address for a
addu $3,$3,$4 # address for b(i)
addiu $1,$1,-1 # i - 1
ldc1 $f0,0($3) # load b
sll $1,$1,3 # 8 * (i-1)
addu $1,$1,$2 # address for a(i)
sdc1 $f0,0($1) # store a |
The .s file for -O2 (using OPT
to perform scalar optimization) produces optimized code:
lw $1,0($6) # load i
sll $1,$1,3 # 8 * i
addu $2,$1,$5 # address of b(i+1)
ldc1 $f0,-8($2) # load b(i)
addu $1,$1,$4 # address of a(i+1)
sdc1 $f0,-8($1) # store a(i) |
Code Generator and Optimization Levels -O2 and
-O3
This section provides additional information about the -O2
and -O3 optimization levels.
The if conversion
transformation converts control-flow into conditional assignments. For example,
consider the following code before if conversion. Note
that expr1 and expr2 are arbitrary expressions
without calls or possible side effects. For example, if expr1
is i++, the following example would be wrong because `
i' would not be updated.
if ( cond )
a = expr1;
else
a = expr2; |
After if conversion, the code looks like this:
tmp1 = expr1;
tmp2 = expr2;
a = (cond) ? tmp1 : tmp2; |
Performing if conversion results in the following
benefits:
It exposes more instruction-level parallelism. This is almost
always valuable on hardware platforms such as R10000.
It eliminates branches. Some platforms (for example, the R10000)
have a penalty for taken branches. There can be substantial costs associated
with branches that are not correctly predicted by branch prediction hardware.
For example, a mispredicted branch on R10000 has an average cost of about
8 cycles.
It enables other compiler optimizations. Currently, cross-iteration
optimizations and software pipelining both require single basic block loops.
The if conversion changes multiple basic block innermost
loops into single basic block innermost loops.
In the preceding code that was if converted, the
expressions, expr1 and expr2, are unconditionally
evaluated. This can conceivably result in the generation of exceptions that
do not occur without if conversion. An operation that is
conditionalized in the source, but is unconditionally executed in the object,
is called a speculated operation. Even if the -TENV:X level
prohibits speculating an operation, it may be possible to if
convert. For information about the -TENV option, see the
appropriate compiler man page.
For example, suppose expr1 = x + y; is a floating
point add, and X=1. Speculating FLOPs is not allowed (to
avoid false overflow exceptions). Define x_safe and
y_safe by x_safe = (cond)? x : 1.0; y_safe = (cond) ? y
: 1.0;. Then unconditionally evaluating tmp1 = x_safe +
y_safe; cannot generate any spurious exception. Similarly, if
X < 4, and expr1 contains a load (for example,
expr1 = *p), it is illegal to speculate the dereference of
p. But, defining p_safe = (cond) ? p : known_safe_address;
and then tmp1 = *p_safe; cannot generate a spurious
memory exception.
Notice that with -TENV:X=2, it is legal to speculate
FLOPs, but not legal to speculate memory references. So expr1 = *p
+ y; can be speculated to tmp1 = *p_safe + y;.
If *known_safe_address is uninitialized, there can be spurious
floating point exceptions associated with this code. In particular, on some
MIPS platforms (for example, R10000) if the input to a FLOP is a denormalized
number, then a trap will occur. Therefore, by default, the code generator
initializes *known_safe_address to 1.0.
Cross-Iteration Optimizations
Four main types of cross-iteration optimizations include: Read-Read Elimination: Consider the following example:
DO i = 1,n
B(i) = A(i+1) - A(i)
END DO |
The load of A(i+1) in iteration i
can be reused (in the role of A(i)) in iteration
i+1. This reduces the memory requirements of the loop from 3 references
per iteration to 2 references per iteration.
Read-Write Elimination: Sometimes a value written in one iteration
is read in a later iteration. For example:
DO i = 1,n
B(i+1) = A(i+1) - A(i)
C(i) = B(i)
END DO |
In this example, the load of B(i) can be eliminated
by reusing the value that was stored to B in the previous
iteration.
Write-Write Elimination: Consider the following example:
DO i = 1,n
B(i+1) = A(i+1) - A(i)
B(i) = C(i) - B(i)
END DO |
Each element of B is written twice. Only the second
write is required, assuming read-write elimination is done.
Common Sub-Expression Elimination: Consider the following
example:
DO i = 1,n
B(i) = A(i+1) - A(i)
C(i) = A(i+2) - A(i+1)
END DO |
The value computed for C in iteration i
may be used for B in iteration i+1
. Thus only one subtract per iteration is required.
In this example, unrolling 4 times converts this code:
for( i = 0; i < n; i++ ) {
a[i] = b[i];
} |
to this code:
for ( i = 0; i < (n % 4); i++ ) {
a[i] = b[i];
}
for ( j = 0; j < (n / 4); j++ ) {
a[i+0] = b[i+0];
a[i+1] = b[i+1];
a[i+2] = b[i+2];
a[i+3] = b[i+3];
i += 4;
} |
Loop unrolling:
Exposes more instruction-level parallelism. This may be valuable
even on execution platforms such as R10000 or R12000 systems.
Eliminates branches.
Amortizes loop overhead. For example, unrolling replaces four
increments i+=1 with one increment i+=4.
Enables some cross-iteration optimizations such as read/write
elimination over the unrolled iterations.
Recurrence breaking offers multiple benefits. Before the recurrences
are broken (see both of the following examples), the compiler waits for the
prior iteration's add to complete (four cycles on the R8000) before starting
the next one, so four cycles per iteration occur.
When the compiler interleaves the reduction, each add must still wait
for the prior iteration's add to complete, but four of these are done at one
time, then partial sums are combined on exiting the loop. The four iterations
are done in four cycles, or one cycle per iteration, quadruple the speed.
With back substitution, each iteration depends on the result from two
iterations back (not the prior iteration), so four cycles per two iterations
occur, or two cycles per iteration (double the speed).
 | Note:
The compiler actually interleaves and back-substitutes
these examples even more than shown below, for even greater benefit (3 cycles/4
iterations for the R8000 in both cases). These examples are simple for purposes
of exposition.
|
Two types of recurrence breaking are reduction interleaving and back
substitution:
Reduction interleaving. For example, interleaving by 4 transforms
this code:
sum = 0
DO i = 1,n
sum = sum + A(i)
END DO |
After reduction interleaving, the code looks like this (omitting the
cleanup code):
sum1 = 0
sum2 = 0
sum3 = 0
sum4 = 0
DO i = 1,n,4
sum1 = sum1 + A(i+0)
sum2 = sum2 + A(i+1)
sum3 = sum3 + A(i+2)
sum4 = sum4 + A(i+3)
END DO
sum = sum1 + sum2 + sum3 + sum4 |
Back substitution. For example:
DO i = 1,n
B(i+1) = B(i) + k
END DO |
The code is converted to:
k2 = k + k
B(2) = B(1) + k
DO i = 2,n
B(i+1) = B(i-1) + k2
END DO |
Software pipelining schedules innermost loops to keep the hardware pipeline
full. For information about software pipelining, see the
MIPSpro 64-Bit Porting and Transition Guide. Also, for
a general discussion of instruction level parallelism, refer to B.R.Rau and
J.A.Fisher, “Instruction Level Parallelism,” Kluwer Academic Publishers,
1993 (reprinted from the Journal of Supercomputing, Volume 7, Number 1/2).
The global code motion phase performs various code motion transformations
in order to reduce the overall execution time of a program. The global code
motion phase is useful because it does the following:
It moves instructions in nonloop code. In the following code
example, assume expr1 and expr2 are
arbitrary expressions that cannot be if-converted. The
cond is a Boolean expression that evaluates to either true or false
and has no side effects with either expr1 or
expr2.
if (cond)
a = expr1;
else
a = expr2; |
After global code motion, the code looks like this:
a = expr1;
if (!cond)
a = expr2; |
Note, that expr1 is arbitrarily chosen to speculate
above the branch. The decision to select candidates for code movement are
based on several factors, including resource availability, critical length,
basic block characteristics, and so forth.
It moves instructions in loops with control-flow. In the following
code example, assume that p is a pointer and
expr1 and expr2 are arbitrary expressions involving
p. Also, assume that cond is a Boolean expression
that uses p and has no side effects with expr2
and expr1.
while (p != NULL) {
if (cond)
sum += expr1(p);
else
sum += expr2(p);
p = p->next;
} |
After global code motion, the code looks like this:
while (p != NULL) {
t1 = expr1(p);
t2 = expr2(p);
if (cond)
sum += t1;
else
sum += t2;
p = p->next;
} |
Note, that t1 and t2 temporaries
are created to evaluate the respective expressions and conditionally executed.
The increment of the pointer, p=p->next, cannot move above
the branch because of side effects with the condition.
It moves instructions across procedure calls. In the following
code example, assume that expr1 has no side effects with
the procedure call to foo (that is, procedure
foo does not use and/or modify the expression expr1).
After global code motion, the code looks like this:
Benefits of Global Code Motion
The benefits of global code motion include the following:
It exposes more instruction-level parallelism. Global code
motion identifies regions and/or blocks that have excessive and/or insufficient
parallelism than that provided by the target architecture. Global code motion
effectively redistributes (or load balances) the regions/blocks by selectively
performing code movements between them. This can effectively reduce their
respective schedule lengths and the overall execution time of the program.
It provides branch delay slot filling. Global code motion
fills branch delay slots and converts most frequently executed branches to
branch-likely form (for example, beql, rs,
rt, L1).
It enables other compiler optimizations. As a result of performing
global code motion, some branches are either removed or transformed to a more
effective form.
Steps Performed by the Code Generator at Levels -O2
and -O3
The steps performed by the code generator at -O2
and -O3 include:
Nonloop if conversion. This
also works in loops by performing any if-conversion that
produces faster code.
Find innermost loop candidates for further optimization.
Loops are rejected for any of the following reasons:
Marked UNIMPORTANT (for example, LNO cleanup
loop)
Strange control flow (for example, branch into the middle)
if convert (-O3
only). This transforms a multi-basic block loop into a single basic block
containing operations with “guards.” The if
conversion of loop bodies containing branches can fail for any of the following
reasons:
Cross-iteration read/write (read/read, and write/write) elimination
Cross-iteration CSE (common subexpression elimination)
Recurrence fixing
Software pipelining
Perform cross-iteration optimizations (except write/write
elimination on loops without trip counts; for example, most “while”
loops).
Unroll loops.
Fix recurrences.
If still a loop, and there is a trip count, and
-O3, invoke software pipelining.
If not software pipelined, reverse if
convert.
Reorder basic blocks to minimize (dynamically)
the number of taken branches. Also eliminate branches to branches when possible,
and remove unreachable basic blocks. This step also happens at -O1
.
Invoke global code motion phase.
At several points in this process local optimizations are performed,
since many of the transformations performed can expose additional opportunities.
It is also important to note that many transformations require legality checks
that depend on alias information. There are three sources of alias information:
At -O3, the loop nest optimizer, LNO, provides
a dependence graph for each innermost loop.
The scalar optimizer provides information on aliasing at the
level of symbols. That is, it can tell whether arrays A and B are independent,
but it does not have information about the relationship of different references
to a single array.
The code generator can sometimes tell that two memory references
are identical or distinct. For example, if two references use the same register,
and there are no definitions of that register between the two references,
then the two references are identical.
Modifying Code Generator Defaults
The code generator makes many choices, for example, what conditional
constructs to if convert, or how much to unroll a loop.
In most cases, the compiler makes reasonable decisions. Occasionally, however,
you can improve performance by modifying the default behavior.
You can control the code generator by:
Increasing or decreasing the unroll amount.
A heuristic is controlled by -OPT:unroll_analysis
(on by default), which generally tries to minimize unrolling. Less unrolling
leads to smaller code size and faster compilation. You can change the upper
bound for the amount of unrolling with -OPT:unroll_times
(default is 8) or -OPT:unroll_size (the number of instructions
in the unrolled body, current default is 80).
You can look at the .s file for notes (starting
with #<loop>) that indicate how the decision to limit
unrolling was made. For example, loops are not unrolled with recurrences that
cannot be broken (since unrolling cannot possibly help in these cases), so
the .s file now tells why unrolling was limited and how
to change it. For example:
#<loop> Loop body line 7, nesting depth:1, estimated iterations: 100
#<loop> Not unrolled: limited by recurrence of 4 cycles
#<loop> Not unrolled: disable analysis w/-CG:unroll_analysis=off |
Disabling software pipelining with -OPT:swp=off.
As far as the code generator is concerned, -O3 -OPT:swp=off
is the same as -O2. Since LNO does not run at
-O2, however, the input to the code generator can be very different,
and the available aliasing information can be very different. In particular,
cross-iteration loop optimizations are much more effective at -O3
even with -OPT:swp=off, due to the improved
alias information.
Other Code Generator Performance Topics
This section explains a few miscellaneous topics including:
Prefetch and Load Latency
At the -O3
level of optimization, with -r10000, LNO generates
prefetches for memory references that are likely to miss either the L1 (primary)
or the L2 (secondary) cache. The code generator generates prefetch operations
for L2 prefetches, and implements L1 prefetches as follows: makes sure that
loads that had associated L1 prefetches are issued at least 8 cycles before
their results are used.
It is often possible to reduce prefetch overhead by eliminating some
of the corresponding prefetches from different replications. For example,
suppose a prefetch is only required on every fourth iteration of a loop, because
four consecutive iterations will load from the same cache line. If the loop
is replicated four times by software pipelining, then there is no need for
a prefetch in each replication, so three of the four corresponding prefetches
are pruned away.
The original software pipelining schedule has room for a prefetch in
each replication, and the number of cycles for this schedule is what is described
in the software pipelining notes as “cycles per iteration.” The
number of memory references listed in the software pipelining notes (“mem
refs”) is the number of memory references including prefetches in Replication
0. If some of the prefetches have been pruned away from replication 0, the
notes will overstate the number of cycles per iteration while understating
the number of memory references per iteration.
Some choices that the code generator makes are decided based on information
about the frequency with which different basic blocks are executed. By default,
the code generator makes guesses about these frequencies based on the program
structure. This information is available in the
.s file. The frequency printed for each block is the predicted
number of times that block will be executed each time the PU is entered.
The frequency guesses are replaced with the measured frequencies. Currently
the information guides if-conversion, some loop unrolling
decisions (unrelated to the trip count estimate), global code motion, control
flow optimizations, global spill and restore placement, global register allocation,
instruction alignment, and delay slot filling. Average loop trip-counts can
be derived from feedback information. Trip count estimates are used to guide
decisions about how much to unroll and whether or not to software pipeline.
Cording
is an optimization technique for reordering parts of your program to achieve
better locality of reference and reduce instruction fetch overhead based on
dynamically collected data. The following areas are influenced by code region
reordering:
Both of these events contribute to instruction fetch overhead, which
can be alleviated with better locality of reference. Retrieving an instruction
from cache is always faster than retrieving it from memory; so the idea is
to keep instructions in cache as much as possible. The frequencies and costs
associated with each of those events differ significantly. The size of the
program in memory and text-resident set sizes can also be reduced as a result
of cording.
Programs can be reordered using either the
cord(1) command (see the following section) or the
ld(1) linker command (see “Reordering with ld”). The SpeedShop prof(1) command and the WorkShop
cvperf(1) user interface are alternative methods of provided feedback
files to cord and ld (see “Using prof or cvperf”).
Use the following procedure to optimize your application
by reordering its text areas with the cord(1)
command:
Run one
or more SpeedShop bbcounts experiments to collect performance
data, setting caliper points to better identify phases of execution.
Use
sswsextr to extract working set files related to the interval between
each pair of calipers in the experiment file.
% sswsextr a.out a.out.bbcounts.m20683 |
A working set list file (in this case, a.out.a.out.bbcounts.m20683.wslist
) is also generated. It assigns a number to the working set files
and the weight for each one (the default weight is 1).
Use ssorder(1)
or sscord(1) to generate a cord feedback file
combining the working set files for the binary.
% ssorder -wsl a.out.a.out.bbcounts.m20683.wslist -gray -o a.out.fb a.out |
% sscord -wsl a.out.a.out.bbcounts.m20683.wslist a.out |
Use the cord command to reorder the
procedures in the binary.
The following procedure uses the
ld linker to reorder routines in a source file. The procedure shows
how to reorder routines in an executable file (a.out),
but you can also reorder a DSO.
Compile the application as follows:
% f90 -OPT:procedure_reorder verge.f |
You can turn reordering on or off for a given compilation by setting
the procedure_reorder argument to an optional Boolean value.
Doing so is convenient if you are compiling the application in a makefile.
Setting a Boolean value of 1 enables reordering, while a value of 0 disables
reordering.
% f90 -OPT:procedure_reorder=1 verge.f |
Run a SpeedShop
bbcounts experiment to collect performance data, setting caliper
points to better identify phases of execution.
Use
sswsextr to extract working set files for the binary.
% sswsextr a.out a.out.bbcounts.m20683 |
Use ssorder or sscord
to generate a cord feedback file, combining multiple working set
files for the binary.
% ssorder -wsl a.out.a.out.bbcounts.m20683.wslist -gray -o a.out.fb a.out |
% sscord -wsl a.out.a.out.bbcounts.m20683.wslist a.out |
Use ld to reorder the procedures
in the binary as follows:
% ld -LD_LAYOUT:reorder_file=a.out.fb |
Once the experiment file is generated, you can generate a cord feedback
file using either the SpeedShop prof command or the WorkShop
cvperf user interface.
Enter the prof command as follows:
% prof -cordfb a.out.bbcounts.m20683 |
The -cordfb option generates cord feedback for the
executable and all DSOs. Along with its usual output, this command writes
a cord feedback file named a.out.fb.
While using prof will give you what you want, using
either the ssorder or sscord method
described in the previous subsections or the cvperf method
described in the following paragraphs produces more efficient results.
If you are using the WorkShop performance analyzer, first enter the
cvperf command with the experiment file as an argument:
% cvperf a.out.bbcounts.m20683 |
Select the Working Set View from
the Views menu. Once the new window appears,
choose Save Cord Map File from the
Admin menu. By default, the name of the cord feedback file will
be a.out.fb.
Specify the cord feedback file on the cord or
ld commands to reorder the procedures in the binary:
% cord a.out a.out.fb
% ld -LD_LAYOUT:reorder_file=a.out.fb |
Programming Hints for Improving Optimization
The global (scalar) optimizer is part of the compiler back end. It improves
the performance of object programs by transforming existing code into more
efficient coding sequences. The optimizer distinguishes between C, C++, and
Fortran programs to take advantage of the various language semantics.
This section describes the global optimizer and contains coding hints.
Specifically this section includes:
Hints for Writing Programs
Coding Hints for Improving Other Optimization
Using SpeedShop to optimize your code.
Hints for Writing Programs
Use the following hints when writing your programs:
Do not use indirect calls (that is, calls via function pointers,
including those passed as subprogram arguments). Indirect calls may cause
unknown side effects (for instance, changing global variables) that reduce
the amount of optimization possible.
Use functions that
return values instead of pointer parameters.
Avoid
unions that cause overlap between integer and floating point data types. The
optimizer cannot assign such fields to registers.
Use local variables and avoid global variables. In
C and C++ programs, declare any variable outside of a function as static,
unless that variable is referenced by another source file. Minimizing the
use of global variables increases optimization opportunities for the compiler.
Declare pointer parameters as const in
prototypes whenever possible, that is, when there is no path through the routine
that modifies the pointee. This allows the compiler to avoid some of the negative
assumptions normally required for pointer and reference parameters (see the
following).
Pass
parameters by value instead of passing by reference (pointers) or using global
variables. Reference parameters have the same performance-degrading effects
as the use of pointers.
Aliases occur when there are multiple ways to reference the
same data object. For instance, when the address of a global variable is passed
as a subprogram argument, it may be referenced either using its global name,
or via the pointer. The compiler must be conservative when dealing with objects
that may be aliased, for instance keeping them in memory instead of in registers,
and carefully retaining the original source program order for possibly aliased
references.
Pointers in particular tend to cause aliasing problems, since it is
often impossible for the compiler to identify their target objects. Therefore,
you can help the compiler avoid possible aliases by introducing local variables
to store the values obtained from dereferenced pointers. Indirect operations
and calls affect dereferenced values, but do not affect local variables. Therefore,
local variables can be kept in registers. The following example shows how
the proper placement of pointers and the elimination of aliasing produces
better code.
In the following example, the optimizer does not know if *p++
= 0 will eventually modify len. Therefore, the
compiler cannot place len in a register for optimal performance.
Instead, the compiler must load it from memory on each pass through the loop.
int len = 10;
void
zero(char *p)
{
int i;
for (i= 0; i!= len; i++) *p++ = 0;
} |
Increase the efficiency of this example by not using global or common
variables to store unchanging values.
Use local variables. Using local (automatic) variables or
formal arguments instead of static or global prevents aliasing and allows
the compiler to allocated them in registers.
For example, in the following code fragment, the variables
loc and form are likely to be more efficient
than ext* and stat*.
extern int ext1;
static int stat1;
void p ( int form )
{
extern int ext2;
static int stat2;
int loc;
...
} |
Write straightforward code. For example, do not use ++ and
- - operators within an expression. Using these operators produces side-effects
(requires the use of extra temporaries, which increases register pressure).
Avoid taking and passing
addresses (and values). Using addresses creates aliases, makes the optimizer
store variables from registers to their home storage locations, and significantly
reduces optimization opportunities that would otherwise be performed by the
compiler.
Avoid
functions that take a variable number of arguments. The optimizer saves all
parameter registers on entry to VARARG or STDARG
functions. If you must use these functions, use the ANSI standard
facilities of stdarg.h. These produce simpler
code than the older version of varargs.h
Coding Hints for Improving Other Optimization
The global optimizer processes programs only when you specify the
-O2 or -O3 option at compilation. The code generator
phase of the compiler performs certain optimizations. This section has coding
hints that increase optimization for other passes of the compiler.
Use Tables Rather Than if-then-else or
switch Statements
In your programs, use tables rather than
if-then-else or switch statements. For example,
consider this code:
typedef enum { BLUE, GREEN, RED, NCOLORS } COLOR; |
Instead of:
switch ( c ) {
case CASE0: x = 5; break;
case CASE1: x = 10; break;
case CASE2: x = 1; break;
} |
Use:
static int Mapping[NCOLORS] = { 5, 10, 1 };
...
x = Mapping[c]; |
Declare Variables Most Frequently Manipulated
As an optimizing technique, the compiler puts the first eight parameters
of a parameter list into registers where they may remain during execution
of the called routine. Therefore, always declare, as the first eight parameters,
those variables that are most frequently manipulated in the called routine.
Use 32-Bit or 64-Bit Scalar Variables
Use 32-bit or 64-bit scalar variables instead of smaller ones. This
practice can take more data space. However, it produces more efficient code
because the MIPS instruction set is optimized for 32-bit and 64-bit data.
Suggestions for C and C++ Programs
The following suggestions apply to C and C++ programs:
Rely on libc.so
functions (for example, strcpy, strlen,
strcmp, bcopy, bzero,
memset, and memcpy). These functions are carefully
coded for efficiency.
Use a signed data type, or cast to a signed data type, for
any variable that does not require the full unsigned range and must be converted
to floating-point. For example:
double d;
unsigned int u;
int i;
/* fast */ d = i;
/* fast */ d = (int)u;
/* slow */ d = u; |
Converting an unsigned type to floating-point takes significantly longer
than converting signed types to floating-point; additional software support
must be generated in the instruction stream for the former case.
Use signed int types in 64-bit code if
they may appear in mixed type expressions with long
ints (or with long long int types in either 32-bit
or 64-bit code). Since the hardware automatically sign-extends the results
of most 32-bit operations, this may avoid explicit zero-extension code. For
example:
unsigned int ui;
signed int i;
long int li;
/* fast */ li += i;
/* fast */ li += (int)ui;
/* slow */ li += ui; |
Use const and restrict
qualifiers. The __restrict keyword tells the compiler to
assume that dereferencing the qualified pointer is the only way the program
can access the memory pointed to by that pointer. Hence loads and stores through
such a pointer are assumed not to alias with any other loads and stores in
the program, except other loads and stores through the same pointer variable.
For example:
float x[ARRAY_SIZE];
float *c = x;
void f4_opt(int n, float * __restrict a, float * __restrict b)
{
int i;
/* No data dependence across iterations because of __restrict */
for (i = 0; i < n; i++)
a[i] = b[i] + c[i];
} |
Suggestions for C++ Programs Only
The following suggestions apply to C++ programs:
Use the inline keyword whenever possible.
Functions calls in loops that are not inlined prevent loop-nest optimizations
and software pipelining.
Use a direct calls rather than indiscriminate use of virtual
function calls. The penalty is in method lookup and the inability to inline
them.
For scalars only, avoid the creation of unnecessary temporaries,
that is, Aa = 1 is better than Aa = A(1).
For structs and class,
pass by const ref to avoid the overhead of copying.
If your code does not use exception handing, use
-LANG:exceptions=off when compiling.
SpeedShop is an integrated package of performance tools
that you can use to gather performance data and generate reports. To record
the experiments, use the ssrun(1) command,
which runs the experiment and captures data from an executable (or instrumented
version). You can compare experiment results with the
sscompare(1) command. To examine data you can use either
prof(1) or display the data in the WorkShop graphical user interface
with the cvperf(1) command. Speedshop also
lets you start a process and attach a debugger to it.
For detailed information about SpeedShop, ssrun,
sscompare and prof, see the
SpeedShop User's Guide.
MIPSpro N32/64 Compiling and Performance Tuning Guide
(document number: 007-2360-010 / published: 2003-08-15)
table of contents | additional info | download
Front Matter
New Features in This Manual
About This Guide
Chapter 1. About the MIPSpro Compiler System
Chapter 2. Using the MIPSpro Compiler System
Chapter 3. Using Dynamic Shared Objects
Chapter 4. Optimizing Program Performance
Chapter 5. Coding for 64-Bit Programs
Chapter 6. Porting Code to N32 and 64-Bit SGI Systems
Index
home/search |
what's new |
help
|