IRIX 6.5 » Books » Developer »
MIPSpro Assembly Language Programmer's Guide
(document number: 007-2418-006 / published: 2003-08-15)
table of contents | additional info | download find in page
Chapter 7. Writing Assembly Language Code
This chapter gives rules and examples to follow when designing
an assembly language program. The chapter includes a tutorial section that
contains information about how calling sequences work. This involves writing
a skeleton version of your prospective assembly routine using a high-level
language, and then compiling it with the -S option
to generate a human-readable assembly language file. The assembly language
file can then be used as the starting point for coding your routine. See “Using the .s Assembly Language File” for details about the assembly language file produced
with this option.
This assembler works in either 32-bit, high performance 32-bit (N32)
or 64-bit compilation modes. While these modes are very similar, due to the
difference in data, register and address sizes, the N32 and 64-bit assembler
linkage conventions are not always the same as those for 32-bit mode. For
details on some of these differences, see the
MIPSpro 64-Bit Porting and Transition Guide and the MIPSpro N32 ABI Handbook.
The procedures and examples in this chapter, for the most part, describe
32-bit compilation mode. In some cases, specific differences necessitated
by 64-bit mode are highlighted.
When you write assembly language routines, you should follow the same
calling conventions that the compilers observe, for two reasons:
Often your code must interact with compiler-generated code,
accepting and returning arguments or accessing shared global data.
The symbolic debugger gives better assistance in debugging
programs using standard calling conventions.
The conventions for the compiler system are a bit more complicated than
some, mostly to enhance the speed of each procedure call. Specifically:
The compilers use the full, general calling sequence only
when necessary; where possible, they omit unneeded portions of it. For example,
the compilers don't use a register as a frame pointer whenever possible.
The compilers and debugger observe certain implicit rules
rather than communicating via instructions or data at execution time. For
example, the debugger looks at information placed in the symbol table by a “.frame”
directive at compilation time, so that it can tolerate the lack of a register
containing a frame pointer at execution time.
This section describes some general areas of concern
to the assembly language programmer:
Stack frame requirements on entering and exiting a routine.
The “shape” of data (scalars, arrays, records,
sets) laid out by the various high-level languages.
For information about register format, and general, special, and floating-point
registeres, see Chapter 1.
This discussion
of the stack frame, particularly regarding the graphics, describes 32-bit
operations. In 32-bit mode, restrictions such as stack addressing are enforced
strictly. While these restrictions are not enforced rigidly for 64-bit stack
frame usage, their observance is probably still a good coding practice, especially
if you count on reliable debugging information.
The compilers classify each routine into one of the following categories:
Non-leaf routines, that is, routines that call other procedures.
Leaf routines, that is, routines that do not themselves execute
any procedure calls. Leaf routines are of two types:
You must decide the routine category before determining the calling
sequence.
To write a program with proper stack frame usage and debugging capabilities,
use the following procedure:
Regardless of the type of routine, you should include a .ent pseudo-op
and an entry label for the procedure. The .ent pseudo-op is for use by the
debugger, and the entry label is the procedure name. The syntax is:
.ent procedure_name
procedure_name: |
If you are writing a leaf procedure that does not use the stack,
skip to step 3. For leaf procedure that uses the stack or non-leaf procedures,
you must allocate all the stack space that the routine requires. The syntax
to adjust the stack size is:
where framesize is the size of frame required; framesize must be a
multiple of 16. Space must be allocated for:
Local variables.
Saved general registers. Space should be allocated only for
those registers saved. For non-leaf procedures, you must save $31
, which is used in the calls to other procedures from this routine.
If you use registers $16-$23, you must also save
them.
Saved floating-point registers. Space should be allocated
only for those registers saved. If you use registers $f20-$f30
(for 32-bit) or $f24-$f31 (for 64-bit),
you must also save them.
Procedure call argument area. You must allocate the maximum
number of bytes for arguments of any procedure that you call from this routine.
 | Note: Once you have modified $sp, you should not
modify it again for the rest of the routine.
|
Now include a .frame pseudo-op:
.frame framereg,framesize,returnreg |
The virtual frame pointer is a frame pointer as used in other compiler
systems but has no register allocated for it. It consists of the
framereg ($sp, in most cases) added to the
framesize (see step 2 above). The following figures show the stack components
for -32 and -n32 and -64.
The returnreg specifies the register containing
the return address (usually $31). These usual values
may change if you use a varying stack pointer or are specifying a kernel trap
routine.
If the procedure is a leaf procedure that does not use the stack,
skip to step 7. Otherwise you must save the registers you allocated space
for in step 2.
To save the general registers, use the following operations:
.mask bitmask,frameoffset
sw reg,framesize+frameoffset-N($sp) |
The .mask directive specifies the registers to be
stored and where they are stored. A bit should be on in bitmask for each register
saved (for example, if register $31 is saved, bit 31
should be `1' in bitmask. Bits are set in bitmask in little-endian order,
even if the machine configuration is big-endian).The frameoffset
is the offset from the virtual frame pointer (this number
is usually negative). N should be 0 for the highest
numbered register saved and then incremented by four for each subsequently
lower numbered register saved. For example:
sw $31,framesize+frameoffset($sp)
sw $17,framesize+frameoffset-4($sp)
sw $16,framesize+frameoffset-16($sp) |
Figure 7-3, illustrates this example.
Now save any floating-point registers that you allocated space for in
step 2 as follows:
.fmask bitmask,frameoffset
s.[sd] reg,framesize+frameoffset-N($sp) |
Notice that saving floating-point registers is identical to saving general
registers except we use the .fmask pseudo-op instead of
.mask, and the stores are of floating-point singles or doubles.The
discussion regarding saving general registers applies here as well, but remember
that N should be incremented by 16 for doubles.The stack framesize must be
a multiple of 16.
This step describes parameter passing: how to access arguments
passed into your routine and passing arguments correctly to other procedures.
For information on high-level language-specific constructs (call-by-name,
call-by-value, string or structure passing), refer to the
MIPSpro N32/64 Compiling and Performance Tuning Guide.
As specified in step 2, space must be allocated on the stack for all
arguments even though they may be passed in registers. This provides a saving
area if their registers are needed for other variables.
General registers must be used for passing arguments. For 32-bit compilations,
general registers $4-$7 and float
registers $f12, $f14 are used for
passing the first four arguments (if possible). You must allocate a pair of
registers (even if it's a single precision argument) that start with an even
register for floating-point arguments appearing in registers.
For 64-bit compilations, general registers $4-
$11 and float registers $f12, through
$f19 are used for passing the first eight arguments (if possible).
In Table 7-1 and Table 7-2,
the “fN” arguments are considered single-
and double-precision floating-point arguments, and “nN
” arguments are everything else. The ellipses (...) mean
that the rest of the arguments do not go in registers regardless of their
type. The “stack” assignment means that you do not put this argument
in a register. The register assignments occur in the order shown in order
to satisfy optimizing compiler protocols:
Table 7-1. Parameter Passing (-32)
Argument List
| Register
and Stack Assignments
|
|---|
f1, f2
| $f12, $f14
| f1, n1, f2
| $f12, $6, stack
| f1, n1, n2
| $f12, $6 $7
| n1, n2, n3, n4
| $4, $5, $6, $7
| n1, n2, n3, f1
| $4, $5, $6, stack
| n1, n2, f1
| $4, $5, ($6, $6)
| n1, f1
| $4, ($6, $7)
|
Table 7-2. Parameter Passing (-n32
and -64)
Argument List
| Register
and Stack Assignments
|
|---|
d1,d2
| $f12, $f13
| s1,s2
| $f12, $f13
| s1,d1
| $f12, $f13
| d1,s1
| $f12, $f13
| n1,d1
| $4,$f13
| d1,n1,d1
| $f12, $5,$f14
| n1,n2,d1
| $4, $5,$f14
| d1,n1,n2
| $f12, $5,$6
| s1,n1,n2
| $f12, $5,$6
| d1,s1,s2
| $f12, $f13, $f14
| s1,s2,d1
| $f12, $f13, $f14
| n1,n2,n3,n4
| $4,$5,$6,$7
| n1,n2,n3,d1
| $4,$5,$6,$f15
| n1,n2,n3,s1
| $4,$5,$6, $f15
| s1,s2,s3,s4
| $f12, $f13,$f14,$f15
| s1,n1,s2,n2
| $f12, $5,$f14,$7
| n1,s1,n2,s2
| $4,$f13,$6,$f15
| n1,s1,n2,n3
| $4,$f13,$6,$7
| d1,d2,d3,d4,d5
| $f12, $f13, $f14, $f15, $f16
| d1,d2,d3,d4,d5,s1,s2,s3,s4
| $f12, $f13, $f14, $f15, $f16, $f17, $f18,$f19,stack
| d1,d2,d3,s1,s2,s3,n1,n2,n3
| $f12, $f13, $f14, $f15, $f16, $f17, $10,$11, stack
|
Next, you must restore registers that were saved in step 4. To
restore general purpose registers:
lw reg,framesize+frameoffset-N($sp) |
To restore the floating-point registers:
l.[sd] reg,framesize+frameoffset-N($sp) |
Refer to step 4 for a discussion of the value of N.)
Get the return address:
lw $31,framesize+frameoffset($sp) |
Clean up the stack:
Return:
To end the procedure:
The difference in stack frame usage for -n32 and
-64 operations can be summarized as follows:
The portion of the argument structure beyond the initial eight doublewords
is passed in memory on the stack, pointed to by the stack pointer at the time
of call. The caller does not reserve space for the register arguments; the
callee is responsible for reserving it if required (either adjacent to any
caller-saved stack arguments if required, or elsewhere as appropriate). No
requirement is placed on the callee either to allocate space and save the
register parameters, or to save them in any particular place.
In most cases,
high-level language routine and assembly routines communicate via simple variables:
pointers, integers, booleans, and single- and double-precision real numbers.
Describing the details of the various high-level data structures (arrays,
records, sets, and so on) is beyond the scope of this book. If you need to
access such a structure as an argument or as a shared global variable, refer
to the MIPSpro N32/64 Compiling and
Performance Tuning Guide.
This section contains the examples that illustrate program design rules.
Each example shows a procedure written and C and its equivalent written in
assembly language.
Example 7-1. Non-leaf procedure
The following example shows a non-leaf procedure. Notice that it creates
a stackframe, and also saves its return address since it must put a new return
address into register $31 when it invokes its callee:
float
nonleaf(int i, int *j)
{
double atof();
int temp;
temp = i - *j;
if (i < *j) temp = -temp;
return atof(temp);
}
.globl nonleaf
# 1 float
# 2 nonleaf(i, j)
# 3 int i, *j;
# 4 {
.ent nonleaf 2
nonleaf;
.cpload $25 ## Load $gp
subu $sp, 32 ## Create stackframe
sw $31, 20($sp) ## Save the return
## address
sw $sp, 24($sp) ## Save gp
.mask 0x80000000, -4
.frame $sp, 32, $31
# 5 double atof();
# 6 int temp;
# 7
# 8 temp = i - *j;
lw $2, 0($5) ## Arguments are in
## $4 and $5
subu $3, $4, $2
# 9 if (i < *j) temp = -temp;
bge $4, $2, $32 ## Note: $32 is a label,
## not a reg
negu $3, $3
$32:
# 10 return atof(temp);
move $4, $3
jal atof
cvt.s. $f0, $f0 ## Return value goes in $f0
lw $gp, 24($sp) ## Restore gp
lw $31, 20($sp) ## Restore return address
addu $sp, 32 ## Delete stackframe
j $31 ## Return to caller
.end nonleaf |
The -n32 code for the previous example is shown below.
Note that this code is under .set noreorder, so be aware
of delay slots. .set noreorder
# Program Unit: nonleaf
.ent nonleaf
.globl nonleaf
nonleaf: # 0x0
.frame $sp, 32, $31
.mask 0x80000000, -32
lw $7,0($5) # load *j
addiu $sp,$sp,-32 #.frame.len.nonleaf
sd $gp,8($sp) # save $gp
sd $31,0($sp) # save $ra
lui $31,%hi(%neg(%gp_rel(nonleaf+0))) #load new $gp
addiu $31,$31,%lo(%neg(%gp_rel(nonleaf +0))) #
addu $gp,$25,$31 #
slt $1,$4,$7 # compare i to *j
beq $1,$0,.L.1.1.temp #
subu $7,$4,$7 # i-*j, in delay slot of branch
subu $7,$0,$7 # temp = -temp
.L.1.1.temp: # 0x2c
lw $25,%call16(atof)($gp)#
jalr $25 #atof
or $4,$7,$0 # delay slot of jalr loads arg
ld $31,0($sp) # restore $ra
cvt.s.d $f0,$f0 #
ld $tp,8($sp) # restore $gp
jr $31 #
addiu $sp,$sp,32 # .frame.len.nonleaf
.end nonleaf
|
Example 7-2. Leaf Procedure
This example shows a leaf procedure that does not require stack space
for local variables. Notice that it creates no stackframe, and saves no return
address.
int
leaf(p1, p2)
int p1, p2;
{
return (p1 > p2) ? p1 : p2;
}
.globl leaf
# 1 int
# 2 leaf(p1, p2)
# 3 int p1, p2;
# 4 {
.ent leaf2
leaf:
.frame $sp, 0, $31
# 5 return (p1 > p2) ? p1 : p2;
ble $4, $5, $32 ## Arguments in
## $4 and $5
move $3, $4
b $33
$32:
move $3, $5
$33:
move $2, $3 ## Return value
## goes in $2
j $31 ## Return to
## caller
# 6 }
.end leaf |
The -n32 code for the previous example looks like
this: .set noreorder
.ent leaf
.globl leaf
leaf: #0x0
.fram$sp, 0, $31
slt $2,$5,$4 # compare p1 and p2
beq $2, $0,.L.1.2.temp #
or $9,$4,$0 # delay slot
b .L.1.1.temp #
or $2,$9,$0 # delay slot, return pl
.L.1.2.temp: # 0x14
or $2,$5,$0 # return p2
.L.1.1.temp: # 0x18
jr $31 #
nop # delay slot
.end leaf
|
Interfaces Between Assembly Routines and Other Languages
The rules and parameter requirements that exist between assembly language
and other languages are varied and complex. The simplest approach to coding
an interface between an assembly routine and a routine written in a high-level
language is to do the following:
Use the high-level language to write a skeletal version of
the routine that you plan to code in assembly language.
Compile the program using the -S
option, which creates an assembly language (.s) version
of the compiled source file (the -O option, though not
required, reduces the amount of code generated, making the listing easier
to read).
Study the assembly-language listing and then, imitating the
rules and conventions used by the compiler, write your assembly language code.
Using the .s Assembly Language File
The MIPSpro compilers can produce a .s file rather
than a .o file. The file is produced by specifying the
-S option on the command line instead of the -c
option.
The assembly language file that is produced contains exactly the same
set of instructions that would have been produced in the .o
object file, and inputting the .s file to the assembler
produces an object file with the same instructions that the compiler would
have produced. The .s file is a listing of the instructions,
but does not contain all the object information that a .o
file contains. Therefore, a .o file generated by a
.s file will not be exactly the same as one generated directly by
the compiler and they are not guaranteed to work identically (for example,
reorg_common information is lost).
In addition to the program's instructions, the .s
file contains comments indicating the effects of various optimization transformations
that were made by the compiler.
Most of these comments are self-explanatory or contain easily understood
information, while other comments require a detailed knowledge of the compiler's
internal workings. The following information is intended to describe the more
useful, non-obvious, features of the file without getting into the details
of optimization theory. For more detailed information about optimization see
the MIPSpro N32/64 Compiling and Performance
Tuning Guide.
The following subsections describe the different elements of the
.s file.
The file begins with comments that indicate the name of the source file
and the compiler that was used to produce the .s file.
The options that were used by the compiler are also listed. It is often important
to know the target machine that the instructions were intended for; this is
discussed in the following subsections. By default, only a select set of options
are included in the file. More detail can be obtained by including the
-LIST:options flag on the compiler's command line.
One of the first pseudo-instructions in the file is similar to the following
example:
.section .text, 1, 0x00000006, 4, 64 |
or
.section .text, 1, 0x00000006, 4, 16 |
This directive is used by the loader to align the start of the program's
instructions at particular byte-address boundaries. The rightmost field is
16 if quad word alignment is required, or is 64
if cache line alignment is needed. The proper number is determined by the
target processor type and the optimization level that was used because some
optimizations require an exact knowledge of the I-Cache placement of each
instruction while others do not benefit from this level of control.
A comment is attached to each label definition (recognized by the colon
(:) following the name). This comment provides the byte
offset of the label's location relative to the start of the .section
.text directive. The first label, which usually corresponds to the
first entry point of the first function, is 0x0.
The remaining labels have addresses that are increased by 4 bytes for
each instruction that is placed between successive labels. These offsets are
the same for both the .s file and the related
.o files, although the loader can choose to place the start of the
program (the 0x0 location) anywhere in the machine's address
space. The start is subject only to the alignment restriction placed on the
.section directive (see “Instruction Alignment”).
This is useful to note when you are using a debugger and trying to correlate
the assembly file to the executed instructions. The machine addresses are
sometimes difficult to translate to the file's relative offsets when only
quad word alignment was requested.
The following is an example of this comment:
To help associate the compiler-generated code to the original source
code, the line number and source code are inserted into comment lines that
are interspersed with the assembly instructions. The comments usually appear
ahead of the machine instructions that are generated for it. However, various
optimizations may cause instructions to be moved or reordered and it is sometimes
difficult to understand where they appear.
A further difficulty can arise if inline code expansion occurs. In these
cases, the line number (503 in the following example) may
refer to the line of the module that contained the inlined routine, and not
to the original source code module of the compiled program. This can be especially
confusing if the -ipa option was requested, and if several
source code files were intermixed.
To determine the original file that contains a particular source code
line, search for the immediately preceding .loc directive.
This directive contains the line number and an index to a previous
.file directive that identifies the file that the source code was
read from. See Chapter 8, “Pseudo Op-Codes (Directives)” for information about the
.loc directive.
The following is an example of this comment:
# 503 x[k] = q + y[k]*( r*z[k+10] + t*z[k+11] ) |
Relative Instruction Issue Times
When any level of optimization greater than -O0 is
requested on the command line, comments are added to the right of machine
instructions that indicate the compiler's knowledge of the relative issue
time for the particular instruction. These comments consist of an integer
between square brackets, as shown in the following example:
mul.d $f2,$f2,$f10 # [11] |
In this example, the [11] indicates the clock cycle
(relative to the start of the block) in which this instruction will be issued
by the processor.
The assembly files targeted for processors that can only issue a single
instruction in a clock period have unique times for each instruction in the
block, while target processors that can issue multiple instructions may show
that several instructions have the same integer in the issue time comment.
The times for processors that support Out-Of-Order issue of instructions
may sometimes appear unusual because an instruction may be issued before other
instructions that precede it in the block. This is common processor behavior.
The compiler attempts to model the queuing mechanisms contained by the hardware
and it uses knowledge of the details to arrive at meaningful times to place
in these comment fields. The times are accurate to the limit that the machine
is modeled.
Several simplifying assumptions are made to calculate these times, which
make it difficult to estimate the performance of the code by using these comments
alone. The most important point to make is that program flow is not taken
into account. The actual performance of a program is influenced by the path
taken into a particular block of code, which often determines when the inputs
that an instruction needs will be ready. It would be difficult to model an
entire program and take into account all possible paths into a block, so it
is assumed that all inputs computed outside a block are available at the start
of the block, and that all functional units are initially free to accept new
operations.
Even with these restrictions, it is difficult to accurately model the
behavior of load and store instructions. The compiler attempts to recognize
accesses that will be satisfied from a data cache and use an appropriate latency.
Although performance data suggests that most data references are to a cache,
this can be very program-dependent. With the additional complexity introduced
when multiple levels of cache are available, the compiler can never be certain
that it is using the correct memory latency to produce the issue time comments.
Because of these uncertainties, the compiler uses times that match what happens
in the average program.
A limitation on the use of these times is illustrated with the following
program example run on a machine with an R10000 processor:
.Lt.0.224: # 0x508
.loc 1 589 17
# 589 temp -= x[j]*y[j];
ldc1 $f9,24024($3) # [0]
ldc1 $f10,32064($5) # [1]
addiu $2,$2,-1 # [0]
addiu $3,$3,8 # [0]
addiu $5,$5,8 # [1]
bne $2,$0,.Lt.0.224 # [1,1]
nmsub.d $f4,$f4,$f9,$f10 # [4] |
With just a glance at the times, you might conclude that a
nmsub instruction will only be issued every 5 clock periods. However,
as long as execution stays within the loop, the processor will prefetch instructions
faster than it can execute them, resulting in an average issue of 1
nmsub instruction every 2 clock periods, limited by the 2 memory
accesses that take 2 clock periods to issue.
There are occasions when the first instruction is not considered to
be part of the block and no instruction issue time is computed for it. This
happens when the block is frequently branched to using a label+4
address specification. The following example code illustrates this:
.Lt.0.274: # 0x9ac
or $3,$9,$0 #
or $6,$10,$0 # [0]
|
The most frequent transfer to the label is the following instruction:
bne $8,$30,.Lt.0.274+4 # [0,1] |
Relative Branch Prediction Times
If the target processor is an Out-Of-Order processor, the cycle when
the hardware will predict the direction of a conditional branch is estimated.
This happens at the time the instruction is first read into the instruction
decode buffer and is independent of the time that the instruction actually
issues.
This time is reported as the first of a pair of integers, in square
brackets, in the comment field of the instruction. The second field is the
issue time. In the preceding example, branch prediction happens in cycle 0,
but the instruction will not issue until cycle 1 (because it has to wait for
an input).
The compiler attempts to move inputs to conditional branches as far
previously as possible so that both the branch prediction and the issue times
are identical. However, there are conditions that prevent the compiler from
doing so; this is done to minimize the number of instructions that are speculatively
executed after the prediction and before the direction of the branch can be
determined with certainty. It is only when the instruction completes execution
that the hardware is certain which branch direction is correct.
If the wrong direction was predicted, all speculatively executed instructions
will need to be aborted, wasting time that could have been devoted to completing
the program.
A nop instruction is a real operation that does not
change the contents of any registers. There are several that could be used,
but the preferred one is sll $0,$0,0, which means “shift
left by 0 bits the contents of register $0 and store the
result into register $0”.
nop instructions usually waste space and should be
deleted by the compiler, but there are situations where they are necessary
for the correctness of the executed code and cases where they can improve
the performance of the executed code. They are most often encountered as a
placeholder for the delay slot of a branch instruction, when no other instruction
can be found. The following code sequence illustrates this:
addiu $5,$5,1 # [0]
bne $5,$30,.Lt.0.460 # [0,1]
nop # [0] |
Other than their use in the delay slot of conditional branches,
nop instructions are used to optimize the fetch and decode performance
of processor types that can read, decode and execute multiple instructions
in each clock period. These processors cannot group together instructions
when a cache line boundary occurs between them, resulting in a delay that
can be avoided by inserting one or more nop instructions
ahead of a label.
The optimization that attempts this alignment depends on the processor
type and the optimization levels selected. In the common case, the first block
of each loop is forced to start on a quad word boundary. This is simple and
fast although it sometimes causes nops to be added in the
middle of a cache line, where they are not useful.
For the highest level of optimization, and only for Out-Of-Order issue
processors, closer track is kept of cache line boundaries. This requires that
the start of the module (that is, the address of the first text label) be
aligned on a cache line boundary, increasing the size of the generated executable
but allowing the compiler to avoid unnecessary instructions.
Along with optimally aligning instructions on Out-Of-Order processors,
attention is paid to a timing "hiccup" that can occur if a branch instruction
is separated from its delay slot instruction by a cache line break. The insertion
of a nop before the branch can improve performance slightly.
The following is an example of this. The nop forces the
bne instruction to start in the next cache line, as can be determined
by the address comment in the label field of the next block.
nop # [1]
bne $0,$1,.Lt.0.550 # [3,5]
xori $1,$1,1 # [5]
.BB307.kernel_: # 0x2408 |
Loop Information Comments
Comments are added at the start of loops to indicate the transformations
that were applied to the loop. The meanings of most of these are obvious,
but some need some explanation: The occurrence of comments that start with <swpf>
or <loop>Not unrolled: indicate that software
pipelining failed to optimize the loop. There is usually a reason given, although
the meaning can be obscure and refer to details of software theory.
Comments that look like <swps> xx cycles per iteration
may not contain an accurate count of the number of cycles for Out-Of-Order
processors. This is because the exact cycle times are determined much later
in the compiler process than when this cycle count is estimated and the comment
is constructed. These inaccuracies also affect the numbers that precede
% of peak comments.
Similarly, for Out-Of-Order processors, the cycle count in
comments similar to <sched> Loop schedule length: xxx cycles (ignoring
nested loops) is sometimes wrong.
A block is a sequence of instructions between 2 labels. Blocks are usually
identified in the assembly file by a comment between the starting label and
the first instruction with a comment that contains BB:xxx.
The block number that follows the BB: is used to identify
each unique block of the program. Comments that start with <freq>
BB:xxx frequency = yyy.yy indicate how often the compiler believes
the block is executed for each invocation of the function where the block
is located.
The comment is followed by (heuristic) or
(feedback) to indicate how that average was arrived at. Because
many optimizations utilize this information, incorrect information can result
in sub-optimal compiler output. It is important that the feedback data be
generated by tests that truly represent the expected behavior of the final
program so that accurate decisions can be made by the compiler.
Blocks that end with conditional branches also contain comments similar
to <freq> BB:xxx => BB:yyy probability = 0.zzzzz. These
indicate the compiler's estimation for the direction of each possible branch.
Again, it is important for optimal performance that feedback be generated
by test cases that are representative of the actual workload.
MIPSpro Assembly Language Programmer's Guide
(document number: 007-2418-006 / published: 2003-08-15)
table of contents | additional info | download
Front Matter
New Features in This Manual
About This Guide
Chapter 1. Registers
Chapter 2. Addressing
Chapter 3. Exceptions
Chapter 4. Lexical Conventions
Chapter 5. The Instruction Set
Chapter 6. Coprocessor Instruction Set
Chapter 7. Writing Assembly Language Code
Chapter 8. Pseudo Op-Codes (Directives)
Index
home/search |
what's new |
help
|