IRIX 6.5 » Books » Developer »
Origin 2000 and Onyx2 Performance Tuning and Optimization Guide
(document number: 007-3430-003 / published: 2001-08-02)
table of contents | additional info | download
find in page
- 64-bit address space
- Selecting an ABI and ISA
- adi2 example program
- Program adi2
- aliasing models
- Understanding Aliasing Models
- Amdahl's law
- Understanding Parallel Speedup and Amdahl's Law
- awk script for
- Awk Script for Amdahl's Law Estimation
- execution time given n and p
- Predicting Execution Time with n CPUs
- parallel fraction p
- Understanding Amdahl's Law
- parallel fraction p given speedup( n )
- Calculating the Parallel Fraction of a Program
- speedup(n ) given p
- Understanding Amdahl's Law
- superlinear speedup
- Understanding Superlinear Speedup
- application binary interface (ABI)
- Selecting an ABI and ISA
- 64-bit
- 64-Bit ABI
- new 32-bit
- New 32-Bit ABI
- old 32-bit
- Old 32-Bit ABI
- arithmetic error
- Understanding Arithmetic Standards
- array padding
- Using Array Padding to Prevent Thrashing
- Diagnosing and Eliminating Cache Thrashing
- Using Array Padding
- auto-parallelizing
- Compiling Serial Code for Parallel Execution
- Bentley, Jon
- Bentley's Rules Updated
- cache
- and hardware event counter
- Primary Cache Use
- blocking
- Understanding Cache Blocking
- Controlling Cache Blocking
- cache miss
- Understanding Level-One and Level-Two Cache Use
- coherent
- Understanding Cache Coherency
- Cache Coherency Events
- compiler's model of
- Adjusting the Optimizer's Cache Model
- contention in
- Diagnosing Cache Problems
- correcting
- Correcting Cache Contention in General
- event 31 reveals
- Diagnosing Cache Problems
- Identifying False Sharing
- diagnosing problems in
- Identifying Cache Problems with Perfex and SpeedShop
- Diagnosing Cache Problems
- directory-based
- Memory Overhead Bits
- Understanding Directory-Based Coherency
- false sharing of
- Identifying False Sharing
- L1
- Level-1 Cache
- Understanding Level-One and Level-Two Cache Use
- Primary Cache Use
- L2
- Level-Two Cache
- Understanding Level-One and Level-Two Cache Use
- Secondary Cache Use
- line size
- Understanding Level-One and Level-Two Cache Use
- data structure blocking for
- Data Structure Augmentation
- on-chip
- Cache Architecture
- operation of
- Understanding Cache Coherency
- Understanding Directory-Based Coherency
- Understanding Level-One and Level-Two Cache Use
- principles of use
- Principles of Good Cache Use
- proper use of
- Principles of Good Cache Use
- Using Other Cache Techniques
- array padding
- Using Array Padding
- blocking data for
- Understanding Cache Blocking
- Controlling Cache Blocking
- grouping related data for
- Grouping Data Used at the Same Time
- loop fusion for
- Understanding Loop Fusion
- parallel execution issues
- Diagnosing Cache Problems
- stride-one access for
- Using Stride-One Access
- transposition for
- Understanding Transpositions
- set-associative
- Understanding Level-One and Level-Two Cache Use
- thrashing in
- Understanding Cache Thrashing
- snoopy
- Coherency Methods
- thrashing
- Understanding Cache Thrashing
- Diagnosing and Eliminating Cache Thrashing
- cache coherence
- and hardware event counter
- Cache Coherency Events
- cache coherency
- Understanding Cache Coherency
- cache line
- Understanding Level-One and Level-Two Cache Use
- call hierarchy profile
- Profiling the Call Hierarchy
- compiler directive
- See directive
-
Reader Comments
- compiler feedback file
- Creating a Compiler Feedback File
- compiler flag
- See compiler option
-
Reader Comments
- compiler option
- -32
- Old 32-Bit ABI
- -64
- 64-Bit ABI
- recommended
- Understanding Compiler Options
- -apo
- Compiling an Auto-Parallel Version of a Program
- -check_bounds
- Computational Differences
- Using Array Padding
- -clist
- Reading the Transformation File
- default
- Understanding Compiler Options
- -fb
- Creating a Compiler Feedback File
- Passing a Feedback File
- -flist
- Reading the Transformation File
- for cache model
- Adjusting the Optimizer's Cache Model
- IEEE_arithmetic
- Exploit Algebraic Identities
- -INLINE
- Using Manual Inlining
- Using Automatic Inlining
- -IPA
- Requesting IPA
- forcedepth
- Using Automatic Inlining
- inline
- Using Automatic Inlining
- space
- Using Automatic Inlining
- -LNO
- Using Loop Nest Optimization
- blocking
- Adjusting Cache Blocking Block Sizes
- fission
- Controlling Fission and Fusion
- gather_scatter
- Understanding Gather-Scatter
- ignore_pragmas
- Requesting LNO
- interchange=off
- Using Loop Interchange
- outer_unroll
- Controlling Loop Unrolling
- prefetch
- Controlling Prefetching
- vintr
- Vector Intrinsics
- -mips3
- New 32-Bit ABI
- -mips4
- New 32-Bit ABI
- Recommended Starting Options
- -n32
- New 32-Bit ABI
- Recommended Starting Options
- -On
- Setting Optimization Level with -On
- -O2
- Recommended Starting Options
- -O3
- for SWP
- Enabling Software Pipelining with -O3
- -Ofast
- versus -O3
- Compile -O3 or -Ofast for Critical Modules
- -Olimit
- Using Automatic Inlining
- -OPT
- alias
- Understanding Aliasing Models
- cray_ivdep
- Breaking Other Dependencies
- IEEE_arithmetic
- Recommended Starting Options
- IEEE Conformance
- IEEE_NaN_inf
- IEEE Conformance
- liberal_ivdep
- Breaking Other Dependencies
- reorg_common
- Using Array Padding
- roundoff
- Roundoff Control
- -r10000
- Standard Math Library
- Setting Target System with -TARG
- -r5000
- Standard Math Library
- Setting Target System with -TARG
- -r8000
- Standard Math Library
- Setting Target System with -TARG
- roundoffWhen
- Exploit Algebraic Identities
- -S
- Reading Software Pipelining Messages
- -static
- Uninitialized Variables
- -TARG
- Setting Target System with -TARG
- -TENV
- Profiling Exception Frequency
- X
- Controlling the Level of Speculation
- copying
- to reduce TLB thrashing
- Using Copying to Circumvent TLB Thrashing
- correctness
- Getting the Right Answers
- CPU
- See MIPS CPU
-
Reader Comments
- CrayLink
- Hub and NUMAlink
- data distribution
- Using Data Distribution Directives
- and dplace
- Using _DSM_VERBOSE
- directives for
- Understanding Directive Syntax
- Distribute directive
- Using Distribute for Loop Parallelization
- mapping types
- Understanding Distribution Mapping Options
- ONTO clause
- Understanding the ONTO Clause
- page placement
- Using the Page_Place Directive for Custom Mappings
- redistribution
- Understanding the Redistribution Directives
- reshaped
- Using Reshaped Distribution Directives
- restrictions
- Restrictions of Reshaped Distribution
- data placement
- Scalability and Data Placement
- for libmp programs
- Tuning Data Placement for MP Library Programs
- modifying code for
- Modifying the Code to Tune Data Placement
- DAXPY
- Understanding Software Pipelining
- and alias model
- Understanding Aliasing Models
- loop fusion of
- Understanding Loop Fusion
- with indirection
- Breaking Other Dependencies
- debugging
- possible with -O2
- Start with -O2 for All Modules
- use -O0 for
- Use -O0 for Debugging
- dependency
- Breaking Other Dependencies
- directive
- blocking size
- Adjusting Cache Blocking Block Sizes
- for data distribution
- Fortran Source with Directives
- Using Data Distribution Directives
- Distribute
- Using Distribute for Loop Parallelization
- page place
- Using the Page_Place Directive for Custom Mappings
- syntax
- Understanding Directive Syntax
- for loop interchange
- Using Loop Interchange
- for loop nest optimizer
- Requesting LNO
- for loop unrolling
- Controlling Loop Unrolling
- for parallel execution
- Fortran Source with Directives
- affinity clause
- Using Parallel Do with Distributed Data
- Understanding the AFFINITY Clause for Data
- Understanding the AFFINITY Clause for Threads
- nest clause
- Understanding the NEST Clause
- for prefetching
- Controlling Prefetching
- ivdep
- Breaking Other Dependencies
- OpenMP
- Fortran Source with Directives
- dlook
- Applying dlook
- dplace
- Non-MP Library Programs and Dplace
- disables data distributiondirectives
- Using _DSM_VERBOSE
- enable migration with
- Enabling Page Migration
- library interface to
- Using the dplace Library for Dynamic Placement
- not for use with libmp
- Non-MP Library Programs and Dplace
- placement file
- Placement File Syntax
- distribute statement
- Assigning Threads to Memories
- memories statement
- Using the memories Statement
- threads statement
- Using the threads Statement
- set page size with
- Changing the Page Size
- Using Larger Page Sizes to Reduce TLB Misses
- specify topology with
- Specifying the Topology
- with MPI
- Using dplace with MPI 3.1
- dprof
- Applying dprof
- dynamic page migration
- Dynamic Page Migration
- Enabling Page Migration
- Trying Dynamic Page Migration
- administration
- Trying Dynamic Page Migration
- enabling
- Trying Dynamic Page Migration
- environment variable
- _DSM_MIGRATION
- Trying Dynamic Page Migration
- Experimenting with Migration Levels
- _DSM_PPM
- Advanced Options
- _DSM_ROUND_ROBIN
- Trying Round-Robin Placement
- _DSM_VERBOSE
- Using _DSM_VERBOSE
- for SpeedShop
- Identifying False Sharing
- in dplace placement file
- Using Environment Variables in Placement Files
- MP_SET_NUMTHREADS
- Controlling a Parallelized Program at Run Time
- MPI_DSM_OFF
- Using dplace with MPI 3.1
- PAGESIZE_*
- Using Larger Page Sizes to Reduce TLB Misses
- SGI_ABI
- Specifying the ABI
- SpeedShop use of
- Sampling Through Other Hardware Counters
- TRAP_FPE
- Understanding Treatment of Underflow Exceptions
- event counter
- See hardware event counter
-
Reader Comments
- exception
- event counter overflow
- R10000 Counter Event Types
- from speculative execution
- Permitting Speculative Execution
- handling
- Using Exception Profiling
- profiling occurrence of
- Using Exception Profiling
- TLB miss
- Understanding TLB and Virtual Memory Use
- underflow
- Understanding Treatment of Underflow Exceptions
- exception profile
- Using Exception Profiling
- false sharing
- Memory Contention
- Identifying False Sharing
- fast fourier transform (FFT)
- Understanding Transpositions
- data placement for
- First-Touch Placement with Multiple Data Distributions
- feedback file
- Creating a Compiler Feedback File
- use of
- Passing a Feedback File
- FFT
- See fast fourier transform (FFT)
-
Reader Comments
- first-touch placement
- Using First-Touch Placement
- Programming For First-Touch Placement
- floating-point exception
- See exception
-
Reader Comments
- floating-point status register (FSR)
- Understanding Treatment of Underflow Exceptions
- graduated instruction
- Graduated Instructions
- hardware event counter
- R10000 Counter Event Types
- branch instructions
- Branching Instructions
- cache coherency
- Cache Coherency Events
- cache use
- Primary Cache Use
- clock cycles
- Clock Cycles
- event 21
- Displaying Operation Counts
- Finding and Removing Memory Access Problems
- event 31
- Sampling Through Other Hardware Counters
- Finding and Removing Memory Access Problems
- Diagnosing Cache Problems
- Identifying False Sharing
- event 4
- Finding and Removing Memory Access Problems
- instruction counts
- Instructions Issued and Done
- lock instructions
- Lock-Handling Instructions
- profiling from
- Sampling through Hardware Event Counters
- Sampling Through Other Hardware Counters
- TLB miss
- Virtual Memory Use
- hardware graph
- Indicating Resource Affinity
- hardware trap
- See exception, page fault, TLB
-
Reader Comments
- hub
- SN0 Organization
- Hub and NUMAlink
- cache coherency support
- Understanding Directory-Based Coherency
- hypercube
- SN0 Organization
- SN0 Memory Distribution
- ideal time profile
- Using Ideal Time Profiling
- IEEE 754
- Understanding Arithmetic Standards
- versus optimization
- IEEE Conformance
- IEEE arithmetic
- Understanding Arithmetic Standards
- inlining
- Understanding Inlining
- automatic versus manual
- Understanding Inlining
- manual with -INLINE
- Using Manual Inlining
- instruction scheduling
- Setting Target System with -TARG
- Understanding Software Pipelining
- instruction set architecture (ISA)
- MIPS I
- Old 32-Bit ABI
- MIPS II
- Old 32-Bit ABI
- MIPS III
- Old 32-Bit ABI
- New 32-Bit ABI
- MIPS IV
- MIPS IV Instruction Set Architecture
- New 32-Bit ABI
- interprocedural analysis (IPA)
- Exploiting Interprocedural Analysis
- applied during link step
- Compiling and Linking with IPA
- features of
- Exploiting Interprocedural Analysis
- requesting
- Requesting IPA
- -IPA
- See compiler option, -IPA
-
Reader Comments
- IRIX
- memory management in
- SN0 Memory Management
- porting to
- Dealing with Porting Issues
- lazy evaluation
- Lazy Evaluation
- ld
- performs IPA
- Compiling and Linking with IPA
- library
- BLAS
- CHALLENGEcomplib Library
- SCSL Library
- CHALLENGEcomplib
- Exploiting Existing Tuned Code
- CHALLENGEcomplib Library
- EISPACK
- CHALLENGEcomplib Library
- LAPACK
- CHALLENGEcomplib Library
- SCSL Library
- libc
- Standard Math Library
- libfastm
- Exploiting Existing Tuned Code
- libfastm Library
- Recommended Starting Options
- libfpe
- Using Exception Profiling
- Understanding Treatment of Underflow Exceptions
- libmp
- Controlling a Parallelized Program at Run Time
- conflicts with dplace
- Non-MP Library Programs and Dplace
- data placement with
- Tuning Data Placement for MP Library Programs
- page migration with
- Trying Dynamic Page Migration
- Experimenting with Migration Levels
- page size control
- Using Larger Page Sizes to Reduce TLB Misses
- round-robin placement with
- Trying Round-Robin Placement
- LINPACK
- CHALLENGEcomplib Library
- SCSL
- Exploiting Existing Tuned Code
- SCSL Library
- library routine
- bzero
- Initializing to Zero
- calloc
- Initializing to Zero
- dplace_file
- Using the dplace Library for Dynamic Placement
- dplace_line
- Using the dplace Library for Dynamic Placement
- dsm_home_threadnum
- Using Dynamic Placement Information
- handle_sigfpes
- Using Exception Profiling
- sasum
- Using Reshaped Distribution Directives
- sscal
- Using Reshaped Distribution Directives
- -LNO
- See loop nest optimizer (LNO) and compiler option -LNO
-
Reader Comments
- loop fission
- Using Loop Fission
- loop fusion
- by LNO
- Using Loop Fusion
- manual
- Understanding Loop Fusion
- loop interchange
- Using Loop Interchange
- disabling
- Using Loop Interchange
- loop nest optimizer (LNO)
- Using Loop Nest Optimization
- cache blocking by
- Controlling Cache Blocking
- controlling
- Adjusting Cache Blocking Block Sizes
- disable loop transformation
- Requesting LNO
- gather-scatter by
- Understanding Gather-Scatter
- loop fission by
- Using Loop Fission
- loop fusion by
- Using Loop Fusion
- loop interchange
- Using Loop Interchange
- loop unrolling
- Using Outer Loop Unrolling
- prefetching by
- Prefetch Overhead and Unrolling
- requesting
- Requesting LNO
- transformed source file
- Reading the Transformation File
- vector intrinsic transformation
- Vector Intrinsics
- loop peeling
- Using Loop Fusion
- loop unrolling
- and roundoff
- Roundoff Control
- and SWP
- Using Outer Loop Unrolling
- by loop nest optimizer (LNO)
- Using Outer Loop Unrolling
- with loop interchange
- Combining Loop Interchange and Loop Unrolling
- makefile
- example
- Basic Makefile
- use of
- Using a Makefile
- math libraries
- Exploiting Existing Tuned Code
- vector intrinsics
- Standard Math Library
- matrix multiply
- loop unrolling of
- Using Outer Loop Unrolling
- memory use in
- Understanding Cache Blocking
- performance of
- Understanding Cache Blocking
- matrix multipy
- cache blocking of
- Controlling Cache Blocking
- memory
- 64-bit addressing
- Selecting an ABI and ISA
- administrator setup
- Using Larger Page Sizes to Reduce TLB Misses
- Trying Dynamic Page Migration
- bus-based
- Memory for Multiprocessors
- Scalability in Multiprocessors
- cache directory bits
- Memory Overhead Bits
- contention for
- Memory Contention
- distributed versus shared
- Shared Memory Multiprocessing
- error correction bits
- Memory Overhead Bits
- hierarchy
- Understanding the Levels of the Memory Hierarchy
- latency of
- SN0 Latencies and Bandwidths
- Degrees of Latency
- locality management
- Memory Locality Management
- management by IRIX
- SN0 Memory Management
- page fault
- Understanding TLB and Virtual Memory Use
- paged virtual
- Understanding TLB and Virtual Memory Use
- parallel execution tuning
- Finding and Removing Memory Access Problems
- physical address display
- Page Address Routine va2pa()
- placement
- first-touch
- Using First-Touch Placement
- Programming For First-Touch Placement
- round-robin
- Using Round-Robin Placement
- Trying Round-Robin Placement
- prefetching
- Understanding Prefetching
- Using Prefetching
- stride
- Using Stride-One Access
- virtual
- Understanding Level-One and Level-Two Cache Use
- See also page
-
Reader Comments
- memory locality domain (MLD)
- Memory Locality Management
- Memory Locality Domain Use
- memory locality domain set (MLDS)
- Memory Locality Domain Use
- Message-Passing Interface (MPI)
- Message-Passing Models MPI and PVM
- dplace with
- Using dplace with MPI 3.1
- perfex with
- Using perfex with MPI
- MIPS CPU
- architecture of
- Understanding MIPS R10000 Architecture
- Understanding Prefetching
- event counters in
- R10000 Counter Event Types
- issued versus graduated instruction
- Graduated Instructions
- off-chip cache
- Level-Two Cache
- on-chip cache
- Cache Architecture
- out-of-order execution
- Executing Out of Order
- R10000
- speculative execution
- Hardware Speculative Execution
- underflow control
- Understanding Treatment of Underflow Exceptions
- R4000
- Specifying the ABI
- R8000
- Specifying the ABI
- Dealing with Software Pipelining Failures
- Software Speculative Execution
- underflow ignored on
- Understanding Treatment of Underflow Exceptions
- specify to compiler
- Standard Math Library
- speculative execution
- Speculative Execution
- superscalar features
- Superscalar CPU Features
- See also hardware event counter
-
Reader Comments
- MIPS IV ISA
- MIPS IV Instruction Set Architecture
- and IEEE 754
- IEEE Conformance
- prefetch in
- Understanding Prefetching
- MP library
- See library,libmp
-
Reader Comments
- MPI
- See Message-Passing Interface (MPI)
-
Reader Comments
- mpirun
- with perfex
- Using perfex with MPI
- node
- SN0 Organization
- SN0 Node Board
- CPU in
- CPUs and Memory
- nonuniform memory access (NUMA)
- SN0 Memory Distribution
- Dealing With Nonuniform Access Time
- and parallel program
- Parallel Programs under NUMA
- and single-threaded program
- Single-Threaded Programs under NUMA
- numeric error
- Understanding Arithmetic Standards
- OpenMP directives
- Fortran Source with Directives
- C pragmas for
- C and C++ Source with Pragmas
- -OPT
- See compiler option, -OPT
-
Reader Comments
- optimization level
- Setting Optimization Level with -On
- out of order execution
- Executing Out of Order
- packing
- Packing
- page
- Understanding TLB and Virtual Memory Use
- migration of
- Dynamic Page Migration
- Enabling Page Migration
- Trying Dynamic Page Migration
- size of
- Dynamic Page Migration
- Policy Modules
- Single-Threaded Programs under NUMA
- Using Larger Page Sizes to Reduce TLB Misses
- set with dplace
- Changing the Page Size
- valid sizes
- Using Larger Page Sizes to Reduce TLB Misses
- page fault
- Understanding TLB and Virtual Memory Use
- parallel execution
- affinity clause
- Using Parallel Do with Distributed Data
- Understanding the AFFINITY Clause for Data
- Understanding the AFFINITY Clause for Threads
- Amdahl's law
- Understanding Parallel Speedup and Amdahl's Law
- auto-parallizing
- Compiling Serial Code for Parallel Execution
- data placement for
- Scalability and Data Placement
- memory access tuning for
- Finding and Removing Memory Access Problems
- nest clause
- Understanding the NEST Clause
- parallel fraction p
- Understanding Amdahl's Law
- Ensuring That the Program Is Properly Parallelized
- programming models for
- Explicit Models of Parallel Computation
- scalability of
- Scalability in Multiprocessors
- Scalability and Data Placement
- topology
- Specifying the Topology
- tuning SN0 for
- Tuning Parallel Code for SN0
- perfex
- Analyzing Performance with perfex
- absolute event counts
- Taking Absolute Counts of One or Two Events
- analytic output
- Getting Analytic Output with the -y Option
- awk script to parse
- Awk Script for Perfex Output
- cache use analysis
- Identifying Cache Problems with Perfex and SpeedShop
- library interface
- Collecting Data over Part of a Run
- statistical counts
- Taking Statistical Counts of All Events
- performance
- aphorisms about
- Bentley's Rules Updated
- of matrix multiply
- Understanding Cache Blocking
- of parallel program
- Parallel Programs under NUMA
- of single-threaded program
- Single-Threaded Programs under NUMA
- performance techniques
- algebraic identities
- Exploit Algebraic Identities
- Exploit Algebraic Identities
- array padding
- Using Array Padding to Prevent Thrashing
- Using Array Padding
- avoiding tests
- Combining Tests
- cache blocking
- Understanding Cache Blocking
- Controlling Cache Blocking
- Controlling Cache Blocking
- caching
- Principles of Good Cache Use
- code motion
- Code Motion Out of Loops
- combining related functions
- Combine Paired Computation
- common block padding
- Exploiting Interprocedural Analysis
- common subexpressions
- Eliminate Common Subexpressions
- constant propagation
- Exploiting Interprocedural Analysis
- copying
- Using Copying to Circumvent TLB Thrashing
- coroutines
- Use Coroutines
- data structure augmentation
- Data Structure Augmentation
- dead function elimination
- Exploiting Interprocedural Analysis
- dead variable elimination
- Exploiting Interprocedural Analysis
- gather-scatter
- Understanding Gather-Scatter
- inlining
- Exploiting Interprocedural Analysis
- Collapse Procedure Hierarchies
- interpreters
- Interpreters
- lazy evaluation
- Lazy Evaluation
- loop fission
- Using Loop Fission
- loop fusion
- Understanding Loop Fusion
- Using Loop Fusion
- Loop Fusion
- loop interchange
- Using Loop Interchange
- loop unrolling
- Using Outer Loop Unrolling
- Loop Unrolling
- packing
- Packing
- precomputation
- Store Precomputed Results
- Precompute Logical Functions
- prefetching
- Using Prefetching
- recursion elimination
- Transform Recursive Procedures
- short-circuiting
- Short-Circuit Monotone Functions
- software pipelining
- Understanding Software Pipelining
- speculative execution
- Permitting Speculative Execution
- transposition
- Understanding Transpositions
- policy module (PM)
- Memory Locality Management
- Policy Modules
- Portable Virtual Machine (PVM)
- Message-Passing Models MPI and PVM
- POSIX threads
- C Source Using POSIX Threads
- pragma
- See directive
-
Reader Comments
- precomputation
- Store Precomputed Results
- prefetching
- Understanding Prefetching
- Using Prefetching
- controlling
- Controlling Prefetching
- manual
- Using Manual Prefetching
- overhead of
- Prefetch Overhead and Unrolling
- pseudo
- Using Pseudo-Prefetching
- prof
- default report
- Displaying Profile Reports from Sampling
- feedback file
- Creating a Compiler Feedback File
- ideal time report
- Default Ideal Time Profile
- line numbers off with opt
- Including Line-Level Detail
- option -archinfo
- Displaying Operation Counts
- option -butterfly
- Displaying Ideal Time Call Hierarchy
- option -feedback
- Creating a Compiler Feedback File
- Passing a Feedback File
- option -heavy
- Displaying Profile Reports from Sampling
- Including Line-Level Detail
- option -lines
- Including Line-Level Detail
- simplifying report
- Removing Clutter from the Report
- profiling
- address space usage
- Using Address Space Profiling
- cache usage
- Identifying Cache Problems with Perfex and SpeedShop
- call hierarchy
- Profiling the Call Hierarchy
- ideal time for
- Using Ideal Time Profiling
- Identifying Cache Problems with Perfex and SpeedShop
- opcode counts
- Displaying Operation Counts
- sampling for
- Understanding Sample Time Bases
- Identifying Cache Problems with Perfex and SpeedShop
- tools for
- Profiling Tools
- program correctness
- Getting the Right Answers
- R4000
- See MIPS CPU
-
Reader Comments
- R8000
- See MIPS CPU
-
Reader Comments
- R10000
- See MIPS CPU
-
Reader Comments
- roundoff
- Roundoff Control
- round-robin placement
- Using Round-Robin Placement
- Trying Round-Robin Placement
- scalability
- Scalability in Multiprocessors
- and bus architecture
- Scalability in Multiprocessors
- and data placement
- Scalability and Data Placement
- and shared memory
- Scalability and Shared, Distributed Memory
- smake
- Using a Makefile
- SN0
- CrayLink
- Hub and NUMAlink
- hub
- SN0 Organization
- Hub and NUMAlink
- Input/Output
- SN0 Input/Output
- latencies
- SN0 Latencies and Bandwidths
- node
- SN0 Organization
- SN0 Node Board
- router
- SN0 Organization
- XIO
- SN0 Organization
- XIO Connection
- SN0 architecture
- Understanding SN0 Architecture
- building blocks of
- SN0 Organization
- hypercube
- SN0 Organization
- SN0 Memory Distribution
- nonuniform memory access (NUMA)
- SN0 Memory Distribution
- snoopy cache
- Coherency Methods
- software pipelining (SWP)
- Exploiting Software Pipelining
- compiler report in
- script to extract
- Software Pipeline Script swplist
- compiler report in .s
- Reading Software Pipelining Messages
- Using Outer Loop Unrolling
- dereferenced pointer defeats
- Improving C Loops
- effect of alias model
- Understanding Aliasing Models
- enable with -O3
- Enabling Software Pipelining with -O3
- failure cause
- Dealing with Software Pipelining Failures
- global variables defeat
- Improving C Loops
- loop unrolling with
- Using Outer Loop Unrolling
- of DAXPY loop
- Pipelining the DAXPY Loop
- speculative execution
- Speculative Execution
- Permitting Speculative Execution
- hardware driven
- Hardware Speculative Execution
- software-driven
- Software Speculative Execution
- speedshop
- Using SpeedShop
- sample time bases
- Understanding Sample Time Bases
- See also prof, ssrun
-
Reader Comments
- ssrun
- exception trace
- Profiling Exception Frequency
- experiment types
- Understanding Sample Time Bases
- ideal time trace
- Capturing an Ideal Time Trace
- Passing a Feedback File
- output filename format
- Performing ssrun Experiments
- shell script to run
- Shell Script ssruno
- usertime experiment
- Displaying Usertime Call Hierarchy
- using
- Performing ssrun Experiments
- stride
- Using Stride-One Access
- superlinear speedup
- Understanding Superlinear Speedup
- superscalar
- Superscalar CPU Features
- -SWP
- See compiler option, -SWP
-
Reader Comments
- swplist shell script
- Reading Software Pipelining Messages
- system routine
- mmap
- C and C++ Source Using UNIX Processes
- Initializing to Zero
- sproc
- C and C++ Source Using UNIX Processes
- sysmp
- Advanced Options
- syssgi
- Using Dynamic Placement Information
- thread
- C Source Using POSIX Threads
- TLB
- See translate lookaside buffer (TLB)
-
Reader Comments
- translate lookaside buffer (TLB)
- Understanding TLB and Virtual Memory Use
- miss
- Understanding TLB and Virtual Memory Use
- hardware counter
- Virtual Memory Use
- thrashing elimination
- Diagnosing and Eliminating TLB Thrashing
- copying
- Using Copying to Circumvent TLB Thrashing
- larger page size
- Using Larger Page Sizes to Reduce TLB Misses
- transposition
- Understanding Transpositions
- trap
- See exception
-
Reader Comments
- uninitialized variable, avoiding
- Uninitialized Variables
- vector intrinsic function
- Standard Math Library
- and LNO
- Vector Intrinsics
- virtual memory
- Understanding TLB and Virtual Memory Use
- XIO
- SN0 Organization
- XIO Connection
- zero-fill
- Initializing to Zero
Origin 2000 and Onyx2 Performance Tuning and Optimization Guide
(document number: 007-3430-003 / published: 2001-08-02)
table of contents | additional info | download
Front Matter
About This Guide
Chapter 1. Understanding SN0 Architecture
Chapter 2. SN0 Memory Management
Chapter 3. Tuning for a Single Process
Chapter 4. Profiling and Analyzing Program Behavior
Chapter 5. Using Basic Compiler Optimizations
Chapter 6. Optimizing Cache Utilization
Chapter 7. Using Loop Nest Optimization
Chapter 8. Tuning for Parallel Processing
Appendix A. Bentley's Rules Updated
Appendix B. R10000 Counter Event Types
Appendix C. Useful Scripts and Code
Glossary
Index
home/search |
what's new |
help
Contact Us |
Site Map |
Trademarks |
Privacy |
Using this site means you accept its Terms of Use
Copyright © 1993-2007 SGI, Inc. All rights reserved.