|
|
Linux » Books » End-User »
SCSL User's Guide
(document number: 007-4325-001 / published: 2003-12-30)
table of contents | additional info | download find in page | jump to first hit | clear highlight
This manual describes the SGI Scientific Computer Software Library,
which runs on SGI IRIX and Linux systems. The information in this manual
supplements the man pages provided with SCSL and provides details about
the implementation and usage of these library routines.
SCSL contains the following groups of routines: Vector-vector linear algebra subprograms (Level 1 BLAS
routines).
Matrix-vector linear algebra subprograms (Level 2 BLAS
routines).
Matrix-matrix linear algebra subprograms (Level 3 BLAS
routines).
LAPACK routines for the solution of dense linear systems
of equations, linear least-squares problems, eigenvalue problems and singular
value decomposition.
Direct linear solvers for real and complex sparse systems
with symmetric non-zero structure, and iterative solvers for real sparse
systems with arbitrary structure.
Signal processing routines, which include Fast Fourier
Transform (FFT) routines, convolution routines, and correlation routines.
64-bit thread-safe parallel random number generators.
The SCSL routines are loaded by using the -lscs
option or the -lscs_mp options to the compiler command
line. The -lscs_mp option directs the linker to use
the multi-processor version of the library.
When linking with SCSL, the default integer size is 4 bytes (32
bits). Another version of SCSL is available in which integers are 8 bytes
(64 bits). This version allows the users access to larger memory sizes.
It can be loaded by using the -lscs_i8 option or the
-lscs_i8_mp option. A program can use only one of the two versions;
4-byte integer and 8-byte integer library calls cannot be
mixed.
Many SCSL routines are multitasked or multithreaded; this means
that a program that calls a multitasked routine will run in parallel mode
and take advantage of multiple processors whenever possible, even if the
program has not specifically requested multitasking. If a significant
percentage of time is spent in the routine, this feature can significantly
reduce wall-clock time.
Note that most LAPACK routines do not perform multiprocessing, but
almost all LAPACK routines call Level 2 BLAS and Level 3 BLAS that do
multiprocessing.
This manual includes the following sections:
Parallel Processing Issues
Parallel processing is a method of splitting a computational task
into subtasks, and then simultaneously performing the subtasks. In many
cases, the use of specialized libraries, such as SCSL, is a key component
of parallel processing.
Parallel processing can eliminate idle CPU time
because the workload is divided among all CPUs; therefore, the amount
of work performed per unit time (the throughput)
increases. However, parallel processing also introduces some overhead
into program execution. In some cases, you may be able to reduce wall-clock
time, but at the cost of extra CPU time which increases because more machine
resources are used.
By using parallel processing,
you can alleviate some of the following common problems: Maximum-memory jobs: if the memory is occupied by a few
large-memory jobs, one or more of the CPUs might be idle even though there
are other jobs to run.
Dedicated machine: if the computer is running a single
job, then all other CPUs are idle.
Light workload: if the amount of jobs waiting for a CPU
is less than the total number of CPUs, then one or more of the CPUs becomes
idle.
With parallel processing, the additional CPUs reduce the wall-clock
time instead of sitting idle. Even when very little idle time exists,
using additional CPUs can still lead to benefits.
Parallel processing introduces some overhead into
program execution. This subsection discusses some of the common types
of overhead introduced by parallel processing:
Multitasked programs require more memory than unitasked
programs, and they can contain more code, more temporary variables, and
can require additional stack space.
Multitasked jobs can be swapped more often, and remain
swapped longer, on a heavily loaded production system.
Processors are forced to wait on semaphores during the
process of synchronization.
Overhead is incurred when slave processors are acquired
(on entry to a parallel region) and at synchronization points within parallel
regions. Tests show that the overhead of executing extra autotasking code
adds a nominal 0% to 5% to the overall execution time.
If inner-loop autotasking is used, vector performance
can decrease because of shorter vector lengths and more vector loop startups.
Processors are sometimes held for the next parallel region
to improve efficiency. While holding a processor can save time, it also
costs time to acquire and hold them.
Because overhead is associated with work distribution, jobs with
large granularity have less partitioning than smaller jobs. Large jobs,
however, may have problems with load balancing.
Parallel processing implementation
strategies are discussed in detail in the following books:
In addition to these books, other documents in the MIPSpro compiler
documentation set discuss parallel processing issues that are specific
to compiler use. See the Guide to
SGI Compilers and Compiling Tools for information
about those books.
SCSL User's Guide
(document number: 007-4325-001 / published: 2003-12-30)
table of contents | additional info | download
Front Matter
About This Guide
Chapter 1. Introduction
Chapter 2. Basic Linear Algebra Subprogram (BLAS) Routines
Chapter 3. LAPACK
Chapter 4. Using Sparse Linear Equation Solvers
Chapter 5. Signal Processing Routines
Appendix A. Supported SCSL Routines
Glossary
Index
home/search |
what's new |
help
|
|
|