|
|
Linux » Books » Developer »
Linux Application Tuning Guide
(document number: 007-4639-010 / published: 2009-01-30)
table of contents | additional info | download find in page
Chapter 1. System Overview
Tuning an application involves making your program run its
fastest on the available hardware. The first step is to make your program
run as efficiently as possible on a single processor system and then consider
ways to use parallel processing.
Application tuning is different from system tuning, which involves
topics such as disk partitioning, optimizing memory management, and configuration
of the system. The Linux Configuration
and Operations Guide discusses those topics in detail.
This chapter provides an overview of concepts involved in working
in parallel computing environments.
Scalability is computational power that can
grow over a large number of CPUs. Scalability depends on the time between
nodes on the system. Latency is the time to send
the first byte between nodes.
A Symmetric Multiprocessor (SMP) is a parallel programming environment
in which all processors have equally fast (symmetric) access to memory.
These types of systems are easy to assemble and have limited scalability
due to memory access times.
Another parallel environment is that of arrays, or clusters. Any
networked computer can participate in a cluster. These are highly scalable,
easy to assemble, but are often hard to use. There is no shared memory
and there are frequently long latency times.
Massively Parallel Processors (MPPs) have a distributed memory and
can scale to thousands of processors; they have large memories and large
local memory bandwidth.
Scalable Symmetric Multiprocessors (S2MPs),
as in the ccNUMA environment, combine qualities of SMPs and MPPs. They
are logically programmable like an SMP and have MPP-like scability.
An Overview of Altix Architecture
This section provides a brief overview of the SGI Altix 3000 and
4000 series systems.
Altix 3000 Series Systems
In order to optimize
your application code, some understanding of the SGI Altix architecture
is needed. This section provides a broad overview of the system architecture.
The SGI Altix 3000 family of servers and superclusters can have
as many as 256 processors and 2048 gigabytes of memory. It uses Intel's
Itanium 2 processors and uses nonuniform memory access (NUMA) in SGI's
NUMAflex global shared-memory architecture. An SGI Altix 350 system can
have as many as 16 processors and 96 gigabytes of memory.
The NUMAflex design permits modular packaging of CPU, memory, I/O,
graphics, and storage into components known as bricks.
The bricks can then be combined and configured into different systems,
based on customer needs.
On Altix 3700 systems, two Itanium processors share a common frontside
bus and memory. This constitutes a node in the NUMA architecture. Access
to other memory (on another node) by these processors has a higher latency,
and slightly different bandwidth characteristics. Two such nodes are packaged
together in each computer brick. For a detailed overview, see the
SGI Altix 3000 User's Guide.
On an SGI Altix 3700 Bx2 system, the CR-brick contains the processors
(8 processors per CR-brick) and two internal high-speed routers. The routers
connect to other system bricks via NUMAlink cables and expand the compute
or memory capacity of the Altix 3700 Bx2. For a detailed overview, see
the SGI Altix 3700 Bx2 User's Guide.
All Altix 350 systems contain at least one base compute module that
contains the following components: One or two Intel Itanium 2 processors; each processor
has integrated L1, L2, and L3 caches
Up to 24 GB of local memory
Four PCI/PCI-X slots
One IO9 PCI card that comes factory-installed in the lowermost
PCI/PCI-X slot
For a detailed overview, see the
SGI Altix 350 System User's Guide.
The system software consists of a standard Linux distribution (Red
Hat) and SGI ProPack, which is
an overlay providing additional features such as optimized libraries and
enhanced kernel support. See Chapter 2, “The SGI Compiling Environment”, for details
about the compilers and libraries included with the distribution.
Altix 4000 Series Systems
In the new SGI Altix
4000 series systems, functional blades - interchangeable compute, memory,
I/O, and special purpose blades in an innovative blade-to-NUMAlink architecture
are the basic building blocks for the system. Compute blades with a bandwidth
configuration have one processor socket per blade. Compute blades with
a density configuration have two processor sockets per blade. Cost-effective
compute density is one advantage of this compact blade packaging.
The Altix 4000 series is a family of multiprocessor distributed
shared memory (DSM) computer systems that currently scales from 8 to 512
CPU sockets (up to 1,024 processor cores) and can accommodate up to 6TB
of globally shared memory in a single system while delivering a teraflop
of performance in a small-footprint rack. The SGI Altix 450 currently
scales from 2 to 76 cores as a cache-coherent single system image (SSI).
In a DSM system, each processor board contains memory that it shares with
the other processors in the system. Because the DSM system is modular,
it combines the advantages of low entry-level cost with global scalability
in processors, memory, and I/O. You can install and operate the Altix
4700 series system in a rack in your lab or server room. Each 42U SGI
rack holds from one to four 10U high enclosures that support up to ten
processor and I/O sub modules known as "blades." These blades are single
printed circuit boards (PCBs) with ASICS, processors, and memory components
mounted on a mechanical carrier. The blades slide directly in and out
of the Altix 4700 1RU enclosures. Each individual rack unit (IRU) is 10U
in height (see Figure 1-1).
For more information on this system, see the SGI Altix
4700 System User's Guide available on the SGI Technical Publications
Library. It provides a detailed overview of the SGI Altix 4700 system
components and it describes how to set up and operate the system. For
an overview of the new SGI Altix 450 system, see Chapter 3, "System Overview"
in the SGI Altix 450 System User's Guide.
The Basics of Memory Management
Virtual memory (VM), also known as virtual
addressing, is used to divide a system's relatively small amount of physical
memory among the potentially larger amount of logical processes in a program.
It does this by dividing physical memory into pages,
and then allocating pages to processes as the pages are needed.
A page is the smallest unit of system memory allocation. Pages are
added to a process when either a validity fault occurs or an allocation
request is issued. Process size is measured in pages and two sizes are
associated with every process: the total size and the resident set size
(RSS). The number of pages being used in a process and the process size
can be determined by using either the ps(1)
or the top(1) command.
Swap space is used for temporarily saving
parts of a program when there is not enough physical memory. The swap
space may be on the system drive, on an optional drive, or allocated to
a particular file in a filesystem. To avoid swapping, try
not to overburden memory. Lack of adequate swap space limits the number
and the size of applications that can run simultaneously on the system,
and it can limit system performance.
Linux is a demand paging operating system, using a least-recently-used
paging algorithm. On a validity fault, pages are mapped into physical
memory when first referenced and pages are brought back into memory if
swapped out.
Linux Application Tuning Guide
(document number: 007-4639-010 / published: 2009-01-30)
table of contents | additional info | download
Front Matter
New Features in This Manual
About This Document
Chapter 1. System Overview
Chapter 2. The SGI Compiling Environment
Chapter 3. Performance Analysis and Debugging
Chapter 4. Monitoring Tools
Chapter 5. Data Placement Tools
Chapter 6. Performance Tuning
Chapter 7. Flexible File I/O
Chapter 8. I/O Tuning
Chapter 9. Suggested Shortcuts and Workarounds
Index
home/search |
what's new |
help
|
|
|