Linux » Books » Developer »
Linux Device Driver Programmer's Guide,Porting to SGI Altix Systems
(document number: 007-4520-007 / published: 2008-09-24)
table of contents | additional info | download find in page
This chapter gives an overview of system components and the management
of physical and virtual memory in SGI Altix series systems, which are
based on the Itanium Processor Family (IPF) of processors. This chapter
also provides background information to help you understand the limitations
and special conventions used by some kernel functions.
The following main topics are covered in this chapter:
The SGI Altix servers are a family of multiprocessor
distributed shared memory (DSM) computer systems. The SGI Altix systems
use a global-address-space cache-coherent multiprocessor that can scale
up to 512 processors in a cache-coherent domain. The processors are
housed in a 3-U high brick called the SC-brick. The SC-brick contains
two processor nodes. A processor node consists of two processors, each
with 1.5- or 3-MB on-chip, private tertiary (L3) cache, connected to the
scalable hub (SHub) ASIC via the front side bus (FSB). The SHub ASIC acts
as a crossbar between the processors, local SDRAM memory, the network
interface, and the I/O interface. Each processor node is interconnected
by a NUMAlink 4 channel. The modularity of the DSM approach combines
the advantages of low entry-level cost with global scalability in processors,
memory, and I/O. The SGI Altix systems are based on the Intel Itanium
2 processor. The Intel Itanium 2 processor is a 64-bit processor that
is initially offered at 900 MHz clock speed with a 1.5 MB L3 cache size.
The SGI Altix has a PCI-X-based I/O system. (For more details on
PCI-X devices, see Chapter 3, “PCI-X Device Attachment”). The I/O components
are housed in an I/O brick. Following are the two types of I/O bricks: | IX-brick | | An IX-brick consists of six PCI-X buses. One slot is preloaded
with the BaseIO card, plus a drive module containing a DVD-ROM and one
or two system disks.
| | PX-brick | | A PX-brick consists of six PCI-X buses, each with two
PCI-X slots.
|
Figure 2-1, shows the links between the various
bricks of the SGI Altix system.
The following sections provide additional information of the various
system bricks. These sections describe the following system components: Compute/processor node (SC-brick)
PCI-X with BaseIO (IX-brick)
PCI-X with expansion (PX-brick)
Compute/Processor Node (SC-brick)
The SC-brick is a 3U (4.5”),
1U==1.5”, rackmountable enclosure that contains the following components: Two processor nodes, each containing two 64-bit processors
with 1.5- or 3-MB secondary caches.
Two SHub chipsets.
Sixteen DIMM slots per SHub; one or two memory banks per
four DIMMs.
Node electronics.
One L1 controller.
The node electronics, L1 controller, and power regulators are contained
on a single half-panel power board (PCB). The two SHubs, four processors,
and processor power pods are housed on separate half-panel boards. Four
memory daughtercards house the memory DIMMs. Each daughtercard supports
eight memory DIMMs. Figure 2-2, shows the block diagram
of an SC-brick.
The SC-brick has the following features: Two 64-bit processors
Contains one 1.5- or 3-MB secondary cache per processor
(integrated within the processor)
Configurable from 2.0 GB to 16 GB of main memory (minimum
8 DIMMs)
Contains two 6.4-GB/s (each direction) NUMAlink channels
Contains two 2.4-GB/s (each direction) Xtown2 channels
Contains one connection port to the L2 controller
Contains one DB9 console port
PCI-X with BaseIO (IX-brick)
The IX-brick
is actually a PX-brick with a BaseIO card in PCI-X bus Q, slot Q, plus
a drive module. The BaseIO card consists of the following components: IOC4 components: | ATA bus connected to DVD-ROM | | NVRAM | | Real-time clock | | Real-time input/output ports | | Serial ports | | PS/2 keyboard and mouse ports |
Ethernet network chipset
SCSI controller
Figure 2-3, shows an IX-brick.
PCI-X with Expansion (PX-brick)
The PX-brick
contains six PCI-X buses with two slots per bus to make a total of 12
PCI-X slots. PX-bricks can be connected to the system via two Xtown2 links.
The PX-brick PCI-X expansion is shown in Figure 2-4.
System Memory Address Space
SGI
Altix systems support 64-bit mode addressing. This section refers to the
64-bit address spaces provided by the SGI Altix system microprocessor
(see Figure 2-8). This architecture uses addresses
that are 64-bit unsigned integers from 0x0000 0000 0000 0000
to 0xFFFF FFFF FFFF FFFF. This is an immense span of numbers--if
it were drawn to a scale of 1 millimeter per terabyte, the drawing would
be 16.8 kilometers long (just over 10 miles).
The following types of space are described in this section: Physical address
Global Memory mapped register (MMR)
Atomic memory operation (AMO)
Cacheable memory
SHub physical address map
This
section provides physical address space information that is normally used
by device drivers. SGI Altix systems support 50-bit physical addressing,
as shown in Figure 2-5.
Fields in Figure 2-5, are defined as follows:
| Bits | | Description
| | 63:50 | | Unused and reserved for future use. The value of these
bits should always be zero. This leaves 512 terabytes of addressing for
SGI Altix systems implemented with SHub.
| | 49:38 | | Node ID bits. SGI Altix systems implemented with SHub
support up to 1024 processor nodes (2048 CPUs per system). Bit 38 indicates
the node type. A value of 0 indicates a processor node. Bit 49 is always
0.
| | 37: 36 | | Address space (AS). Each SHub is allocated 256 GB of
physical address space. Bits 37:36 divide the 256 GB into four 64-GB spaces,
as follows: | Bits [37:36] | | Description
| | 00 | | Local resource space and global MMR space
| | 01 | | GET space
| | 10 | | AMO space
| | 11 | | Cacheable memory space
|
The AS bits are analogous to the uncached attribute bits of the
SGI Origin series systems; however, since Itanium 2 processors do not
support uncached attribute bits in the translation lookaside buffer (TLB),
physical address bits are used to perform the equivalent function.
| | 35:0 | | Node offset. These bits point to a specific byte location
within one of the four 64-GB spaces of the SHub. When the value of bits
37:36 is 0b00, the 64-GB local resource space and global MMR space is
really split into two 32-GB regions: 32 GB of local resource space and
32 GB of global MMR space. Bit 35 selects between these two regions. When
the value of bits 37:35 is 0b000, the request targets the local resource
space. When the value of bits 37:35 is 0b001, the request targets the
global MMR space.
|
The following sections describe global MMR space, AMO space, and
cacheable memory space.
A
node's global memory mapped register (MMR) space provides all processor
nodes in the system with access to a node's MMRs (see Figure 2-6).
Notice the position of the global MMR space in the physical address map
shown in Figure 2-8. Following are the values of
the bits for global MMR space: | Bit | | Value
| | 49 | | 0
| | 48:38 | | Node ID (remember, SHubs are even nodes)
| | 37:36 | | 00 (AS bits)
| | 35 | | 1
|
 | Note: Programmable I/O addresses reside in this space (for
example, SHub systems, registers set, PCI configuration space, PCI I/O
and memory space, I/O brick registers, and so on).
|
When the address
space (AS) bits are set to 10, the reference is to atomic memory operation
(AMO) space. An AMO read operation (AMOR) or AMO write
operation (AMOW) request is issued to the SHub that
is identified by the number in the node ID (see Figure 2-7).
Notice the position of the AMO space in the physical address map shown
in Figure 2-8. The node offset bits specify a 36-bit
offset within the SHub address space, as follows: | Bit | | Value
| | 49 | | 0
| | 48:38 | | Node ID (remember, SHubs are even nodes)
| | 37:36 | | 10
| | 35:0 | | Node offset
|
A number of fetch-and-op style AMOs are supported to optimize common
synchronization primitives such as locks, tickets, and barriers. These
AMOs operate on a read-modify-write basis. AMOs are defined only for word
and doubleword data sizes and are performed using uncached loads and stores
to the AMO address space. In addition, operations are allowed only on
the first doubleword of each 64-byte block (half cache line) in memory.
The AMO variable can be accessed either as one 64-bit AMO variable or
as two 32-bit AMO variables.
In the AMO address space, bits 5:3 of the node offset (the three
address bits above the doubleword offset) determine the type of AMO to
perform.
The following AMO read operations are supported: | Fetch | | Simple uncached read of the location.
| | Fetch and Increment | | The location's current value is returned and then the
location's value is incremented. This operation is followed by a write
operation.
| | Fetch and Decrement | | The location's current value is returned and then the
location's value is decremented. This operation is followed by a write
operation.
| | Fetch and Clear | | The location's current value is returned and then the
location's value is cleared. This operation is followed by a write operation.
|
The following AMO write operations are supported: | Initialize | | Simple uncached write of the location.
| | Increment | | The location's value is incremented.
| | Decrement | | The location's value is decremented.
| | Logical AND | | Stored data is logically AND'd with the location's current
value.
| | Logical OR | | Stored data is logically OR'd with the location's current
value.
|
When the AS bits are set to 11, the reference is
to cacheable memory space. A memory request is issued to the SHub that
is identified by the number in the node ID. The node offset bits specify
a 36-bit offset within the SHub address space. UC, WB, and WC attributes
are supported for cacheable memory space. Notice the position of the cacheable
memory space in the physical address map shown in Figure 2-8.
The 50-bit physical address has a 36-bit offset within the SHub address
space, as follows: | Bit | | Value
| | 49 | | 0
| | 48:38 | | Node ID (remember, SHubs are even nodes)
| | 37:36 | | 11 (AS bits)
| | 35:0 | | Node offset
|
 | Note: Direct memory access (DMA) addresses reside
in cacheable memory space.
|
The primary,
secondary, and tertiary caches shown in Figure 2-10,
are essential to CPU performance. There is an order of magnitude difference
in the speed of access between cache memory and main memory. Execution
speed remains high only as long as a very high proportion of memory accesses
are satisfied from the primary, secondary, or tertiary cache.
The use of caches means that there are often multiple copies of
data: a copy in main memory, a copy in the secondary cache (when one is
used), and a copy in the primary cache. Moreover, a multiprocessor system
has multiple CPU modules like the one shown in Figure 2-10,
and there can be copies of the same data in the cache of each CPU.
The problem of cache coherency is to ensure
that all cache copies of data are true reflections of the data in main
memory. Different SGI systems use different hardware designs to achieve
cache coherency.
Multiprocessor systems have more complex cache coherency protection
because it is possible to have data in multiple caches. In an SGI Altix
multiprocessor system, the hardware ensures that cache coherency is maintained
under all conditions, including DMA input and output, without action by
the software.
SHub Physical Address Map
Figure 2-8 shows the SHub physical
address map. On SHub, AMO space and global MMR space must be accessed
uncached, and GET space must be accessed cached. Cacheable memory space
can be accessed cached or uncached, subject to operating system constraints.
 | Note: Linux drivers run in virtual mode (TLBs enabled for
all addresses) all the time. Therefore, the address space they see depends
not only on behavior of the SHub, but also on the TLB mapping conventions
of the operating system.
|
PIO Addresses and DMA Addresses
Figure 2-12,
is too simple for some devices that are attached through a bus adapter.
A bus adapter connects a bus of a different type to the system bus, as
shown in Figure 2-9.
For example, the PCI/PCI-X bus adapter connects a PCI/PCI-X bus
to the Xtalk I/O interface of SHub. Multiple PCI/PCI-X devices can be
plugged into the PCI/PCI-X bus and use the bus to read and write. The
bus adapter translates the PCI/PCI-X bus protocol into the system Xtalk
protocol.
Each PCI/PCI-X bus has address lines that carry the address
values used by devices on that PCI/PCI-X bus. These bus addresses are
not related to the physical addresses used on the system front side bus
(FSB). The issue of bus addressing is made complicated by three facts:
Bus-master devices independently generate memory-read
and memory-write commands that are intended to access system memory.
The bus adapter can translate addresses between addresses
on the bus it manages, and different addresses on the system bus it uses.
The translation done by the bus adapter can be programmed
dynamically (mapped), and can change from one I/O operation to another.
This subject can be simplified by dividing it into two distinct
subjects: PIO addressing, used by the CPU to access a device, and DMA
addressing, used by a bus master to access memory. These addressing modes
need to be treated differently.
Programmable I/O (PIO) is the term for a load or store instruction
executed by the CPU that names an I/O device space as its operand. The
CPU places a physical address on the system bus. The bus adapter repeats
the read or write command on its bus, but not necessarily using the same
address bits as the CPU put on the system bus.
One task of a bus adapter is to translate between the physical addresses
used on the system bus and the addressing scheme used within the proprietary
bus. The address placed on the target bus is not necessarily the same
as the address generated by the CPU. The translation is done differently
with different bus adapters and in different system models.
With the more sophisticated PCI and PCI-X buses, the translation
is dynamic. Both of these buses support bus address spaces that are as
large or larger than the physical address space of the system bus. It
is impossible to hard-wire a translation of the entire bus address space.
Furthermore, SGI Altix architecture provides multiple system buses. For
more details, see “Address Spaces Supported” in Chapter 3.
The PCI/PCI-X resource addresses in the pci_dev
structure are PIO mapped addresses that the device driver can use in their
existing state.
To use a dynamic PIO address, a device driver
can create a software object called a PIO map that represents that portion
of bus address space that contains the device registers the driver uses.
When the driver wants to use the PIO map, the kernel dynamically sets
up a translation from an unused part of physical address space to the
needed part of the bus address space. The driver extracts an address from
the PIO map and uses it as the base for accessing the device registers.
This is an extension that SGI provides.
A bus-master device on the PCI bus can be programmed to perform
transfers to or from memory independently and asynchronously. A bus master
is programmed using PIOs with a starting bus address and a length. The
bus master generates a series of memory-read or memory-write operations
to successive addresses. But what bus addresses should it use in order
to store into the proper memory addresses?
The bus adapter translates the addresses used on the proprietary
bus to corresponding addresses on the system bus. As shown in Figure 2-9,
the operation of a DMA device is as follows:
The device places a bus address and data
on the PCI or PCI-X bus.
The bus adapter translates the address to a
meaningful physical address, and places that address and the data on the
system Xtalk I/O link.
The memory modules store the data.
The translation of bus virtual to physical addresses is done by
the bus adapter and programmed by the kernel. A device driver requests
the kernel to set up a dynamic mapping from a designated memory buffer
to bus addresses. For more information, see Chapter 9, “PCI-X Direct Memory Access (DMA)”.
Linux device drivers on SGI Altix systems must use the standard
Linux pci_dma map routines. For more information, see Chapter 9, “PCI-X Direct Memory Access (DMA)”.
The driver calls kernel functions to establish
the range of memory addresses that the bus master device will need to
access--typically the address of an I/O buffer. When the driver calls
one of the pci_dma map routines, the kernel sets up
the bus adapter hardware to translate between some range of bus addresses
and the desired range of memory space. The driver uses PIO to program
this bus address into the bus master device registers. SGI software supports
64- and 32-bit DMA addresses. For more information on 64- and 32-bit DMA
map addresses, see Chapter 9, “PCI-X Direct Memory Access (DMA)”.
Linux Kernel and User Virtual Address Management
The
SGI Altix system uses the same virtual memory manager as any IA-64 Linux
system with ccNUMA and discontiguous memory support. For more information
on Linux kernel and user virtual address management, see IA-64
Linux Kernel Design and Implementation.
The following sections describe CPU and device access to memory.
CPU Access to Memory or I/O Address Space
Each SGI computer system has one or more
CPU modules and one or more I/O modules. A CPU reads data from memory
or a device by placing an address on a system bus and receiving data back
from the addressed memory or device. An address can be translated more
than once as it passes through multiple layers of I/O chipsets and bus
adapters. Access to memory can also pass through multiple levels of cache.
The CPU generates the address of data that it needs--the address
of an instruction to fetch, or the address of an operand of an instruction.
It requests the data through a mechanism that is depicted in simplified
form in Figure 2-10.
The process is as follows: The address
of the needed data is formed in the processor execution or instruction-fetch
unit. Most addresses are then mapped from virtual to real through the
translation lookaside buffer (TLB). On Itanium 2 processors, all addresses
go through the TLBs if TLBs are enabled. With some very small exceptions,
TLBs are always enabled.
Most addresses are presented to the L1 cache,
a cache in the processor chip. If a copy of the data with that address
is found, it is returned immediately. Certain address ranges are never
cached; these addresses pass directly to the bus.
If the L1 cache does not contain the data,
the address is presented to the L2 cache. If it contains a copy of the
data, the data is returned immediately. The size and the architecture
of the secondary cache differ from one CPU model to another.
If L2 does not contain the data, the address
is presented to the L3 cache. The address is placed on the system bus.
The memory module that recognizes the address places the data on the bus.
The process in Figure 2-10 is correct for an
SGI Altix system when the addressed data is in the local node.
 | Note: When the address applies to memory in another node, the address
passes out through the connection fabric to a memory module in another
node, from which the data is returned.
|
CPU Access to I/O Address Space - Programmable I/O (PIO)
The CPU accesses a device
register using programmable I/O (PIO), a process
illustrated in Figure 2-11. Access to device registers
is always uncached. It is not affected by considerations of memory cache
coherency in any system (see “Cache Use”).
The process is as follows: The address
of the device is formed in the execution unit. It is not usually an address
that is mapped by the TLB.
A device address, after mapping if necessary,
always falls in one of the ranges that is not cached, so it passes directly
to the system bus.
The device or system component (such as SHub)
recognizes its physical address and responds with data.
The PIO process shown in Figure 2-11, is correct
for an SGI Altix system when the addressed device is attached to the same
node. When the device is attached to a different node, the address passes
through the connection fabric to that node, and the data returns the same
way.
Device Access to System Physical Memory Space - Direct Memory
Access
Some devices can perform direct memory access (DMA),
in which the device itself, not the CPU, reads or writes data into memory.
A device that can perform DMA is called a bus master
because it independently generates a sequence of bus accesses without
help from the CPU.
To read or write a sequence of memory addresses, the bus master
has to be told the proper physical address (bus address) range to use.
This is done by using PIO to store a bus address and length into the device's
registers from the CPU. When the device has the DMA information, it can
access memory through the system bus as shown in Figure 2-12.
The process is as follows: The device
makes a request on the PCI/PCI-X bus.
The PCI/PCI-X bus adapter translates the PCI/PCI-X
bus request and generates a request to the I/O chipset (SHub).
The local SHub forwards the request to the
requested memory controllers (local or remote).
The memory module stores the data.
In an SGI Altix system, the device and the memory module can be
in different nodes, with address and data passing through the connection
fabric (NUMAlink) between nodes.
When a device is programmed with an invalid physical address, the
result is a bus error interrupt. The interrupt occurs on some CPU that
is enabled for bus error interrupts. These interrupts are not simple to
process for two reasons. First, the CPU that receives the interrupt is
not necessarily the CPU from which the DMA operation was programmed. Second,
the bus error can occur a long time after the operation was initiated.
Linux Device Driver Programmer's Guide,Porting to SGI Altix Systems
(document number: 007-4520-007 / published: 2008-09-24)
table of contents | additional info | download
Front Matter
New Features in This Guide
About This Guide
Chapter 1. Introduction
Chapter 2. Architecture
Chapter 3. PCI-X Device Attachment
Chapter 4. PCI System Initialization
Chapter 5. Finding Your PCI Device
Chapter 6. PCI/PCI-X Configuration Space
Chapter 7. PCI-X I/O and Memory Resources
Chapter 8. PCI-X Interrupt Mechanism
Chapter 9. PCI-X Direct Memory Access (DMA)
Chapter 10. Device Driver Memory Usage
Chapter 11. Time Management
Chapter 12. Building Linux Kernels and Modules
Appendix A. Memory Operation Ordering on SGI Altix Systems
Index
home/search |
what's new |
help
|