Linux » Books » Administrative »
Linux Configuration and Operations Guide
(document number: 007-4633-017 / published: 2009-10-22)
table of contents | additional info | download find in page | jump to first hit | clear highlight
Chapter 2. Configuring Your System
This chapter provides information on configuring your system and
covers the following topics:
 | Note: For information on system configuration
and operation on SGI ProPack 4 for Linux or SGI ProPack 5 for Linux systems,
see the 007-4633-009 or 007-4633-0012 version of this manual, respectively.
From the current 007-4633-016 version of this manual on the SGI Technical
Publications Library, select the additional info link.
Click on 007-4633-009 under Other Versions
:
|
PCI or PCI-X Card Hot-Plug Software
The Linux PCI/X hot-plug feature supports inserting a PCI
or PCI-X card into an empty slot and preparing that card for use or deactivating
a PCI or PCI-X card and then removing it from its slot, while the system
is running. Hot-plug operations can be initiated using a series of shell
commands.
 | Note: PCI hot-plug is only supported
on SGI Altix 3000 series systems on the IX-Brick, PX-Brick, IA-Brick,
and PA-Brick. It is not supported on SGI Altix XE
systems or Altix 350 systems. For SGI ProPack 6 systems running RHEL5,
the I3X blade (3 slot double-wide PCI-X I/O) on SGI Altix 450 or SGI Altix
4700 systems supports PCI hot-plug. PCI-X cards can be added, removed,
or replaced in the I3X blade while the I3X blade is installed in the 1955
chassis and the system is operating.
|
This
section describes hot-swap operations and covers the following topics:
Introduction PCI or PCI-X Card Hot-plug Operations
A hot-swap operation
is the combination of a remove and insert operation targeting the same
slot. Single function cards, multi-function cards, and PCI/X-to-PCI/X
bridges are supported.
A hot-plug insert operation consists of attaching a card to an SGI
card carrier, inserting the carrier in an empty slot, and using software
commands to initiate the software controlled power-up and initialization
of the card.
A hot-plug remove operation consists of manually terminating any
users of the card, and then using software commands to initiate the remove
operation to deactivate and power-down the card.
The Altix system L1 hardware controller has these hot-plug restrictions,
as follows:
If these restrictions are detected by the Linux kernel and reported
to the user, the requested hot-plug operation fails.
For detailed instructions on how to install or remove a PCI or PCI-X
card on the SGI Altix 3000 series systems, see “Adding or Replacing
a PCI or PCI-X Card” in Chapter 12, “Maintenance and Upgrade
Procedures” in SGI Altix 3000 User's Guide.
For more information on the SGI L1 and L2 controller software, see
the SGI L1 and L2 Controller Software User's Guide.
Loading Hot-plug Software
The hot-plug feature is not configured on by default. To load the
sgi_hotplug module, perform the following steps: Load the sgi_hotplug module, as follows:
Make sure the module is loaded, as follows: % lsmod | grep sgi_hotplug
sgi_hotplug 145168 0
pci_hotplug 189124 2 sgi_hotplug,shpchp |
Change directory (cd) to
the /sys/bus/pci/slots directory and verify its contents,
as follows: % ls -l
total 0
drwxr-xr-x 2 root root 0 Aug 22 10:54 0021:00:01
drwxr-xr-x 2 root root 0 Aug 22 10:54 0022:00:01
drwxr-xr-x 2 root root 0 Aug 22 10:54 0022:00:02
drwxr-xr-x 2 root root 0 Aug 22 10:54 0023:00:01
drwxr-xr-x 2 root root 0 Aug 22 10:54 0024:00:01
drwxr-xr-x 2 root root 0 Aug 22 10:54 0024:00:02 |
Controlling Hot-plug Operations
This
section describes hot-plug operations and the format of a slot name. It
covers the following topics:
Hot-plug operations target a particular slot using the name of the
slot. All slots that are eligible for a hot-plug operation have a directory
in the hot-plug file system that is mounted at /sys/bus/pci/slots
. The name of the target slot is based on the hardware location
of the slot in the system. For the SGI ProPack 6 release, slot directories
are in the form that the lspci(8) command uses, that
is, as follows:
segment:bus:slot
Change directory (cd), to the /sys/bus/pci/slots
directory and use the ls or ls
-l command to view the contents of the file, as follows: pci/slots> ls
0001:00:01 0002:00:01 0002:00:02 0003:00:01 0004:00:01 0004:00:02 |
Slot is part of a PCI domain. On an SGI Altix system, a
PCI domain is a functional entity that includes a root bridge,
subordinate buses under the root bridge, and the peripheral devices it
controls. For more information, see “PCI Domain Support for SGI Altix Systems”.
Each slots directory contains two files called
path and power. For example, change directory
to /sys/bus/pci/slots/0001:00:01 and perform the
ls command, as follows: slots/0001:00:01> ls
path power |
The power file provides the current hot-plug
status of the slot. A value of 0 indicates that the
slot is powered-down, and a value of 1 indicates that
the slot is powered-up.
The path file is the module ID where the brick
resides.
Hot-plug Insert Operation
A hot-plug insert
operation first instructs the L1 hardware controller to power-up the slot
and reset the card. The L1 controller then checks that the card to be
inserted is compatible with the running bus. Compatible is defined, as
follows: The card must support the same mode as the running bus,
for example, PCI or PCI-X
The card must be able to run at the current bus speed
That a 33 MHz card is not being inserted
into an empty bus
Any L1 controller detected incompatibilities or errors are reported
to the user and the insert operation fails.
Once the slot has been successfully powered-up by the L1 controller,
the Linux hot-plug infrastructure notifies the driver of the card that
the card is available and needs to be initialized. After the driver has
initialized the card, the hot-plug insert operation is complete and card
is ready for use.
Hot-plug Remove Operation
Before initiating
a hot-plug remove operation, the system administrator must manually terminate
any processes using the target card.
 | Warning: Failure to properly terminate any outstanding accesses
to the target card may result in a system failure or data corruption when
the hot-plug operation is initiated.
|
For a hot-plug remove operation, the hot-plug infrastructure verifies
that the target slot is eligible to be powered-down. The L1 hardware controller
restrictions do not permit the last card to be removed
from a bus running at 33 MHz and an attempt to remove the last card fails.
The hot-plug infrastructure then notifies the driver of the card of a
pending hot-plug remove operation and the driver deactivates the card.
The L1 hardware controller is then instructed to power-down the slot.
Attempts to power-down a slot that is already powered-down, or power-up
a slot that is already powered-up are ignored.
Using Shell Commands To Control a Hot-plug Operation
A hot-plug operation
can be initiated by writing to the target power file
of the slot. After composing the name of the slot based on its location,
change into the directory of the slot in the hot-plug virtual file system.
Change directory (cd), to the /sys/bus/pci/slots
directory and use the ls or ls
-l command to view the contents of the file, as follows: pci/slots> ls
0001:00:01 0002:00:01 0002:00:02 0003:00:01 0004:00:01 0004:00:02 |
For example, to target hot-plug operations to slot1, segment1 in
module 001i03 (IA-Brick in position 3 of rack 1), change to directory
0001:00:01 and then perform the ls command,
as follows: slots/0001:00:01> ls
path power |
To query the current hot-plug status of the slot, read the
power file of the slot, as follows: slots//0001:00:01> cat power
1 |
A value of 0 indicates that the slot is powered-down,
and a value of 1 indicates that the slot is powered-up.
The path file is the module ID were the brick
resides, as follows: slots/0001:00:01> cat path
module_001i03 |
To initiate an insert operation to a slot that is powered-down,
write the character 1 to the power
file of the slot, as follows: slots/0001:00:01> echo 1 > power |
Detailed status messages for the insert operation are written to
the syslog by the hot-plug infrastructure. These messages
can be displayed using the Linux dmesg user command,
as follows:
A hot-plug remove operation is initiated to a slot that is powered-up
by writing the character 0 to the power file of the
slot, as follows: slots/0001:00:01> echo 0 > power
|
Detailed status messages for the remove operation are written to
the syslog by the hot-plug infrastructure. These messages
can be displayed using the Linux dmesg user command,
as follows:
Faster SCSI Device Booting
There are three files in the /etc/udev/rules.d
directory that can be modified to make systems boot faster that have many
logical unit numbers (LUNs) for attached SCSI devices (1000 LUNS or more),
as follows:
This section describes these rules.
 | Note: This section only applies to SGI Altix systems with
SGI ProPack 6 running SLES10 or SLES11.
|
The rules in the 50-udev-default.rules file cause
SCSI drivers sd_mod,osst,
st, sr_mod, and sg to
be loaded automatically when appropriate devices are found. These rules
are, as follows: SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="0|7|14", RUN+="/sbin/modprobe sd_mod"
SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="1", SYSFS{vendor}=="On[sS]tream", RUN+="/sbin/modprobe osst"
SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="1", RUN+="/sbin/modprobe st"
SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="[45]", RUN+="/sbin/modprobe sr_mod"
SUBSYSTEM=="scsi_device", ACTION=="add", RUN+="/sbin/modprobe sg" |
You can comment out all of these rules to save calls to the modprobe(8) command for each SCSI device to save
boot time, as follows: #SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="0|7|14", RUN+="/sbin/modprobe sd_mod"
#SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="1", SYSFS{vendor}=="On[sS]tream", RUN+="/sbin/modprobe osst"
#SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="1", RUN+="/sbin/modprobe st"
#SUBSYSTEM=="scsi_device", ACTION=="add", SYSFS{type}=="[45]", RUN+="/sbin/modprobe sr_mod"
#SUBSYSTEM=="scsi_device", ACTION=="add", RUN+="/sbin/modprobe sg" |
Make sure the drivers are loaded by adding them to INITRD_MODULES
variable in the /etc/sysconfig/kernel file
and then run the mkinitrd(8) command to
make sure the changes are picked up.
 | Note: This section only applies to SGI Altix systems with
SGI ProPack 6 running SLES10 or SLES11.
|
The rules in the 58-xscsi.rules file create the
/dev/xscsi/pci... persistent device symbolic links. The rules
are, as follows: # This rule creates sg symlinks for all SCSI devices (/dev/xscsi/pci..../sg)
KERNEL=="sg*", PROGRAM="/sbin/udev_xscsi %p", SYMLINK+="%c"
# This rule creates symlinks for entire disks/luns (/dev/xscsi/pci..../disc)
KERNEL=="sd*[a-z]", PROGRAM="/sbin/udev_xscsi %p", SYMLINK+="%c"
# This rule creates symlinks for disk partitions (/dev/xscsi/pci..../partX)
KERNEL=="sd*[a-z]*[0-9]", PROGRAM="/sbin/udev_xscsi %p", SYMLINK+="%c" |
You need the rule for the disk/luns symbolic
links for XVM in this file, (the middle rule). You can comment out the
rule that creates sg symbolic links and the rule for
disk partition symbolic links (top and bottom rules, respectively).
60-persistent-storage.rules
 | Note: This section only applies to SGI Altix 4000 series systems
with SGI ProPack 6 running SLES10 or SLES11.
|
The rules in the 60-persistent-storage.rules
file are for persistent storage links. They are not necessary and can
all be commented out or you can add GOTO="persistent_storage_end"
at the top of the file to accomplish the same.
The tmpfs filesystem memory allocations have
changed for the SGI ProPack 6 release. Prior to the SGI ProPack 6 release,
allocations were always done round-robin on all nodes. With SGI ProPack
6, this is now a tmpfs mount option. This is actually
a Linux kernel change that applies to both SLES10, SLES11, and RHEL5 base
releases.
 | Note: To maintain SGI ProPack 4 tmpfs filesystem
memory allocation default behavior, use the tmpfs filesystem
mpol=interleave mount option.
|
The tmpfs filesystem has a mount
option to set the NUMA memory allocation policy for all files if the
CONFIG_NUMA flag is enabled at mount time. You can adjust this
on a running system, as follows: The following mount options
apply: | mpol=default | Prefers to allocate memory from the local node
| | mpol=prefer:Node | Prefers to allocate memory from the given node
| | mpol=bind:NodeList | Allocates memory only from nodes in NodeList
| | mpol=interleave | Prefers to allocate from each node in turn
| | mpol=interleave:NodeList | Allocates from each node of NodeList
in turn
|
NodeList format is a comma-separated list of
decimal numbers and ranges. The range being two hyphen-separated decimal
numbers, the smallest and largest node numbers in the range. For example,
mpol=bind:0-3,5,7,9-15.
Trying to mount a tmpfs filesystem with an
mpol option will fail if the running kernel does not support
NUMA architecture. It will also fail if its Nodelist
argument specifies a node greater than or equal to MAX_NUMNODES
.
If your system relies on that tmpfs file system being mounted, but
from time to time runs a kernel built without NUMA capability (such as,
a safe recovery kernel), or configured to support fewer nodes, it is advisable
to omit the mpol option from automatic mount options.
It can be added later, when the tmpfs is already mounted
on MountPoint using the following:
mount -o remount,mpol=Policy:NodeList MountPoint |
For more information on the tmpfs filesystem,
see /usr/src/linux-2.6.x.x-x/Documentation/filesystems
file on your system.
Although
some HPC workloads might be mostly CPU bound, others involve processing
large amounts of data and require an I/O subsystem capable of moving data
between memory and storage quickly, as well as having the ability to manage
large storage farms effectively. The XSCSI subsystem, XFS filesystem,
XVM volume manager, and data migration facilities were leveraged from
the IRIX operating system and ported to provide a robust, high-performance,
and stable storage I/O subsystem on Linux.
The following sections describe persistent PCI-X bus numbering,
persistent naming of Ethernet devices, the XSCSI subsystem, the XSCSI-SCSI
subsystem, the XFS filesystem, and the XVM Volume Manager.
This section covers the following topics:
XSCSI Naming Systems on SGI ProPack Systems
This section describes XSCSI naming systems on SGI ProPack systems.
 | Note: The XSCSI subsystem on SGI ProPack 3 systems is an I/O infrastructure
that leverages technology from the IRIX operating system to provide more
robust error handling, failover, and storage area network (SAN) infrastructure
support, as well as long-term, large system performance tuning. This subsystem
was not necessary for SGI ProPack 4 systems or later. However, the XSCSI
naming convention is still used on SGI ProPack 3, SGI ProPack 4, and
SGI ProPack 5 and SGI ProPack 6 systems. XSCSI naming is used to provide
persistent naming for devices, by using the persistent PCI bus numbering.
For SGI Altix 450 and 4700 systems, see “PCI Domain Support for SGI Altix Systems”.
|
This section covers the following topics:
XSCSI Names on Non-blade Systems
For SGI ProPack 3 and SGI ProPack 4 non-blade systems, the XSCSI
name has the following forms.
For direct attached devices, the form of the XSCSI name
is, as follows: /dev/xscsi/pciBB.SS.F[-C]/targetX/lunL/partT |
For SAN attached devices, it is, as follows: /dev/xscsi/pciBB.SS.F[-P]/nodeWWN/portP/lunL/partT |
Where:
BB is the PCI bus number
SS is the Slot
F is the Function
C is the Channel ("-C" for QLA12160 HBA cards
only)
For direct attach devices:
X is the target number
For SAN attached devices:
WWN is the world wide node number of the device
P is the port number of the device
For either direct attach or SAN attach devices:
L is the logical unit number (LUN) in
lunL
T is the partition number
There are two ways of handling multiple port host bus adapter (HBA)
cards in PCI. One way is to have each port be a different function. The
other way is to have one function but have multiple channels off of the
function. Most HBA cards use multiple functions. Therefore, most HBA cards
will have differing F (function) numbers and the
-C will be absent. The QLA12160 (Qlogic Parallel SCSI) uses
multiple channels. Therefore, it will have one function, "0" and multiple
channels, that is, "0-1" and "0-2".
An example of a direct attached device with partition 1 is, as follows: /dev/xscsi/pci01.02.0/target1/lun0/part1 |
The same device attached device off of a Qlogic SCSI HBA card (or
base IO9 base I/O) would be, as follows: /dev/xscsi/pci01.02.0/target1/lun0/part1 |
An example of a SAN attached device is, as follows: /dev/xscsi/pci22.01.1/node20000004cf2c84de/port1/lun0/part1
|
All SGI Altix systems running SGI ProPack 6 for Linux use domain-based
XSCSI names. The XSCSI names change to accommodate PCI Express. They basically
have the same form except that the PCI numbering takes the following form: /dev/xscsi/pciDDDD:BB:SS.F[-C]/... |
Where:
DDDD is the domain number
BB is the Bridge number
SS is the slot number
F is the function
C is the Channel
An example of a direct attach device with partition 1 is, as follows: /dev/xscsi/pci0001:00:03.0-1/target1/lun0/part1 |
An example of a SAN attached device with partition 1 is, as follows: /dev/xscsi/pci0002:00:01.0/node20000004cf2c8d0c/port2/lun0/part1 |
Note that the device number (slot number), function WWN
, logical unit, and port number are fixed. These will never
change. However, the system bus number could change because of a hardware
problem (such as the I/O brick not booting) or the system being reconfigured.
For non-PCI express host bus adapter (HBA) cards in an SGI Altix
4700 system, the domain number is equivalent to the old bus number and
the bridge number is 0. For more information, see “PCI Domain Support for SGI Altix Systems”.
Persistent Network Interface Names
 | Note: This section only applies to SGI Altix systems with
SGI ProPack 6 running SLES10 or SLES11.
|
Ethernet persistent behavior changes in the SGI ProPack 6 for Linux
from prior releases. This functionality is provided by the base operating
system.
The basic change is that the first time a SGI ProPack 6 system is
booted after installation, a udev rule defined in
/etc/udev/rules.d/31-net_create_names.rules is invoked that
enumerates all of the Ethernet devices on the system. It then writes another
rule file called /etc/udev/rules.d/30-net_persistent_names.rules
. This file contains a mapping of the media access control (MAC)
addresses to Ethernet IP addresses. A specific physical interface is always
mapped to the same Ethernet address.
When a system is rebooted, the same Ethernet addresses are always
mapped back to the same MAC addresses, even if some of the interfaces
have been removed in the interim. For example, if Ethernet 1 and Ethernet
3 devices were removed, a sparsely populated Ethernet space of Ethernet
0, Ethernet 2 and Ethernet 4 would result.
To re-enumerate the devices attached to your system, delete the
/etc/udev/rules.d/30-net_persistent_names.rules file. When the
system is rebooted, a new rules file is created for the current compliment
of network device Ethernet addresses.
For more information, see the SGI ProPack 6 /usr/share/doc/packages/sysconfig/README.Persistent_Interface_Names
file.
PCI Domain Support for SGI Altix Systems
 | Note: This section does not apply to SGI Altix XE systems.
|
On an SGI Altix system, a PCI domain is a
functional entity that includes a root bridge, subordinate buses under
the root bridge, and the peripheral devices it controls. Separation,
management, and protection of PCI domains is implemented and controlled
by system software.
Previously,
a PCI device was identified by bus:slot:function. A PCI device is identified
by domain:bus:slot:function.
Domains
(sometimes referred to as a PCI segment) are numbered from (0 to ffff).
bus (0 to ff), slot (0 to 1f) and function (0 to 7).
A domain is a root bridge (with the bus being numbered zero), subordinate
buses under that root bridge are buses 1-255.
In the past, a PA-brick was numbered bus 01, 02, 03, 04 (A bus for
each root bridge (that is, each TIO ASIC)). With PCI domain support each
root bridge is numbered with a domain/bus number 0001:00, 0002:00, 0003:00,
0004:00. If a subordinate bus is plugged into the root bridge bus, it
has the same domain number as the root bridge, but a different bus number
(for example, 0001:01).
For PCI Express the PCIE root complex is its own domain. Each port
is a subordinate bus under that domain.
Domain numbers are allocated starting from the lowest module ID
(that is, rack/slot/blade location).
This section describes how to partition an SGI ProPack server
and contains the following topics:
This section does not apply to SGI Altix XE systems.
A single SGI ProPack for Linux server can be divided
into multiple distinct systems, each with its own console, root filesystem,
and IP network address. Each of these software-defined group of processors
are distinct systems referred to as a partition.
Each partition can be rebooted, loaded with software, powered down, and
upgraded independently. The partitions communicate with each other over
an SGI NUMAlink connection. Collectively, all of these partitions compose
a single, shared-memory cluster.
Direct memory access between partitions, sometimes referred to as
global shared memory, is made available by the XPC and XPMEM kernel modules.
This allows processes in one partition to access physical memory located
on another partition. The benefits of global shared memory are currently
available via SGI's Message Passing Toolkit (MPT) software.
It is relatively easy to configure a large SGI Altix system into
partitions and reconfigure the machine for specific needs. No cable changes
are needed to partition or repartition an SGI Altix machine. Partitioning
is accomplished by commands sent to the system controller. For details
on system controller commands, see the
SGI L1 and L2 Controller Software User's Guide.
Advantages of Partitioning
This section describes the advantages of partitioning
an SGI ProPack server as follows:
Create a Large, Shared-memory Cluster
You can use SGI's NUMAlink technology and the XPC and XPMEM
kernel modules to create a very low latency, very large, shared-memory
cluster for optimized use of Message Passing Interface (MPI) software
and logically shared, distributed memory access (SHMEM) routines. The
globally addressable, cache coherent, shared memory is exploited by MPI
and SHMEM to deliver high performance.
Provides Fault Containment
Another
reason for partitioning a system is fault containment. In most cases,
a single partition can be brought down (because of a hardware or software
failure, or as part of a controlled shutdown) without affecting the rest
of the system. Hardware memory protections prevent any unintentional
accesses to physical memory on a different partition from reaching and
corrupting that physical memory. For current fault containment caveats,
see “Limitations of Partitioning”.
You can power off and “warm swap” a failing C-brick
in a down partition while other partitions are powered up and booted.
For information see “Adding or Replacing a PCI or PCI-X Card”
in chapter 12, “Maintenance and Upgrade Procedures” in the
SGI Altix 3000 User's Guide or see “PCI and PCI-X Cards”
in Chapter 6, “Installing and Removing Customer-replaceable Units”
in the SGI Altix 350 User's Guide or see the appropriate chapter
in the new SGI Altix 4700 User's Guide.
Allows Variable Partition Sizes
Partitions
can be of different sizes, and a particular system can be configured in
more than one way. For example, a 128-processor system could be configured
into four partitions of 32 processors each or configured into two partitions
of 64 processors each. (See "Supported Configurations" for a list of supported
configurations for system partitioning.)
Your choice of partition size and number of partitions affects both
fault containment and scalability. For example, you may want to dedicate
all 64 processors of a system to a single large application during the
night, but then partition the system in two 32 processor systems for separate
and isolated use during the day.
Provide High Performance Clusters
One
of the fundamental factors that determines the performance of a high-end
computer is the bandwidth and latency of the memory. The SGI NUMAflex
technology gives an SGI ProPack partitioned, shared-memory cluster a huge
performance advantage over a cluster of commodity Linux machines (white
boxes). If a cluster of N white boxes, each with M CPUs is connected via
Ethernet or Myrinet or InfinaBand, an SGI ProPack system with N partitions
of M CPUs provides superior performance because of the significantly lower
latency of the NUMAlink interconnect, which is exploited by the XPNET
kernel module.
Limitations of Partitioning
Partitioning can increase the reliability of a
system because power failures and other hardware errors can be contained
within a particular partition. There are still cases where the whole shared
memory cluster is affected; for example, during upgrades of harware which
is shared by multiple partitions.
If a partition is sharing its memory with other partitions, the
loss of that partition may take down all other partitions that were accessing
its memory. This is currently possible when an MPI or SHMEM job is running
across partitions using the XPMEM kernel module.
Failures can usually be contained within a partition even when memory
is being shared with other partitions. XPC is invoked using normal shutdown
commands such as reboot(8) and halt(8) to ensure that all memory shared between
partitions is revoked before the partition resets. This is also done if
you remove the XPC kernel modules using the rmmod
(8) command. Unexpected failures such as kernel panics or hardware
failures almost always force the affected partition into the KDB kernel
debugger or the LKCD crash dump utility. These tools also invoke XPC to
revoke all memory shared between partitions before the partition resets.
XPC cannot be invoked for unexpected failures such as power failures and
spontaneous resets (not generated by the operating system), and thus
all partitions sharing memory with the partition may also reset.
See the SGI Altix 3000 User's Guide,
SGI Altix 350 User's Guide, SGI Altix 450 User's
Guide, or the SGI Altix 4700 User's Guide
for information on configurations that are supported for system partitioning. Currently, the following guidelines are valid
for SGI ProPack 6 for Linux release: Maximum number of partitions supported is 48
Maximum partition size is 1024 cores
Maximum system size is 9726 cores
For additional information about configurations that are supported
for system partitioning, see your sales representative.
Installing Partitioning Software and Configuring Partitions
To enable or disable partitioning
software, see “Partitioning Software”, to use the system partitioning
capabilities, see “Partitioning Guidelines for SGI Altix 3000 Series Systems” and “Partitioning a System”.
This section covers the following topics:
SGI ProPack for Linux servers
have XP, XPC, XPNET, and XPMEM kernel modules installed by default to
provide partitioning support. XPC and XPNET are configured off by default
in the /etc/sysconfig/sgi-xpc and /etc/sysconfig/sgi-xpnet
files, respectively. XPMEM is configured on by default in
the /etc/sysconfig/sgi-xpmem file. To enable or disable
any of these features, edit the appropriate /etc/sysconfig/
file and execute the /etc/init.d/sgi-xp script.
On SGI ProPack systems running SLES10 or SLES11, if you intend to
use the cross-partition functionality of XPMEM, you will need to add
xpc to the line in the /etc/sysconfig/kernel
file that begins with MODULES_LOADED_ON_BOOT. Once
that is added, you may either reboot the system or issue an
modprobe xpc command to get the cross-partition functionality
to start working. For more information on using modprobe,
see the modprobe(8) man page.
To activate xpc on future boots on SGI ProPack
6 systems running RHEL5, you need to add a modprobe xpc
line to the /etc/sysconfig/modules/sgi-propack.modules
file instead of adding the module name to the MODULES_LOADED_ON_BOOT
line.
The XP kernel module is a simple module which coordinates activities
between XPC, XPMEM, and XPNET. All of the other cross-partition kernel
modules require XP to function.
The XPC kernel module provides fault-tolerant, cross-partition communication
channels over NUMAlink for use by the XPNET and XPMEM kernel modules.
The XPNET kernel module implements an Internet protocol (IP) interface
on top of XPC to provide high-speed network access via NUMAlink. XPNET
can be used by applications to communicate between partitions via NUMAlink,
to mount file systems across partitions, and so on. The XPNET driver is
configured using the ifconfig commands. For more information,
see the ifconfig(1M) man page. The procedure
for configuring the XPNET kernel module as a network driver is essentially
the same as the procedure used to configure the Ethernet driver. You can
configure the XPNET driver at boot time like the Ethernet interfaces by
using the configuration files in /etc/sysconfig/network-scripts
. To configure the XPNET driver as a network driver see the
following procedure.
Procedure 2-1. Setting up Networking Between Partitions
The procedure for configuring
the XPNET driver as a network driver is essentially the same as the procedure
used to configure the Ethernet driver (eth0), as follows: Log in as root.
On a SGI ProPack 3 system, configure the
xp0 IP address, as follows:
For later SGI ProPack systems, configure the xp0
IP address using yast2. For information on using
yast2, see SUSE LINUX Enterprise Server 9 Installation
and Administration manual. The driver's full name inside
yast2 is SGI Cross Partition Network adapter.
Add the network address for the xp0
interface by editing the /etc/hosts file.
Reboot your system or restart networking.
The XPMEM kernel module provides direct access to memory located
on other partitions. It uses XPC internally to communicate with XPMEM
kernel modules on other partitions to accomplish this. XPMEM is currently
used by SGI's Message Passing Toolkit (MPT) software (MPI and SHMEM).
Partitioning rules define the set of valid configurations for a
partitioned system. Fault isolation is one of the major reasons for partitioning
a system. A software or hardware failure in one partition should not cause
a failure in another partition. This section describes restrictions are
placed on partitions to accomplish this. This section covers these topics:
Partitioning Guidelines for SGI Altix 3000 Series Systems
Follow these guidelines when partitioning
your system: A partition must be made up of one or more C-bricks. The
number of C-bricks in your systems determines the number of partitions
you can create. The number of partitions cannot exceed the number of C-bricks
your system contains. The first C-brick in each partition must have an
IX-brick attached to it via the XIO connection of the C-brick.
You need at least as many IX-bricks for base IO as partitions
you wish to use.
Each partition needs to have an IX-brick with a valid
system disk in it. Since each partition is a separate running system,
each system disk should be configured with a different IP address/system
name, and so on.
Each partition must have a unique partition ID number
between 1 and 63, inclusively.
All bricks in a partition must be physically contiguous.
The route between any two processors in the same partition must be contained
within that partition, and not through any other partition. If the bricks
in a partition are not contiguous, the system will not boot.
Each partition must contain the following components:
Partitioning Guidelines for SGI Altix 450 Series Systems
Partitioning guidelines on a SGI Altix 450 system are, as follows: The minimum granularity for a partition is one IRU (ideally
with its own power supply setup). On Altix 450 systems, this means four
compute blades is the minimum level of hardware isolation.
Each partition must have the infrastructure to run as
a standalone system. This infrastructure includes a system disk and console
connection.
An I/O blade belongs to the partition that the attached
IRU belongs to. I/O blades cannot be shared by two partitions.
Peripherals, such as dual-ported disks, can be shared
the same way two nodes in a cluster can share peripherals.
Partitions must be contiguous in the topology (for example,
the route between any two nodes in the same partition must be contained
within that partition - and not route through any other partition). This
allows intra-partition communication to be independent of other partitions.
Partitions must be fully interconnected. That is to say,
for any two partitions, there is a direct route between those partitions
without passing through a third. This is required to fulfill true isolation
of a hardware or software fault to the partition in which it occurs.
Partitioning Guidelines for SGI Altix 4000 Series Systems
Partitioning guidelines on a SGI Altix 4700 system are, as follows:
The minimum granularity for a partition is one individual
rack unit (IRU) (ideally with its own power supply setup). On SGI Altix
4700 systems, this means eight compute blades is the minimum level of
hardware isolation.
Each partition must have the infrastructure to run as
a standalone system. This infrastructure includes a system disk and console
connection.
An I/O blade belongs to the partition to which the attached
IRU belongs. I/O blades cannot be shared by two partitions.
Peripherals, such as dual-ported disks, can be shared
the same way two nodes in a cluster can share peripherals.
Partitions must be contiguous in the topology (for example,
the route between any two nodes in the same partition must be contained
within that partition; and not route through any other partition). This
allows intra-partition communication to be independent of other partitions.
Quad-dense meta-routers ( 32-port routers) are a shared
resource. A single quad-dense meta-router can connect to multiple partitions.
Partitions must be fully interconnected. That is, for
any two partitions, there is a direct route between those partitions without
passing through a third. This is required to fulfill true isolation of
a hardware or software fault to the partition in which it occurs.
When the total system is greater than 16 IRUs (128 SHubs),
it runs in coarse mode. In coarse mode, the minimum partition size is
two IRUs (16 SHUBs).
This
section describes how to partition your system.
Procedure 2-2. Partitioning a System
Into Four Partitions
To partition your system, perform the following steps : Make sure your system can be partitioned. See “Partitioning Guidelines for SGI Altix 3000 Series Systems”.
You can use the Connect to System
Controller task of the SGIconsole Console Manager GUI to connect
to the L2 controller of the system you want to partition. The L2 controller
must appear as a node in the SGIconsole configuration. For information
on how to use SGIconsole, see the
Console Manager for SGIconsole Administrator's Guide.
Using the L2 terminal (l2term), connect to
the L2 controller of the system you wish to partition. After a connection
to the L2 controller, an L2> prompt appears, indicating that the L2 is
ready to accept commands, for example: cranberry-192.168.11.92-L2>: |
If the L2 prompt does not appear, you can type Ctrl-T.
To remain at the L2 command prompt, type l2 (lowercase
letter 'L') at the L2> prompt
 | Note: Each partition has its own set of the following PROM environment
variables: ConsolePath, OSLoadPartition,
SystemPartition, netaddr, and root
.
For more information on using the L2 controller, see the
SGI L1 and L2 Controller Software User's Guide.
You can partition a system from SGIconsole Console Manager system
console connection, however, SGIconsole does not include any GUI awareness
of partitions (in the node tree view for instance) or in the commands,
and there is no way to power down a group of partitions, or get all the
logs of a partitioned system, or to actually partition a system. If you
partition a node that is managed by SGIconsole, make sure to edit the
partition number of the node using the Modify a Node
task. For more information, see the Console Manager for SGIconsole
Administrator's Guide.
|
Use the L2 sel command to
list the available consoles, as follows: cranberry-192.168.11.92-L2>sel |
 | Note: If the Linux operating system
is currently executing, perform the proper shutdown procedures before
you partition your system.
|
This step shows an example of how to partition
a system into four separate partitions.
To see the current configuration of
your system, use the L2 cfg command to display the
available bricks, as follows: cranberry-192.168.11.92-L2>cfg
L2 192.168.11.92: - --- (no rack ID set) (LOCAL)
L1 192.168.11.92:8:0 - 001c31
L1 192.168.11.92:8:1 - 001i34
L1 192.168.11.92:11:0 - 001c31
L1 192.168.11.92:11:1 - 001i34
L1 192.168.11.92:6:0 - 001r29
L1 192.168.11.92:9:0 - 001r27
L1 192.168.11.92:7:0 - 001c24
L1 192.168.11.92:7:1 - 101i25
L1 192.168.11.92:10:0 - 001c24
L1 192.168.11.92:10:1 - 101i25
L1 192.168.11.92:0:0 - 001r22
L1 192.168.11.92:3:0 - 001r20
L1 192.168.11.92:2:0 - 001c14
L1 192.168.11.92:2:1 - 101i21
L1 192.168.11.92:5:0 - 001c14
L1 192.168.11.92:5:1 - 101i21
L1 192.168.11.92:1:0 - 001c11
L1 192.168.11.92:1:1 - 001i07
L1 192.168.11.92:4:0 - 001c11
L1 192.168.11.92:4:1 - 001i07 |
In this step, you need to decide which bricks
to put into which partitions.
You can determine which C-bricks are directly attached to IX-bricks
by looking at the output from the cfg man. Consult
the hardware configuration guide for the partitioning layout for your
particular system. In the cfg output above, you can
check the number after the IP address. For example, 001c31
is attached to 001i34 which is indicated by the fact
that they both have 11 after their
respective IP address.
 | Note: On some systems, you will have a rack
ID in place of the IP address. 001c31 is a C-brick
(designated by the c in 001c31)
and 001i34 is an IX-brick (designated with an
i in 001i34).
|
Another pair is 101i25 and 001c24.
They both have 10 after the IP
address. The brick names containing an r designation
are routers. Routers do not need to be designated to a specific partition
number.
In this example, the maximum number of partitions this system can
have is four. There are only four IX-bricks total: 001i07,
101i21, 101i25, and 001i34.
 | Note: Some IX-brick names appear twice. This occurs because some IX-bricks
have dual XIO connections.
You do not have to explicitly assign IX-bricks to a partition. The
IX-bricks assigned to a partition are inherited from the C-bricks.
|
When you specify bricks to L2 commands, you
use a rack.slot naming convention. To configure
the system into four partitions, do not specify the whole brick name (
001c31) but rather use the designation 1.31 as follows: cranberry-192.168.11.92-L2>1.31 brick part 1
001c31:
brick partition set to 1.
cranberry-192.168.11.92-L2>1.34 brick part 1
001#34:
brick partition set to 1.
cranberry-192.168.11.92-L2>1.24 brick part 2
001c24:
brick partition set to 2.
cranberry-192.168.11.92-L2>101.25 brick part 2
101#25:
brick partition set to 2.
cranberry-192.168.11.92-L2>1.14 brick part 3
001c14:
brick partition set to 3.
cranberry-192.168.11.92-L2>101.21 brick part 3
101i21:
brick partition set to 3.
cranberry-192.168.11.92-L2>1.11 brick part 4
001c11:
brick partition set to 4.
cranberry-192.168.11.92-L2>1.07 brick part 4
001#07:
brick partition set to 4. |
To confirm your settings, enter the
cfg command again, as follows:
 | Note: This may take up to
30 seconds.
|
cranberry-192.168.11.92-L2>cfg
L2 192.168.11.92: - --- (no rack ID set) (LOCAL)
L1 192.168.11.92:8:0 - 001c31.1
L1 192.168.11.92:8:1 - 001i34.1
L1 192.168.11.92:11:0 - 001c31.1
L1 192.168.11.92:11:1 - 001i34.1
L1 192.168.11.92:6:0 - 001r29
L1 192.168.11.92:9:0 - 001r27
L1 192.168.11.92:7:0 - 001c24.2
L1 192.168.11.92:7:1 - 101i25.2
L1 192.168.11.92:10:0 - 001c24.2
L1 192.168.11.92:10:1 - 101i25.2
L1 192.168.11.92:0:0 - 001r22
L1 192.168.11.92:3:0 - 001r20
L1 192.168.11.92:2:0 - 001c14.3
L1 192.168.11.92:2:1 - 101i21.3
L1 192.168.11.92:5:0 - 001c14.3
L1 192.168.11.92:5:1 - 101i21.3
L1 192.168.11.92:1:0 - 001c11.4
L1 192.168.11.92:1:1 - 001i07.4
L1 192.168.11.92:4:0 - 001c11.4
L1 192.168.11.92:4:1 - 001i07.4 |
The system is now partitioned. However, you
need to reset each partition to complete the configuration, as follows:
cranberry-192.168.11.92-L2>p 1,2,3,4 rst |
 | Note: You can use a shortcut to reset every partition, as follows: cranberry-192.168.11.92-L2>p * rst |
|
To get to the individual console of a partition,
such as partition 2, enter the following: cranberry-192.168.11.92-L2>sel p 2 |
For more information on accessing the
console of a partition, see “Accessing the Console on a Partitioned System”.
Procedure 2-3. Partitioning a System into Two Partitions
To partition your system, perform the following steps: Perform steps 1 through 5 in Procedure 2-2.
To configure the system into two partitions,
enter the following commands: cranberry-192.168.11.92-L2>1.31 brick part 1
001c31:
brick partition set to 1.
cranberry-192.168.11.92-L2>1.34 brick part 1
001#34:
brick partition set to 1.
cranberry-192.168.11.92-L2>1.24 brick part 1
001c24:
brick partition set to 1.
cranberry-192.168.11.92-L2>101.25 brick part 1
101#25:
brick partition set to 1.
cranberry-192.168.11.92-L2>1.14 brick part 2
001c14:
brick partition set to 2.
cranberry-192.168.11.92-L2>101.21 brick part 2
101i21:
brick partition set to 2.
cranberry-192.168.11.92-L2>1.11 brick part 2
001c11:
brick partition set to 2.
cranberry-192.168.11.92-L2>1.7 brick part 2
001#07:
brick partition set to 2. |
To confirm your settings, issue the
cfg command again, as follows:
 | Note: This may take up to
30 seconds.
|
cranberry-192.168.11.92-L2>cfg
L2 192.168.11.92: - --- (no rack ID set) (LOCAL)
L1 192.168.11.92:8:0 - 001c31.1
L1 192.168.11.92:8:1 - 001i34.1
L1 192.168.11.92:11:0 - 001c31.1
L1 192.168.11.92:11:1 - 001i34.1
L1 192.168.11.92:6:0 - 001r29
L1 192.168.11.92:9:0 - 001r27
L1 192.168.11.92:7:0 - 001c24.1
L1 192.168.11.92:7:1 - 101i25.1
L1 192.168.11.92:10:0 - 001c24.1
L1 192.168.11.92:10:1 - 101i25.1
L1 192.168.11.92:0:0 - 001r22
L1 192.168.11.92:3:0 - 001r20
L1 192.168.11.92:2:0 - 001c14.2
L1 192.168.11.92:2:1 - 101i21.2
L1 192.168.11.92:5:0 - 001c14.2
L1 192.168.11.92:5:1 - 101i21.2
L1 192.168.11.92:1:0 - 001c11.2
L1 192.168.11.92:1:1 - 001i07.2
L1 192.168.11.92:4:0 - 001c11.2
L1 192.168.11.92:4:1 - 001i07.2
|
Now the system has two partitions. To complete
the configuration, reset the two partitions as follows:
cranberry-192.168.11.92-L2>p 1,2 rst |
Determining If a System is Partitioned
Procedure 2-4. Determing If a System Is Partitioned
To determine whether
a system is partitioned or not, perform the following steps: Use the L2term to connect to the L2 controller of the
system.
 | Note: If you are connected to the L2 controller, but do not
have the L2 prompt, try typing the following: CTRL-t.
|
Use the cfg command to determine
if the system is partitioned, as follows: cranberry-192.168.11.92-L2>cfg
L2 192.168.11.92: -(no rack ID set) (LOCAL)
L1 192.168.11.92:8:0 - 001c31.1
L1 192.168.11.92:8:1 - 001i34.1
L1 192.168.11.92:11:0 - 001c31.1
L1 192.168.11.92:11:1 - 001i34.1
L1 192.168.11.92:6:0 - 001r29
L1 192.168.11.92:9:0 - 001r27
L1 192.168.11.92:7:0 - 001c24.2
L1 192.168.11.92:7:1 - 101i25.2
L1 192.168.11.92:10:0 - 001c24.2
L1 192.168.11.92:10:1 - 101i25.2
L1 192.168.11.92:0:0 - 001r22
L1 192.168.11.92:3:0 - 001r20
L1 192.168.11.92:2:0 - 001c14.3
L1 192.168.11.92:2:1 - 101i21.3
L1 192.168.11.92:5:0 - 001c14.3
L1 192.168.11.92:5:1 - 101i21.3
L1 192.168.11.92:1:0 - 001c11.4
L1 192.168.11.92:1:1 - 001i07.4
L1 192.168.11.92:4:0 - 001c11.4
L1 192.168.11.92:4:1 - 001i07.4 |
See the explanation of the output from the
cfg command in Procedure 2-2.
Accessing the Console on a Partitioned System
Procedure 2-5. Access the Console on a Partitioned System
To access the
console on a partition, perform the following steps: Use the L2term to connect to the L2 controller of the system.
 | Note: If you are connected to the L2 controller, but do not
have the L2 prompt, try typing the following: CTRL-t.
|
To see output that shows which C-bricks have
system consoles, enter the sel command without options
on a partitioned system as follows: cranberry-192.168.11.92-L2>sel
known system consoles (partitioned)
partition 1: 001c31 - L2 detected
partition 2: 001c24 - L2 detected
partition 3: 001c14 - L2 detected
partition 4: 001c11 - L2 detected
current system console
console input: not defined
console output: not filtered |
The output from the sel command shows that there
are four partitions defined.
To get to the console of partition 2, for example,
enter the following: cranberry-192.168.11.92-L2>sel p 2 |
To connect to the console of partition 2, enter
Ctrl-d.
When a system is partitioned, the L2 prompt shows the partition
number of the partition you selected, as follows: cranberry-001-L2>sel p 2
console input: partition 2, 001c24 console0
console output: any brick partition 2
cranberry-001-L2:p2> |
Procedure 2-6. Unpartitioning a System
To
remove the partitions from a system, perform the following steps: Use the L2term to connect to the L2 controller of the
system.
 | Note: If you are connected to the L2 controller, but do not
have the L2 prompt, try typing the following: CTRL-t.
|
Shut down the Linux operating system running
on each partition before unpartitioning a system.
To set the partition ID on all bricks to zero,
enter the following command: cranberry-192.168.11.92-L2>r * brick part 0 |
To confirm that all the partitions on your
system have been removed, enter the following command: cranberry-192.168.11.92-L2>cfg |
The list of bricks no longer have a dot followed by a number in
their name (see “Determining If a System is Partitioned”).
To reset all of the bricks, enter the following
command: cranberry-192.168.11.92-L2>r * rst |
To get to the system console for the newly
unpartitioned system, you need to reset the select setting as follows: cranberry-192.168.11.92-L2>sel reset |
To get the console (assuming you still have
the L2 prompt), enter Ctrl-d.
Connecting the System Console to the Controller
System partitioning is an administrative function. The system console
is connected to the controller as required by the configuration selected
when an SGI ProPack system is installed. For additional information or
recabling, contact your service representative.
Making Array Services Operational
This section describes how
to get Array Services operational on your system. For detailed information
on Array Services, see chapter 3, “Array Sevices”, in the
Linux Resource Administration Guide.
 | Note: This section does not apply to SGI Altix XE systems.
|
Standard Array Services is installed by default on an SGI ProPack
6 system. To install Secure Array Services, use the YaST Software Management
and use the Filter->search function to search for
secure array services by name (sarraysvcs).
Procedure 2-7. Making Array Services Operational
To make Array Services operational on your system, perform the following
steps:
 | Note: Most of the steps to install array services is now performed
automatically when the array services RPM is installed. To complete installation,
perform the steps that follow.
|
Make sure that the setting in the
/usr/lib/array/arrayd.auth file is appropriate for your site.
 | Caution: Changing the AUTHENTICATION parameter from
NOREMOTE to NONE may have a negative
security impact on your site.
|
Make sure that the list of machines in your
cluster is included in one or more array definitions in /usr/lib/array/arrayd.conf
file.
To determine if Array Services is correctly
installed, run the following command: You should see yourself listed.
Floating Point Assist Warnings from Applications
Some applications can generate an excessive number of kernel
KERN_WARN "floating point assist" warning messages. This section
describes how you can make these messages disappear for specific applications
or for specific users.
 | Note: This section does not apply to SGI Altix XE systems.
|
An application generates a "floating point assist" trap when a floating
point operation involves a corner case of operand value(s) that the Itanium
processor cannot handle in hardware and requires kernel emulation software
to complete.
Sometimes the application can be recompiled with specific options
that will avoid such problematic operand value(s). Using the
-ffast-math option with gcc or the
-ftz option with the Intel compiler might be effective. Other
instances can be avoided by recoding the application to avoid denormalized
or non-finite operands.
When recoding or recompiling is ineffective or infeasible, the user
running the application or the system administrator can avoid having these
"floating point assist" syslog messages appear by using the
prctl(1) command, as follows: % prctl --fpemu=silent command |
The command and every child process of
the command invoked with prctl
will produce no "floating point assist" messages. The command
can be a single application that is producing unwanted
syslog messages or it may be any shell script. For example, the "floating
point assist" messages for the user as a whole, that is, for all applications
that the user may execute, can be silenced by changing the /etc/passwd
entry of the user to invoke a custom script at login, instead
of executing (for instance) /bin/bash. That custom
script then launches the user's high level shell using the following: prctl --fpemu=silent /bin/bash |
Even if the syslog messages are silenced, the "assist" traps will
continue to occur and the kernel software will still handle the problem.
If these traps occur at a high enough frequency, the application performance
may suffer and notification of these occurrences are not logged.
The syslog messages can be made to reappear by executing the
prctl command again, as follows: % prctl --fpemu=default command |
Unaligned Access Warnings
 | Note: This section does not apply to SGI Altix XE systems.
|
The section describes unaligned access warnings, as follows: The kernel generated unaligned access warnings in syslog
and on the console, when applications do misaligned loads and stores.
This is normally a sign of badly written code and an indication that the
application should be fixed.
Use the prctl(1) command
to disable these messages on a per application basis.
SLES10 offers a new way allowing system administrators
to disable these messages on a system wide basis. This is generally discouraged,
but useful for the case where a system is used to run third-party applications
which cannot be fixed by the system owner.
In order to disable these messages on a system wide level, do the
following as root: echo 1 > /proc/sys/kernel/ignore-unaligned-usertrap |
Linux Configuration and Operations Guide
(document number: 007-4633-017 / published: 2009-10-22)
table of contents | additional info | download
Front Matter
New Features in This Manual
About This Guide
Chapter 1. SGI LK License Facility
Chapter 2. Configuring Your System
Chapter 3. System Operation
Chapter 4. Simple Network Management Protocol
Chapter 5. Kernel Tunable Parameters on SGI ProPack Servers
Index
home/search |
what's new |
help
|