SGI Techpubs Library

The new home for SGI documentation is the SGI Customer Portal, This site will be redirected to the new location later this month.

IRIX 6.5  »  Man Pages
find in page | jump to first hit | clear highlight



     grio2 - Guaranteed Rate I/O Version 2

     The term reservation is used to refer to the set of Quality-of-Service
     (QoS) parameters (bandwidth, reservation interval) requested by an
     application. Reservation requests are forwarded to the GRIO bandwidth
     management daemon ggd2(1M). If the request is granted then the
     application is said to have received a guarantee from the system that its
     QoS requirements will be met. Within the kernel an object is instantiated
     that encodes the requested QoS parameters and maintains the necessary
     scheduling and monitoring state. This object is referred to as a GRIO
     stream. The stream ID is returned to the user application.  Stream IDs
     are unique across reservations and across the cluster.

     This manual page describes the second version of the GRIO product. Where
     it is necessary to distinguish between this release, and the previous
     release the terms GRIOv2 and GRIOv1 are used. Where the term GRIO is used
     without qualification it refers to the second version of the product.


     GRIOv1 was designed for use with tightly controlled, locally attached
     storage devices. It depends on detailed performance data for every piece
     of hardware in the I/O path including: the storage devices themselves,
     the SCSI and Fiber Channel busses, system interconnects and bridges. It
     only works with the XLV volume manager and does not support shared CXFS

     Modern storage systems are moving towards large interconnected Storage
     Area Networks (SANs) in which heterogeneous systems and storage devices
     are connected via a dedicated high-speed network. In this model, large
     storage resources, such as multi-terabyte RAID devices, are shared
     amongst a number of clients using a shared filesystem such as CXFS.

     GRIOv2 has been created to broaden the GRIO QoS framework to this next
     generation of storage architectures.

     Its key features are as follows:

     1.   Support for shared filesystems and clustered heterogeneous

          GRIOv2 has been designed from the outset to work with the XVM volume
          manager and fully supports guaranteed-rate I/O to both local XFS and
          shared CXFS filesystems. It is designed to manage I/O from multiple
          heterogeneous nodes and to ensure that a GRIO reservation on one
          node is not affected by I/O elsewhere in a cluster.

     2.   A new filesystem-level performance qualification model.

          GRIOv1 uses a complicated per-device qualification model, in which
          the maximum sustainable bandwidth for each component in the I/O
          path, from disk device to memory, is qualified separately. A
          synthetic benchmark grio_bandwidth(1M) is used to profile individual
          storage devices.

          GRIOv1 depends on this information being complete and accurate. This
          approach is appropriate for the tightly controlled environment of a
          locally attached filesystem. However, as storage networks become
          increasingly heterogeneous and topologies increasingly complex, this
          approach becomes impractical.

          As a result, GRIOv2 has moved to a filesystem-level qualification
          model in which the maximum sustainable bandwidth is measured across
          the entire filesystem under a realistic application workload.
          Empirical measurement of actual filesystem performance is used to
          determine the QoS parameters that can be delivered in practice by a
          particular configuration. This is referred to as the qualified
          bandwidth for the filesystem (and the XVM volume on which it

          For local volumes the qualified bandwidth is stored in /etc/griotab,
          for shared volumes it is stored in the cluster configuration
          database (CDB). Refer to the GRIO Version 2 Guide, ggd2(1M) and
          griotab(4) for more information on measuring and setting the
          qualified bandwidth for a filesystem.

     3.   Comprehensive QoS Monitoring.

          GRIOv2 provides comprehensive tools for measuring and monitoring
          delivered QoS levels. This includes in-kernel collection of per-
          stream performance metrics. Refer to grioqos(1M) for further

          The information provided by the QoS facilities can be used to help
          choose the tradeoff between resource utilisation and delivered I/O
          performance that is most appropriate for a given application mix,
          workload, and production environment.

     4.   Cluster-wide encapsulation and control of non-GRIO I/O.

          When GRIOv2 begins managing an XVM volume, every node with access to
          that volume is notified. From that point on, all user and system I/O
          that doesn't have an explicit GRIO reservation is encapsulated. This
          means that all non-GRIO I/O is automatically associated with a
          system managed nongrio kernel stream.

          The central bandwidth management component of GRIOv2 ggd2 allocates
          otherwise unused filesystem bandwidth to these streams - allowing
          non-GRIO I/O to be processed even when there are active reservations
          in the system. ggd2 dynamically adjusts the amount of bandwidth
          allocated for this purpose based on monitoring of filesystem demand
          and utilisation. In addition to this Dynamic Bandwidth Allocation,
          an administrator can reserve bandwidth at the node-level for use by
          all nongrio applications running on that node, this is referred to
          as a Static Bandwidth Allocation. Refer to ggd2(1M) and
          grioadmin(1M) for more information.


     In order to utilize a GRIO reservation a file must be read or written
     using direct I/O. The open(2) manual page describes the use and buffer
     alignment restrictions of the direct I/O interface. A GRIO reservation
     can be made for any file within an XFS or CXFS filesystem created on an
     XVM volume.

     In some applications more deterministic performance can be achieved by
     creating files on a dedicated real-time subvolume. To allocate a file on
     the real-time subvolume of an XFS or CXFS filesystem the fcntl(2)
     F_FSSETXATTR command must be used to set the XFS_XFLAG_REALTIME flag.
     This can only be issued on a newly created file. It is not possible to
     mark a file as real-time once non-real-time data blocks have been
     allocated to it.


     GRIOv2 functionality is distributed between three main components:  the
     new guarantee-granting daemon ggd2; the userspace library libgrio2 and
     command line utilities; and the kernel.

     ggd2(1M) is a user level process started at system boot. It is
     responsible for activating and deactivating the GRIOv2 kernel scheduler,
     processing client requests to reserve and release bandwidth, tracking
     bandwidth utilisation, managing unallocated bandwidth, and enforcing the
     GRIOv2 software licenses.

     grioadmin(1M) is used to perform node-level administration tasks for XFS
     and CXFS filesystems including: querying available bandwidth, listing
     active GRIO reservations, and creating, modifying and releasing node-
     level static bandwidth allocations.

     grioqos(1M) is used to extract and report the QoS metrics that GRIO
     maintains for each stream.

     libgrio2 implements the GRIOv2 userspace API. User processes communicate
     with the daemon using the following core API calls:

          grio_avail()      - get available bandwidth for a filesystem
          grio_reserve()    - reserve bandwidth from a filesystem
          grio_reserve_fd() - reserve bandwidth and bind a file descriptor
          grio_bind()       - bind a file descriptor to a stream
          grio_unbind()     - unbind a file descriptor
          grio_modify()     - modify an existing stream
          grio_get_stream() - map a bound file descriptor to its stream ID
          grio_release()    - signal that a stream should be reclaimed

     The process that initially reserves bandwidth with a call to grio_reserve
     or grio_reserve_fd is referred to as the owning process. Any streams not
     already released when their owning process exits will be automatically
     released. Streams can be shared between processes. The ownership of a
     GRIOv2 stream is non-transferable.

     GRIOv2 functionality in the kernel includes stream management, the I/O
     scheduler, cluster integration and messaging.


     There are two important constraints that must be observed when setting up
     a GRIOv2 filesystems:

     1.   If any of the luns on a particular device will be managed as GRIO
          volumes, then all of the luns should be managed as GRIO volumes.
          Typically there will be hardware contention between separate luns,
          both in the SAN and within the storage device. If only a subset of
          the luns are managed, I/O to the unmanaged luns could still cause
          oversubscription of the device and in turn violate GRIO rate
          guarantees on the managed volumes.

     2.   For a similar reason, a storage device containing GRIO managed
          volumes should not be shared between clusters. The GRIO daemons
          running within different clusters are not coordinated, and unmanaged
          I/O from one cluster can cause GRIO rate guarantees in the other
          cluster to be violated.

     It may be appropriate to relax these constraints if a storage device can
     be configured such that there is no internal or external contention
     between independent luns.


     This section provides some tips on how to set up a filesystem on a RAID
     device to achieve correct filesystem device alignment and maximise I/O
     performance. There are three steps that are essential to ensuring correct
     filesystem alignment:

     1.   Ensuring that each data partition is correctly aligned with the
          internal disk layout of its lun.

     2.   Setting XVM stripe parameters correctly.

     3.   Passing correct volume geometry (stripe unit and width) to

     These three issues are demonstrated with an example.

     Consider a RAID device with 28 disks arranged as 4 volume groups, with 7
     disks per volume group, with each volume group configured as 6+1 RAID 5
     (6 data disks, 1 parity disk). These are mapped directly to 4 luns - 1
     lun per volume group.

     If the back end transfer size of the RAID device is 128 KB (i.e. the size
     of transfers between the RAID controllers and individual disks), then
     each lun will have an aligned transfer size of 6*128 KB which is 768 KB
     or 1536 filesystem blocks (512 bytes each).

     The first step is to ensure that the raw data partitions are correctly
     aligned with the start of their corresponding luns (i.e.  the first disk
     in the volume group). In this case luns are 1536 blocks wide, so the
     start of the data partition should be a multiple of this number. As we
     already require space at the start of the lun for the volume header (e.g.
     4096 blocks by default for XLV/XVM) a good choice would be to move the
     start of the data partition to 4*1536 or 6144 blocks.

     GRIOv2 can only be used with XVM volumes, so xvm(1M) is used to partition
     each lun. The location of the data partition is controlled by adjusting
     the size of volume header and the size of the XVM volume label. In this
     case, by passing the following options to the label command:

          xvm> label -volhdrblks 5120 -xvmlabelblks 1024 <devname>

     The luns are then arranged into a stripe. The stripe unit must match the
     aligned transfer size of the luns (or a multiple thereof). This is
     specified in the stripe subcommand as follows:

          xvm> stripe -unit 1536 <slices ...>

     Now a filesystem is created on the XVM volume. If the stripe is used as
     the data subvolume the following command creates a filesystem with the
     correct alignment:

          mkfs_xfs -d sunit=1536,swidth=6144 <xvm_devname>

     As there are four luns in total the stripe width swidth is four times the
     aligned transfer size of the individual luns. Specifying the stripe unit
     and width to mkfs_xfs allows it to ensure that key internal regions of
     the filesystem are correctly aligned with the underlying volume

     If the stripe is used as the realtime subvolume then the realtime extent
     size should be set to a multiple of the volume stripe width. This extent
     size also becomes the optimal I/O size that should be used by
     applications doing I/O to the filesystem. The following command sets the
     extent size to the stripe width (note that the 'b' suffix is required to
     specify filesystem blocks):

          mkfs_xfs -r extsize=6144b <xvm_devname>

     This will optimize the filesystem for I/Os spanning the entire disk

     Note that if a non-GRIO XFS filesystem was created directly on one of
     these luns, the fx(1M) command is used to partition the disk and move the
     start of the data partition. For example the following sequence of

          fx -x -d <devname>
          fx> repartition
          fx/repartition> optiondrive
          fx/repartition> expert -b

     will partition a drive as an option drive and then allow the layout of
     the partitions to be adjusted interactively (-b specifies input values
     are in filesystem blocks). The data partition should be selected and its
     first block moved to 6144 - placing the start of the data partition on
     the first disk of the lun.

     Remember, however, that an XFS filesystem must be made on an XVM volume
     if it is to be managed by GRIO.




     ggd2(1M), grioadmin(1M), griotab(4) grio_avail(3X), grio_bind(3X),
     grio_modify(3X), grio_release(3X), grio_reserve(3X), grio_reserve_fd(3X)
     grio_unbind(3X), grioqos(1M)

Home    •     What's New    •     Help    •     Terms of Use    •     Privacy Policy    •

© 2009 - 2015 Silicon Graphics International Corp. All Rights Reserved.