Chapter 7. Flexible File I/O
Flexible File I/O (FFIO) provides
a mechanism for improving the file I/O performance of existing applications
without having to resort to source code changes, that is, the current
executable remains unchanged. Knowledge of source code is not required,
but some knowledge of how the source and the application software work
can help you better interpret and optimize FFIO results. To take advantage
of FFIO, all you need to do is to set some environment variables before
running your application. This chapter covers the following topics:
The FFIO subsystem allows you to define
one or more additional I/O buffer caches for specific files to augment
the Linux kernel I/O buffer cache. The FFIO subsystem then manages this
buffer cache for you. In order to accomplish this, FFIO intercepts standard
I/O calls like open, read, and write, and replaces them with FFIO equivalent
routines. These routines route I/O requests through the FFIO subsystem
which utilizes the user defined FFIO buffer cache. FFIO can bypass the
Linux kernel I/O buffer cache by communicating with the disk subsystem
via direct I/O. This gives you precise control over cache I/O characteristics
and allows for more efficient I/O requests. For example, doing direct
I/O in large chunks (say 16 megabytes) allows the FFIO cache to amortize
disk access. All file buffering occurs in user space when FFIO is used
with direct I/O enabled. This differs from the Linux buffer cache mechanism
which requires a context switch in order to buffer data in kernel memory.
Avoiding this kind of overhead, helps FFIO to scale efficiently. Another
important distinction is that FFIO allows you to create an I/O buffer
cache dedicated to a specific application. The Linux kernel, on the other
hand, has to manage all the jobs on the entire system with a single I/O
buffer cache. As a result, FFIO typically outperforms the Linux kernel
buffer cache when it comes to I/O intensive throughput.
There are only two
environment variables that you need to set in order to use FFIO. They
are LD_PRELOAD and FF_IO_OPTS.
In order to enable FFIO to trap standard I/O calls, you must set
the LD_PRELOAD environment variable.
For SGI Altix 4000 series systems, perform the following:
setenv LD_PRELOAD /usr/lib/libFFIO.so |
For SGI Altix XE systems, perform the following:
setenv LD_PRELOAD /usr/lib64/libFFIO.so |
The
LD_PRELOAD
software is a Linux feature that instructs the linker to preload
the indicated shared libraries. In this case,
libFFIO.so
is preloaded and provides the routines which replace the standard I/O
calls. An application that is not dynamically linked with the
glibc library will not work with FFIO, since the standard I/O
calls will not be intercepted. To disable FFIO, perform the following:
The FFIO buffer cache is managed by the FF_IO_OPTS
environment variable. The syntax for setting this variable can be quite
complex. A simple method for defining this variable is, as follows:
setenv FF_IO_OPTS '<string>(eie.direct.mbytes:<size>:<num>:<lead>:<share>:<stride>:0)' |
You can use the following parameters with the FF_IO_OPTS
environment variable:
| <string> | | Matches the names of files that can use the buffer cache.
|
| <size> | | Number of 4k blocks in each page of the I/O buffer cache.
|
| <num> | | Number of pages in the I/O buffer cache.
|
| <lead> | | The maximum number of "read ahead" pages.
|
| <share> | | A value of 1 means a shared cache, 0 means private
|
| <stride> | | Note that the number after the stride
parameter is always 0.
|
The following example shows a command that creates a shared buffer
cache of 128 pages where each page is 16 megabytes (that is, 4096*4k).
The cache has a lead of six pages and uses a stride of one, as follows:
setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0)' |
Each time the application opens a file, the FFIO code checks the
file name to see if it matches the string supplied by FF_IO_OPTS
. The file's path name is not considered when checking for a
match against the string. So in the example supplied above, file names
like /tmp/test16 and /var/tmp/testit
would both be a match.
More complicated usages of FF_IO_OPTS are built
upon this simpler version. For example, multiple types of file names can
share the same cache, as follows:
setenv FF_IO_OPTS 'output* test*(eie.direct.mbytes:4096:128:6:1:1:0)' |
Multiple caches may also be specified with FF_IO_OPTS.
In the example that follows, files of the form output*
and test* share a 128 page cache of 16 megabyte pages.
The file special42 has a 256 page private cache of
32 megabyte pages, as follows:
setenv FF_IO_OPTS 'output* test*(eie.direct.mbytes:4096:128:6:1:1:0) special42(eie.direct.mbytes:8192:256:6:0:1:0)' |
Additional parameters can be added to FF_IO_OPTS
to create feedback that is sent to standard output. Examples of doing
this diagnostic output will be presented in the following section.
This section walks you through
some simple examples using FFIO.
Assume that LD_PRELOAD is set for the correct
library and FF_IO_OPTS is defined, as follows:
setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0)' |
This example uses a small C program called fio
that reads four megabyte chunks from a file for 100 iterations. When the
program runs it produces output, as follows:
./fio -n 100 /build/testit
Reading 4194304 bytes 100 times to /build/testit
Total time = 7.383761
Throughput = 56.804439 MB/sec |
It can be difficult to tell what FFIO may or may not be doing even
with a simple program such as shown above. A summary of the FFIO operations
that occurred can be directed to standard output by making a simple addition
to FF_IO_OPTS, as follows:
setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0, event.summary.mbytes.notrace )' |
This new setting for FF_IO_OPTS generates the
following summary on standard output when the program is run:
./fio -n 100 /build/testit
Reading 4194304 bytes 100 times to /build/testit
Total time = 7.383761
Throughput = 56.804439 MB/sec
event_close(testit) eie <-->syscall (496 mbytes)/( 8.72 s)= 56.85 mbytes/s
oflags=0x0000000000004042=RDWR+CREAT+DIRECT
sector size =4096(bytes)
cblks =0 cbits =0x0000000000000000
current file size =512 mbytes high water file size =512 mbytes
function times wall all mbytes mbytes min max avg
called time hidden requested delivered request request request
open 1 0.00
read 2 0.61 32 32 16 16 16
reada 29 0.01 0 464 464 16 16 16
fcntl
recall
reada 29 8.11
other 5 0.00
flush 1 0.00
close 1 0.00 |
Two synchronous reads of 16 megabytes each were issued (for a total
of 32 megabytes) and 29 asynchronous reads (reada)
were also issued (for a total of 464 megabytes). Additional diagnostic
information can be generated by specifying the .diag
modifier, as follows:
setenv FF_IO_OPTS 'test*(eie.direct.diag.mbytes:4096:128:6:1:1:0 )' |
The .diag modifier may also be used in conjunction
with .event.summary, the two operate independently
from one another, as follows:
setenv FF_IO_OPTS 'test*(eie.diag.direct.mbytes:4096:128:6:1:1:0, event.summary.mbytes.notrace )' |
An example of the diagnostic output generated when just the
.diag modifier is used is, as follows:
./fio -n 100 /build/testit
Reading 4194304 bytes 100 times to /build/testit
Total time = 7.383761
Throughput = 56.804439 MB/sec
eie_close EIE final stats for file /build/testit
eie_close Used shared eie cache 1
eie_close 128 mem pages of 4096 blocks (4096 sectors), max_lead = 6 pages
eie_close advance reads used/started : 23/29 79.31% (1.78 seconds wasted)
eie_close write hits/total : 0/0 0.00%
eie_close read hits/total : 98/100 98.00%
eie_close mbytes transferred parent --> eie --> child sync async
eie_close 0 0 0 0
eie_close 400 496 2 29 (0,0)
eie_close parent <-- eie <-- child
eie_close EIE stats for Shared cache 1
eie_close 128 mem pages of 4096 blocks
eie_close advance reads used/started : 23/29 79.31% (0.00 seconds wasted)
eie_close write hits/total : 0/0 0.00%
eie_close read hits/total : 98/100 98.00%
eie_close mbytes transferred parent --> eie --> child sync async
eie_close 0 0 0
eie_close 400 496 2 29 (0,0) |
Information is listed for both the file and the cache. An
mbytes transferred example is shown below:
eie_close mbytes transferred parent --> eie --> child sync async
eie_close 0 0 0
eie_close 400 496 2 29 (0,0) |
The last two lines are for write and read operations, respectively.
Only for very simple I/O patterns, the difference between (parent -->
eie) and (eie --> child) read statistics can be explained by the number
of read aheads. For random reads of a large file over a long period of
time, this is not the case. All write operations count as async
.
Multithreading Considerations
FFIO will work with applications that use MPI for parallel
processing. An MPI job assigns each thread a number or rank. The master
thread has rank 0, while the remaining threads (called slave threads)
have ranks from 1 to N-l where N is the total number of threads in the
MPI job. It is important to consider that the threads comprising an MPI
job do not (necessarily) have access to each others address space. As
a result, there is no way for the different MPI threads to share the same
FFIO cache. By default, each thread defines a separate FFIO cache based
on the parameters defined by FF_IO_OPTS.
Having each MPI thread define a separate FFIO cache based on a single
environment variable (FF_IO_OPTS) can waste a lot of
memory. Fortunately, FFIO provides a mechanism that allows the user to
specify a different FFIO cache for each MPI thread via the following environment
variables:
setenv FF_IO_OPTS_RANK0 'result*(eie.direct.mbytes:4096:512:6:1:1:0)'
setenv FF_IO_OPTS_RANK1 'output*(eie.direct.mbytes:1024:128:6:1:1:0)'
setenv FF_IO_OPTS_RANK2 'input*(eie.direct.mbytes:2048:64:6:1:1:0)'
.
.
.
setenv FF_IO_OPTS_RANKN-1 ... (N = number of threads). |
Each rank environment variable is set using the exact same syntax
as FF_IO_OPTS and each defines a distinct cache for
the corresponding MPI rank. If the cache is designated shared, all files
within the same ranking thread will use the same cache. FFIO works with
SGI MPI, HP MPI, and LAM MPI. In order to work with MPI applications,
FFIO needs to determine the rank of callers by invoking the
mpi_comm_rank_() MPI library routine . Therefore, FFIO needs
to determine the location of the MPI library used by the application.
This is accomplished by having the user set one (and only one) of the
following environment variables:
setenv SGI_MPI /usr/lib # ia64 only
or
setenv LAM_MPI *see below
or
setenv HP_MPI *see below
*LAM and HP MPIs are usually distributed via a third party application. The precise
paths to the LAM and the HP MPI libraries are application dependent. Please refer to the
application installation guide to find the correct path. |
In order to use the rank functionality, both the MPI and
FF_IO_OPTS_RANK0 environment variables must be set. If either
variable is not set, then the MPI threads all use FF_IO_OPTS
. If both the MPI and the FF_IO_OPTS_RANK0
variables are defined but, for example, FF_IO_OPTS_RANK2
is undefined, all rank 2 files would generate a no match
with FFIO. This means that none of the rank 2 files would be cached by
FFIO (in this case things DO NOT default to FF_IO_OPTS).
Fortran and C/C++ applications that use the pthreads
interface will create threads that share the same address space. These
threads can all make use of the single FFIO cache defined by
FF_IO_OPTS.
FFIO has been deployed successfully with several HPC applications
such as Nastran and Abaqus. In a recent customer benchmark, an eight-way
Abaqus throughput job ran approximately twice as fast when FFIO was used.
The FFIO cache used 16 megabyte pages (that is, page_size
= 4096) and the cache size was 8.0 gigabytes. As a rule of thumb, it was
determined that setting the FFIO cache size to roughly 10-15% of the disk
space required by Abaqus yielded reasonable I/O performance. For this
benchmark, the FF_IO_OPTS environment variable was
defined by:
setenv FF_IO_OPTS '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023
*.nck* *.sct *.lop *.ngr *.elm *.ptn* *.stp* *.eig *.lnz* *.mass *.inp* *.scn* *.ddm
*.dat* fort*(eie.direct.nodiag.mbytes:4096:512:6:1:1:0,event.summary.mbytes.notrace)' |
For the MPI version of Abaqus, different caches were specified for
each MPI rank, as follows:
setenv FF_IO_OPTS_RANK0 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023
*.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm
*.dat* fort*(eie.direct.nodiag.mbytes:4096:512:6:1:1:0,event.summary.mbytes.notrace)'
setenv FF_IO_OPTS_RANK1 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023
*.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm
*.dat* fort*(eie.direct.nodiag.mbytes:4096:16:6:1:1:0,event.summary.mbytes.notrace)'
setenv FF_IO_OPTS_RANK2 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023
*.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm
*.dat* fort*(eie.direct.nodiag.mbytes:4096:16:6:1:1:0,event.summary.mbytes.notrace)'
setenv FF_IO_OPTS_RANK3 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023
*.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm
*.dat* fort*(eie.direct.nodiag.mbytes:4096:16:6:1:1:0,event.summary.mbytes.notrace)' |
By specifying the .trace option as part of the
event parameter the user can enable the event tracing feature in FFIO,
as follows:
setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0, event.summary.mbytes.trace )' |
This option generates files of the form ffio.events.pid
for each process that is part of the application. By default,
event files are placed in /tmp but this destination
can be changed by setting the FFIO_TMPDIR environment
variable. These files contain time stamped events for files using the
FFIO cache and can be used to trace I/O activity (for example, I/O sizes
and offsets).
System Information and Issues
The SGI ProPack 5 Service Pack 1 release provided the first stable
version of FFIO. Applications written in C, C++, and Fortran are supported.
C and C++ applications can be built with either the Intel or gcc compiler.
Only Fortran codes built with the Intel compiler will work.
The following restrictions on FFIO must also be observed:
The FFIO implementation of pread/
pwrite is not correct (the file offset advances).
Do not use FFIO to do I/O on a socket.
Do not link your application with the librt
asynchronous I/O library.
Calls that operate on files in /proc,
/etc, and /dev are not intercepted by FFIO.
Calls that operate on stdin,
stdout, and stderr are not intercepted by
FFIO.
FFIO is not intended for generic I/O applications such
as vi, cp, or mv,
and so on.