File Subsystem
The file subsystem controls access to data files as
well as devices on the REAL/IX Operating System. It provides an hierarchical
organization for files and directories, buffering of data (and the
ability to bypass buffering for critical realtime I/O operations),
and control of file access.
See File Access Utilities
in Chapter 8 for a list of commands used to access and manipulate
files and directories.
File types
A file is the basic unit of the file subsystem. Files
can represent character or binary information, but all files are stored
identically: as a sequence of bytes. There are four basic file types.
- Regular - one dimensional array of bytes used to store
data text, or programs
- Pipe - similar to regular files in how they represent
unformatted information, and unique in that reads from them are
destructive. This file is used to communicate between processes
- Special - used to interface between the process and device.
These files reside in the /dev directory.
- Directory - used to organize regular, pipe, special,
and other directory files. They hold file names and references
to information about the files. Only the operating system can
write to them. Users can modify directories by making a request
to the operating system.
REAL/IX file names consist of one to fourteen ASCII
characters, and should use only printable characters that are not
special to the shell.
The REAL/IX file subsystem uses three entities to maintain
user and system files and directories. These are:
- Directory Tree: all accessible directories and file systems,
used when giving path names.
- File Systems: independent collections of files and directories,
each of which is located on the same partition (or contiguous
partitions) of a disk device.
- Directories: reside within a file system and are used
to further organize user and system files. They contain the names
of the files and references to the remainder of the information
about the file. This reference is called the inumber and
is used to locate the inode for the file being accessed.
File Listings
The ls(1) command is used to list the files in
a directory. A number of options are available to provide additional
information and specify the format of the output; some of these are
explained later in this chapter, and all are listed on the ls
manual page. For now, it is useful to understand the basic information
given for the ls -l (long) command:
Table 1 - File Listings
This listing provides the following information:
- mode - the first character identifies the type of file. - is
a regular file; d is a directory; e is an extent-based file. Special
device files use c for a character-access file and b for a block-access
file. Special device files are discussed more in Chapter
6.
- - The remaining nine characters define the file access permissions
as discussed in the following section.
- links - the number of names linked to this file.
- owner and group - identifies who can access the file as owner
and who may access the file as group. Use the chgrp(1)
command to change the group, and the chown(2) command to
change the owner of a file.
- size - logical size of the file in bytes. For special device
files, this field contains the major and minor device numbers
- date and time - date and time of last file modification
- file name - the common name used to access this file
File Permissions
File permissions determine who may read, write, and
execute a file. Each file has an owner (changed with the chown(2)
command) and a group (changed with the chgrp(1) command). The
mode of a file is an octal mask that determines the access privileges
for the owner (u), group (g), and world or other (o).
It is displayed in the last nine characters of the leftmost column
of the output from the ls -l command.
The format of the output is three sets of three characters
each, defining read, write, and execute permissions for the owner,
group, and other, as illustrated below:
Table 2 - File Permissions
For permissions that are not granted to that class of
user, a dash (-) replaces the letter; for instance, r--
permissions indicate read privileges but no write or execute permissions.
Initial file permissions are determined by the umask value
set in the /etc/profile or $HOME/.profile file. The
file permissions are changed with the chmod(2) command. chmod
can use either the single letter representation or a three digit number,
where each digit corresponds to the owner, group, and other, respectively.
The value of each digit is determined by summing the number corresponding
to the per-missions:
- 4 -read permissions (r)
- 2 -write permissions (w)
- 1 -execute permissions (x)
So, for example, 777 represents the permissions
shown above (read, write, and execute permissions for the owner, group,
and other). 640 gives the owner read and write permissions,
the group read permissions, and no access permissions to other users.
chmod can assign permissions in a format such as chmod g+rw
(which adds read and write permissions for the group) or chmod
o-w (which removes write permissions for other. Note that any
executable file must have execute permissions; directories must have
execute permissions for use in a path name.
The user name shown in the ls output corresponds
to a user ID assigned in the /etc/passwd file; the group name
corresponds to a group ID assigned in the /etc/passwd file
and populated in the /etc/group file. The user and group shown
in the ls output correspond to the "real" user identification
number (UID) and group identification number (GID).
Normally, a process runs with the permissions of the
user who executes it. A program can use the setuid(2) and setgid(2)
system calls to set the "effective" UID and GID: when the
setuid bit is on, the process runs with the permissions of
the real owner, and when the setgid bit is on, the process
runs with the permissions of the real group. This is used, for instance,
so that a number of users may update a file but only through a specific
command (not through an editor or other utility). The program can
give execute permissions to anyone, then internally reset the effective
UID or GID so it can access a file that has restricted permissions.
User View of a File System
A file system is a combination of directories and files
descending from a common directory. The combination of directories
and files make up a file system. Figure 5-1 shows the relationship
between directories and files in a REAL/IX file system. The directories
are represented by circles; files are represented with lines beneath
the directories.
Figure 1 - A REAL/IX File System
Path Names
The starting point of any UNIX file system is a directory
that serves as its root. One file system is referred to by that name,
root, and is the topmost directory on the system. The root
directory of the root file system is represented by a single
slash (/). The file system diagrammed in Figure 5-1 is a root
file system, with subdirectories /bin, /etc, and /usr.
A full path name for a file gives the location
of a file in relationship to root, for instance, /bin/cat.
A relative path name for a file gives the location of a file
in relationship to ones present working directory. So, if ones present
working directory is root, one could refer to bin/cat.
Relative path names use . to refer to the present
working directory and .. to refer to the directory level directly
above the present working directory. So, if ones present working directory
is /etc, one could use the relative path name ../bin/cat
to refer to this file. Typically, the .. directory is referred
to as the "parent directory".
Mounting Another File System
You may wish to mount other file systems under an existing
file system. A prime example is the usr file system mounted
under the root file system. Figure 5-2 shows such a file system
mounted as /usr.
Figure 2 - Adding Another File System to root
A directory such as /usr is often referred to
as a "leaf" or "mount point", because it forms
the connection between the root file system and another mountable
file system. You may also choose to refer to it as a child of /,
or the parent of the /usr file system. For a complete list
of all file systems mounted on your machine, execute the /etc/mount
command string.
Internal Representation
Every file in a file system has a unique number, called
the inumber, associated with it. An inumber is used as an index
into the ilist, which is a collection of information nodes
(inodes) for that file system. This is described more fully
later in this chapter. Here we are concerned with the inumbers of
two file names that exist in every directory and hold the file system
together: . (dot) for the directory itself and .. (dot
dot), that points to the parent directory. The directory entry for
. contains the inumber of the directory itself, and the entry
for .. contains the inumber of the parent directory, which
is the same number as that given to the . file in that parent
directory. This interrelationship between the . and ..
files gives the file system structure its cohesion.
The file system structure illustrated in Figure 5-2
shows the inumbers of . and .. for the various directories.
The inumbers of the . and .. files in all mount-point
directories are 2. Notice that the .. inumber of all files
descending directly from root (/) or /usr is 2. Generally,
the inumber of the .. file in /etc/src is the same as
the inumber for the . file in /etc. The only time this
is not true is when a directory is the mount point of a file system,
in which case the inumbers of both . and .. are 2.
To look at the inumbers of all files in a directory,
use the ls -ai command.
Accessing Files in Programs
User-level programs access files through system calls
(accessible through library routines). These calls go through the
file system and (if so programmed) the system buffer cache. Below
the buffer cache, the I/O subsystem (described in Chapter 6) handles
the interaction with the disk device where the file data is stored.
Note that the REAL/IX Operating System supports two file system architectures
which are accessed through the file system switch; these two architectures
are discussed later in this chapter. Figure 5-3 illustrates the flow
of data and the flow of control for file access.
Figure 3 - REAL/IX File Subsystem
Note that the REAL/IX Operating System supports two
file system architectures, called S5 and F5. The S5 architecture is
that of UNIX System V; the F5 architecture provides faster file access
for realtime and time-share processes. The file system architectures
are discussed in more detail later in this chapter.
File Descriptors
When a process first accesses a file with the open(2)
or creat(2) system call, it is assigned a file descriptor.
All subsequent file I/O performed by the process uses this descriptor,
which serves as a sort of handle on the file. Each executing process
has a descriptor table that contains an entry for each open
file descriptor. Entries in the descriptor table are indexed from
0 to (n -1), where n is the value of the NOFILES tunable parameter
that defines the maximum number of file descriptors an executing process
can have at one time; the default value is 80.
I/O system calls other than open and creat
use this index value to specify the target descriptor. Because the
descriptor table is part of a process's environment, a child process
inherits access to all of its parent's descriptors. However, because
the child's environment is a copy of the parent's, the child can alter
its description table without affecting that of its parent.
New descriptors are assigned sequentially as is appropriate
for the type of file. For regular files, each open/create operation
creates one new file descriptor, but the pipe(2) system call
that creates a pipe creates two descriptors (one for the read end
and one for the write end of the pipe). In all cases, the system call
used returns the index value(s) of the newly created descriptor, which
is used in subsequent I/O calls.
The close(2) system call notifies the system
that the process is finished with I/O operations on this file descriptor
and frees the applicable slot in the descriptor table for reuse. A
process may also manipulate table entries with the dup(2) system
call, which makes a duplicate of a descriptor in the first available
slot. You may then close the original descriptor and a open new descriptor
in its place. To restore the initial descriptor, it is duped
from its saved location, after closing the new descriptor. A typical
use for dup is in assigning the standard input to file descriptor
0, standard output to file descriptor 1, and standard error to file
descriptor 2.
Standard I/O and Redirection
Standard I/O defines file descriptors 0, 1, and 2 as
referring to standard input (stdin), standard output (stdout),
and standard error (stderr), respectively. Except when there
is good reason not to, programs read their input from the standard
input, write their output to the standard output, and write diagnostic
information to standard error. A process inherits descriptors from
its parent, so the standard I/O files are already set up; the new
process merely needs to establish descriptors for the auxiliary files
and devices it needs. Using the standard I/O descriptors gives a program
a large measure of universality automatically. Files, devices, and
pipes all work essentially the same, therefore a program will work
with the standard I/O descriptors hooked up to any of them. As an
example, a program that counts words in a text stream and writes the
total to the standard output works equally well whether the input
is taken from a terminal, from a file, or from another process. Likewise,
you may redirect the output to a terminal, file, or another process.
Unless otherwise defined, the standard input is associated
with the keyboard of the user who invokes the process; the standard
output and standard error are associated with the user's display.
Usually the standard error retains its association with the display,
but often a new child process redirects its standard input and/or
output before execing a program. This way, the execed
program does not have to concern itself as to the source of its input
or the destination of its output. The most common example of this
is the shell, which provides users with a very convenient notation
for redirecting the standard input, output, and error of a program
before it is executed. A detailed discussion of redirection
is provided in Chapter 7.
Asynchronous File I/O
Most UNIX kernels support only synchronous I/O operations,
meaning that any I/O operation issued by a process causes that process
to block until the I/O operation is complete. A realtime application
needs the capability of overlapping I/O operations with process execution.
The REAL/IX Operating System supports asynchronous I/O operations
for files and devices, enabling the process that initiated the I/O
operation to continue the process execution stream once the I/O operation
is queued to the device. When the I/O operation completes (either
successfully or unsuccessfully), the initiating process is notified
with either the common event notification mechanism or by polling
a control block; this polling option saves the overhead of a system
call.
The ability to overlap application processing and I/O
operations initiated by the application program and to allow one process
to simultaneously perform several separate I/O operations is required
by a number of realtime applications. For instance, journalizing functions
may require the ability to queue logging records for output without
blocking the initiating process. Data acquisition processes may have
two or more channels delivering intermittent data that require sampling
within a certain time. The process issues one asynchronous read on
each channel. When one of the channels needs data collection, the
process reads the data and posts it to secondary memory with an asynchronous
write; the process may defer actual processing of the data.
The REAL/IX Operating System provides facilities for
asynchronous read and write operations, and the ability to cancel
an asynchronous I/O request. There are also optional initialization
services that speed I/O throughput by preallocating and initializing
various data structures.
The implementation of asynchronous I/O provides the
following capabilities:
- You may issue asynchronous I/O requests for both regular files
and I/O devices.
- Simultaneously queueing of multiple asynchronous read and write
operations to one file descriptor.
- One process can queue asynchronous read and write operations
to several open file descriptors.
- Asynchronous I/O operations to the extended portion of extent-based
files can bypass the buffer cache, which further improves I/O
throughput. Unbuffered I/O functionality is implemented in the
inode associated with a file descriptor, using fcntl(2)
requests. You may emulate unbuffered I/O when required. Note that
asynchronous I/O without emulation to regular files is supported
only for extent based files on F5 file systems.
- Cancellation of pending asynchronous I/O requests.
- Notification of asynchronous I/O completion is optional. If
used, notification is obtained through either polling or the common
event notification method.
- You may use asynchronous I/O operations with both sequential
and random access devices.
- One driver and its associated devices can support both synchronous
and asynchronous read and write operations.
Each asynchronous I/O operation establishes an asynchronous
I/O control block (aiocb(4)) structure, which contains information
to control the I/O operations, such as the number of bytes to transfer
and whether to post an event to the sending process when the I/O operation
completes. When the I/O operation completes, the aiocb structure is
updated, indicating either that the operation was successful or the
error code.
Asynchronous I/O operations to character devices are
implemented using the aio(D3X) entry point to cdevsw(D4X).
The operating system sets up an areq(D4X) kernel data structure, populated
with appropriate information from aiocb(4), and the requesting process.
This structure controls the data transfer, and is updated when the
I/O transfer is completed. Neither the user-level process nor the
driver blocks at any time, and since each I/O request generates a
separate aiocb-areq pair of structures. You may initiate additional
asynchronous I/O requests before the previous operation completes.
Buffer Cache
Unless otherwise specified, file I/O operations use
the system buffer cache as an intermediate storage area between user
address space and the device itself. For instance, when writing a
file, the data is actually written to the buffer cache; the operating
system periodically flushes the contents of the buffer cache to the
disk. Each buffer has a buffer header associated with it that holds
the control information about the buffer such as what block and file
system this data came from. This buffering scheme is defined in the
buf.h header file. The number and size of buffers and the number
of buffer headers and hash slots for the buffer cache are controlled
by a series of tunable kernel parameters.
From the view of the driver that transfers data between
the system buffer cache and the device, I/O operations that use the
buffer cache are called block I/O, because data is transferred
one block at a time, where a block can range from 512 bytes to 131072
bytes, depending on the logical block size of the file system. The
system buffering scheme allows drivers to transfer linked lists of
data, although actual physical I/O is done one block at a time. Without
this facility, an I/O operation would have to return after each buffer
was transferred. For instance, when writing a 6 block file, the system
would write one block, return, write 1 more block, return, and so
forth. By using a linked list, the system looks for the next buffer
when it finishes transferring the first block of data, and only returns
when the entire 6 blocks are transferred. The system still performs
six distinct operations, but it avoids the overhead of returning after
each operation. The driver interface for block
I/O is discussed more in Chapter 6.
Use of the buffer cache has several advantages:
- Data caching - The data remains in main memory as long
as possible. This allows a user process to access the same data
several times without performing physical I/O for each request.
The user process is blocked for a very short time while waiting
for the I/O, since it does not have to wait for the physical I/O
operation.
- Paging enabled - If no buffering of data were done, a
user process undergoing I/O would lock in main memory until the
device transferred data into or out of the user data space. Because
there is a system buffer between the user data space and the device,
the process is paged out until the transfer between the device
and the buffer is completed, then paged back in to transfer data
between the buffer and user data space.
- Consistency - The operating system uses the same buffer
cache as user processes when doing I/O with a file system, so
there is only one view of what a file contains. This allows any
process to access a file without worrying about timing. The buffer
algorithms also help ensure file system integrity, because they
maintain a common, single image of disk blocks contained in the
cache. If two processes simultaneously attempt to manipulate one
disk block, the buffer algorithms serialize their access, preventing
data corruption.
- Portability - Since I/O operations are done through the
buffer cache, user-level programs do not have to worry about data
alignment restrictions the hardware may impose. The kernel aligns
data internally, so you can port programs to new kernels without
rewriting them to conform to new data alignment properties.
- System Performance - Use of the buffer cache can reduce
the amount of disk traffic, thus increasing overall system throughput
and decreasing average response time. Especially when transmitting
small amounts of data, having the kernel buffer the data until
it is economical to transmit it to or from the disk improves general
system performance.
Bypassing the Buffer Cache
Use of the buffer cache has significant advantages for
general applications, but it also has disadvantages for some realtime
applications. Consequently, the REAL/IX Operating System provides
the capability of bypassing the buffer cache by setting a file control
(fcntl(2)) function on the file descriptor. I/O operations
that bypass the buffer cache transfer information directly between
user address space and the device.
Bypassing the buffer cache may speed the time of individual
I/O operations. Remember, however, that this increase in the speed
of the individual operation is at the expense of slowing other I/O
operations running on the system.
Sharing Files
Two or more processes may share the same open file in
one of the following ways:
- A child process inherits its parent's open files. The child
and parent processes also share I/O pointers, so a read(2)
by one process changes the position in the file for both processes.
Normally the child takes over the files while it runs, and the
parent does nothing with the files until after the child terminates.
- Two or more unrelated processes may open the same file at the
same time, as long as they all have adequate permissions. In this
case, each process has a separate I/O pointer, but a change to
the contents of the file is seen by all processes that have the
file open.
- You may link two or more file names to the same file, as long
as both file names are in the same file system. The two linked
files have the same inode, so any updates made to one file affect
the contents seen by the other file.
If two or more processes update the same file at the
same time, you may lose or corrupt data. For this reason, the REAL/IX
Operating System allows you to set read and write locks on a file
or record with fcntl(2) functions. A read lock prevents any
process from write locking the protected area; a write lock prevents
any process from read or write locking the protected area. You may
implement these locks so that the process either fails when it attempts
to set a lock against an already-locked area, or blocks until the
existing lock is released.
File System Structure
The REAL/IX Operating System supports two file system
architectures, referred to as S5 and F5. The S5 architecture is the
same as the file system structure in UNIX System V. The F5 file system
architecture provides faster file access for both realtime and time-sharing
applications. Both file system architectures are accessed through
the same system calls and support the full range of logical block
sizes. The structure of the two file systems is similar, as illustrated
in Figure 5-4.
Figure 4 - Internal Structure of a File System
The major sections shown above are:
- Superblock - Contains global information about the file
system, such as the file system architecture being used, the size
of the ilist, how many inodes are used, the logical block size,
the total number of blocks in the file system, and the number
of blocks used. It also contains some status information about
the file system, as explained below.
- Ilist - A numbered collection of inodes for all files
in the file system. An inode holding control and status information
is associated with each file in the file system.
- Bitmap (F5) - A map of all data storage blocks allocated
to the file system. The operating system can scan this table to
determine which blocks are free and which are not.
- Dta Stor. Blocks - Storage for the data in the files
and directories.
Inodes
To the operating system, a file system is an arrangement
of logical blocks of disk space. The size of a logical block is determined
for each file system at the time it is created, and can range from
512 to 131072 bytes. It is held together by a system of inodes,
meaning information nodes. Each regular file, pipe, special file,
and directory has an inode associated with it; it contains all of
the information about the file except its name (kept in a directory)
and its contents (kept in the data storage blocks for the file system).
Note that the inodes for special files do not point to data storage
blocks, but store the major and minor number of the device instead.
Major and minor device numbers are
discussed in greater detail in Chapter 6.
All the inodes for a file system are stored contiguously,
starting at logical block number 2 of the file system, in a list called
the ilist. The number of inodes each block in the ilist can
hold is determined by the file system architecture being used (S5
inodes are 64 bytes each and F5 inodes are 128 bytes each) and the
logical block size of the file system. For instance, each block in
an S5 file system that uses 512-byte logical blocks can hold 8 inodes
(which are 64 bytes each), and each block in an F5 file system that
uses 4096-byte logical blocks can hold 32 inodes (which are 128 bytes
each). The number of inodes specified at the time the file system
is created determines the number of blocks dedicated to the inode
list.
The inumber gives the starting point for the
inode as an offset into the list. It is not kept as part of the inode
for a file, but is calculated when the inode is accessed. To access
the inumber, use the stat(2) or estat(2) system call
or the ls -i command.
An inode for a file contains:
- Mode - the file type (regular, directory, special, or pipe),
the execution bits (sticky bit, set user and group IDs), and access
permissions for the owner, group, and others. If this field is
null, the inode is not in use.
- Number of Links - number of times this file is referenced in
a directory, or the number of links to the file. Unnamed pipes
have a zero value in this field.
- Owner and Group IDs - indicate the owner and group identification
for the file and are used when checking access permissions for
the file. Both of these are numeric values; the translation to
character strings is found in the /etc/passwd and /etc/group
files.
- Size - the number of bytes in the file. If the current offset
in the file is the same as the size of the file, the user is at
the end of the file. For a pipe file, this field shows the number
of bytes written but not yet read. For a special device file,
this field shows the major and minor numbers of the device.
- Access Times - three times are kept: the last time the file
was modified, the last time the file was read, and the last time
the inode was altered. A simple read of a file does not affect
the modification time but does affect the access time.
- Data Block Addresses - an array of 13 data block numbers used
to access the (non-contiguous) data blocks for the file.
- Extent List (F5) - a list of up to 4 extents of contiguous data
blocks allocated for the file.
- Last Written (F5) - the last block written to the file.
- Flags (F5) - determine how space is automatically allocated
for writes beyond the extent-based portion of the file and whether
space should shrink on creat(2)/trunc(2) operations.
Superblock
The first 512-bytes of the file system (the first physical
block) is not used in REAL/IX file systems. This is a vestige from
older file system architectures that needed this area for bootstrap
information.
The superblock occupies the second physical block of
the file system. Note that, for file systems using logical blocks
larger than the 512-byte physical block, this leaves the rest of the
first logical block unused before the ilist begins at logical block
2. The superblock is read into memory when the file system is mounted,
and is updated on the disk periodically by the bdflush daemon.
For synchronous file systems, the super block is updated every time
a write operation is initiated.
The super block contains the following status information.
To display any of the size or count information on a particular file
system, use the df -t command.
Sizes
- size of the inode list: the total number of inodes available
in the file system; it must be less than 65500.
- size of the file system: the total number of logical
blocks in the file system, including the super block, inode list,
and data blocks. You may derive the size of the data block collection
from this number.
Counts of Available File System Space
- free inodes: the total number of inodes available for
allocation.free blocks: the total number of storage blocks
available for allocation.
Memory-Resident Flags
- modification ("dirty") flag: This flag is set
when the super block in main memory is modified. When sync(2)
runs, it checks for this flag and, if set, updates the super block
on disk and clears the flag. If this flag is not set, sync
does not update the file system or the super block.
- read-only flag: This flag is set when the file system
is mounted as a read-only file system; if set, any attempt to
write to a file in this file system will fail. This flag is frequently
set on file systems that contain only source code, or when debugging
device drivers to ensure that no writes are accidentally done
to files in the file system. This flag is also set when backing
up a file system while the system is in multi-user state.
File System Description Information
- magic number: indicates the file system architecture
used. This is S5FsMAGIC for S5 file systems and F5FsMAGIC for
F5 file systems.
- logical block size (type): indicates the logical block
size of the file system. 1 indicates a 512-byte logical file,
2 indicates a 1024-byte logical file, and so forth up to 9, which
indicates a 131072-byte (128K) logical file size.
The super block also contains two array fields used
for allocating inodes and storage blocks, each of which has a corresponding
locks field used to lock the arrays while they are manipulated. This
is discussed under "Allocating Free Blocks and Inodes" below.
Accessing Data Storage Blocks
UNIX operating systems do not require creating files
with an initial allocation; in fact, on most UNIX systems, this is
not possible. Rather, a file is created when opened for writing for
the first time, and space for the data is allocated as writes are
done to the file, one block at a time. As a file expands, free blocks
are allocated for the file and associated with the file's inode, and
as the file shrinks, data blocks that are no longer needed are deallocated
from the file and made available to other files in the file system.
This conserves space in the file system, since the maximum waste for
each file is less than the size of one logical block.
The drawback of the standard scheme is that executing
processes must absorb the overhead of allocating and deallocating
data storage blocks for their files, and that the blocks associated
with a file are (most likely) not contiguous, which also increases
file access time. This overhead is not excessive for most standard
applications; you may consider them unacceptable for critical realtime
processes. For this reason, the F5 file system architecture allows
you to allocate up to four extents for each file. An extent
is a chunk of contiguous data blocks that are preallocated when the
file is created.
The inode of each file has pointers to the data storage
blocks associated with that file. Inodes in both the S5 and F5 file
system architectures include 13 disk block addresses that point to
the non-contiguous data blocks allocated for that file. In addition,
the F5 file system has four extent addresses that point to the extents
allocated to the file. A file in an F5 file system may use only the
data block addresses, only the extents, or a combination of the two
to access data block addresses. The data stored in a storage block
that is part of an extent is referred to as the extent-based part
of the file; certain operations are supported only on extent-based
parts of files.
Data Block Address Array
The inode's data block address array points to the data
storage blocks associated with all S5 files and the non-extent based
files in F5 file systems. It contains 13 blocks:
- blocks 0 - 9 - point to the first 10 data storage blocks associated
with the file.
- block 10 - for files that are larger than 10 blocks, points
to an indirect block, which in turn points to the block addresses
of another set of data blocks associated with the file.
- block 11 - for files that contain more blocks than are accessible
through the single indirection of block 10; points to a double
indirect block that contains the addresses of a set of indirect
blocks, each of which contains the addresses of another set of
data blocks.
- block 12 - for even larger files, points to a triple indirect
block that contains the addresses of a set of double indirect
blocks that point to sets of indirect blocks that point to the
data blocks.
This scheme is illustrated in Figure 5-5. Data storage
blocks are shaded, and indirect blocks are shown unshaded.
Figure 5 - Accessing Non-Contiguous Data Blocks
The number of data blocks accessed by each indirect
block varies depending on the logical block size of the file system;
it is calculated by dividing the logical block size by 4, which is
the size of one pointer. Given that the number of bytes of data that
you may store in each data block also increases for the larger logical
block sizes. This illustrates that double and triple indirection is
seldom needed but, if it is used, one can incur a fair amount of excess
overhead in file access.
Extent List
Inodes for files in F5 file systems have an extent list
in addition to the data block address array discussed above. An extent
is a set of preallocated, contiguous data blocks. The operating system
does not impose a limit on the number of data blocks that are allocatable
to one extent beyond the limit of the size of the file system. In
other words, the operating system will allow you to create a file
system that contains one file with one extent that contains all the
data storage blocks allocated for the file system. Each extent is
identified by the physical offset into the file system of the first
block and a running sum of the number of data blocks allocated to
all extents for the file, as illustrated in Figure 5-6.
Figure 6 - Accessing Extents (Contiguous Data Blocks)
In this figure, the first extent begins at block 300
and contains 10 data storage blocks, so the sum is 10. The second
extent begins at block 100, and contains 50 data storage blocks, so
the running sum is 60, which is the number of data blocks allocated
to the first and second extents. The sum for the fourth extent represents
the total number of data storage blocks allocated to all extents for
the file (162 in this example). Summing the extents make loops self-terminating,
and thus more efficient. If no extents are used, the offset and sum
for all extents are 0. You may view the offset and sum for the extents
by using the ls -E command string.
Inodes in files in F5 file systems also contain additional
flags that describe the allocation mechanism. These are viewed with
either the ls -e or ls -E command:
- c -file has contiguous extents.
- s -physical space will shrink (in non-contiguous fashion)
on truncates and creates; default is to not shrink.
- g -grow file for writes beyond the end of physical space;
if not set, writes beyond last active extent will fail.
Use the prealloc(1R) command or system call to
allocate contiguous file space to a file, specify the physical disk
location of the extent, and set the flags that control allocation.
The trunc(2) system call truncates a file to a specified size.
The estat(2) and efstat(2) system calls provide statistical
information (such as file size and number of links) for files in F5
file systems similar to that returned by the stat(2) and fstat(2)
system calls for S5 file systems, as well as information on the extents.
Accessing Free Inodes and Data Blocks
Each time a file is created, the operating system must
allocate an inode and an appropriate number of data blocks to the
file. Inodes are allocated the same for both file system architectures,
using an array of free inodes stored in the super block. Data blocks
are identified and allocated using different schemes for each architecture,
with S5 file systems using a linked list with a cache kept in the
super block and F5 file systems using a bitmap of free blocks stored
outside the superblock.
Allocating and Deallocating Inodes
The super block's free inode array contains the inumbers
of 100 free inodes, although the file system may have many more available.
An index points to the next available slot in the array. When a file
is created, the system must allocate an inode; if no inodes are available
for allocation, the open(2) or creat(2) system call
will fail.
The operating system takes the following steps to allocate
an inode:
- Lock the free inode array.
- Decrement the index into the free inode array and use it to
find the inumber of the next available inode. Free inodes are
identified by a zero value for the mode field.
- If the free inode array empties, go to the ilist on disk and
look for free inodes. As they are found, put their numbers in
the array of free inumbers until the array is filled (100 inumbers)
or, if the file system has less than 100 available inodes, until
all available inodes are listed in the free inode array.
- Read the inode into main memory.
- Assign appropriate header information to the inode.
- Unlock the free inumber array and unblock any processes that
were blocked while waiting for the array.
When an unlink(2) system call removes the last
link to a file, the inode for that file is removed and its inumber
is added back to the free inode array.
If the array is already full (has 100 inumbers in it),
the operating system compares the number of the newly deallocated
inode against the number in the first slot of the free inumber array.
The lower of the two numbers is put (remains) in the first slot; the
higher of the two numbers is added to the ilist on disk. This eliminates
unnecessary searches when refilling the free inumber array, by ensuring
that the lowest numbered inode available is in the first slot of the
free inumber array.
Allocating and Deallocating Free
Blocks
Free blocks are data storage blocks that do not currently
contain inodes, indirect address blocks, file extents, or data and
are not part of the directory tree structure. Free blocks are each
one logical block long.
For S5 file systems, the super block's free block array
contains the data block numbers of 50 free data blocks, forming the
beginning of a free list of data block numbers. This array is used
as a stack whenever the file system needs to allocate another data
block. To repopulate the free block array, the operating system uses
address 0 as a pointer to a free block that contains 50 more addresses.
This algorithm is extremely fast but tends to scatter data over time.
For F5 file systems, a bitmap of all free data blocks
in the file system is stored outside the super block. To allocate
data blocks for a file, the operating system can quickly search this
bitmap to identify free data blocks. This is slower, but tends to
localize disk accesses to a particular file, and also enables the
operating system to quickly identify contiguous blocks available for
allocation to extents.
When the process allocates a data block dynamically
in an F5 file system, the system attempts to cache several contiguous
data blocks (the default is 8, but you may modify this number through
a kernel tunable parameter), which is used if the application needs
to allocate additional blocks. The extra data blocks in the cluster
are freed when the process closes the file or exits. This results
in files that are less fragmented on disk, and thus accessed more
efficiently.
Summary Comparison of S5 and F5
File Systems
The previous sections have discussed the internal differences
between the S5 and F5 file system architectures. Table 5-1 summarizes
these differences.
Table 1 - File System Comparison
File System Access Tables
When a file system is identified to the operating system
with a mount(1M) command, the operating system makes an entry
in the mount table and reads the super block into an internal buffer
maintained by the kernel. Parts of the super block that are needed
in memory are the lists of free inodes and storage blocks and the
flags and time fields that are constantly being modified.
Three system tables maintain information about all files
that are opened or referenced. These are:
- System inode Table: contains information from the inodes
for each open or referenced file. The operating system maintains
one system inode table for all processes; it is part of the operating
system's address space.
- System File Table: contains information about opens of
files. Each time a file is opened, an entry is allocated in this
table that identifies the way the file was opened and the user's
current offset in the file. Like the system inode table, this
table is part of the operating system's address space.
- User Area File Descriptor Table: A table of pointers
to entries in the system file table. One of these tables for each
process resides in the user's user area.
Figure 5-7 shows how these tables string together. The
following sections describe these tables in more detail.
Figure 7 - File System Tables and Their Pointers
The System Inode Table
The system inode table holds most information from the
disk version of the inode, as well as the following:
- device: identifies the device on which the file system
resides. This indicates the file system of which the file is a
part.
- inumber: along with device, locates the file in the current
file hierarchy. The location of the inode is calculated each time
the file is referenced using the inumber as an offset, and is
not stored in the disk version of the inode.
- reference count: number of times the file was opened
or referenced; in other words, how many pointers from the system
file table point at this entry. When a table entry is shared,
the reference count shows how many processes are sharing the entry.
When this number drops to zero, the file is no longer being referenced
and the entry deallocation is permitted.
- last logical block read: used by the system to determine
if the system is reading the file sequentially. If it is, the
operating system invokes the read-ahead feature each time a read
is done. This causes the queueing of block n+1 for reading when
the block n is read, assuming that the system eventually requires
block n+1.
- hash list pointers: used for locating used inode table
entries and maintaining a free list of inode table entries.
- flags: indicate whether the file represents a shared-text
a.out file, a locked inode, a modified file, or a mount-point
directory. Other flags specify if the file is open for synchronous
writes and if someone is waiting for the system to unblock this
inode.
Notice that the system inode table does not contain
any time stamps, which are part of the disk version of the inode.
Several linked lists are used to provide easy access
to system inode table entries currently in use and to keep a list
of free inode table entries for use when allocating entries while
opening and referencing files. These lists are present in all systems
based on UNIX System V to decrease the time required to manipulate
new files:
- System Inode Table Free List: starts at an operating
system variable, which points to an unused system inode table
entry, which points to another, and so on until all free table
entries are on the free list. The link pointer of the last entry
has a null pointer as its value.
- System Inode Table Hash Lists: consist of pointers to
inode table entries, each of which starts a hash list of inode
table entries. This reduces the average number of entries inspected
when looking for a file in the table by breaking the total number
of used entries into a series of lists specified by the NINODE
kernel parameter. These lists are maintained through an inode
hash table.
The inumber and device are hashed to find an entry in
the hash table. This entry points to an entry in the system inode
table which is the head of the hash list. If the file is represented
by an entry in the inode table, it is on this hash list.
The System File Table
The system file table contains information about the
opening of files on the system. Each time a file is opened, an entry
is allocated and populated in the system file table. A system file
table entry contains the following information:
- reference count: indication of how many file descriptors
are pointing to this entry. One or more processes may have descriptors
pointing to this entry. If two processes are pointing to the same
system file table entry, they must be related through a fork and
the file must have been opened before the fork.
- current offset: indication, in bytes, of the user's position
with-in the file.
- flags: used to record how the file was opened. They include
the following values:
- read: the file is available for reading
- write: the file is available for writing
- append: before each write to the file, the current offset
is set equal to the size of the file. Therefore, if two or
more users are appending to the same file, they will always
append to the end of the file, not just starting at the end.
- no delay: if the opened file is a named pipe, reads to an
empty pipe will not cause the process to sleep, nor will writes
to a full pipe. Instead, an error is returned.
- pointer to system inode table entry: link to the remainder
of the information about the file; an access pathway to the data
blocks for the file.
The User Area File Descriptor Table
The user area file descriptor table is part of the user
area and contains pointers to entries in the system file table, which
in turn points to entries in the system inode table. Each entry is
referenced by a file descriptor (described above) and corresponds
to an open file. The size of this table is determined by the kernel
parameter NOFILES. The default value of NOFILES is 80, which allows
each process to have up to 80 files open concurrently, including stdin,
stdout, and stderr. The dup(2) system call copies
one entry in the file descriptor table to another.
Using the File Access Tables
To illustrate how the file access tables work, the following
sections describe the internal system activities related to the open(2),
creat(2), read(2), and write(2) system calls.
open
The open(2) system call opens an existing file
and returns a file descriptor. As a simple example of the open
system call, assume that the argument to open is the path /a/b.
- The operating system sees that the path name starts with a slash,
so in the user area there is a pointer to the inode table entry
for the root directory's inode.
- Using the root inode, the system does a linear scan of the root
directory file looking for an entry "a". When "a"
is found, the operating system picks up the inumber associated
with "a".
- The inumber gives the offset into the ilist in which the inode
for "a" is stored. At that location, the system determines
that "a" is a directory by looking at the file type.
- That directory is searched linearly until an entry "b"
is found.The "b" is found, its inumber is picked up
and used as an index into the ilist to find the inode for "b".
- The inode for "b" is copied to the system inode table,
and the reference count is incremented.
- The system file table entry is allocated, the pointer to the
system inode table is set, the offset for the I/O pointer is set
to zero to indicate the beginning of the file, and the reference
count is initialized.
- The user area file descriptor table entry is allocated with
a pointer set to the entry in the system file table.
The algorithm for locating the inode of a file illustrates
why it is advisable to keep directories small. Search time is also
speeded up by keeping subdirectory names near the beginning of a directory
file, which the dcopy command does.
creat
The creat(2) system call creates a new file and
returns a file descriptor. It functions like the open(2) system
call, with three additional steps at the beginning:
- The super block is referenced for a free inode.
- The mode of the file is established (by combining the system
defaults with the complement of a umask entry) and entered
in the inode.
- Using the inumber, the system goes through a directory search
similar to that used in the open system call, except that
here the last portion of the path name is written by the system
into the directory that is the next to last portion of the path
name, and the inumber of the newly-created file is stored with
it.
read and write
The read(2), write(2), aread(2)
and awrite(2) system calls take a file descriptor as an argument,
and follow these steps:
- Using the file descriptor as an index, the file descriptor table
is read to get a pointer to the system file table.
- The user buffer address and number of bytes to read/write are
supplied as arguments to the call. The correct offset into the
buffer is read from the system file table entry. For the aread
and awrite system calls, the offset is modified by settings
within the aiocb control block.
- For read operations, the inode is found by following the pointer
from the system file table entry to the system inode table. The
operating system copies the data from storage to the user's buffer.
- For write operations, the same pointer chain is followed, but
the system writes into the data blocks. If new blocks are needed,
they are obtained by the system from the file system's list of
free blocks.
- The read or write operation will take place, provided the file
is not locked by another process.
- Before the system call returns to the user, the number of bytes
read or written is added to the offset in the system file table.
Files Shared by Related Processes
If related processes are sharing an open file (as is
the case after a fork(2)), they also share the same file descriptor
and entry in the system file table.
Unrelated processes that access the same file have separate
file descriptors and separate entries in the system file table. Because
they executed separate open(2) calls, they may read from or
write to different places in the file.
In both cases, the entry in the inode table is shared;
the correct offset at which the read or write operations should take
place is tracked by the offset entry in the system file table.
Path Name Conversion
The directory search and path name conversion take place
only when the file is opened. For subsequent access of the file, the
system supplies a file descriptor which is an index into the file
descriptor table in your user process area. The file descriptor table
points to the system file table entry where the pointer to the system
inode table is picked up. Given the inode, the system can find the
data blocks that make up the file
Synchronizing Disk Files and the
Buffer Cache
The file subsystem must handle several simultaneous
processes that access different files. To do this, the system keeps
a cache of free blocks and inodes in memory along with the super block.
When you write a file, you actually write to these blocks. Synchronization
is the process by which the contents of these blocks are written to
the actual device.
The operating system flushes the contents of these disk
buffers (along with super blocks and updated inodes) to the disk devices
periodically. Every five minutes (unless a high-priority realtime
process prevents it), the init(1M) process flushes all buffers
to disk. The bdflusr daemon runs more frequently to flush active
buffers. Tunable parameters determine how often the bdflushr
daemon runs, the priority at which it runs, and the number of seconds
after the last write access before bdflushr writes it out.
Under normal circumstances, this scheme is adequate to keep the disk
file synchronized. If the system crashes before the buffers are written
to disk, data loss may occur. During testing or when AC power problems
increase the odds for a system crash, we recommend that you modify
the tunable parameters that control the bdflushr daemon to
reduce the interval between flush operations.
The REAL/IX Operating System also allows you to mount
some file systems as "synchronous file systems." For synchronous
file systems, each write operation writes to the cache, then immediately
writes the data blocks and the updated inode to the device. You should
use synchronous file systems only for applications in which the immediate
updating of the file is critical, such as a process control data acquisition
system that gathers statistics and has some realtime processes that
use that data to run/change the actions of processes. While an individual
write operation to a synchronous file system is faster than a write
operation to a non-synchronous file system, the use of synchronous
file systems can degrade overall system perfor-mance.
The fsync(2) system call allows you to immediately
flush the buffers associated with a particular file after a write(2)
operation. fsync should be used by programs requiring that
a file be in a known, up-to-the-minute state; for example, a transaction
log for a database. fsync allows you to do critical updates
without suffering the performance degradation that is sometimes associated
with file systems mounted synchronously.
Go to Chapter 6 TOC