Corporate What's New? Support Contact Us Home



 

File Subsystem

The file subsystem controls access to data files as well as devices on the REAL/IX Operating System. It provides an hierarchical organization for files and directories, buffering of data (and the ability to bypass buffering for critical realtime I/O operations), and control of file access.

See File Access Utilities in Chapter 8 for a list of commands used to access and manipulate files and directories.

 

File types

A file is the basic unit of the file subsystem. Files can represent character or binary information, but all files are stored identically: as a sequence of bytes. There are four basic file types.

  1. Regular - one dimensional array of bytes used to store data text, or programs
  2. Pipe - similar to regular files in how they represent unformatted information, and unique in that reads from them are destructive. This file is used to communicate between processes
  3. Special - used to interface between the process and device. These files reside in the /dev directory.
  4. Directory - used to organize regular, pipe, special, and other directory files. They hold file names and references to information about the files. Only the operating system can write to them. Users can modify directories by making a request to the operating system.

REAL/IX file names consist of one to fourteen ASCII characters, and should use only printable characters that are not special to the shell.

The REAL/IX file subsystem uses three entities to maintain user and system files and directories. These are:

  1. Directory Tree: all accessible directories and file systems, used when giving path names.
  2. File Systems: independent collections of files and directories, each of which is located on the same partition (or contiguous partitions) of a disk device.
  3. Directories: reside within a file system and are used to further organize user and system files. They contain the names of the files and references to the remainder of the information about the file. This reference is called the inumber and is used to locate the inode for the file being accessed.

 

File Listings

The ls(1) command is used to list the files in a directory. A number of options are available to provide additional information and specify the format of the output; some of these are explained later in this chapter, and all are listed on the ls manual page. For now, it is useful to understand the basic information given for the ls -l (long) command:

file listings

Table 1 - File Listings

 

This listing provides the following information:

  • mode - the first character identifies the type of file. - is a regular file; d is a directory; e is an extent-based file. Special device files use c for a character-access file and b for a block-access file. Special device files are discussed more in Chapter 6.
  • - The remaining nine characters define the file access permissions as discussed in the following section.
  • links - the number of names linked to this file.
  • owner and group - identifies who can access the file as owner and who may access the file as group. Use the chgrp(1) command to change the group, and the chown(2) command to change the owner of a file.
  • size - logical size of the file in bytes. For special device files, this field contains the major and minor device numbers
  • date and time - date and time of last file modification
  • file name - the common name used to access this file

 

File Permissions

File permissions determine who may read, write, and execute a file. Each file has an owner (changed with the chown(2) command) and a group (changed with the chgrp(1) command). The mode of a file is an octal mask that determines the access privileges for the owner (u), group (g), and world or other (o). It is displayed in the last nine characters of the leftmost column of the output from the ls -l command.

The format of the output is three sets of three characters each, defining read, write, and execute permissions for the owner, group, and other, as illustrated below:

file permissions

Table 2 - File Permissions

 

For permissions that are not granted to that class of user, a dash (-) replaces the letter; for instance, r-- permissions indicate read privileges but no write or execute permissions. Initial file permissions are determined by the umask value set in the /etc/profile or $HOME/.profile file. The file permissions are changed with the chmod(2) command. chmod can use either the single letter representation or a three digit number, where each digit corresponds to the owner, group, and other, respectively. The value of each digit is determined by summing the number corresponding to the per-missions:

  • 4 -read permissions (r)
  • 2 -write permissions (w)
  • 1 -execute permissions (x)

So, for example, 777 represents the permissions shown above (read, write, and execute permissions for the owner, group, and other). 640 gives the owner read and write permissions, the group read permissions, and no access permissions to other users. chmod can assign permissions in a format such as chmod g+rw (which adds read and write permissions for the group) or chmod o-w (which removes write permissions for other. Note that any executable file must have execute permissions; directories must have execute permissions for use in a path name.

The user name shown in the ls output corresponds to a user ID assigned in the /etc/passwd file; the group name corresponds to a group ID assigned in the /etc/passwd file and populated in the /etc/group file. The user and group shown in the ls output correspond to the "real" user identification number (UID) and group identification number (GID).

Normally, a process runs with the permissions of the user who executes it. A program can use the setuid(2) and setgid(2) system calls to set the "effective" UID and GID: when the setuid bit is on, the process runs with the permissions of the real owner, and when the setgid bit is on, the process runs with the permissions of the real group. This is used, for instance, so that a number of users may update a file but only through a specific command (not through an editor or other utility). The program can give execute permissions to anyone, then internally reset the effective UID or GID so it can access a file that has restricted permissions.

 

User View of a File System

A file system is a combination of directories and files descending from a common directory. The combination of directories and files make up a file system. Figure 5-1 shows the relationship between directories and files in a REAL/IX file system. The directories are represented by circles; files are represented with lines beneath the directories.

REAL/IX file system

Figure 1 - A REAL/IX File System

 

Path Names

The starting point of any UNIX file system is a directory that serves as its root. One file system is referred to by that name, root, and is the topmost directory on the system. The root directory of the root file system is represented by a single slash (/). The file system diagrammed in Figure 5-1 is a root file system, with subdirectories /bin, /etc, and /usr.

A full path name for a file gives the location of a file in relationship to root, for instance, /bin/cat. A relative path name for a file gives the location of a file in relationship to ones present working directory. So, if ones present working directory is root, one could refer to bin/cat.

Relative path names use . to refer to the present working directory and .. to refer to the directory level directly above the present working directory. So, if ones present working directory is /etc, one could use the relative path name ../bin/cat to refer to this file. Typically, the .. directory is referred to as the "parent directory".

 

Mounting Another File System

You may wish to mount other file systems under an existing file system. A prime example is the usr file system mounted under the root file system. Figure 5-2 shows such a file system mounted as /usr.

adding another file system to root

Figure 2 - Adding Another File System to root

 

A directory such as /usr is often referred to as a "leaf" or "mount point", because it forms the connection between the root file system and another mountable file system. You may also choose to refer to it as a child of /, or the parent of the /usr file system. For a complete list of all file systems mounted on your machine, execute the /etc/mount command string.

 

Internal Representation

Every file in a file system has a unique number, called the inumber, associated with it. An inumber is used as an index into the ilist, which is a collection of information nodes (inodes) for that file system. This is described more fully later in this chapter. Here we are concerned with the inumbers of two file names that exist in every directory and hold the file system together: . (dot) for the directory itself and .. (dot dot), that points to the parent directory. The directory entry for . contains the inumber of the directory itself, and the entry for .. contains the inumber of the parent directory, which is the same number as that given to the . file in that parent directory. This interrelationship between the . and .. files gives the file system structure its cohesion.

The file system structure illustrated in Figure 5-2 shows the inumbers of . and .. for the various directories. The inumbers of the . and .. files in all mount-point directories are 2. Notice that the .. inumber of all files descending directly from root (/) or /usr is 2. Generally, the inumber of the .. file in /etc/src is the same as the inumber for the . file in /etc. The only time this is not true is when a directory is the mount point of a file system, in which case the inumbers of both . and .. are 2.

To look at the inumbers of all files in a directory, use the ls -ai command.

 

Accessing Files in Programs

User-level programs access files through system calls (accessible through library routines). These calls go through the file system and (if so programmed) the system buffer cache. Below the buffer cache, the I/O subsystem (described in Chapter 6) handles the interaction with the disk device where the file data is stored. Note that the REAL/IX Operating System supports two file system architectures which are accessed through the file system switch; these two architectures are discussed later in this chapter. Figure 5-3 illustrates the flow of data and the flow of control for file access.

REAL/IX file subsystem

Figure 3 - REAL/IX File Subsystem

 

Note that the REAL/IX Operating System supports two file system architectures, called S5 and F5. The S5 architecture is that of UNIX System V; the F5 architecture provides faster file access for realtime and time-share processes. The file system architectures are discussed in more detail later in this chapter.

 

File Descriptors

When a process first accesses a file with the open(2) or creat(2) system call, it is assigned a file descriptor. All subsequent file I/O performed by the process uses this descriptor, which serves as a sort of handle on the file. Each executing process has a descriptor table that contains an entry for each open file descriptor. Entries in the descriptor table are indexed from 0 to (n -1), where n is the value of the NOFILES tunable parameter that defines the maximum number of file descriptors an executing process can have at one time; the default value is 80.

I/O system calls other than open and creat use this index value to specify the target descriptor. Because the descriptor table is part of a process's environment, a child process inherits access to all of its parent's descriptors. However, because the child's environment is a copy of the parent's, the child can alter its description table without affecting that of its parent.

New descriptors are assigned sequentially as is appropriate for the type of file. For regular files, each open/create operation creates one new file descriptor, but the pipe(2) system call that creates a pipe creates two descriptors (one for the read end and one for the write end of the pipe). In all cases, the system call used returns the index value(s) of the newly created descriptor, which is used in subsequent I/O calls.

The close(2) system call notifies the system that the process is finished with I/O operations on this file descriptor and frees the applicable slot in the descriptor table for reuse. A process may also manipulate table entries with the dup(2) system call, which makes a duplicate of a descriptor in the first available slot. You may then close the original descriptor and a open new descriptor in its place. To restore the initial descriptor, it is duped from its saved location, after closing the new descriptor. A typical use for dup is in assigning the standard input to file descriptor 0, standard output to file descriptor 1, and standard error to file descriptor 2.

 

Standard I/O and Redirection

Standard I/O defines file descriptors 0, 1, and 2 as referring to standard input (stdin), standard output (stdout), and standard error (stderr), respectively. Except when there is good reason not to, programs read their input from the standard input, write their output to the standard output, and write diagnostic information to standard error. A process inherits descriptors from its parent, so the standard I/O files are already set up; the new process merely needs to establish descriptors for the auxiliary files and devices it needs. Using the standard I/O descriptors gives a program a large measure of universality automatically. Files, devices, and pipes all work essentially the same, therefore a program will work with the standard I/O descriptors hooked up to any of them. As an example, a program that counts words in a text stream and writes the total to the standard output works equally well whether the input is taken from a terminal, from a file, or from another process. Likewise, you may redirect the output to a terminal, file, or another process.

Unless otherwise defined, the standard input is associated with the keyboard of the user who invokes the process; the standard output and standard error are associated with the user's display. Usually the standard error retains its association with the display, but often a new child process redirects its standard input and/or output before execing a program. This way, the execed program does not have to concern itself as to the source of its input or the destination of its output. The most common example of this is the shell, which provides users with a very convenient notation for redirecting the standard input, output, and error of a program before it is executed. A detailed discussion of redirection is provided in Chapter 7.

 

Asynchronous File I/O

Most UNIX kernels support only synchronous I/O operations, meaning that any I/O operation issued by a process causes that process to block until the I/O operation is complete. A realtime application needs the capability of overlapping I/O operations with process execution. The REAL/IX Operating System supports asynchronous I/O operations for files and devices, enabling the process that initiated the I/O operation to continue the process execution stream once the I/O operation is queued to the device. When the I/O operation completes (either successfully or unsuccessfully), the initiating process is notified with either the common event notification mechanism or by polling a control block; this polling option saves the overhead of a system call.

The ability to overlap application processing and I/O operations initiated by the application program and to allow one process to simultaneously perform several separate I/O operations is required by a number of realtime applications. For instance, journalizing functions may require the ability to queue logging records for output without blocking the initiating process. Data acquisition processes may have two or more channels delivering intermittent data that require sampling within a certain time. The process issues one asynchronous read on each channel. When one of the channels needs data collection, the process reads the data and posts it to secondary memory with an asynchronous write; the process may defer actual processing of the data.

The REAL/IX Operating System provides facilities for asynchronous read and write operations, and the ability to cancel an asynchronous I/O request. There are also optional initialization services that speed I/O throughput by preallocating and initializing various data structures.

The implementation of asynchronous I/O provides the following capabilities:

  • You may issue asynchronous I/O requests for both regular files and I/O devices.
  • Simultaneously queueing of multiple asynchronous read and write operations to one file descriptor.
  • One process can queue asynchronous read and write operations to several open file descriptors.
  • Asynchronous I/O operations to the extended portion of extent-based files can bypass the buffer cache, which further improves I/O throughput. Unbuffered I/O functionality is implemented in the inode associated with a file descriptor, using fcntl(2) requests. You may emulate unbuffered I/O when required. Note that asynchronous I/O without emulation to regular files is supported only for extent based files on F5 file systems.
  • Cancellation of pending asynchronous I/O requests.
  • Notification of asynchronous I/O completion is optional. If used, notification is obtained through either polling or the common event notification method.
  • You may use asynchronous I/O operations with both sequential and random access devices.
  • One driver and its associated devices can support both synchronous and asynchronous read and write operations.

Each asynchronous I/O operation establishes an asynchronous I/O control block (aiocb(4)) structure, which contains information to control the I/O operations, such as the number of bytes to transfer and whether to post an event to the sending process when the I/O operation completes. When the I/O operation completes, the aiocb structure is updated, indicating either that the operation was successful or the error code.

Asynchronous I/O operations to character devices are implemented using the aio(D3X) entry point to cdevsw(D4X). The operating system sets up an areq(D4X) kernel data structure, populated with appropriate information from aiocb(4), and the requesting process. This structure controls the data transfer, and is updated when the I/O transfer is completed. Neither the user-level process nor the driver blocks at any time, and since each I/O request generates a separate aiocb-areq pair of structures. You may initiate additional asynchronous I/O requests before the previous operation completes.

 

Buffer Cache

Unless otherwise specified, file I/O operations use the system buffer cache as an intermediate storage area between user address space and the device itself. For instance, when writing a file, the data is actually written to the buffer cache; the operating system periodically flushes the contents of the buffer cache to the disk. Each buffer has a buffer header associated with it that holds the control information about the buffer such as what block and file system this data came from. This buffering scheme is defined in the buf.h header file. The number and size of buffers and the number of buffer headers and hash slots for the buffer cache are controlled by a series of tunable kernel parameters.

From the view of the driver that transfers data between the system buffer cache and the device, I/O operations that use the buffer cache are called block I/O, because data is transferred one block at a time, where a block can range from 512 bytes to 131072 bytes, depending on the logical block size of the file system. The system buffering scheme allows drivers to transfer linked lists of data, although actual physical I/O is done one block at a time. Without this facility, an I/O operation would have to return after each buffer was transferred. For instance, when writing a 6 block file, the system would write one block, return, write 1 more block, return, and so forth. By using a linked list, the system looks for the next buffer when it finishes transferring the first block of data, and only returns when the entire 6 blocks are transferred. The system still performs six distinct operations, but it avoids the overhead of returning after each operation. The driver interface for block I/O is discussed more in Chapter 6.

Use of the buffer cache has several advantages:

  • Data caching - The data remains in main memory as long as possible. This allows a user process to access the same data several times without performing physical I/O for each request. The user process is blocked for a very short time while waiting for the I/O, since it does not have to wait for the physical I/O operation.
  • Paging enabled - If no buffering of data were done, a user process undergoing I/O would lock in main memory until the device transferred data into or out of the user data space. Because there is a system buffer between the user data space and the device, the process is paged out until the transfer between the device and the buffer is completed, then paged back in to transfer data between the buffer and user data space.
  • Consistency - The operating system uses the same buffer cache as user processes when doing I/O with a file system, so there is only one view of what a file contains. This allows any process to access a file without worrying about timing. The buffer algorithms also help ensure file system integrity, because they maintain a common, single image of disk blocks contained in the cache. If two processes simultaneously attempt to manipulate one disk block, the buffer algorithms serialize their access, preventing data corruption.
  • Portability - Since I/O operations are done through the buffer cache, user-level programs do not have to worry about data alignment restrictions the hardware may impose. The kernel aligns data internally, so you can port programs to new kernels without rewriting them to conform to new data alignment properties.
  • System Performance - Use of the buffer cache can reduce the amount of disk traffic, thus increasing overall system throughput and decreasing average response time. Especially when transmitting small amounts of data, having the kernel buffer the data until it is economical to transmit it to or from the disk improves general system performance.

 

Bypassing the Buffer Cache

Use of the buffer cache has significant advantages for general applications, but it also has disadvantages for some realtime applications. Consequently, the REAL/IX Operating System provides the capability of bypassing the buffer cache by setting a file control (fcntl(2)) function on the file descriptor. I/O operations that bypass the buffer cache transfer information directly between user address space and the device.

Bypassing the buffer cache may speed the time of individual I/O operations. Remember, however, that this increase in the speed of the individual operation is at the expense of slowing other I/O operations running on the system.

 

Sharing Files

Two or more processes may share the same open file in one of the following ways:

  • A child process inherits its parent's open files. The child and parent processes also share I/O pointers, so a read(2) by one process changes the position in the file for both processes. Normally the child takes over the files while it runs, and the parent does nothing with the files until after the child terminates.
  • Two or more unrelated processes may open the same file at the same time, as long as they all have adequate permissions. In this case, each process has a separate I/O pointer, but a change to the contents of the file is seen by all processes that have the file open.
  • You may link two or more file names to the same file, as long as both file names are in the same file system. The two linked files have the same inode, so any updates made to one file affect the contents seen by the other file.

If two or more processes update the same file at the same time, you may lose or corrupt data. For this reason, the REAL/IX Operating System allows you to set read and write locks on a file or record with fcntl(2) functions. A read lock prevents any process from write locking the protected area; a write lock prevents any process from read or write locking the protected area. You may implement these locks so that the process either fails when it attempts to set a lock against an already-locked area, or blocks until the existing lock is released.

 

File System Structure

The REAL/IX Operating System supports two file system architectures, referred to as S5 and F5. The S5 architecture is the same as the file system structure in UNIX System V. The F5 file system architecture provides faster file access for both realtime and time-sharing applications. Both file system architectures are accessed through the same system calls and support the full range of logical block sizes. The structure of the two file systems is similar, as illustrated in Figure 5-4.

internal structure of a file system

Figure 4 - Internal Structure of a File System

 

The major sections shown above are:

  • Superblock - Contains global information about the file system, such as the file system architecture being used, the size of the ilist, how many inodes are used, the logical block size, the total number of blocks in the file system, and the number of blocks used. It also contains some status information about the file system, as explained below.
  • Ilist - A numbered collection of inodes for all files in the file system. An inode holding control and status information is associated with each file in the file system.
  • Bitmap (F5) - A map of all data storage blocks allocated to the file system. The operating system can scan this table to determine which blocks are free and which are not.
  • Dta Stor. Blocks - Storage for the data in the files and directories.

 

Inodes

To the operating system, a file system is an arrangement of logical blocks of disk space. The size of a logical block is determined for each file system at the time it is created, and can range from 512 to 131072 bytes. It is held together by a system of inodes, meaning information nodes. Each regular file, pipe, special file, and directory has an inode associated with it; it contains all of the information about the file except its name (kept in a directory) and its contents (kept in the data storage blocks for the file system). Note that the inodes for special files do not point to data storage blocks, but store the major and minor number of the device instead. Major and minor device numbers are discussed in greater detail in Chapter 6.

All the inodes for a file system are stored contiguously, starting at logical block number 2 of the file system, in a list called the ilist. The number of inodes each block in the ilist can hold is determined by the file system architecture being used (S5 inodes are 64 bytes each and F5 inodes are 128 bytes each) and the logical block size of the file system. For instance, each block in an S5 file system that uses 512-byte logical blocks can hold 8 inodes (which are 64 bytes each), and each block in an F5 file system that uses 4096-byte logical blocks can hold 32 inodes (which are 128 bytes each). The number of inodes specified at the time the file system is created determines the number of blocks dedicated to the inode list.

The inumber gives the starting point for the inode as an offset into the list. It is not kept as part of the inode for a file, but is calculated when the inode is accessed. To access the inumber, use the stat(2) or estat(2) system call or the ls -i command.

An inode for a file contains:

  • Mode - the file type (regular, directory, special, or pipe), the execution bits (sticky bit, set user and group IDs), and access permissions for the owner, group, and others. If this field is null, the inode is not in use.
  • Number of Links - number of times this file is referenced in a directory, or the number of links to the file. Unnamed pipes have a zero value in this field.
  • Owner and Group IDs - indicate the owner and group identification for the file and are used when checking access permissions for the file. Both of these are numeric values; the translation to character strings is found in the /etc/passwd and /etc/group files.
  • Size - the number of bytes in the file. If the current offset in the file is the same as the size of the file, the user is at the end of the file. For a pipe file, this field shows the number of bytes written but not yet read. For a special device file, this field shows the major and minor numbers of the device.
  • Access Times - three times are kept: the last time the file was modified, the last time the file was read, and the last time the inode was altered. A simple read of a file does not affect the modification time but does affect the access time.
  • Data Block Addresses - an array of 13 data block numbers used to access the (non-contiguous) data blocks for the file.
  • Extent List (F5) - a list of up to 4 extents of contiguous data blocks allocated for the file.
  • Last Written (F5) - the last block written to the file.
  • Flags (F5) - determine how space is automatically allocated for writes beyond the extent-based portion of the file and whether space should shrink on creat(2)/trunc(2) operations.

 

Superblock

The first 512-bytes of the file system (the first physical block) is not used in REAL/IX file systems. This is a vestige from older file system architectures that needed this area for bootstrap information.

The superblock occupies the second physical block of the file system. Note that, for file systems using logical blocks larger than the 512-byte physical block, this leaves the rest of the first logical block unused before the ilist begins at logical block 2. The superblock is read into memory when the file system is mounted, and is updated on the disk periodically by the bdflush daemon. For synchronous file systems, the super block is updated every time a write operation is initiated.

The super block contains the following status information. To display any of the size or count information on a particular file system, use the df -t command.

Sizes

  • size of the inode list: the total number of inodes available in the file system; it must be less than 65500.
  • size of the file system: the total number of logical blocks in the file system, including the super block, inode list, and data blocks. You may derive the size of the data block collection from this number.

Counts of Available File System Space

  • free inodes: the total number of inodes available for allocation.free blocks: the total number of storage blocks available for allocation.

Memory-Resident Flags

  • modification ("dirty") flag: This flag is set when the super block in main memory is modified. When sync(2) runs, it checks for this flag and, if set, updates the super block on disk and clears the flag. If this flag is not set, sync does not update the file system or the super block.
  • read-only flag: This flag is set when the file system is mounted as a read-only file system; if set, any attempt to write to a file in this file system will fail. This flag is frequently set on file systems that contain only source code, or when debugging device drivers to ensure that no writes are accidentally done to files in the file system. This flag is also set when backing up a file system while the system is in multi-user state.

File System Description Information

  • magic number: indicates the file system architecture used. This is S5FsMAGIC for S5 file systems and F5FsMAGIC for F5 file systems.
  • logical block size (type): indicates the logical block size of the file system. 1 indicates a 512-byte logical file, 2 indicates a 1024-byte logical file, and so forth up to 9, which indicates a 131072-byte (128K) logical file size.

The super block also contains two array fields used for allocating inodes and storage blocks, each of which has a corresponding locks field used to lock the arrays while they are manipulated. This is discussed under "Allocating Free Blocks and Inodes" below.

 

Accessing Data Storage Blocks

UNIX operating systems do not require creating files with an initial allocation; in fact, on most UNIX systems, this is not possible. Rather, a file is created when opened for writing for the first time, and space for the data is allocated as writes are done to the file, one block at a time. As a file expands, free blocks are allocated for the file and associated with the file's inode, and as the file shrinks, data blocks that are no longer needed are deallocated from the file and made available to other files in the file system. This conserves space in the file system, since the maximum waste for each file is less than the size of one logical block.

The drawback of the standard scheme is that executing processes must absorb the overhead of allocating and deallocating data storage blocks for their files, and that the blocks associated with a file are (most likely) not contiguous, which also increases file access time. This overhead is not excessive for most standard applications; you may consider them unacceptable for critical realtime processes. For this reason, the F5 file system architecture allows you to allocate up to four extents for each file. An extent is a chunk of contiguous data blocks that are preallocated when the file is created.

The inode of each file has pointers to the data storage blocks associated with that file. Inodes in both the S5 and F5 file system architectures include 13 disk block addresses that point to the non-contiguous data blocks allocated for that file. In addition, the F5 file system has four extent addresses that point to the extents allocated to the file. A file in an F5 file system may use only the data block addresses, only the extents, or a combination of the two to access data block addresses. The data stored in a storage block that is part of an extent is referred to as the extent-based part of the file; certain operations are supported only on extent-based parts of files.

 

Data Block Address Array

The inode's data block address array points to the data storage blocks associated with all S5 files and the non-extent based files in F5 file systems. It contains 13 blocks:

  • blocks 0 - 9 - point to the first 10 data storage blocks associated with the file.
  • block 10 - for files that are larger than 10 blocks, points to an indirect block, which in turn points to the block addresses of another set of data blocks associated with the file.
  • block 11 - for files that contain more blocks than are accessible through the single indirection of block 10; points to a double indirect block that contains the addresses of a set of indirect blocks, each of which contains the addresses of another set of data blocks.
  • block 12 - for even larger files, points to a triple indirect block that contains the addresses of a set of double indirect blocks that point to sets of indirect blocks that point to the data blocks.

This scheme is illustrated in Figure 5-5. Data storage blocks are shaded, and indirect blocks are shown unshaded.

accessing non-contiguous data blocks

Figure 5 - Accessing Non-Contiguous Data Blocks

 

The number of data blocks accessed by each indirect block varies depending on the logical block size of the file system; it is calculated by dividing the logical block size by 4, which is the size of one pointer. Given that the number of bytes of data that you may store in each data block also increases for the larger logical block sizes. This illustrates that double and triple indirection is seldom needed but, if it is used, one can incur a fair amount of excess overhead in file access.

 

Extent List

Inodes for files in F5 file systems have an extent list in addition to the data block address array discussed above. An extent is a set of preallocated, contiguous data blocks. The operating system does not impose a limit on the number of data blocks that are allocatable to one extent beyond the limit of the size of the file system. In other words, the operating system will allow you to create a file system that contains one file with one extent that contains all the data storage blocks allocated for the file system. Each extent is identified by the physical offset into the file system of the first block and a running sum of the number of data blocks allocated to all extents for the file, as illustrated in Figure 5-6.

accessing extents (contiguous data blocks)

Figure 6 - Accessing Extents (Contiguous Data Blocks)

 

In this figure, the first extent begins at block 300 and contains 10 data storage blocks, so the sum is 10. The second extent begins at block 100, and contains 50 data storage blocks, so the running sum is 60, which is the number of data blocks allocated to the first and second extents. The sum for the fourth extent represents the total number of data storage blocks allocated to all extents for the file (162 in this example). Summing the extents make loops self-terminating, and thus more efficient. If no extents are used, the offset and sum for all extents are 0. You may view the offset and sum for the extents by using the ls -E command string.

Inodes in files in F5 file systems also contain additional flags that describe the allocation mechanism. These are viewed with either the ls -e or ls -E command:

  • c -file has contiguous extents.
  • s -physical space will shrink (in non-contiguous fashion) on truncates and creates; default is to not shrink.
  • g -grow file for writes beyond the end of physical space; if not set, writes beyond last active extent will fail.

Use the prealloc(1R) command or system call to allocate contiguous file space to a file, specify the physical disk location of the extent, and set the flags that control allocation. The trunc(2) system call truncates a file to a specified size. The estat(2) and efstat(2) system calls provide statistical information (such as file size and number of links) for files in F5 file systems similar to that returned by the stat(2) and fstat(2) system calls for S5 file systems, as well as information on the extents.

 

Accessing Free Inodes and Data Blocks

Each time a file is created, the operating system must allocate an inode and an appropriate number of data blocks to the file. Inodes are allocated the same for both file system architectures, using an array of free inodes stored in the super block. Data blocks are identified and allocated using different schemes for each architecture, with S5 file systems using a linked list with a cache kept in the super block and F5 file systems using a bitmap of free blocks stored outside the superblock.

 

Allocating and Deallocating Inodes

The super block's free inode array contains the inumbers of 100 free inodes, although the file system may have many more available. An index points to the next available slot in the array. When a file is created, the system must allocate an inode; if no inodes are available for allocation, the open(2) or creat(2) system call will fail.

The operating system takes the following steps to allocate an inode:

  1. Lock the free inode array.
  2. Decrement the index into the free inode array and use it to find the inumber of the next available inode. Free inodes are identified by a zero value for the mode field.
  3. If the free inode array empties, go to the ilist on disk and look for free inodes. As they are found, put their numbers in the array of free inumbers until the array is filled (100 inumbers) or, if the file system has less than 100 available inodes, until all available inodes are listed in the free inode array.
  4. Read the inode into main memory.
  5. Assign appropriate header information to the inode.
  6. Unlock the free inumber array and unblock any processes that were blocked while waiting for the array.

When an unlink(2) system call removes the last link to a file, the inode for that file is removed and its inumber is added back to the free inode array.

If the array is already full (has 100 inumbers in it), the operating system compares the number of the newly deallocated inode against the number in the first slot of the free inumber array. The lower of the two numbers is put (remains) in the first slot; the higher of the two numbers is added to the ilist on disk. This eliminates unnecessary searches when refilling the free inumber array, by ensuring that the lowest numbered inode available is in the first slot of the free inumber array.

 

Allocating and Deallocating Free Blocks

Free blocks are data storage blocks that do not currently contain inodes, indirect address blocks, file extents, or data and are not part of the directory tree structure. Free blocks are each one logical block long.

For S5 file systems, the super block's free block array contains the data block numbers of 50 free data blocks, forming the beginning of a free list of data block numbers. This array is used as a stack whenever the file system needs to allocate another data block. To repopulate the free block array, the operating system uses address 0 as a pointer to a free block that contains 50 more addresses. This algorithm is extremely fast but tends to scatter data over time.

For F5 file systems, a bitmap of all free data blocks in the file system is stored outside the super block. To allocate data blocks for a file, the operating system can quickly search this bitmap to identify free data blocks. This is slower, but tends to localize disk accesses to a particular file, and also enables the operating system to quickly identify contiguous blocks available for allocation to extents.

When the process allocates a data block dynamically in an F5 file system, the system attempts to cache several contiguous data blocks (the default is 8, but you may modify this number through a kernel tunable parameter), which is used if the application needs to allocate additional blocks. The extra data blocks in the cluster are freed when the process closes the file or exits. This results in files that are less fragmented on disk, and thus accessed more efficiently.

 

Summary Comparison of S5 and F5 File Systems

The previous sections have discussed the internal differences between the S5 and F5 file system architectures. Table 5-1 summarizes these differences.

file system comparison

Table 1 - File System Comparison

 

File System Access Tables

When a file system is identified to the operating system with a mount(1M) command, the operating system makes an entry in the mount table and reads the super block into an internal buffer maintained by the kernel. Parts of the super block that are needed in memory are the lists of free inodes and storage blocks and the flags and time fields that are constantly being modified.

Three system tables maintain information about all files that are opened or referenced. These are:

  1. System inode Table: contains information from the inodes for each open or referenced file. The operating system maintains one system inode table for all processes; it is part of the operating system's address space.
  2. System File Table: contains information about opens of files. Each time a file is opened, an entry is allocated in this table that identifies the way the file was opened and the user's current offset in the file. Like the system inode table, this table is part of the operating system's address space.
  3. User Area File Descriptor Table: A table of pointers to entries in the system file table. One of these tables for each process resides in the user's user area.

Figure 5-7 shows how these tables string together. The following sections describe these tables in more detail.

file system tables and their pointers

Figure 7 - File System Tables and Their Pointers

 

The System Inode Table

The system inode table holds most information from the disk version of the inode, as well as the following:

  1. device: identifies the device on which the file system resides. This indicates the file system of which the file is a part.
  2. inumber: along with device, locates the file in the current file hierarchy. The location of the inode is calculated each time the file is referenced using the inumber as an offset, and is not stored in the disk version of the inode.
  3. reference count: number of times the file was opened or referenced; in other words, how many pointers from the system file table point at this entry. When a table entry is shared, the reference count shows how many processes are sharing the entry. When this number drops to zero, the file is no longer being referenced and the entry deallocation is permitted.
  4. last logical block read: used by the system to determine if the system is reading the file sequentially. If it is, the operating system invokes the read-ahead feature each time a read is done. This causes the queueing of block n+1 for reading when the block n is read, assuming that the system eventually requires block n+1.
  5. hash list pointers: used for locating used inode table entries and maintaining a free list of inode table entries.
  6. flags: indicate whether the file represents a shared-text a.out file, a locked inode, a modified file, or a mount-point directory. Other flags specify if the file is open for synchronous writes and if someone is waiting for the system to unblock this inode.

Notice that the system inode table does not contain any time stamps, which are part of the disk version of the inode.

Several linked lists are used to provide easy access to system inode table entries currently in use and to keep a list of free inode table entries for use when allocating entries while opening and referencing files. These lists are present in all systems based on UNIX System V to decrease the time required to manipulate new files:

  • System Inode Table Free List: starts at an operating system variable, which points to an unused system inode table entry, which points to another, and so on until all free table entries are on the free list. The link pointer of the last entry has a null pointer as its value.
  • System Inode Table Hash Lists: consist of pointers to inode table entries, each of which starts a hash list of inode table entries. This reduces the average number of entries inspected when looking for a file in the table by breaking the total number of used entries into a series of lists specified by the NINODE kernel parameter. These lists are maintained through an inode hash table.

The inumber and device are hashed to find an entry in the hash table. This entry points to an entry in the system inode table which is the head of the hash list. If the file is represented by an entry in the inode table, it is on this hash list.

 

The System File Table

The system file table contains information about the opening of files on the system. Each time a file is opened, an entry is allocated and populated in the system file table. A system file table entry contains the following information:

  • reference count: indication of how many file descriptors are pointing to this entry. One or more processes may have descriptors pointing to this entry. If two processes are pointing to the same system file table entry, they must be related through a fork and the file must have been opened before the fork.
  • current offset: indication, in bytes, of the user's position with-in the file.
  • flags: used to record how the file was opened. They include the following values:
    • read: the file is available for reading
    • write: the file is available for writing
    • append: before each write to the file, the current offset is set equal to the size of the file. Therefore, if two or more users are appending to the same file, they will always append to the end of the file, not just starting at the end.
    • no delay: if the opened file is a named pipe, reads to an empty pipe will not cause the process to sleep, nor will writes to a full pipe. Instead, an error is returned.
  • pointer to system inode table entry: link to the remainder of the information about the file; an access pathway to the data blocks for the file.

 

The User Area File Descriptor Table

The user area file descriptor table is part of the user area and contains pointers to entries in the system file table, which in turn points to entries in the system inode table. Each entry is referenced by a file descriptor (described above) and corresponds to an open file. The size of this table is determined by the kernel parameter NOFILES. The default value of NOFILES is 80, which allows each process to have up to 80 files open concurrently, including stdin, stdout, and stderr. The dup(2) system call copies one entry in the file descriptor table to another.

 

Using the File Access Tables

To illustrate how the file access tables work, the following sections describe the internal system activities related to the open(2), creat(2), read(2), and write(2) system calls.

 

open

The open(2) system call opens an existing file and returns a file descriptor. As a simple example of the open system call, assume that the argument to open is the path /a/b.

  1. The operating system sees that the path name starts with a slash, so in the user area there is a pointer to the inode table entry for the root directory's inode.
  2. Using the root inode, the system does a linear scan of the root directory file looking for an entry "a". When "a" is found, the operating system picks up the inumber associated with "a".
  3. The inumber gives the offset into the ilist in which the inode for "a" is stored. At that location, the system determines that "a" is a directory by looking at the file type.
  4. That directory is searched linearly until an entry "b" is found.The "b" is found, its inumber is picked up and used as an index into the ilist to find the inode for "b".
  5. The inode for "b" is copied to the system inode table, and the reference count is incremented.
  6. The system file table entry is allocated, the pointer to the system inode table is set, the offset for the I/O pointer is set to zero to indicate the beginning of the file, and the reference count is initialized.
  7. The user area file descriptor table entry is allocated with a pointer set to the entry in the system file table.

The algorithm for locating the inode of a file illustrates why it is advisable to keep directories small. Search time is also speeded up by keeping subdirectory names near the beginning of a directory file, which the dcopy command does.

 

creat

The creat(2) system call creates a new file and returns a file descriptor. It functions like the open(2) system call, with three additional steps at the beginning:

  1. The super block is referenced for a free inode.
  2. The mode of the file is established (by combining the system defaults with the complement of a umask entry) and entered in the inode.
  3. Using the inumber, the system goes through a directory search similar to that used in the open system call, except that here the last portion of the path name is written by the system into the directory that is the next to last portion of the path name, and the inumber of the newly-created file is stored with it.

 

read and write

The read(2), write(2), aread(2) and awrite(2) system calls take a file descriptor as an argument, and follow these steps:

  1. Using the file descriptor as an index, the file descriptor table is read to get a pointer to the system file table.
  2. The user buffer address and number of bytes to read/write are supplied as arguments to the call. The correct offset into the buffer is read from the system file table entry. For the aread and awrite system calls, the offset is modified by settings within the aiocb control block.
  3. For read operations, the inode is found by following the pointer from the system file table entry to the system inode table. The operating system copies the data from storage to the user's buffer.
  4. For write operations, the same pointer chain is followed, but the system writes into the data blocks. If new blocks are needed, they are obtained by the system from the file system's list of free blocks.
  5. The read or write operation will take place, provided the file is not locked by another process.
  6. Before the system call returns to the user, the number of bytes read or written is added to the offset in the system file table.

 

Files Shared by Related Processes

If related processes are sharing an open file (as is the case after a fork(2)), they also share the same file descriptor and entry in the system file table.

Unrelated processes that access the same file have separate file descriptors and separate entries in the system file table. Because they executed separate open(2) calls, they may read from or write to different places in the file.

In both cases, the entry in the inode table is shared; the correct offset at which the read or write operations should take place is tracked by the offset entry in the system file table.

 

Path Name Conversion

The directory search and path name conversion take place only when the file is opened. For subsequent access of the file, the system supplies a file descriptor which is an index into the file descriptor table in your user process area. The file descriptor table points to the system file table entry where the pointer to the system inode table is picked up. Given the inode, the system can find the data blocks that make up the file

 

Synchronizing Disk Files and the Buffer Cache

The file subsystem must handle several simultaneous processes that access different files. To do this, the system keeps a cache of free blocks and inodes in memory along with the super block. When you write a file, you actually write to these blocks. Synchronization is the process by which the contents of these blocks are written to the actual device.

The operating system flushes the contents of these disk buffers (along with super blocks and updated inodes) to the disk devices periodically. Every five minutes (unless a high-priority realtime process prevents it), the init(1M) process flushes all buffers to disk. The bdflusr daemon runs more frequently to flush active buffers. Tunable parameters determine how often the bdflushr daemon runs, the priority at which it runs, and the number of seconds after the last write access before bdflushr writes it out. Under normal circumstances, this scheme is adequate to keep the disk file synchronized. If the system crashes before the buffers are written to disk, data loss may occur. During testing or when AC power problems increase the odds for a system crash, we recommend that you modify the tunable parameters that control the bdflushr daemon to reduce the interval between flush operations.

The REAL/IX Operating System also allows you to mount some file systems as "synchronous file systems." For synchronous file systems, each write operation writes to the cache, then immediately writes the data blocks and the updated inode to the device. You should use synchronous file systems only for applications in which the immediate updating of the file is critical, such as a process control data acquisition system that gathers statistics and has some realtime processes that use that data to run/change the actions of processes. While an individual write operation to a synchronous file system is faster than a write operation to a non-synchronous file system, the use of synchronous file systems can degrade overall system perfor-mance.

The fsync(2) system call allows you to immediately flush the buffers associated with a particular file after a write(2) operation. fsync should be used by programs requiring that a file be in a known, up-to-the-minute state; for example, a transaction log for a database. fsync allows you to do critical updates without suffering the performance degradation that is sometimes associated with file systems mounted synchronously.


Go to Chapter 6 TOC

 


E-Mail Webmaster  | Legal | Copyright © 2001 MODCOMP, Inc. | Rendered Sept. 28, 2001

MODCOMP is a subsidiary of CSP Inc