Process Subsystem
The process subsystem is responsible for process scheduling,
process synchronization, memory management, and interprocess communications.
The REAL/IX Operating System supports all process subsystem facilities
of UNIX System V, plus enhancements to provide an appropriate execution
environment for realtime processes.
Processes
In UNIX terminology, a program is the set of
instructions and data coded and compiled by the programmer, and a
process is one execution of a program. Some other operating
systems use the term task for what we call a process. On UNIX
operating systems, many processes may concurrently execute the same
program at the same time.
Processes execute at either user level or kernel level.
- Each user-level process runs in its own address space, separated
from all other processes. Other processes may communicate with
it through one of the facilities discussed in the next chapter,
and a process executing at a higher priority may prevent it from
executing, but otherwise it is totally protected from other processes.
- All kernel-level processes execute as functions of the kernel
main() routine. While it is possible to synchronize kernel
processes to prevent concurrent access of kernel resources, any
kernel process can access the address space of any other executing
process. For this reason, it is important that kernel-level processes
(including user-installed system calls and drivers) use defined
functions as much as possible, access kernel data structures appropriately,
use kernel semaphores and spin locks as needed, and are tested
thoroughly before installation.
Memory Segments for Processes
Each process is represented by two memory segments called
the text (or code) segment and the data segment, and
a set of data structures that are referred to as the process environment.
A text segment contains code and constant data and is shared by all
processes running the same program.
The data segment of an executing process is composed
of two regions: the stack region and the data region. The data region
contains the process's static variables, and may also contain a dynamic
data structure known as a heap. The data region begins at the
low address of the space and grows upward; the stack region begins
at the high address and grows downward. The actual stack and heap
areas are always slightly smaller than the region that is allocated
for them, as illustrated in Figure 3-1. Note that any shared memory
segments are located between the stack and the heap
segments.
Figure 1 - Data Segment of Executing Process
The process environment records the information the
kernel needs to manage the process, such as register contents, priority,
open files, and so forth. In order to maintain system integrity, a
process may not address its environment directly, but can use system
calls to modify it.
Creation and Termination
One of the fundamental differences between UNIX operating
systems and other operating systems is its reliance on processes that
are spawned from others, called child processes. A number of standard
system services exist as ordinary utility processes rather than embedded
in the operating system. When the operating system is first booted,
it creates the init process, which is the parent of all shell
processes created when users log in. If you issue a command from the
terminal (such as cat(1), to list the contents of a file),
that process is a child of your shell program. On UNIX operating systems,
processes often use other processes rather than a subroutine. Of course,
processes also create processes to run portions of algorithms in parallel,
as they would on other operating systems, but the use of a process
as something similar to a subroutine is peculiar to UNIX operating
systems and the associated programming style.
Forking Child Processes
A new process is created with a fork(2) system
call. This call creates a child process, which is effectively a duplicate
of the parent process that created it. This is implemented using a
copy-on-write scheme, where data pages are copied only when a write
operation is requested, thus avoiding unnecessary copying. fork
copies the parent's data and stack segments (or regions) plus
its environment to the child. The child shares its parent's text segment,
which conserves memory space (by default, all REAL/IX Operating System
programs are reentrant). Child processes initialize more quickly than
the original process, since they only have to modify parts of the
inherited environment rather than recreate the entire environment.
fork returns values to both the parent and child,
but returns a different value to each. The parent receives the process
id (PID) of the child, which can never be 0, and the child receives
0. The program uses these return values to determine which is the
child process and which the parent process (since they are both executing
the same program) and takes different branches in the code for each.
After checking the value returned by fork, the
parent and child execute different branches of the same program in
parallel. If the child needs to execute a different program, it issues
an exec(2) call. exec replaces the text and data segments
of its caller with those of a new program read from a file specified
in the call. exec does not alter its caller's environment;
a child process may execute a different program, but still have access
to its parent's files, although they may have been modified by the
child between the fork and the exec.
A child issues an exit(2) system call to terminate
normally; this call takes a parameter whose value is returned to the
child's parent. A child may also terminate abnormally by a signal
issued by the kernel, a user, or another process. When a child terminates,
either normally or abnormally, the operating system sends a SIGCLD
signal to the parent process. Signals
are discussed in more detail in Chapter 4.
Waiting for a Child Process to Terminate
Meanwhile, the parent process is free to continue execution
in parallel with the child process. If the parent needs to wait for
the child to terminate, it issues the wait(2) system call.
wait returns the process id of the terminated child (which
allows one parent process to spawn several children and specify the
one for which it is waiting) and the status code passed by the child
when it exits.
Whether or not the parent waits for the child to terminate,
it receives a SIGCLD signal when the child terminates. The parent
process can catch this signal, then issue a wait to learn what
happened to the child process.
Process States
On the REAL/IX Operating System, there are eight process
states, viewed with the ps-efl command. These states, with
the ps descriptor shown in parentheses, are:
- The process is runnable when the kernel schedules it, although
it is not currently running (R).
- The zombie state, where the process has issued the exit(2)
system call and no longer exists, but it leaves a record containing
an exit code and some timing statistics for its parent process
to collect. The zombie state is the final state of a process (Z).
- The process is stopped by a signal (T).
- The process is newly created and is in a transition state; the
process exists, but is neither blocked nor runnable. This state
is the start state for all processes except process 0 (I).
- The process is blocked awaiting memory availability (X).
- The process is executing in either kernel or user mode (N).
The "N " indicates the processor number on which
the process is executing.
- The process is blocked awaiting some event; a signal will not
unblock it (D).
- The process is blocked awaiting some event; a signal may unblock
it (S). This is usually the most common state.
Memory Management
The memory management module allocates memory resources
among all executing processes on the system.
Data Structures
Every executing process has three memory management
data structures (proc, user, and pregion)
associated with it:
- proc structure (defined in the proc.h header
file). All proc structures are listed in the kernel's
process table, whose size is determined at sysgen(1M) time.
The kernel's process table is always in memory (never paged out),
so the proc structures contain information the kernel
may need while a process is paged out, such as its priority, its
process group and parent process, address of the "u"
page, and addresses used for sleep functionality.
- user block (or u area, defined in the user.h
header file). The user block is never paged out on the
REAL/IX Operating System, but because some other varieties of
the UNIX operating system do page them out, the convention is
to include no information in user blocks that the kernel
needs in case the user block is paged out. The user
block for the currently executing process is always located at
a specific location in virtual memory; when the kernel does a
context switch, the u area for the currently running process
is mapped out of the fixed address, and the u area for
the process that is about to run is mapped into the fixed address.
Each user block has a pointer to the corresponding proc
structure.
- process region (or pregion) structure. The entries
in the process's pregion table point to entries in the
system's region table, each of which describes a logical
segment of memory. Each pregion table has a fixed number
of entries, usually three (for text, data, and stack) plus the
number of shared memory segments for the process. All processes
executing the same program have a pregion pointer to
the same text segment in the system's region table. When
sharing memory, the pregion structure for the process
that initialized the shared memory segment points to the entry
in the region table; the pregion structure for
all other pro-cesses that access that shared memory segment point
to the same region entry.
These structures are illustrated in Figure 4.
Figure 4 - Memory Management Data Structures
Note that the shared memory is accessed by the two processes
whose pregion tables both point to the same Kernel Region
Table entry.
The fork(2) and exec(2) system calls are
intimately associated with the memory management data structures.
When a process forks, the kernel:
- makes a new entry in its process table for the child process,
copying most of its contents from the parent's proc structure.
- copies the parent's pregion table to the pregion
table it allocates for the child.
- allocates new pregion entries and page descriptors
for the child's data and stack segments.
- allocates new pregion entries for the child's text
and shared memory segments; these regions share the parent's corresponding
physical pages as long as both processes only read the pages.
- if either the child or the parent writes to one of the pages
for data or stack, the kernel makes a separate copy of that physical
page for the child.
Address Mapping
A running process refers to memory with virtual addresses,
which essentially consist of a virtual page number and a byte offset
into the page. The hardware's memory management unit (MMU) translates
the virtual page number into a physical page frame number, adds the
offset, and sends the resulting physical address to memory.
The key to performing the virtual-to-physical address
translation, or mapping, is the page map maintained in kernel memory
and defined in immu.h. As discussed earlier, each process has
its own page table, containing one entry for each of its pages; when
the process is running, its page table is loaded into the page map.
Paging
When the number of processes exceeds the permissible
number for residence in memory, the kernel moves less-active pages
(that are not locked in main memory) to disk memory from main memory.
This maintains over-all good system performance, although it may slow
the response of an individual process.
The vhand daemon is responsible for paging operations.
When less than 10% of the available memory is free, vhand makes
"aging passes". When less than 1% of the available memory
is free, vhand pages out the least recently used pages on the
system until 5% of memory is free.
The free list contains page frames that are eligible
for reuse. Page frames are added to the head of the free list when
they are no longer needed, and to the tail of the free list when they
are needed again. For example, when the last process executing a program
terminates, the process's stack, data, user page, and page table frames
are of no use to another process, and so are added to the head of
the free list. The frames containing the code pages are usable if
another process executes the program, so they are added to the tail
of the free list.
If another process executes the program while the code
pages are on the free list, the kernel reclaims them rather than bringing
them in from the object file or the swap device. Page frames are allocated
only from the head of the free list so that pages that may be needed
again stay associated with page frames as long as possible.
Note that frames containing kernel code and data are
on neither the swap list nor the free list because no part of the
kernel is ever paged out.
Page Faults
A page fault is an attempt to access a page that the
pager has marked invalid, and invokes the kernel's page fault handler.
The page fault handler takes different actions depending on where
the invalid page is:
- If the page is on the free list, the page fault handler unlinks
the associated page frame from the free list, marks it valid,
and resumes the process. Reclaiming a page frame from the free
list in this manner does not block the faulting process and is
faster than a disk read, although the system may incur unacceptable
(for critical realtime processes) overhead.
- If the page is paged out (in other words, the frame it formerly
occupied is allocated to another page), the page fault handler
blocks the faulting process and schedules a disk read to retrieve
the page from the swap device. Later, when the page is read, the
kernel allocates a frame from the head of the free list, updates
the frame address in the process's page table, marks the page
valid, and unblocks the process.
The combined actions of the pager and the page fault
handler tend to keep frequently-accessed pages associated with page
frames, while little-used pages tend to migrate to the swap device.
The pager moves page frames to the free list if they are not recently
accessed, so that they are eventually reallocated. Concurrently, page
faults taken on these pages nullify the pager's efforts. If a page
is accessed very frequently, the pager never sees it marked as not-accessed
and never adds it to the free list. If a page is accessed often, it
is possible that it is added to the free list, but is rapidly reclaimed
by the page fault handler before it gets to the head of the list.
Only the least-used pages get to the head of the free list and require
reading from disk before they are used. The free list thus serves
both as a source of available page frames and a cache of recently
discarded pages that are quickly reclaimed.
Allocating Memory
REAL/IX memory allocation is similar to that on other
UNIX operating systems, with extensions to provide explicit control
over memory allocation in critical realtime application programs.
These extensions provide the realtime programmer with complete control
over the REAL/IX demand-paging subsystem. To guarantee response time,
users are allowed to pre-page and lock all pages (instructions, data,
shared memory, and stack) of a program into memory. At the programmer's
option, the operating system notifies a realtime process of any attempt
to grow the stack or data portions of the process's data segment.
The underlying philosophy of memory allocation is different
for realtime and time-sharing processes. The time-sharing philosophy
is to avoid consuming any more memory than is absolutely necessary
so that all processes have equal access to memory resources. For critical
realtime processes, the emphasis is on providing optimal performance
for the program.
Consequently, realtime programs typically preallocate
a generous amount of memory and lock all resources they might need
into memory. You may use the REAL/IX Operating System to implement
either philosophy.
Preallocating Memory
The REAL/IX Operating System uses demand paging, so
the operating system does not allocate any memory for a process when
it is initialized. Rather, the process is allowed to start executing
at its entry point, which causes the process to page fault text and
data pages as they are referenced. As the data segment (which is composed
of the stack and data regions) outgrows that memory, the system allocates
more physical pages. This scheme conserves memory and is appropriate
for many applications, but the overhead incurred is unacceptable for
critical realtime programs.
Most UNIX operating systems provide the brk(2)
and sbrk(2) system calls to preallocate virtual space for the
data region. The end of the data segment is called the break.
brk and sbrk allow you to specify a new location for
the break, with brk specifying an absolute address and sbrk
specifying an address relative to the current break.
The REAL/IX Operating System also provides the stkexp(2)
system call, to preallocate virtual space for the stack. You can specify
either the absolute size of the stack or the increment by which the
stack is to grow.
For programs that need memory allocated for data space,
the malloc(3C) mechanisms are a simple, general-purpose memory
allocation package. Realtime programs should call malloc only
during the initialization part of the program, or use brk or
sbrk to pre-allocate data space. Note that one program should
not call both malloc and brk/sbrk.
Locking Pages in Memory
Paging enables the operating system to provide good
performance to a number of programs executing at the same time. However,
the overhead associated with accessing processes or data that are
paged out is significantly more than the overhead involved in accessing
processes or data that are resident in memory.
The plock(2) system call allows you to lock text
and data segments into memory. The shmctl(2) SHM_LOCK
system call allows you to lock shared memory segments into memory.
These calls lock segments into memory when they are first accessed.
They are used with the system calls that preallocate memory, or can
allow the operating system to allocate memory as needed.
Critical realtime processes can lock segments into memory
during process initialization with the resident system call,
so that the first attempt to access a segment does not incur the overhead
of loading it into memory. The resident(2) call requires preallocation
of memory as discussed above. For critical realtime processes that
preallocate memory, expanding memory beyond the preallocated limits
is usually considered a fault, although the action taken for such
a fault is at the discretion of the programmer. If desired, the resident
call may post an event (as discussed in the following chapter) to
the process if the memory allocated is inadequate for the stack or
data region.
Scheduling
Scheduling determines how a CPU is allocated to executing
processes. Each executing process has a priority that determines its
position on the run queue.
The multiprocessing environment uses, effectively, two
run queues: the global run queue and the local run queue. The global
run queue is the list of processes that are executable by the first
available CPU. The local run queue is maintained by each individual
CPU and is a list of processes that have been targeted for execution
on that particular CPU. Specifying a targeted CPU for a process might
be used to improve the realtime performance of a particular process.
Targeting CPUs is accomplished with the targetcpu(2) system
call. The multiprocessor run queues are organized in the same manner
as uniprocessors. Note that if a local run queue is available, the
global run queue is not used. The run queue organization is discussed
in the following paragraphs.
On the REAL/IX Operating System, the run queue consists
of 256 process priority "buckets", divided into two major
parts as illustrated in Figure 3-3: processes executing at priorities
128 through 255 (time-slice priorities) utilize a time-sharing scheduler
implemented in the onesec process, and processes executing
at priorities 0 through 127 (realtime priorities) utilize a process
priority scheduler implemented internally. Each of these schedulers
utilize different scheduling algorithms and are discussed below:
|
0
.
.
.
127
|
realtime priorities are changed only by explicit
request of program or administrator. |
|
128
.
.
.
253
|
timesharing priorities are dynamically adjusted by the operating
system. |
|
254
255
|
non-migrating priorities execute only when all other processes
are idle |
Figure 3 - REAL/IX Process Priorities
Processes are scheduled according to the following rules:
- A process runs only when no other process at a higher priority
is runnable.
- Once a process with a realtime priority (0 through 127) has
control of a CPU, it retains possession of that CPU until it is
preempted by a process running at a higher priority or relinquishes
the CPU by making a call that causes a context switch, or blocks
to await some event, such as I/O completion, or its time slice
expires (by default, the time slice is 11 years, but this is changeable
with the setslice(2) system call.
- Because the REAL/IX Operating System has a preemptive kernel,
pre-empting a running process is possible at any time if a process
at a higher priority becomes runnable.
The process table for each executing process includes
scheduling parameters used by the kernel to determine the order in
which processes are executed. These parameters are determined differently
for time-sharing and fixed-priority scheduled processes.
Time-Sharing Scheduling
Processes executing at priorities 128 through 253 utilize
a time-sharing scheduler similar to that on other UNIX operating systems.
The operating system varies the priorities of executing processes
according to a specific algorithm. For instance, interactive processes
gravitate towards higher priorities (their actual run times are relatively
small compared to their wait times), and processes that recently consumed
a large amount of the CPU are relegated to lower priorities. A process'
only control over its priority is with the nice(2) command
and system call, which lowers the relative priority of a process or,
if issued by the superuser, can grant the process a more favorable
priority.
The kernel allocates a CPU to a process for a time quantum,
determined by the MAXSLICE tunable kernel parameter. The process will
retain possession of the CPU until preempted by a higher priority
process, until it is finished and relinquishes control of the CPU,
or until its time quantum expires. If the process has not released
the CPU at the end of the time quantum, it is preempted and fed back
into the queue at the same priority. When the kernel again allocates
a CPU to the process, the process resumes execution from the point
where it was suspended. Once a second, those time-sharing processes
that have consistently used their whole quantum are shuffled to lower
priorities.
This time-slice scheduler has the advantage of equitably
distributing the CPU among all executing processes. It is not adequate
for critical realtime processes, which need to execute in a determinate
(preferably fast) time frame. For this reason, the REAL/IX system
supplements the traditional time-sharing scheduler with the fixed-priority
scheduler.
Effect of nice
Each executing process has a nice value. This
value is displayed on the ps -efl output. The nice(2)
system call and user command change this nice value.
When the nice value is recalculated, the execute
priority of the process is recalculated if the process is executing
at a time-share priority (128 through 253). If the process is executing
at a fixed priority (0 through 127, 254, or 255), recalculating the
nice value has no affect on the execute priority.
Fixed-Priority Scheduling
Processes executing at priorities 0 through 127 utilize
a fixed-priority scheduler. A process establishes its own priority
with the setpri(2) system call. The operating system never
automatically changes the priority of a process executing at a realtime
priority except during a semaphore boost operation; only the process
itself or a user with realtime or superuser privileges can change
the priority. A process executes only if there are no processes with
higher priorities that are runnable at this time.
Runnable processes at the same priority are arranged
in a circular, doubly-linked list. A round-robin scheduling scheme
is used for processes at the same priority, with the ability for a
process to relinquish its time slice to another process at the same
priority. A process can also use the setslice(2) system call
to establish quantums; the default is 6 ticks.
The fixed-priority scheduler allows a critical realtime
process to "hog" the CPU as long as necessary to finish.
If processes at high realtime priorities are using the CPU, low-priority
realtime processes and time-share processes may never get to execute.
Note that a process scheduled at priority 100 will run just as fast
as a process scheduled at priority 1 if no processes are scheduled
at a higher priority. Because a high-priority "runaway"
process may never surrender the CPU, we recommend setting the console
to run at priority 0 or 1 and no other processes (other than critical
system processes) run at that priority, to ensure that you can regain
control of the system.
Run Queue Organization
To enable the operating system to search the queue efficiently,
two bit mask schemes are implemented for each run queue. The rqmask2
scheme contains 8 bit masks, each of which is 32 bits long. Each bit
corresponds to one bucket in the queue; if a bit is set to 1, there
are one or more runnable processes at that priority. The rqmask
is an 8-bit mask, with one bit for each bit mask in rqmask2.
If there are any runnable processes at priorities 0 through 31, the
first bit is set; if there are any runnable processes at priorities
32 through 63, the second bit is set, and so forth. The run queue
organization is illustrated in Figure 6.
Figure 6 - REAL/IX Run Queue Organization
Locks to Preserve Data Structure
Integrity
The REAL/IX Operating System uses spin locks and suspend
locks (or kernel semaphores) to ensure data structure integrity in
the preemptive kernel. If two processes access the same global data
structure, it is important that the first process completes any update
of that structure before the second process accesses it. In other
UNIX kernels, this is handled by disabling interrupts to prevent an
interrupt handler from accessing a data structure that was being manipulated
by process-level kernel code. If the preemptive kernel were implemented
without locks, a higher-priority process could cause a context switch
from a lower-priority process even though it is in the process of
updating a data structure, and thus corrupt the structure.
In a non-preemptive uniprocessor configuration, data
structure integrity is preserved by manipulating processor execution
levels to prevent interrupts when updating a structure, but this is
inadequate for a multiprocessor configuration because the interrupt
handler or another process may execute on a different processor than
the process-level routines. The locking mechanism enables the REAL/IX
Operating System to run on a multiprocessor configuration, where all
processors operate on a symmetrical, peer-to-peer basis and each processor
can simultaneously execute user-level code, process-level kernel code,
and interrupt-level kernel code.
Synchronization in Compatibility
Mode
Other UNIX operating systems provide kernel-level synchronization
with the sleep/wakeup functions to block and unblock a process,
and the spl (set priority level) function to disable interrupts.
The REAL/IX system provides three compatibility modes that allow drivers
to be ported from UNIX System V without rewriting the synchronization
facilities. The three modes are:
- Non-preemptible - kernel preemption is turned off when
the process is running.
- Major-device semaphoring - one semaphore is set for the
major device (that is, the driver itself). This is implemented
in the switch table, so that only one instance of the driver entry
point can execute at a time.
- Minor-device semaphoring - one semaphore is set for each
minor device (that is, for each actual device controlled by the
driver).
Drivers installed using these compatibility modes may
not realize the full performance enhancements provided by rewriting
the drivers to use kernel semaphores and spin locks, but should perform
similarly to how they do on UNIX System V.
Kernel Level Semaphores
Suspend locks, or kernel semaphores, are used when the
lock time is relatively long (implemented by switching to another
runnable process while the desired resource is busy). They are used
to limit the number of processes that access a kernel resource simultaneously
or to block a process until a specified event occurs (in lieu of the
sleep/wakeup functions of traditional UNIX operating systems).
The value of a semaphore is initialized at system initialization
time using the initsema function. initsema sets up the
initial integer values for the semaphore while the valuesema
function is used to read the current status of a semaphore. valuesema
does not either lock or alter the state of the semaphore.
The value of a semaphore is decremented with the psema
or cpsema (conditional psema) functions. The difference
between these two functions is that, if the resource is not available,
psema causes the process to block and wait until the resource
is available, and cpsema returns without gaining access to
the resource and attempts to gain access to the semaphore at a later
time. The decsema is used by the operating system to unconditionally
decrement the value of the semaphore counters.
The value of a semaphore is incremented with the vsema
or cvsema (conditional vsema) functions. The difference
between these two functions is that vsema increments the value
of the semaphore unconditionally (thus unblocking a process that may
be blocked waiting for it), whereas cvsema increments the value
of the semaphore only if a process is blocked on that semaphore. The
incsema function is used by the operating system to unconditionally
increment the value of the semaphore counters.
The value to which a specific semaphore is initialized
determines its use. Semaphores used only to block processes, such
as while waiting for an I/O operation to complete, are initialized
to 0, so that the first process to issue a psema will block.
Semaphores used to control access to a kernel resource are initialized
to the number of resources available (for instance, the number of
buffers in the pool), so that processes do not block unless the resource
is exhausted.
Whatever the initial value of a semaphore, its effect
is determined by its value at any given time. Figure 7 summarizes
the meaning of kernel semaphore values.
|
Value of Kernel (Suspend Lock) Semaphore
|
|
<0
|
0
|
>0
|
| One or more processes are blocked
waiting access to this semaphore (and the resource it controls).
The absolute value of the semaphore is the number of processes
that are blocked. |
The semaphore (and the resource it
controls) may be in use, but no processes are blocked on it. |
The resource controlled by the semaphore
is available. The value of the semaphore indicates the number
of resources available. |
| A process that issues a psema on
the semaphore will block; a process that issues a cpsema on
the process will return without accessing the resource. |
The value of the semaphore indicates
the number of processes that can access the resource without
blocking. |
Figure 5 - Values of Kernel Semaphores
Spin Locks
Spin locks are used when the lock time is very small
(typically less than or equal to the time of two context switches).
The locks are initialized to zero when the system is booted by initlock.
When a process wants to gain exclusive of a protected resource, it
executes an spsema operation which tests and sets the lock.
If other processes try to use the resource they will be blocked until
the process owning the lock issues an svsema operation to unlock
the resource.
For uniprocessors, spin locks are equivalent to disable/enable
interrupt instructions. In multiprocessors, spin loops using special
hardware instructions are provided such as "test & set (TAS)
a bit in a memory location."
To better understand the REAL/IX spin lock by the TAS
spin loop process, consider the example illustrated in Figure 8. Processes
MP1 and MP2 share a common interest in a piece of code. MP1 enters
the region and sets a spin lock on the common code area. When MP2
attempts to enter the same region, it finds the lock set, and so loops
on the lock operation until MP1 releases the protected regions with
the unlock operation.
The operating system also makes use of a special semaphore
function, cspsema. If, upon execution of the cspsema
the semaphore is spin locked, the operating system does not wait until
the semaphore is available, but immediately returns control of the
process back to the CPU.
As illustrated in Figure 8, the spin lock process in
the semaphore processor differs from that of the TAS spin loop process,
but is functionally identical. When the semaphore processor receives
a semaphore function, it tests the semaphores to ensure that the requested
semaphore is available for use. MP1 and MP2 share a common interest
in a piece of code. MP1 enters the region and the semaphore processor
sets a spin lock. The spin lock is actually a single read request
to semaphore memory. When MP2 attempts to enter the same region, the
semaphore processor indicates the lock is set, whereby MP2 actually
suspends execution (stalls) and remains stalled until the semaphore
is released by the unlock operation.
The advantage of the SSP method over the TAS spin loop
method is that a LOCK signal is not required in order to suspend CPU
execution. For more detailed information about the semaphore processor,
and its spin lock process, refer to the System Level Concepts Technical
Manual.
Figure 6 - REAL/IX Spin Locks, TAS Spin Loops
Figure 7 - REAL/IX Spin Locks, Semaphore Processor
Kernel Daemons
To meet realtime constraints, some of the traditional
interrupt handler functions have been moved to high-priority processes
(daemons) that are triggered from an interrupt level with a vsema
or cvsema operation. These daemons include:
- onesec - maintains free page counts, calculates new process
priorities for time-slice processes, and unblocks other daemons
- vhand - processes paging operations
- bdflush - flushes I/O buffers that have been around too
long
- hitimed - handles delays and timeouts for high priority
processes
- lotimed - handles delays and timeouts for low priority
processes
- ttyd - handles interrupts for the line discipline
- prfd - handles kernel printf functions
- pgrpsigd - handles process group signal delivery functions
- idle - ensures that the system always has a process to
execute there by maintaining scheduler algorithms
- streamsd - runs in support of the STREAMS toolkit
The executing priorities of these daemons are set with
tunable parameters, enabling users to choose which daemons run at
priorities higher than critical realtime processes.
Note that the idle daemon is fixed at the lowest
running priority of 255. There is a special case priority of 254 that
is reserved as an idle task priority for those customers who want
to execute a custom idle task that may run some background process
instead of the standard REAL/IX system idle daemon.
In addition to these kernel daemons, the TCP/IP networking
software uses a set of user-level daemons; these are discussed in
Chapter 6 under Networking Daemons.
Timers
The operating system uses timers to schedule "housekeeping"
operations that must be run periodically and to provide a mechanism
through which realtime processes can schedule events.
System Clock
The system clock is the primary source for system synchronization.
A clock interrupt is generated on every clock tick. The clock interrupt
handler is activated by this interrupt, and maintains the user and
system times and the current date. It also provides a triggering mechanism
for process interval timers, profiling, and driver timeout functions.
By default, the system clock will tick at the rate specified
by the constant HZ in the file sys/param.h, typically
60 times a second. It is possible to increase the frequency and thereby
gain resolution at the expense of additional interrupt processing
time.
Each CPU in a multiprocessor system has a clock. When
the REAL/IX Operating System is booted, only one of the CPUs is designated
to maintain system time, the current date, and process driver time-outs,
as well as triggering process interval timer expirations. The other
CPUs maintain CPU dependent information that requires updating at
each clock tick.
The REAL/IX clock interrupt handlers differ from that
on most other UNIX operating systems in that the interrupt blocking
time has been bounded to never block interrupts longer than 100 microseconds.
This has been achieved by moving some functionality to kernel daemons.
Timing Functions
The REAL/IX Operating System supports all timing functionality
supported by the UNIX System V operating system. These include the
following system calls that consider time as the number of seconds
since 00:00:00 GMT, January 1, 1970:
- time(2) returns the value of time in seconds
- stime(2) sets the system time measured in seconds
- alarm(2) sends SIGALARM to the calling process after
a specified number of seconds have elapsed
- pause(2) suspend process until a signal is received
- sleep(2) library routine that suspends execution of a
process for a specified number of seconds or until a signal is
received
In addition, the REAL/IX system offers some timing facilities
that originated with the Berkeley variants of the UNIX system.
- setitimer(2) sends SIGALARM to the calling process after
a specified number of seconds and microseconds, optionally allows
the SIGALARM to be repeated at fixed intervals, and may be used
to cancel a running timer
- getitimer(2) returns the length of time remaining before
a SIGALARM due to a previous setitimer is delivered
- adjtime(2) make small changes in the system clock to
allow synchronization with other time sources
Additional timing facilities based on those in the POSIX
1003.4 1b documents include:
- gettimer(2) returns the value of time in seconds and
nanoseconds
- settimer(2) sets the system clock to a value in seconds
and nanoseconds
- nanosleep(2) suhyspends execution of a process for a
specified number of seconds and nano-seconds
- nanosleep_getres(2) returns details of the resolution
of the nanosleep function
The cron(1M) facilities build on these timer
services to allow users to schedule processes by way of the command:
- at(1), to execute either at some specified time.
- batch(1), to execute when system load levels permit.
- crontab(1), to execute periodically.
Process Interval Timers
Process interval timers are designed for use by one
or more realtime processes to schedule system events within a very
fine time scale, from a few seconds down to a 1/1920 second. The interval
timers are set by realtime processes to expire based on a time value
that is relative to the current system time, or a time value that
represents an absolute time in the future. They are set and used as
"one-shot" or periodic timers.
There is flexibility in specifying the action taken
when one of the timers expires. It is possible to cause an asynchronous
signal in the traditional manner of SIGALARM (although the timers
are not restricted to delivering this one signal). It is also possible
for an event to be delivered to a waiting synchronous process. Refer
to Common Event Notification in
Chapter 4 for more information about common events.
A list of free process interval timers is defined at
system generation (sysgen) time. The interval timers from this
list are allocatable by realtime processes during process initialization
or during normal execution. By using sysgen parameters, customization
of a timer mechanism for a particular application (requiring the availability
of varying amounts of timers for process allocation) is possible.
The REAL/IX Operating System offers process interval
timers based on the POSIX 1003. 1b document. A process may use as
many of these timers as it wishes, subject only to configuration limits.
The system calls used are:
- gettimerid(2) get a process interval timer identifier
for subsequent use
- reltimerid(2) release a process interval timer identifier
- incinterval(2) set a process interval timer running to
expire relative to current time; also used to cancel a timer
- absinterval(2) set a process interval timer running with
absolute expiration time specified; also used to cancel a timer
- resinc(2) returns the resolution details of the incinterval
function
- resabs(2) returns the resolution details of the absinterval
function
To use a process interval timer, a realtime process
does the following:
- Use evget(2) to obtain an event identifier to be used
when a timer expires. Parameters to evget allow flexibility
in the method of notification.
- Issue a gettimerid(2) system call, quoting the previously
obtained event identifier, to obtain access to a process interval
timer. gettimerid gets a unique timer identifier from the
free pool of process interval timers.
- Set a timer expiration value and activate the process interval
timer.
- To set the value to an absolute time, use the absinterval(2)
system call.
- To set the value to a time relative to the current system time,
use the incinterval(2) system call.
- Both incinterval(2) and absinterval allow for
an optional periodic repeat.
- 4. When the timer expires, the appropriate notification method
is used. The process takes whatever action is necessary.
Implementation Details
A common mechanism is used for all time services that
are driven from the system clock. These services include:
- process interval timers (refer to gettimerid(2)).
- Berkeley style interval timers (refer to setitimer(2)).
- high resolution sleeps (refer to nanosleep(2)).
- internal kernel services such as delay(D3X) and timeout(D3X).
In all cases, a control block describing the operation
is placed on an appropriate queue. At each clock interrupt the interrupt
handler examines all queues that may contain a block which requires
servicing. If the clock interrupt handler determines that further
processing is necessary it wakes up a system daemon to do the work.
This style limits the amount of processing required in the interrupt
handler.
The clock interrupt handler also provides statistics
and profiling information. These operations are performed at a fixed
rate, so if the system clock is configured to deliver four times as
many interrupts as normal, statistics will only be collected on every
fourth tick. In a multiprocessor system, each clock tick will result
in an interrupt to a single CPU. On every tick where statistics gathering
is required the clock interrupt handler will cause interrupt handling
code on the other CPUs to run. These "slave" clock interrupt
handlers will perform statistics gathering for the processes running
on their respective CPUs.
There are two timer daemons, one each for high priority
and low priority processes. The dividing point between high and low
priority processes is a tunable parameter, as are the priorities of
the two daemons. Normally the high priority timeout daemon should
run at a priority greater than any other process that may require
timer services. In order to prevent such a high priority process having
a potentially large amount of work to do on behalf of low priority
processes, such work is relegated to the other timeout daemon, which
has a comparatively low priority.
Go to Chapter 4 TOC