aboutsummaryrefslogtreecommitdiff
path: root/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
diff options
context:
space:
mode:
Diffstat (limited to 'en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml')
-rw-r--r--en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml957
1 files changed, 0 insertions, 957 deletions
diff --git a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml b/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
deleted file mode 100644
index c6c78e0feb..0000000000
--- a/en_US.ISO8859-1/books/arch-handbook/smp/chapter.sgml
+++ /dev/null
@@ -1,957 +0,0 @@
-<!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
-<!ENTITY % man PUBLIC "-//FreeBSD//ENTITIES DocBook Manual Page Entities//EN">
-%man;
-
-<!ENTITY % authors PUBLIC "-//FreeBSD//ENTITIES DocBook Author Entities//EN">
-%authors;
-<!ENTITY % misc PUBLIC "-//FreeBSD//ENTITIES DocBook Miscellaneous FreeBSD Entities//EN">
-%misc;
-
-<!--ENTITY % mailing-lists PUBLIC "-//FreeBSD//ENTITIES DocBook Mailing List Entities//EN"-->
-<!--
-%mailing-lists;
--->
-
-]>
-
-<article>
- <articleinfo>
- <title>SMPng Design Document</title>
-
- <authorgroup>
- <author>
- <firstname>John</firstname>
- <surname>Baldwin</surname>
- </author>
- <author>
- <firstname>Robert</firstname>
- <surname>Watson</surname>
- </author>
- </authorgroup>
-
- <pubdate>$FreeBSD$</pubdate>
-
- <copyright>
- <year>2002</year>
- <year>2003</year>
- <holder>John Baldwin</holder>
- <holder>Robert Watson</holder>
- </copyright>
-
- <abstract>
- <para>This document presents the current design and implementation of
- the SMPng Architecture. First, the basic primitives and tools are
- introduced. Next, a general architecture for the FreeBSD kernel's
- synchronization and execution model is laid out. Then, locking
- strategies for specific subsystems are discussed, documenting the
- approaches taken to introduce fine-grained synchronization and
- parallelism for each subsystem. Finally, detailed implementation
- notes are provided to motivate design choices, and make the reader
- aware of important implications involving the use of specific
- primitives. </para>
- </abstract>
- </articleinfo>
-
- <sect1>
- <title>Introduction</title>
-
- <para>This document is a work-in-progress, and will be updated to
- reflect on-going design and implementation activities associated
- with the SMPng Project. Many sections currently exist only in
- outline form, but will be fleshed out as work proceeds. Updates or
- suggestions regarding the document may be directed to the document
- editors.</para>
-
- <para>The goal of SMPng is to allow concurrency in the kernel.
- The kernel is basically one rather large and complex program. To
- make the kernel multi-threaded we use some of the same tools used
- to make other programs multi-threaded. These include mutexes,
- shared/exclusive locks, semaphores, and condition variables. For
- the definitions of these and other SMP-related terms, please see
- the <xref linkend="glossary"> section of this article.</para>
- </sect1>
-
- <sect1>
- <title>Basic Tools and Locking Fundamentals</title>
-
- <sect2>
- <title>Atomic Instructions and Memory Barriers</title>
-
- <para>There are several existing treatments of memory barriers
- and atomic instructions, so this section will not include a
- lot of detail. To put it simply, one can not go around reading
- variables without a lock if a lock is used to protect writes
- to that variable. This becomes obvious when you consider that
- memory barriers simply determine relative order of memory
- operations; they do not make any guarantee about timing of
- memory operations. That is, a memory barrier does not force
- the contents of a CPU's local cache or store buffer to flush.
- Instead, the memory barrier at lock release simply ensures
- that all writes to the protected data will be visible to other
- CPU's or devices if the write to release the lock is visible.
- The CPU is free to keep that data in its cache or store buffer
- as long as it wants. However, if another CPU performs an
- atomic instruction on the same datum, the first CPU must
- guarantee that the updated value is made visible to the second
- CPU along with any other operations that memory barriers may
- require.</para>
-
- <para>For example, assuming a simple model where data is
- considered visible when it is in main memory (or a global
- cache), when an atomic instruction is triggered on one CPU,
- other CPU's store buffers and caches must flush any writes to
- that same cache line along with any pending operations behind
- a memory barrier.</para>
-
- <para>This requires one to take special care when using an item
- protected by atomic instructions. For example, in the sleep
- mutex implementation, we have to use an
- <function>atomic_cmpset</function> rather than an
- <function>atomic_set</function> to turn on the
- <constant>MTX_CONTESTED</constant> bit. The reason is that we
- read the value of <structfield>mtx_lock</structfield> into a
- variable and then make a decision based on that read.
- However, the value we read may be stale, or it may change
- while we are making our decision. Thus, when the
- <function>atomic_set</function> executed, it may end up
- setting the bit on another value than the one we made the
- decision on. Thus, we have to use an
- <function>atomic_cmpset</function> to set the value only if
- the value we made the decision on is up-to-date and
- valid.</para>
-
- <para>Finally, atomic instructions only allow one item to be
- updated or read. If one needs to atomically update several
- items, then a lock must be used instead. For example, if two
- counters must be read and have values that are consistent
- relative to each other, then those counters must be protected
- by a lock rather than by separate atomic instructions.</para>
- </sect2>
-
- <sect2>
- <title>Read Locks versus Write Locks</title>
-
- <para>Read locks do not need to be as strong as write locks.
- Both types of locks need to ensure that the data they are
- accessing is not stale. However, only write access requires
- exclusive access. Multiple threads can safely read a value.
- Using different types of locks for reads and writes can be
- implemented in a number of ways.</para>
-
- <para>First, sx locks can be used in this manner by using an
- exclusive lock when writing and a shared lock when reading.
- This method is quite straightforward.</para>
-
- <para>A second method is a bit more obscure. You can protect a
- datum with multiple locks. Then for reading that data you
- simply need to have a read lock of one of the locks. However,
- to write to the data, you need to have a write lock of all of
- the locks. This can make writing rather expensive but can be
- useful when data is accessed in various ways. For example,
- the parent process pointer is protected by both the
- proctree_lock sx lock and the per-process mutex. Sometimes
- the proc lock is easier as we are just checking to see who a
- parent of a process is that we already have locked. However,
- other places such as <function>inferior</function> need to
- walk the tree of processes via parent pointers and locking
- each process would be prohibitive as well as a pain to
- guarantee that the condition you are checking remains valid
- for both the check and the actions taken as a result of the
- check.</para>
- </sect2>
-
- <sect2>
- <title>Locking Conditions and Results</title>
-
- <para>If you need a lock to check the state of a variable so
- that you can take an action based on the state you read, you
- can not just hold the lock while reading the variable and then
- drop the lock before you act on the value you read. Once you
- drop the lock, the variable can change rendering your decision
- invalid. Thus, you must hold the lock both while reading the
- variable and while performing the action as a result of the
- test.</para>
- </sect2>
- </sect1>
-
- <sect1>
- <title>General Architecture and Design</title>
-
- <sect2>
- <title>Interrupt Handling</title>
-
- <para>Following the pattern of several other multi-threaded Unix
- kernels, FreeBSD deals with interrupt handlers by giving them
- their own thread context. Providing a context for interrupt
- handlers allows them to block on locks. To help avoid
- latency, however, interrupt threads run at real-time kernel
- priority. Thus, interrupt handlers should not execute for very
- long to avoid starving other kernel threads. In addition,
- since multiple handlers may share an interrupt thread,
- interrupt handlers should not sleep or use a sleepable lock to
- avoid starving another interrupt handler.</para>
-
- <para>The interrupt threads currently in FreeBSD are referred to
- as heavyweight interrupt threads. They are called this
- because switching to an interrupt thread involves a full
- context switch. In the initial implementation, the kernel was
- not preemptive and thus interrupts that interrupted a kernel
- thread would have to wait until the kernel thread blocked or
- returned to userland before they would have an opportunity to
- run.</para>
-
- <para>To deal with the latency problems, the kernel in FreeBSD
- has been made preemptive. Currently, we only preempt a kernel
- thread when we release a sleep mutex or when an interrupt
- comes in. However, the plan is to make the FreeBSD kernel
- fully preemptive as described below.</para>
-
- <para>Not all interrupt handlers execute in a thread context.
- Instead, some handlers execute directly in primary interrupt
- context. These interrupt handlers are currently misnamed
- <quote>fast</quote> interrupt handlers since the
- <constant>INTR_FAST</constant> flag used in earlier versions
- of the kernel is used to mark these handlers. The only
- interrupts which currently use these types of interrupt
- handlers are clock interrupts and serial I/O device
- interrupts. Since these handlers do not have their own
- context, they may not acquire blocking locks and thus may only
- use spin mutexes.</para>
-
- <para>Finally, there is one optional optimization that can be
- added in MD code called lightweight context switches. Since
- an interrupt thread executes in a kernel context, it can
- borrow the vmspace of any process. Thus, in a lightweight
- context switch, the switch to the interrupt thread does not
- switch vmspaces but borrows the vmspace of the interrupted
- thread. In order to ensure that the vmspace of the
- interrupted thread does not disappear out from under us, the
- interrupted thread is not allowed to execute until the
- interrupt thread is no longer borrowing its vmspace. This can
- happen when the interrupt thread either blocks or finishes.
- If an interrupt thread blocks, then it will use its own
- context when it is made runnable again. Thus, it can release
- the interrupted thread.</para>
-
- <para>The cons of this optimization are that they are very
- machine specific and complex and thus only worth the effort if
- their is a large performance improvement. At this point it is
- probably too early to tell, and in fact, will probably hurt
- performance as almost all interrupt handlers will immediately
- block on Giant and require a thread fix-up when they block.
- Also, an alternative method of interrupt handling has been
- proposed by Mike Smith that works like so:</para>
-
- <orderedlist>
- <listitem>
- <para>Each interrupt handler has two parts: a predicate
- which runs in primary interrupt context and a handler
- which runs in its own thread context.</para>
- </listitem>
-
- <listitem>
- <para>If an interrupt handler has a predicate, then when an
- interrupt is triggered, the predicate is run. If the
- predicate returns true then the interrupt is assumed to be
- fully handled and the kernel returns from the interrupt.
- If the predicate returns false or there is no predicate,
- then the threaded handler is scheduled to run.</para>
- </listitem>
- </orderedlist>
-
- <para>Fitting light weight context switches into this scheme
- might prove rather complicated. Since we may want to change
- to this scheme at some point in the future, it is probably
- best to defer work on light weight context switches until we
- have settled on the final interrupt handling architecture and
- determined how light weight context switches might or might
- not fit into it.</para>
- </sect2>
-
- <sect2>
- <title>Kernel Preemption and Critical Sections</title>
-
- <sect3>
- <title>Kernel Preemption in a Nutshell</title>
-
- <para>Kernel preemption is fairly simple. The basic idea is
- that a CPU should always be doing the highest priority work
- available. Well, that is the ideal at least. There are a
- couple of cases where the expense of achieving the ideal is
- not worth being perfect.</para>
-
- <para>Implementing full kernel preemption is very
- straightforward: when you schedule a thread to be executed
- by putting it on a runqueue, you check to see if it's
- priority is higher than the currently executing thread. If
- so, you initiate a context switch to that thread.</para>
-
- <para>While locks can protect most data in the case of a
- preemption, not all of the kernel is preemption safe. For
- example, if a thread holding a spin mutex preempted and the
- new thread attempts to grab the same spin mutex, the new
- thread may spin forever as the interrupted thread may never
- get a chance to execute. Also, some code such as the code
- to assign an address space number for a process during
- exec() on the Alpha needs to not be preempted as it supports
- the actual context switch code. Preemption is disabled for
- these code sections by using a critical section.</para>
- </sect3>
-
- <sect3>
- <title>Critical Sections</title>
-
- <para>The responsibility of the critical section API is to
- prevent context switches inside of a critical section. With
- a fully preemptive kernel, every
- <function>setrunqueue</function> of a thread other than the
- current thread is a preemption point. One implementation is
- for <function>critical_enter</function> to set a per-thread
- flag that is cleared by its counterpart. If
- <function>setrunqueue</function> is called with this flag
- set, it does not preempt regardless of the priority of the new
- thread relative to the current thread. However, since
- critical sections are used in spin mutexes to prevent
- context switches and multiple spin mutexes can be acquired,
- the critical section API must support nesting. For this
- reason the current implementation uses a nesting count
- instead of a single per-thread flag.</para>
-
- <para>In order to minimize latency, preemptions inside of a
- critical section are deferred rather than dropped. If a
- thread is made runnable that would normally be preempted to
- outside of a critical section, then a per-thread flag is set
- to indicate that there is a pending preemption. When the
- outermost critical section is exited, the flag is checked.
- If the flag is set, then the current thread is preempted to
- allow the higher priority thread to run.</para>
-
- <para>Interrupts pose a problem with regards to spin mutexes.
- If a low-level interrupt handler needs a lock, it needs to
- not interrupt any code needing that lock to avoid possible
- data structure corruption. Currently, providing this
- mechanism is piggybacked onto critical section API by means
- of the <function>cpu_critical_enter</function> and
- <function>cpu_critical_exit</function> functions. Currently
- this API disables and re-enables interrupts on all of
- FreeBSD's current platforms. This approach may not be
- purely optimal, but it is simple to understand and simple to
- get right. Theoretically, this second API need only be used
- for spin mutexes that are used in primary interrupt context.
- However, to make the code simpler, it is used for all spin
- mutexes and even all critical sections. It may be desirable
- to split out the MD API from the MI API and only use it in
- conjunction with the MI API in the spin mutex
- implementation. If this approach is taken, then the MD API
- likely would need a rename to show that it is a separate API
- now.</para>
- </sect3>
-
- <sect3>
- <title>Design Tradeoffs</title>
-
- <para>As mentioned earlier, a couple of trade-offs have been
- made to sacrifice cases where perfect preemption may not
- always provide the best performance.</para>
-
- <para>The first trade-off is that the preemption code does not
- take other CPUs into account. Suppose we have a two CPU's A
- and B with the priority of A's thread as 4 and the priority
- of B's thread as 2. If CPU B makes a thread with priority 1
- runnable, then in theory, we want CPU A to switch to the new
- thread so that we will be running the two highest priority
- runnable threads. However, the cost of determining which
- CPU to enforce a preemption on as well as actually signaling
- that CPU via an IPI along with the synchronization that
- would be required would be enormous. Thus, the current code
- would instead force CPU B to switch to the higher priority
- thread. Note that this still puts the system in a better
- position as CPU B is executing a thread of priority 1 rather
- than a thread of priority 2.</para>
-
- <para>The second trade-off limits immediate kernel preemption
- to real-time priority kernel threads. In the simple case of
- preemption defined above, a thread is always preempted
- immediately (or as soon as a critical section is exited) if
- a higher priority thread is made runnable. However, many
- threads executing in the kernel only execute in a kernel
- context for a short time before either blocking or returning
- to userland. Thus, if the kernel preempts these threads to
- run another non-realtime kernel thread, the kernel may
- switch out the executing thread just before it is about to
- sleep or execute. The cache on the CPU must then adjust to
- the new thread. When the kernel returns to the interrupted
- CPU, it must refill all the cache information that was lost.
- In addition, two extra context switches are performed that
- could be avoided if the kernel deferred the preemption until
- the first thread blocked or returned to userland. Thus, by
- default, the preemption code will only preempt immediately
- if the higher priority thread is a real-time priority
- thread.</para>
-
- <para>Turning on full kernel preemption for all kernel threads
- has value as a debugging aid since it exposes more race
- conditions. It is especially useful on UP systems were many
- races are hard to simulate otherwise. Thus, there will be a
- kernel option to enable preemption for all kernel threads
- that can be used for debugging purposes.</para>
- </sect3>
- </sect2>
-
- <sect2>
- <title>Thread Migration</title>
-
- <para>Simply put, a thread migrates when it moves from one CPU
- to another. In a non-preemptive kernel this can only happen
- at well-defined points such as when calling
- <function>tsleep</function> or returning to userland.
- However, in the preemptive kernel, an interrupt can force a
- preemption and possible migration at any time. This can have
- negative affects on per-CPU data since with the exception of
- <varname>curthread</varname> and <varname>curpcb</varname> the
- data can change whenever you migrate. Since you can
- potentially migrate at any time this renders per-CPU data
- rather useless. Thus it is desirable to be able to disable
- migration for sections of code that need per-CPU data to be
- stable.</para>
-
- <para>Critical sections currently prevent migration since they
- do not allow context switches. However, this may be too strong
- of a requirement to enforce in some cases since a critical
- section also effectively blocks interrupt threads on the
- current processor. As a result, it may be desirable to
- provide an API whereby code may indicate that if the current
- thread is preempted it should not migrate to another
- CPU.</para>
-
- <para>One possible implementation is to use a per-thread nesting
- count <varname>td_pinnest</varname> along with a
- <varname>td_pincpu</varname> which is updated to the current
- CPU on each context switch. Each CPU has its own run queue
- that holds threads pinned to that CPU. A thread is pinned
- when its nesting count is greater than zero and a thread
- starts off unpinned with a nesting count of zero. When a
- thread is put on a runqueue, we check to see if it is pinned.
- If so, we put it on the per-CPU runqueue, otherwise we put it
- on the global runqueue. When
- <function>choosethread</function> is called to retrieve the
- next thread, it could either always prefer bound threads to
- unbound threads or use some sort of bias when comparing
- priorities. If the nesting count is only ever written to by
- the thread itself and is only read by other threads when the
- owning thread is not executing but while holding the
- <varname>sched_lock</varname>, then
- <varname>td_pinnest</varname> will not need any other locks.
- The <function>migrate_disable</function> function would
- increment the nesting count and
- <function>migrate_enable</function> would decrement the
- nesting count. Due to the locking requirements specified
- above, they will only operate on the current thread and thus
- would not need to handle the case of making a thread
- migrateable that currently resides on a per-CPU run
- queue.</para>
-
- <para>It is still debatable if this API is needed or if the
- critical section API is sufficient by itself. Many of the
- places that need to prevent migration also need to prevent
- preemption as well, and in those places a critical section
- must be used regardless.</para>
- </sect2>
-
- <sect2>
- <title>Callouts</title>
-
- <para>The <function>timeout()</function> kernel facility permits
- kernel services to register functions for execution as part
- of the <function>softclock()</function> software interrupt.
- Events are scheduled based on a desired number of clock
- ticks, and callbacks to the consumer-provided function
- will occur at approximately the right time.</para>
-
- <para>The global list of pending timeout events is protected
- by a global spin mutex, <varname>callout_lock</varname>;
- all access to the timeout list must be performed with this
- mutex held. When <function>softclock()</function> is
- woken up, it scans the list of pending timeouts for those
- that should fire. In order to avoid lock order reversal,
- the <function>softclock</function> thread will release the
- <varname>callout_lock</varname> mutex when invoking the
- provided <function>timeout()</function> callback function.
- If the <constant>CALLOUT_MPSAFE</constant> flag was not set
- during registration, then Giant will be grabbed before
- invoking the callout, and then released afterwards. The
- <varname>callout_lock</varname> mutex will be re-grabbed
- before proceeding. The <function>softclock()</function>
- code is careful to leave the list in a consistent state
- while releasing the mutex. If <constant>DIAGNOSTIC</constant>
- is enabled, then the time taken to execute each function is
- measured, and a warning generated if it exceeds a
- threshold.</para>
- </sect2>
- </sect1>
-
- <sect1>
- <title>Specific Locking Strategies</title>
-
- <sect2>
- <title>Credentials</title>
-
- <para><structname>struct ucred</structname> is the kernel's
- internal credential structure, and is generally used as the
- basis for process-driven access control within the kernel.
- BSD-derived systems use a <quote>copy-on-write</quote> model for credential
- data: multiple references may exist for a credential structure,
- and when a change needs to be made, the structure is duplicated,
- modified, and then the reference replaced. Due to wide-spread
- caching of the credential to implement access control on open,
- this results in substantial memory savings. With a move to
- fine-grained SMP, this model also saves substantially on
- locking operations by requiring that modification only occur
- on an unshared credential, avoiding the need for explicit
- synchronization when consuming a known-shared
- credential.</para>
-
- <para>Credential structures with a single reference are
- considered mutable; shared credential structures must not be
- modified or a race condition is risked. A mutex,
- <structfield>cr_mtxp</structfield> protects the reference
- count of <structname>struct ucred</structname> so as to
- maintain consistency. Any use of the structure requires a
- valid reference for the duration of the use, or the structure
- may be released out from under the illegitimate
- consumer.</para>
-
- <para>The <structname>struct ucred</structname> mutex is a leaf
- mutex, and for performance reasons, is implemented via a mutex
- pool.</para>
-
- <para>Usually, credentials are used in a read-only manner for access
- control decisions, and in this case <structfield>td_ucred</structfield>
- is generally preferred because it requires no locking. When a
- process' credential is updated the <literal>proc</literal> lock
- must be held across the check and update operations thus avoid
- races. The process credential <structfield>p_ucred</structfield>
- must be used for check and update operations to prevent
- time-of-check, time-of-use races.</para>
-
- <para>If system call invocations will perform access control after
- an update to the process credential, the value of
- <structfield>td_ucred</structfield> must also be refreshed to
- the current process value. This will prevent use of a stale
- credential following a change. The kernel automatically
- refreshes the <structfield>td_ucred</structfield> pointer in
- the thread structure from the process
- <structfield>p_ucred</structfield> whenever a process enters
- the kernel, permitting use of a fresh credential for kernel
- access control.</para>
- </sect2>
-
- <sect2>
- <title>File Descriptors and File Descriptor Tables</title>
-
- <para>Details to follow.</para>
- </sect2>
-
- <sect2>
- <title>Jail Structures</title>
-
- <para><structname>struct prison</structname> stores
- administrative details pertinent to the maintenance of jails
- created using the &man.jail.2; API. This includes the
- per-jail hostname, IP address, and related settings. This
- structure is reference-counted since pointers to instances of
- the structure are shared by many credential structures. A
- single mutex, <structfield>pr_mtx</structfield> protects read
- and write access to the reference count and all mutable
- variables inside the struct jail. Some variables are set only
- when the jail is created, and a valid reference to the
- <structname>struct prison</structname> is sufficient to read
- these values. The precise locking of each entry is documented
- via comments in <filename>sys/jail.h</filename>.</para>
- </sect2>
-
- <sect2>
- <title>MAC Framework</title>
-
- <para>The TrustedBSD MAC Framework maintains data in a variety
- of kernel objects, in the form of <structname>struct
- label</structname>. In general, labels in kernel objects
- are protected by the same lock as the remainder of the kernel
- object. For example, the <structfield>v_label</structfield>
- label in <structname>struct vnode</structname> is protected
- by the vnode lock on the vnode.</para>
-
- <para>In addition to labels maintained in standard kernel objects,
- the MAC Framework also maintains a list of registered and
- active policies. The policy list is protected by a global
- mutex (<varname>mac_policy_list_lock</varname>) and a busy
- count (also protected by the mutex). Since many access
- control checks may occur in parallel, entry to the framework
- for a read-only access to the policy list requires holding the
- mutex while incrementing (and later decrementing) the busy
- count. The mutex need not be held for the duration of the
- MAC entry operation--some operations, such as label operations
- on file system objects--are long-lived. To modify the policy
- list, such as during policy registration and de-registration,
- the mutex must be held and the reference count must be zero,
- to prevent modification of the list while it is in use.</para>
-
- <para>A condition variable,
- <varname>mac_policy_list_not_busy</varname>, is available to
- threads that need to wait for the list to become unbusy, but
- this condition variable must only be waited on if the caller is
- holding no other locks, or a lock order violation may be
- possible. The busy count, in effect, acts as a form of
- shared/exclusive lock over access to the framework: the difference
- is that, unlike with an sx lock, consumers waiting for the list
- to become unbusy may be starved, rather than permitting lock
- order problems with regards to the busy count and other locks
- that may be held on entry to (or inside) the MAC Framework.</para>
- </sect2>
-
- <sect2>
- <title>Modules</title>
-
- <para>For the module subsystem there exists a single lock that is
- used to protect the shared data. This lock is a shared/exclusive
- (SX) lock and has a good chance of needing to be acquired (shared
- or exclusively), therefore there are a few macros that have been
- added to make access to the lock more easy. These macros can be
- located in <filename>sys/module.h</filename> and are quite basic
- in terms of usage. The main structures protected under this lock
- are the <structname>module_t</structname> structures (when shared)
- and the global <structname>modulelist_t</structname> structure,
- modules. One should review the related source code in
- <filename>kern/kern_module.c</filename> to further understand the
- locking strategy.</para>
- </sect2>
-
- <sect2>
- <title>Newbus Device Tree</title>
-
- <para>The newbus system will have one sx lock. Readers will
- hold a shared (read) lock (&man.sx.slock.9;) and writers will hold
- an exclusive (write) lock (&man.sx.xlock.9;). Internal functions
- will not do locking at all. Externally visible ones will lock as
- needed.
- Those items that do not matter if the race is won or lost will
- not be locked, since they tend to be read all over the place
- (e.g. &man.device.get.softc.9;). There will be relatively few
- changes to the newbus data structures, so a single lock should
- be sufficient and not impose a performance penalty.</para>
- </sect2>
-
- <sect2>
- <title>Pipes</title>
-
- <para>...</para>
- </sect2>
-
- <sect2>
- <title>Processes and Threads</title>
-
- <para>- process hierarchy</para>
- <para>- proc locks, references</para>
- <para>- thread-specific copies of proc entries to freeze during system
- calls, including td_ucred</para>
- <para>- inter-process operations</para>
- <para>- process groups and sessions</para>
- </sect2>
-
- <sect2>
- <title>Scheduler</title>
-
- <para>Lots of references to <varname>sched_lock</varname> and notes
- pointing at specific primitives and related magic elsewhere in the
- document.</para>
- </sect2>
-
- <sect2>
- <title>Select and Poll</title>
-
- <para>The select() and poll() functions permit threads to block
- waiting on events on file descriptors--most frequently, whether
- or not the file descriptors are readable or writable.</para>
-
- <para>...</para>
- </sect2>
-
- <sect2>
- <title>SIGIO</title>
-
- <para>The SIGIO service permits processes to request the delivery
- of a SIGIO signal to its process group when the read/write status
- of specified file descriptors changes. At most one process or
- process group is permitted to register for SIGIO from any given
- kernel object, and that process or group is referred to as
- the owner. Each object supporting SIGIO registration contains
- pointer field that is NULL if the object is not registered, or
- points to a <structname>struct sigio</structname> describing
- the registration. This field is protected by a global mutex,
- <varname>sigio_lock</varname>. Callers to SIGIO maintenance
- functions must pass in this field <quote>by reference</quote> so that local
- register copies of the field are not made when unprotected by
- the lock.</para>
-
- <para>One <structname>struct sigio</structname> is allocated for
- each registered object associated with any process or process
- group, and contains back-pointers to the object, owner, signal
- information, a credential, and the general disposition of the
- registration. Each process or progress group contains a list of
- registered <structname>struct sigio</structname> structures,
- <structfield>p_sigiolst</structfield> for processes, and
- <structfield>pg_sigiolst</structfield> for process groups.
- These lists are protected by the process or process group
- locks respectively. Most fields in each <structname>struct
- sigio</structname> are constant for the duration of the
- registration, with the exception of the
- <structfield>sio_pgsigio</structfield> field which links the
- <structname>struct sigio</structname> into the process or
- process group list. Developers implementing new kernel
- objects supporting SIGIO will, in general, want to avoid
- holding structure locks while invoking SIGIO supporting
- functions, such as <function>fsetown()</function>
- or <function>funsetown()</function> to avoid
- defining a lock order between structure locks and the global
- SIGIO lock. This is generally possible through use of an
- elevated reference count on the structure, such as reliance
- on a file descriptor reference to a pipe during a pipe
- operation.<para>
- </sect2>
-
- <sect2>
- <title>Sysctl</title>
-
- <para>The <function>sysctl()</function> MIB service is invoked
- from both within the kernel and from userland applications
- using a system call. At least two issues are raised in locking:
- first, the protection of the structures maintaining the
- namespace, and second, interactions with kernel variables and
- functions that are accessed by the sysctl interface. Since
- sysctl permits the direct export (and modification) of
- kernel statistics and configuration parameters, the sysctl
- mechanism must become aware of appropriate locking semantics
- for those variables. Currently, sysctl makes use of a
- single global sx lock to serialize use of sysctl(); however, it
- is assumed to operate under Giant and other protections are not
- provided. The remainder of this section speculates on locking
- and semantic changes to sysctl.</para>
-
- <para>- Need to change the order of operations for sysctl's that
- update values from read old, copyin and copyout, write new to
- copyin, lock, read old and write new, unlock, copyout. Normal
- sysctl's that just copyout the old value and set a new value
- that they copyin may still be able to follow the old model.
- However, it may be cleaner to use the second model for all of
- the sysctl handlers to avoid lock operations.</para>
-
- <para>- To allow for the common case, a sysctl could embed a
- pointer to a mutex in the SYSCTL_FOO macros and in the struct.
- This would work for most sysctl's. For values protected by sx
- locks, spin mutexes, or other locking strategies besides a
- single sleep mutex, SYSCTL_PROC nodes could be used to get the
- locking right.</para>
- </sect2>
-
- <sect2>
- <title>Taskqueue</title>
-
- <para> The taskqueue's interface has two basic locks associated
- with it in order to protect the related shared data. The
- <varname>taskqueue_queues_mutex</varname> is meant to serve as a
- lock to protect the <varname>taskqueue_queues</varname> TAILQ.
- The other mutex lock associated with this system is the one in the
- <structname>struct taskqueue</structname> data structure. The
- use of the synchronization primitive here is to protect the
- integrity of the data in the <structname>struct
- taskqueue</structname>. It should be noted that there are no
- separate macros to assist the user in locking down his/her own work
- since these locks are most likely not going to be used outside of
- <filename>kern/subr_taskqueue.c</filename>.</para>
- </sect2>
- </sect1>
-
- <sect1>
- <title>Implementation Notes</title>
-
- <sect2>
- <title>Details of the Mutex Implementation</title>
-
- <para>- Should we require mutexes to be owned for mtx_destroy()
- since we can not safely assert that they are unowned by anyone
- else otherwise?</para>
-
- <sect3>
- <title>Spin Mutexes</title>
-
- <para>- Use a critical section...</para>
- </sect3>
-
- <sect3>
- <title>Sleep Mutexes</title>
-
- <para>- Describe the races with contested mutexes</para>
-
- <para>- Why it is safe to read mtx_lock of a contested mutex
- when holding sched_lock.</para>
-
- <para>- Priority propagation</para>
- </sect3>
- </sect2>
-
- <sect2>
- <title>Witness</title>
-
- <para>- What does it do</para>
-
- <para>- How does it work</para>
- </sect2>
- </sect1>
-
- <sect1>
- <title>Miscellaneous Topics</title>
-
- <sect2>
- <title>Interrupt Source and ICU Abstractions</title>
-
- <para>- struct isrc</para>
-
- <para>- pic drivers</para>
- </sect2>
-
- <sect2>
- <title>Other Random Questions/Topics</title>
-
- <para>Should we pass an interlock into
- <function>sema_wait</function>?</para>
-
- <para>- Generic turnstiles for sleep mutexes and sx locks.</para>
-
- <para>- Should we have non-sleepable sx locks?</para>
- </sect2>
- </sect1>
-
- <glossary id="glossary">
- <title>Glossary</title>
-
- <glossentry id="atomic">
- <glossterm>atomic</glossterm>
- <glossdef>
- <para>An operation is atomic if all of its effects are visible
- to other CPUs together when the proper access protocol is
- followed. In the degenerate case are atomic instructions
- provided directly by machine architectures. At a higher
- level, if several members of a structure are protected by a
- lock, then a set of operations are atomic if they are all
- performed while holding the lock without releasing the lock
- in between any of the operations.</para>
-
- <glossseealso>operation</glossseealso>
- </glossdef>
- </glossentry>
-
- <glossentry id="block">
- <glossterm>block</glossterm>
- <glossdef>
- <para>A thread is blocked when it is waiting on a lock,
- resource, or condition. Unfortunately this term is a bit
- overloaded as a result.</para>
-
- <glossseealso>sleep</glossseealso>
- </glossdef>
- </glossentry>
-
- <glossentry id="critical-section">
- <glossterm>critical section</glossterm>
- <glossdef>
- <para>A section of code that is not allowed to be preempted.
- A critical section is entered and exited using the
- &man.critical.enter.9; API.</para>
- </glossdef>
- </glossentry>
-
- <glossentry id="MD">
- <glossterm>MD</glossterm>
- <glossdef>
- <para>Machine dependent.</para>
-
- <glossseealso>MI</glossseealso>
- </glossdef>
- </glossentry>
-
- <glossentry id="memory-operation">
- <glossterm>memory operation</glossterm>
- <glossdef>
- <para>A memory operation reads and/or writes to a memory
- location.</para>
- </glossdef>
- </glossentry>
-
- <glossentry id="MI">
- <glossterm>MI</glossterm>
- <glossdef>
- <para>Machine independent.</para>
-
- <glossseealso>MD</glossseealso>
- </glossdef>
- </glossentry>
-
- <glossentry id="operation">
- <glossterm>operation</glossterm>
- <glosssee>memory operation</glosssee>
- </glossentry>
-
- <glossentry id="primary-interrupt-context">
- <glossterm>primary interrupt context</glossterm>
- <glossdef>
- <para>Primary interrupt context refers to the code that runs
- when an interrupt occurs. This code can either run an
- interrupt handler directly or schedule an asynchronous
- interrupt thread to execute the interrupt handlers for a
- given interrupt source.</para>
- </glossdef>
- </glossentry>
-
- <glossentry>
- <glossterm>realtime kernel thread</glossterm>
- <glossdef>
- <para>A high priority kernel thread. Currently, the only
- realtime priority kernel threads are interrupt threads.</para>
-
- <glossseealso>thread</glossseealso>
- </glossdef>
- </glossentry>
-
- <glossentry id="sleep">
- <glossterm>sleep</glossterm>
- <glossdef>
- <para>A thread is asleep when it is blocked on a condition
- variable or a sleep queue via <function>msleep</function> or
- <function>tsleep</function>.</para>
-
- <glossseealso>block</glossseealso>
- </glossdef>
- </glossentry>
-
- <glossentry id="sleepable-lock">
- <glossterm>sleepable lock</glossterm>
- <glossdef>
- <para>A sleepable lock is a lock that can be held by a thread
- which is asleep. Lockmgr locks and sx locks are currently
- the only sleepable locks in FreeBSD. Eventually, some sx
- locks such as the allproc and proctree locks may become
- non-sleepable locks.</para>
-
- <glossseealso>sleep</glossseealso>
- </glossdef>
- </glossentry>
-
- <glossentry id="thread">
- <glossterm>thread</glossterm>
- <glossdef>
- <para>A kernel thread represented by a struct thread. Threads own
- locks and hold a single execution context.</para>
- </glossdef>
- </glossentry>
- </glossary>
-</article>