System Mechanisms

Microsoft Windows provides several base mechanisms that kernel-mode components such as

the executive, the kernel, and device drivers use. This chapter explains the following system

mechanisms and describes how they are used:

Trap dispatching, including interrupts, deferred procedure calls (DPCs), asynchronous

procedure calls (APCs), exception dispatching, and system service dispatching

The executive object manager

Synchronization, including spinlocks, kernel dispatcher objects, and how waits are


System worker threads

Miscellaneous mechanisms such as Windows global flags

Local procedure calls (LPCs)

Kernel Event Tracing


Trap Dispatching

Interrupts and exceptions are operating system conditions that divert the processor to code outside

the normal flow of control. Either hardware or software can detect them. The term trap

refers to a processor’s mechanism for capturing an executing thread when an exception or an

interrupt occurs and transferring control to a fixed location in the operating system. In Windows,

the processor transfers control to a trap handler, a function specific to a particular interrupt

or exception. Figure 3-1 illustrates some of the conditions that activate trap handlers.

The kernel distinguishes between interrupts and exceptions in the following way. An interrupt

is an asynchronous event (one that can occur at any time) that is unrelated to what the processor

is executing. Interrupts are generated primarily by I/O devices, processor clocks, or

timers, and they can be enabled (turned on) or disabled (turned off). An exception, in contrast,

is a synchronous condition that results from the execution of a particular instruction. Running

a program a second time with the same data under the same conditions can reproduce

exceptions. Examples of exceptions include memory access violations, certain debugger

Copyrighted material.

86 Microsoft Windows Internals, Fourth Edition

instructions, and divide-by-zero errors. The kernel also regards system service calls as exceptions

(although technically they’re system traps).

Figure 3-1 Trap dispatching

Either hardware or software can generate exceptions and interrupts. For example, a bus error

exception is caused by a hardware problem, whereas a divide-by-zero exception is the result of

a software bug. Likewise, an I/O device can generate an interrupt, or the kernel itself can issue

a software interrupt (such as an APC or DPC, described later in this chapter).

When a hardware exception or interrupt is generated, the processor records enough machine

state on the kernel stack of the thread that’s interrupted so that it can return to that point in

the control flow and continue execution as if nothing had happened. If the thread was executing

in user mode, Windows switches to the thread’s kernel-mode stack. Windows then creates

a trap frame on the kernel stack of the interrupted thread into which it stores the

execution state of the thread. The trap frame is a subset of a thread’s complete context, and

you can view its definition by typing dt nt!_ktrap_frame in the kernel debugger. (Thread context

is described in Chapter 6.) The kernel handles software interrupts either as part of hardware

interrupt handling or synchronously when a thread invokes kernel functions related to

the software interrupt.

In most cases, the kernel installs front-end trap handling functions that perform general trap

handling tasks before and after transferring control to other functions that field the trap. For

example, if the condition was a device interrupt, a kernel hardware interrupt trap handler

transfers control to the interrupt service routine (ISR) that the device driver provided for the

interrupting device. If the condition was caused by a call to a system service, the general

Virtual memory












Hardware exceptions

Software exceptions

Virtual address




System service call


Trap handlers

Copyrighted material.

Chapter 3: System Mechanisms 87

system service trap handler transfers control to the specified system service function in the

executive. The kernel also installs trap handlers for traps that it doesn’t expect to see or

doesn’t handle. These trap handlers typically execute the system function KeBugCheckEx,

which halts the computer when the kernel detects problematic or incorrect behavior that, if

left unchecked, could result in data corruption. (For more information on bug checks, see

Chapter 14.) The following sections describe interrupt, exception, and system service dispatching

in greater detail.

Interrupt Dispatching

Hardware-generated interrupts typically originate from I/O devices that must notify the processor

when they need service. Interrupt-driven devices allow the operating system to get the

maximum use out of the processor by overlapping central processing with I/O operations. A

thread starts an I/O transfer to or from a device and then can execute other useful work while

the device completes the transfer. When the device is finished, it interrupts the processor for

service. Pointing devices, printers, keyboards, disk drives, and network cards are generally

interrupt driven.

System software can also generate interrupts. For example, the kernel can issue a software

interrupt to initiate thread dispatching and to asynchronously break into the execution of a

thread. The kernel can also disable interrupts so that the processor isn’t interrupted, but it

does so only infrequently—at critical moments while it’s processing an interrupt or dispatching

an exception, for example.

The kernel installs interrupt trap handlers to respond to device interrupts. Interrupt trap

handlers transfer control either to an external routine (the ISR) that handles the interrupt

or to an internal kernel routine that responds to the interrupt. Device drivers supply ISRs to

service device interrupts, and the kernel provides interrupt handling routines for other

types of interrupts.

In the following subsections, you’ll find out how the hardware notifies the processor of device

interrupts, the types of interrupts the kernel supports, the way device drivers interact with the

kernel (as a part of interrupt processing), and the software interrupts the kernel recognizes

(plus the kernel objects that are used to implement them).

Hardware Interrupt Processing

On the hardware platforms supported by Windows, external I/O interrupts come into one of

the lines on an interrupt controller. The controller in turn interrupts the processor on a single

line. Once the processor is interrupted, it queries the controller to get the interrupt request

(IRQ). The interrupt controller translates the IRQ to an interrupt number, uses this number

as an index into a structure called the interrupt dispatch table (IDT), and transfers control to

the appropriate interrupt dispatch routine. At system boot time, Windows fills in the IDT with

pointers to the kernel routines that handle each interrupt and exception.

Copyrighted material.

88 Microsoft Windows Internals, Fourth Edition


You can view the contents of the IDT, including information on what trap handlers Windows

has assigned to interrupts (including exceptions and IRQs), using the !idt kernel

debugger command. The !idt command with no flags shows vectors that map to

addresses in modules other than Ntoskrnl.exe.

The following example shows what the output of the !idt command looks like:

kd> !idt

Dumping IDT:

30: 806b14c0 hal!HalpClockInterrupt

31: 8a39dc3c i8042prt!I8042KeyboardInterruptService (KINTERRUPT 8a39dc00)

34: 8a436dd4 serial!SerialCIsrSw (KINTERRUPT 8a436d98)

35: 8a44ed74 NDIS!ndisMIsr (KINTERRUPT 8a44ed38)

portcls!CInterruptSync::Release+0x10 (KINTERRUPT 899c44a0)

38: 806abe80 hal!HalpProfileInterrupt

39: 8a4a8abc ACPI!ACPIInterruptServiceRoutine (KINTERRUPT 8a4a8a80)

3b: 8a48d8c4 pcmcia!PcmciaInterrupt (KINTERRUPT 8a48d888)

ohci1394!OhciIsr (KINTERRUPT 8a41da18)

VIDEOPRT!pVideoPortInterrupt (KINTERRUPT 8a1bc2c0)

USBPORT!USBPORT_InterruptService (KINTERRUPT 8a2302b8)

USBPORT!USBPORT_InterruptService (KINTERRUPT 8a0b8008)

USBPORT!USBPORT_InterruptService (KINTERRUPT 8a170008)

USBPORT!USBPORT_InterruptService (KINTERRUPT 8a258380)

NDIS!ndisMIsr (KINTERRUPT 8a0e0430)

3c: 8a39d3ec i8042prt!I8042MouseInterruptService (KINTERRUPT 8a39d3b0)

3e: 8a47264c atapi!IdePortInterrupt (KINTERRUPT 8a472610)

3f: 8a489b3c atapi!IdePortInterrupt (KINTERRUPT 8a489b00)

On the system used to provide the output for this experiment, the keyboard device

driver’s (I8042prt.sys) keyboard ISR is at interrupt number 0x3C and several devices—

including the video adapter, PCMCIA bus, USB and IEEE 1394 ports, and network

adapter—share interrupt 0x3B.

Windows maps hardware IRQs to interrupt numbers in the IDT, and the system also uses the

IDT to configure trap handlers for exceptions. For example, the x86 and x64 exception number

for a page fault (an exception that occurs when a thread attempts to access a page of virtual

memory that isn’t defined or present) is 0xe. Thus, entry 0xe in the IDT points to the

system’s page fault handler. Although the architectures supported by Windows allow up to

256 IDT entries, the number of IRQs a particular machine can support is determined by the

design of the interrupt controller the machine uses.

Copyrighted material.

Chapter 3: System Mechanisms 89

Each processor has a separate IDT so that different processors can run different ISRs, if appropriate.

For example, in a multiprocessor system, each processor receives the clock interrupt,

but only one processor updates the system clock in response to this interrupt. All the processors,

however, use the interrupt to measure thread quantum and to initiate rescheduling when

a thread’s quantum ends. Similarly, some system configurations might require that a particular

processor handle certain device interrupts.

x86 Interrupt Controllers

Most x86 systems rely on either the i8259A Programmable Interrupt Controller (PIC) or a

variant of the i82489 Advanced Programmable Interrupt Controller (APIC); the majority of

new computers include an APIC. The PIC standard originates with the original IBM PC. PICs

work only with uniprocessor systems and have 15 interrupt lines. APICs and SAPICs (discussed

shortly) work with multiprocessor systems and have 256 interrupt lines. Intel and

other companies have defined the Multiprocessor Specification (MP Specification), a design

standard for x86 multiprocessor systems that centers on the use of APIC. To provide compatibility

with uniprocessor operating systems and boot code that starts a multiprocessor system

in uniprocessor mode, APICs support a PIC compatibility mode with 15 interrupts and delivery

of interrupts to only the primary processor. Figure 3-2 depicts the APIC architecture. The

APIC actually consists of several components: an I/O APIC that receives interrupts from

devices, local APICs that receive interrupts from the I/O APIC on a private APIC bus and that

interrupt the CPU they are associated with, and an i8259A-compatible interrupt controller

that translates APIC input into PIC-equivalent signals. The I/O APIC is responsible for implementing

interrupt routing algorithms—which are software-selectable (the hardware abstraction

layer, or HAL, makes the selection on Windows)—that both balance the device interrupt

load across processors and attempt to take advantage of locality, delivering device interrupts

to the same processor that has just fielded a previous interrupt of the same type.

Figure 3-2 x86 APIC architecture













Copyrighted material.

90 Microsoft Windows Internals, Fourth Edition

x64 Interrupt Controllers

Because the x64 architecture is compatible with x86 operating systems, x64 systems must

provide the same interrupt controllers as does the x86. A significant difference, however, is

that the x64 versions of Windows will not run on systems that do not have an APIC and they

use the APIC for interrupt control.

IA64 Interrupt Controllers

The IA64 architecture relies on the Streamlined Advanced Programmable Interrupt Controller

(SAPIC), which is an evolution of the APIC. A major difference between the APIC and SAPIC

architectures is that the I/O APICs on an APIC system deliver interrupts to local APICs over a

private APIC bus, whereas on a SAPIC system interrupts traverse the I/O and system bus for

faster delivery. Another difference is that interrupt routing and load balancing is handled by

the APIC bus on an APIC system, but a SAPIC system, which doesn’t have a private APIC bus,

requires that the support be programmed into the firmware. Even if load balancing and routing

are present in the firmware, Windows does not take advantage of it; instead, it statically

assigns interrupts to processors in a round-robin manner.

EXPERIMENT: Viewing the PIC and APIC

You can view the configuration of the PIC on a uniprocessor and the APIC on a multiprocessor

by using the !pic and !apic kernel debugger commands, respectively. (You

can’t use LiveKd for this experiment because LiveKd can’t access hardware.) Here’s the

output of the !pic command on a uniprocessor. (Note that the !pic command doesn’t

work if your system is using an APIC HAL.)

lkd> !pic

----- IRQ Number ----- 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

Physically in service: . . . . . . . . . . . . . . . .

Physically masked: . . . Y . . Y Y . . Y . . Y . .

Physically requested: . . . . . . . . . . . . . . . .

Level Triggered: . . . . . Y . . . Y . Y . . . .

Here’s the output of the !apic command on a system running with the MPS HAL. The

“0:” prefix for the debugger prompt indicates that commands are running on processor

0, so this is the I/O APIC for processor 0:

lkd> !apic

Apic @ fffe0000 ID:0 (40010) LogDesc:01000000 DestFmt:ffffffff TPR 20

TimeCnt: 0bebc200clk SpurVec:3f FaultVec:e3 error:0

Ipi Cmd: 0004001f Vec:1F FixedDel Dest=Self edg high

Timer..: 000300fd Vec:FD FixedDel Dest=Self edg high masked

Linti0.: 0001003f Vec:3F FixedDel Dest=Self edg high masked

Linti1.: 000184ff Vec:FF NMI Dest=Self lvl high masked

TMR: 61, 82, 91-92, B1



Copyrighted material.

Chapter 3: System Mechanisms 91

The following output is for the !ioapic command, which displays the configuration of the

I/O APIC, the interrupt controller component connected to devices:

0: kd> !ioapic

IoApic @ ffd02000 ID:8 (11) Arb:0

Inti00.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti01.: 00000962 Vec:62 LowestDl Lg:03000000 edg

Inti02.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti03.: 00000971 Vec:71 LowestDl Lg:03000000 edg

Inti04.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti05.: 00000961 Vec:61 LowestDl Lg:03000000 edg

Inti06.: 00010982 Vec:82 LowestDl Lg:02000000 edg masked

Inti07.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti08.: 000008d1 Vec:D1 FixedDel Lg:01000000 edg

Inti09.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti0A.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti0B.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti0C.: 00000972 Vec:72 LowestDl Lg:03000000 edg

Inti0D.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti0E.: 00000992 Vec:92 LowestDl Lg:03000000 edg

Inti0F.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti10.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti11.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti12.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti13.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti14.: 0000a9a3 Vec:A3 LowestDl Lg:03000000 lvl

Inti15.: 0000a993 Vec:93 LowestDl Lg:03000000 lvl

Inti16.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Inti17.: 000100ff Vec:FF FixedDel PhysDest:00 edg masked

Software Interrupt Request Levels (IRQLs)

Although interrupt controllers perform a level of interrupt prioritization, Windows imposes

its own interrupt priority scheme known as interrupt request levels (IRQLs). The kernel represents

IRQLs internally as a number from 0 through 31 on x86 and from 0 to 15 on x64 and

IA64, with higher numbers representing higher-priority interrupts. Although the kernel

defines the standard set of IRQLs for software interrupts, the HAL maps hardware-interrupt

numbers to the IRQLs. Figure 3-3 shows IRQLs defined for the x86 architecture, and Figure

3-4 shows IRQLs for the x64 and IA64 architectures.

Note SYNCH_LEVEL, which multiprocessor versions of the kernel use to protect access to

per-processor processor control blocks (PRCB), is not shown in the charts because its value varies

across different versions of Windows. See Chapter 6 for a description of SYNCH_LEVEL and

its possible values.

Copyrighted material.

92 Microsoft Windows Internals, Fourth Edition

Figure 3-3 x86 interrupt request levels (IRQLs)

Figure 3-4 x64 and IA64 interrupt request levels (IRQLs)

Interrupts are serviced in priority order, and a higher-priority interrupt preempts the servicing

of a lower-priority interrupt. When a high-priority interrupt occurs, the processor saves the

interrupted thread’s state and invokes the trap dispatchers associated with the interrupt. The

trap dispatcher raises the IRQL and calls the interrupt’s service routine. After the service routine

executes, the interrupt dispatcher lowers the processor’s IRQL to where it was before the

interrupt occurred and then loads the saved machine state. The interrupted thread resumes

executing where it left off. When the kernel lowers the IRQL, lower-priority interrupts that

were masked might materialize. If this happens, the kernel repeats the process to handle the

new interrupts.

IRQL priority levels have a completely different meaning than thread-scheduling priorities

(which are described in Chapter 6). A scheduling priority is an attribute of a thread, whereas


Power fail

Inter-processor interrupt

Device n

Device 1






• • •

Software interrupts

Normal thread execution

Hardware interrupts













Inter-processor interrupt/Power

Device n

Device 1


Synch (Srv 2003)






Inter-processor interrupt

Device n

Correctable Machine Check

Device 1


Synch (MP only)



Dispatch/DPC & Synch (UP only)











Copyrighted material.

Chapter 3: System Mechanisms 93

an IRQL is an attribute of an interrupt source, such as a keyboard or a mouse. In addition,

each processor has an IRQL setting that changes as operating system code executes.

Each processor’s IRQL setting determines which interrupts that processor can receive. IRQLs

are also used to synchronize access to kernel-mode data structures. (You’ll find out more

about synchronization later in this chapter.) As a kernel-mode thread runs, it raises or lowers

the processor’s IRQL either directly by calling KeRaiseIrql and KeLowerIrql or, more commonly,

indirectly via calls to functions that acquire kernel synchronization objects. As Figure

3-5 illustrates, interrupts from a source with an IRQL above the current level interrupt the

processor, whereas interrupts from sources with IRQLs equal to or below the current level are

masked until an executing thread lowers the IRQL.

Figure 3-5 Masking interrupts

Because accessing a PIC is a relatively slow operation, HALs that use a PIC implement a performance

optimization, called lazy IRQL, that avoids PIC accesses. When the IRQL is raised,

the HAL notes the new IRQL internally instead of changing the interrupt mask. If a lower-priority

interrupt subsequently occurs, the HAL sets the interrupt mask to the settings appropriate

for the first interrupt and postpones the lower-priority interrupt until the IRQL is lowered.

Thus, if no lower-priority interrupts occur while the IRQL is raised, the HAL doesn’t need to

modify the PIC.

A kernel-mode thread raises and lowers the IRQL of the processor on which it’s running,

depending on what it’s trying to do. For example, when an interrupt occurs, the trap handler

(or perhaps the processor) raises the processor’s IRQL to the assigned IRQL of the interrupt

source. This elevation masks all interrupts at and below that IRQL (on that processor only),

which ensures that the processor servicing the interrupt isn’t waylaid by an interrupt at the

same or a lower level. The masked interrupts are either handled by another processor or held


Power fail

Inter-processor interrupt

Device n

Device 1






• • •

Interrupts masked on

Processor B

IRQL setting

Processor B

IRQL = DPC/dispatch

IRQL = Clock

Interrupts masked on

Processor A

Processor A

Copyrighted material.

94 Microsoft Windows Internals, Fourth Edition

back until the IRQL drops. Therefore, all components of the system, including the kernel and

device drivers, attempt to keep the IRQL at passive level (sometimes called low level). They do

this because device drivers can respond to hardware interrupts in a timelier manner if the

IRQL isn’t kept unnecessarily elevated for long periods.

Note An exception to the rule that raising the IRQL blocks interrupts of that level and lower

relates to APC_LEVEL interrupts. If a thread raises the IRQL to APC_LEVEL and then is rescheduled

because of a DISPATCH_LEVEL interrupt, the system might deliver an APC_LEVEL interrupt

to the newly scheduled thread. Thus, APC_LEVEL can be considered a thread-local rather than

processor-wide IRQL.


If you are running the kernel debugger on Windows Server 2003, you can view a processor’s

IRQL with the !irql debugger command:

kd> !irql

Debugger saved IRQL for processor 0x0 -- 0 (LOW_LEVEL)

Note that there is a field called IRQL in a data structure called the processor control region

(PCR) and its extension the processor control block (PRCB), which contain information

about the state of each processor in the system, such as the current IRQL, a pointer to

the hardware IDT, the currently running thread, and the next thread selected to run.

The kernel and the HAL use this information to perform architecture-specific and

machine-specific actions. Portions of the PCR and PRCB structures are defined publicly

in the Windows Device Driver Kit (DDK) header file Ntddk.h, so examine that file if you

want a complete definition of these structures.

You can view the contents of the PCR with the kernel debugger by using the !pcr command:

kd> !pcr

PCR Processor 0 @ffdff000

NtTib.ExceptionList: f8effc68

NtTib.StackBase: f8effdf0

NtTib.StackLimit: f8efd000

NtTib.SubSystemTib: 00000000

NtTib.Version: 00000000

NtTib.UserPointer: 00000000

NtTib.SelfTib: 7ffde000

Copyrighted material.

Chapter 3: System Mechanisms 95

SelfPcr: ffdff000

Prcb: ffdff120

Irql: 00000000

IRR: 00000000

IDR: ffff28e8

InterruptMode: 00000000

IDT: 80036400

GDT: 80036000

TSS: 802b5000

CurrentThread: 81638020

NextThread: 00000000

IdleThread: 8046bdf0

Unfortunately, Windows does not maintain the Irql field on systems that do not use lazy

IRQL, so on most systems the field will always be 0.

Because changing a processor’s IRQL has such a significant effect on system operation, the

change can be made only in kernel mode—user-mode threads can’t change the processor’s

IRQL. This means that a processor’s IRQL is always at passive level when it’s executing usermode

code. Only when the processor is executing kernel-mode code can the IRQL be higher.

Each interrupt level has a specific purpose. For example, the kernel issues an interprocessor

interrupt (IPI) to request that another processor perform an action, such as dispatching a particular

thread for execution or updating its translation look-aside buffer cache. The system

clock generates an interrupt at regular intervals, and the kernel responds by updating the

clock and measuring thread execution time. If a hardware platform supports two clocks, the

kernel adds another clock interrupt level to measure performance. The HAL provides a number

of interrupt levels for use by interrupt-driven devices; the exact number varies with the processor

and system configuration. The kernel uses software interrupts (described later in this chapter)

to initiate thread scheduling and to asynchronously break into a thread’s execution.

Mapping Interrupts to IRQLs IRQL levels aren’t the same as the interrupt requests (IRQs)

defined by interrupt controllers—the architectures on which Windows runs don’t implement

the concept of IRQLs in hardware. So how does Windows determine what IRQL to assign to

an interrupt? The answer lies in the HAL. In Windows, a type of device driver called a bus

driver determines the presence of devices on its bus (PCI, USB, and so on) and what interrupts

can be assigned to a device. The bus driver reports this information to the Plug and Play

manager, which decides, after taking into account the acceptable interrupt assignments for all

other devices, which interrupt will be assigned to each device. Then it calls the HAL function

HalpGetSystemInterruptVector, which maps interrupts to IRQLs.

Copyrighted material.

96 Microsoft Windows Internals, Fourth Edition

The algorithm for assignment differs for the various HALs that Windows includes. On uniprocessor

x86 systems, the HAL performs a straightforward translation: the IRQL of a given interrupt

vector is calculated by subtracting the interrupt vector from 27. Thus, if a device uses

interrupt vector 5, its ISR executes at IRQL 22. On an x86 multiprocessor system, the mapping

isn’t as simple. APICs support over 200 interrupt vectors, so there aren’t enough IRQLs

for a one-to-one correspondence. The multiprocessor HAL therefore assigns IRQLs to interrupt

vectors in a round-robin manner, cycling through the device IRQL (DIRQL) range. As a

result, on an x86 multiprocessor system there’s no easy way for you to predict or to know

what IRQL Windows assigns to APIC IRQs. Finally, on x64 and IA64 systems, the HAL computes

the IRQL for a given IRQ by dividing the interrupt vector assigned to the IRQ by 16.

Predefined IRQLs Let’s take a closer look at the use of the predefined IRQLs, starting from

the highest level shown in Figure 3-5:

The kernel uses high level only when it’s halting the system in KeBugCheckEx and masking

out all interrupts.

Power fail level originated in the original Microsoft Windows NT design documents,

which specified the behavior of system power failure code, but this IRQL has never been


Inter-processor interrupt level is used to request another processor to perform an action,

such as queue a DISPATCH_LEVEL interrupt to schedule a particular thread for execution,

updating the processor’s translation look-aside buffer (TLB) cache, system shutdown,

or system crash.

Clock level is used for the system’s clock, which the kernel uses to track the time of day

as well as to measure and allot CPU time to threads.

The system’s real-time clock uses profile level when kernel profiling, a performance measurement

mechanism, is enabled. When kernel profiling is active, the kernel’s profiling

trap handler records the address of the code that was executing when the interrupt

occurred. A table of address samples is constructed over time that tools can extract and

analyze. You can download Kernrate, a kernel profiling tool that you can use to configure

and view profiling-generated statistics, from

sysperf/krview.mspx. See the Kernrate experiment for more information on using

this tool.

The device IRQLs are used to prioritize device interrupts. (See the previous section for

how hardware interrupt levels are mapped to IRQLs.)

DPC/dispatch-level and APC-level interrupts are software interrupts that the kernel and

device drivers generate. (DPCs and APCs are explained in more detail later in this chapter.)

The lowest IRQL, passive level, isn’t really an interrupt level at all; it’s the setting at which

normal thread execution takes place and all interrupts are allowed to occur.

Copyrighted material.

Chapter 3: System Mechanisms 97

EXPERIMENT: Using Kernel Profiler to Profile Execution

You can use the Kernel Profiler tool to enable the system profiling timer, collect samples

of the code that is executing when the timer fires, and display a summary showing the

frequency distribution across image files and functions. It can be used to track CPU

usage consumed by individual processes and/or time spent in kernel mode independent

of processes (for example, interrupt service routines). Kernel profiling is useful

when you want to obtain a breakdown of where the system is spending time.

In its simplest form, Kernrate samples where time has been spent in each kernel module

(for example, Ntoskrnl, drivers, and so on). For example, after installing the Krview

package referred to previously, try performing the following steps:

1. Open a command prompt.

2. Type cd c:\program files\krview\kernrates.

3. Type dir. (You will see kernrate images for each platform.)

4. Run the image that matches your platform (with no arguments or switches). For

example, Kernrate_i386_XP.exe is the image for Windows XP running on an x86


5. While Kernrate is running, go perform some other activity on the system. For example,

run Windows Media Player and play some music, run a graphics-intensive game,

or perform network activity such as doing a directory of a remote network share.

6. Press Ctrl+C to stop Kernrate. This causes Kernrate to display the statistics from

the sampling period.

In the sample partial output from Kernrate, Windows Media Player was running, playing

a track from a CD.

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe




Date: 2004/05/13 Time: 9:48:28

Machine Name: BIGDAVID

Number of Processors: 1



Kernrate User-Specified Command Line:


***> Press ctrl-c to finish collecting profile data

===> Finished Collecting Data, Starting to Process Results

------------Overall Summary:--------------

P0 K 0:00:03.234 (11.7%) U 0:00:08.352 (30.2%) I 0:00:16.093 (58.1%)

DPC 0:00:01.772 ( 6.4%) Interrupt 0:00:00.350 ( 1.3%)

Copyrighted material.

98 Microsoft Windows Internals, Fourth Edition

Interrupts= 52899, Interrupt Rate= 1911/

sec.Time 7315 hits, 19531 events per hit --------

Module Hits msec %Total Events/Sec

gv3 4735 27679 64 % 3341135

smwdm 872 27679 11 % 615305

win32k 764 27679 10 % 539097

ntoskrnl 739 27679 10 % 521457

hal 124 27679 1 % 87497

The overall summary shows that the system spent 11.7 percent of the time in kernel

mode, 30.2 percent in user mode, 58.1 percent idle, 6.4 percent at DPC level, and 1.3

percent at interrupt level. The module with the highest hit rate was GV3.SYS, the processor

driver for the Pentium M Geyserville family. It is used for performance collection,

which is why it is first. The module with the second highest hit rate was Smwdm.sys, the

audio driver for the sound card on the machine used for the test. This makes sense

because the major activity going on in the system was Windows Media Player sending

sound I/O to the sound driver.

If you have symbols available, you can zoom in on individual modules and see the time

spent by function name. For example, profiling the system while dragging a window

around the screen rapidly resulted in the following (partial) output:

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe -z ntoskrnl -z win32k




Date: 2004/05/13 Time: 10:26:55

Time 4087 hits, 19531 events per hit --------

Module Hits msec %Total Events/Sec

win32k 1649 10424 40 % 3089660

ati2dvag 1269 10424 31 % 2377670

ntoskrnl 794 10424 19 % 1487683

gv3 162 10424 3 % 303532

----- Zoomed module win32k.sys (Bucket size = 16 bytes, Rounding Down) -------

Module Hits msec %Total Events/Sec

EngPaint 328 10424 19 % 614559

EngLpkInstalled 302 10424 18 % 565844

----- Zoomed module ntoskrnl.exe (Bucket size = 16 bytes, Rounding Down) -----

Module Hits msec %Total Events/Sec

KiDispatchInterrupt 243 10424 26 % 455298

ZwYieldExecution 50 10424 5 % 93682

InterlockedDecrement 39 10424 4 % 73072

The module with the highest hit rate was Win32k.sys, the windowing system driver.

Second on the list was the video driver. These results make sense because the main

activity in the system was drawing on the screen. Note in the zoomed display for

Win32k.sys, the function with the highest hit was EngPaint, the main GDI function to

paint on the screen.

Copyrighted material.

Chapter 3: System Mechanisms 99

One important restriction on code running at DPC/dispatch level or above is that it can’t wait

for an object if doing so would necessitate the scheduler to select another thread to execute,

which is an illegal operation because the scheduler synchronizes its data structures at DPC/

dispatch level and cannot therefore be invoked to perform a reschedule. Another restriction is

that only nonpaged memory can be accessed at IRQL DPC/dispatch level or higher. This rule

is actually a side-effect of the first restriction because attempting to access memory that isn’t

resident results in a page fault. When a page fault occurs, the memory manager initiates a disk

I/O and then needs to wait for the file system driver to read the page in from disk. This wait

would in turn require the scheduler to perform a context switch (perhaps to the idle thread if

no user thread is waiting to run), thus violating the rule that the scheduler can’t be invoked

(because the IRQL is still DPC/dispatch level or higher at the time of the disk read). If either

of these two restrictions is violated, the system crashes with an

IRQL_NOT_LESS_OR_EQUAL crash code. (See Chapter 4 for a thorough discussion of system

crashes.) Violating these restrictions is a common bug in device drivers. The Windows

Driver Verifier, explained in the section “Driver Verifier” in Chapter 7, has an option you can

set to assist in finding this particular type of bug.

Interrupt Objects The kernel provides a portable mechanism—a kernel control object

called an interrupt object—that allows device drivers to register ISRs for their devices. An interrupt

object contains all the information the kernel needs to associate a device ISR with a particular

level of interrupt, including the address of the ISR, the IRQL at which the device

interrupts, and the entry in the kernel’s IDT with which the ISR should be associated. When

an interrupt object is initialized, a few instructions of assembly language code, called the dispatch

code, are copied from an interrupt handling template, KiInterruptTemplate, and stored

in the object. When an interrupt occurs, this code is executed.

This interrupt-object resident code calls the real interrupt dispatcher, which is typically either

the kernel’s KiInterruptDispatch or KiChainedDispatch routine, passing it a pointer to the interrupt

object. KiInterruptDispatch is the routine used for interrupt vectors for which only one

interrupt object is registered, and KiChainedDispatch is for vectors shared among multiple

interrupt objects. The interrupt object contains information this second dispatcher routine

needs to locate and properly call the ISR the device driver provides. The interrupt object also

stores the IRQL associated with the interrupt so that KiInterruptDispatch or KiChainedDispatch

can raise the IRQL to the correct level before calling the ISR and then lower the IRQL

after the ISR has returned. This two-step process is required because there’s no way to pass a

pointer to the interrupt object (or any other argument for that matter) on the initial dispatch

because the initial dispatch is done by hardware. On a multiprocessor system, the kernel allocates

and initializes an interrupt object for each CPU, enabling the local APIC on that CPU to

accept the particular interrupt. Figure 3-6 shows typical interrupt control flow for interrupts

associated with interrupt objects.

Copyrighted material.

100 Microsoft Windows Internals, Fourth Edition

Figure 3-6 Typical interrupt control flow

EXPERIMENT: Examining Interrupt Internals

Using the kernel debugger, you can view details of an interrupt object, including its

IRQL, ISR address, and custom interrupt dispatching code. First, execute the !idt command

and locate the entry that includes a reference to I8042KeyboardInterruptService,

the ISR routine for the PS2 keyboard device:

31: 8a39dc3c i8042prt!I8042KeyboardInterruptService (KINTERRUPT 8a39dc00)

To view the contents of the interrupt object associated with the interrupt, execute dt

nt!_kinterrupt with the address following KINTERRUPT:

kd> dt nt!_kinterrupt 8a39dc00


+0x000 Type : 22

+0x002 Size : 484

+0x004 InterruptListEntry : _LIST_ENTRY [ 0x8a39dc04 - 0x8a39dc04 ]

+0x00c ServiceRoutine : 0xba7e74a2 i8042prt!I8042KeyboardInterruptService+0

+0x010 ServiceContext : 0x8a067898

+0x014 SpinLock : 0

+0x018 TickCount : 0xffffffff

+0x01c ActualLock : 0x8a067958 -> 0

+0x020 DispatchAddress : 0x80531140 nt!KiInterruptDispatch+0

+0x024 Vector : 0x31

+0x028 Irql : 0x1a ’’

+0x029 SynchronizeIrql : 0x1a ’’

+0x02a FloatingSave : 0 ’’



Peripheral Device


CPU Interrupt


CPU Interrupt

Dispatch Table





Raise IRQL

Grab Spinlock

Drop Spinlock

Lower IRQL

ISR Address






KiInterruptDispatch Driver ISR

Read from device



Request DPC

Copyrighted material.

Chapter 3: System Mechanisms 101

+0x02b Connected : 0x1 ’’

+0x02c Number : 0 ’’

+0x02d ShareVector : 0 ’’

+0x030 Mode : 1 ( Latched )

+0x034 ServiceCount : 0

+0x038 DispatchCount : 0xffffffff

+0x03c DispatchCode : [106] 0x56535554

In this example, the IRQL Windows assigned to the interrupt is 0x1a (which is 26 in

decimal). Because this output is from a uniprocessor x86 system, we calculate that the

IRQ is 1, because IRQLs on x86 uniprocessors are calculated by subtracting the IRQ

from 27. We can verify this by opening the Device Manager (on the Hardware tab in the

System applet in the Control Panel), locating the PS/2 keyboard device, and viewing its

resource assignments, as shown in the following figure.

On a multiprocessor x86, the IRQ will be essentially randomly assigned, and on an x64

or IA64 system you will see that the IRQ is the interrupt vector number (0x31—49 decimal—

in this example) divided by 16.

The ISR’s address for the interrupt object is stored in the ServiceRoutine field (which is

what !idt displays in its output), and the interrupt code that actually executes when an

interrupt occurs is stored in the DispatchCode array at the end of the interrupt object.

The interrupt code stored there is programmed to build the trap frame on the stack and

then call the function stored in the DispatchAddress field (KiInterruptDispatch in the

example), passing it a pointer to the interrupt object.

Copyrighted material.

102 Microsoft Windows Internals, Fourth Edition

Windows and Real-Time Processing

Deadline requirements, either hard or soft, characterize real-time environments. Hard

real-time systems (for example, a nuclear power plant control system) have deadlines

that the system must meet to avoid catastrophic failures such as loss of equipment or

life. Soft real-time systems (for example, a car’s fuel-economy optimization system) have

deadlines that the system can miss, but timeliness is still a desirable trait. In real-time

systems, computers have sensor input devices and control output devices. The designer

of a real-time computer system must know worst-case delays between the time an input

device generates an interrupt and the time the device’s driver can control the output

device to respond. This worst-case analysis must take into account the delays the operating

system introduces as well as the delays the application and device drivers impose.

Because Windows doesn’t prioritize device IRQs in any controllable way and user-level

applications execute only when a processor’s IRQL is at passive level, Windows isn’t

always suitable as a real-time operating system. The system’s devices and device drivers—

not Windows—ultimately determine the worst-case delay. This factor becomes a

problem when the real-time system’s designer uses off-the-shelf hardware. The designer

can have difficulty determining how long every off-the-shelf device’s ISR or DPC might

take in the worst case. Even after testing, the designer can’t guarantee that a special case

in a live system won’t cause the system to miss an important deadline. Furthermore, the

sum of all the delays a system’s DPCs and ISRs can introduce usually far exceeds the tolerance

of a time-sensitive system.

Although many types of embedded systems (for example, printers and automotive computers)

have real-time requirements, Windows XP Embedded doesn’t have real-time

characteristics. It is simply a version of Windows XP that makes it possible, using system

designer technology that Microsoft licensed from VenturCom, to produce small-footprint

versions of Windows XP suitable for running on devices with limited resources.

For example, a device that has no networking capability would omit all the Windows XP

components related to networking, including network management tools and adapter

and protocol stack device drivers.

Still, there are third-party vendors that supply real-time kernels for Windows. The

approach these vendors take is to embed their real-time kernel in a custom HAL and to

have Windows run as a task in the real-time operating system. The task running Windows

serves as the user interface to the system and has a lower priority than the tasks

responsible for managing the device. See VenturCom’s Web site,, for

an example of a third-party real-time kernel extension for Windows.

Associating an ISR with a particular level of interrupt is called connecting an interrupt object,

and dissociating an ISR from an IDT entry is called disconnecting an interrupt object. These

Copyrighted material.

Chapter 3: System Mechanisms 103

operations, accomplished by calling the kernel functions IoConnectInterrupt and IoDisconnectInterrupt,

allow a device driver to “turn on” an ISR when the driver is loaded into the system

and to “turn off” the ISR if the driver is unloaded.

Using the interrupt object to register an ISR prevents device drivers from fiddling directly

with interrupt hardware (which differs among processor architectures) and from needing to

know any details about the IDT. This kernel feature aids in creating portable device drivers

because it eliminates the need to code in assembly language or to reflect processor differences

in device drivers.

Interrupt objects provide other benefits as well. By using the interrupt object, the kernel

can synchronize the execution of the ISR with other parts of a device driver that might share

data with the ISR. (See Chapter 9 for more information about how device drivers respond to


Furthermore, interrupt objects allow the kernel to easily call more than one ISR for any interrupt

level. If multiple device drivers create interrupt objects and connect them to the same

IDT entry, the interrupt dispatcher calls each routine when an interrupt occurs at the specified

interrupt line. This capability allows the kernel to easily support “daisy-chain” configurations,

in which several devices share the same interrupt line. The chain breaks when one of

the ISRs claims ownership for the interrupt by returning a status to the interrupt dispatcher.

If multiple devices sharing the same interrupt require service at the same time, devices not

acknowledged by their ISRs will interrupt the system again once the interrupt dispatcher has

lowered the IRQL. Chaining is permitted only if all the device drivers wanting to use the same

interrupt indicate to the kernel that they can share the interrupt; if they can’t, the Plug and

Play manager reorganizes their interrupt assignments to ensure that it honors the sharing

requirements of each. If the interrupt vector is shared, the interrupt object invokes KiChained-

Dispatch, which will invoke the ISRs of each registered interrupt object in turn until one of

them claims the interrupt or all have been executed. In the earlier sample !idt output, vector

0x3b is connected to several chained interrupt objects.

Software Interrupts

Although hardware generates most interrupts, the Windows kernel also generates software

interrupts for a variety of tasks, including these:

Initiating thread dispatching

Non-time-critical interrupt processing

Handling timer expiration

Asynchronously executing a procedure in the context of a particular thread

Supporting asynchronous I/O operations

These tasks are described in the following subsections.

Copyrighted material.

104 Microsoft Windows Internals, Fourth Edition

Dispatch or Deferred Procedure Call (DPC) Interrupts When a thread can no longer continue

executing, perhaps because it has terminated or because it voluntarily enters a wait state,

the kernel calls the dispatcher directly to effect an immediate context switch. Sometimes, however,

the kernel detects that rescheduling should occur when it is deep within many layers of

code. In this situation, the kernel requests dispatching but defers its occurrence until it completes

its current activity. Using a DPC software interrupt is a convenient way to achieve this


The kernel always raises the processor’s IRQL to DPC/dispatch level or above when it needs

to synchronize access to shared kernel structures. This disables additional software interrupts

and thread dispatching. When the kernel detects that dispatching should occur, it requests a

DPC/dispatch-level interrupt; but because the IRQL is at or above that level, the processor

holds the interrupt in check. When the kernel completes its current activity, it sees that it’s

going to lower the IRQL below DPC/dispatch level and checks to see whether any dispatch

interrupts are pending. If there are, the IRQL drops to DPC/dispatch level and the dispatch

interrupts are processed. Activating the thread dispatcher by using a software interrupt is a

way to defer dispatching until conditions are right. However, Windows uses software interrupts

to defer other types of processing as well.

In addition to thread dispatching, the kernel also processes deferred procedure calls (DPCs)

at this IRQL. A DPC is a function that performs a system task—a task that is less time-critical

than the current one. The functions are called deferred because they might not execute immediately.

DPCs provide the operating system with the capability to generate an interrupt and execute a

system function in kernel mode. The kernel uses DPCs to process timer expiration (and

release threads waiting for the timers) and to reschedule the processor after a thread’s quantum

expires. Device drivers use DPCs to complete I/O requests. To provide timely service for

hardware interrupts, Windows—with the cooperation of device drivers—attempts to keep the

IRQL below device IRQL levels. One way that this goal is achieved is for device driver ISRs to

perform the minimal work necessary to acknowledge their device, save volatile interrupt state,

and defer data transfer or other less time-critical interrupt processing activity for execution in

a DPC at DPC/dispatch IRQL. (See Chapter 9 for more information on DPCs and the I/O system.)

A DPC is represented by a DPC object, a kernel control object that is not visible to user-mode

programs but is visible to device drivers and other system code. The most important piece of

information the DPC object contains is the address of the system function that the kernel will

call when it processes the DPC interrupt. DPC routines that are waiting to execute are stored

in kernel-managed queues, one per processor, called DPC queues. To request a DPC, system

code calls the kernel to initialize a DPC object and then places it in a DPC queue.

By default, the kernel places DPC objects at the end of the DPC queue of the processor on

which the DPC was requested (typically the processor on which the ISR executed). A device

Copyrighted material.

Chapter 3: System Mechanisms 105

driver can override this behavior, however, by specifying a DPC priority (low, medium, or

high, where medium is the default) and by targeting the DPC at a particular processor. A DPC

aimed at a specific CPU is known as a targeted DPC. If the DPC has a low or medium priority,

the kernel places the DPC object at the end of the queue; if the DPC has a high priority, the

kernel inserts the DPC object at the front of the queue.

When the processor’s IRQL is about to drop from an IRQL of DPC/dispatch level or higher to

a lower IRQL (APC or passive level), the kernel processes DPCs. Windows ensures that the

IRQL remains at DPC/dispatch level and pulls DPC objects off the current processor’s queue

until the queue is empty (that is, the kernel “drains” the queue), calling each DPC function in

turn. Only when the queue is empty will the kernel let the IRQL drop below DPC/dispatch

level and let regular thread execution continue. DPC processing is depicted in Figure 3-7.

Figure 3-7 Delivering a DPC

DPC priorities can affect system behavior another way. The kernel usually initiates DPC queue

draining with a DPC/dispatch-level interrupt. The kernel generates such an interrupt only if

the DPC is directed at the processor the ISR is requested on and the DPC has a high or

medium priority. If the DPC has a low priority, the kernel requests the interrupt only if the

number of outstanding DPC requests for the processor rises above a threshold or if the number

of DPCs requested on the processor within a time window is low. If a DPC is targeted at a

CPU different from the one on which the ISR is running and the DPC’s priority is high, the

kernel immediately signals the target CPU (by sending it a dispatch IPI) to drain its DPC

queue. If the priority is medium or low, the number of DPCs queued on the target processor

must exceed a threshold for the kernel to trigger a DPC/dispatch interrupt. The system idle

thread also drains the DPC queue for the processor it runs on. Although DPC targeting and

The dispatcher executes each DPC routine

in the DPC queue, emptying the queue as

it proceeds. If required, the dispatcher also

reschedules the processor.





After the DPC interrupt,

control transfers to the

(thread) dispatcher.

When the IRQL drops below

DPC/dispatch level, a DPC

interrupt occurs.

A timer expires, and the kernel

queues a DPC that will release

any threads waiting on the

timer. The kernel then

requests a software interrupt.




DPC queue




Power failure


IRQL setting


• • •

Copyrighted material.

106 Microsoft Windows Internals, Fourth Edition

priority levels are flexible, device drivers rarely need to change the default behavior of their

DPC objects. Table 3-1 summarizes the situations that initiate DPC queue draining.

Because user-mode threads execute at low IRQL, the chances are good that a DPC will interrupt

the execution of an ordinary user’s thread. DPC routines execute without regard to what

thread is running, meaning that when a DPC routine runs, it can’t assume what process

address space is currently mapped. DPC routines can call kernel functions, but they can’t call

system services, generate page faults, or create or wait for dispatcher objects (explained later

in this chapter). They can, however, access nonpaged system memory addresses, because system

address space is always mapped regardless of what the current process is.

DPCs are provided primarily for device drivers, but the kernel uses them too. The kernel most

frequently uses a DPC to handle quantum expiration. At every tick of the system clock, an

interrupt occurs at clock IRQL. The clock interrupt handler (running at clock IRQL) updates

the system time and then decrements a counter that tracks how long the current thread has

run. When the counter reaches 0, the thread’s time quantum has expired and the kernel

might need to reschedule the processor, a lower-priority task that should be done at DPC/dispatch

IRQL. The clock interrupt handler queues a DPC to initiate thread dispatching and then

finishes its work and lowers the processor’s IRQL. Because the DPC interrupt has a lower priority

than do device interrupts, any pending device interrupts that surface before the clock

interrupt completes are handled before the DPC interrupt occurs.

EXPERIMENT: Monitoring Interrupt and DPC Activity

You can use Process Explorer to monitor interrupt and DPC activity by adding the

Context Switch Delta column and watching the Interrupt and DPC processes. These

are not real processes, but they are shown as processes for convenience and therefore

do not incur context switches. Process Explorer’s context switch count for these

pseudo processes reflects the number of occurrences of each within the previous

refresh interval. You can stimulate interrupt and DPC activity by moving the mouse

quickly around the screen.

Table 3-1 DPC Interrupt Generation Rules

DPC Priority DPC Targeted at ISR’s Processor DPC Targeted at Another Processor

Low DPC queue length exceeds maximum

DPC queue length or DPC

request rate is less than minimum

DPC request rate

DPC queue length exceeds maximum

DPC queue length or System is idle

Medium Always DPC queue length exceeds maximum

DPC queue length or System is idle

High Always Always

Copyrighted material.

Chapter 3: System Mechanisms 107

You can also trace the execution of specific interrupt service routines and deferred procedure

calls with the built-in event tracing support (described later in this chapter) in

Windows XP Service Pack 2 and Windows Server 2003 Service Pack 1 and later.

1. Start capturing events by typing the following command:

tracelog -start -f kernel.etl -b 64 -UsePerfCounter -

eflag 8 0x307 0x4084 0 0 0 0 0 0

2. Stop capturing events by typing:

tracelog -stop to stop logging.

3. Generate reports for the event capture by typing:

tracerpt kernel.etl -df -o -report

This will generate two files: workload.txt and dumpfile.csv.

4. Open “workload.txt” and you will see summaries of the time spent in ISRs and

DPCs by each driver type.

5. Open the file “dumpfile.csv” created in step 4; search for lines with “DPC” or “ISR”

in the second value. For example, the following three lines from a dumpfile.csv

generated using the above commands show a timer DPC, a DPC, and an ISR:

PerfInfo, TimerDPC, 0xFFFFFFFF, 127383953645422825, 0,

0, 127383953645421500, 0xFB03A385, 0, 0

PerfInfo, DPC, 0xFFFFFFFF, 127383953645424040, 0,

0, 127383953645421394, 0x804DC87D, 0, 0

PerfInfo, ISR, 0xFFFFFFFF, 127383953645470903, 0,

0, 127383953645468696, 0xFB48D5E0, 0, 0, 0

Doing an “ln” command in the kernel debugger on the start address in each event

record (the eighth value on each line) shows the name of the function that executed the


lkd> ln 0xFB03A385

(fb03a385) rdbss!RxTimerDispatch | (fb03a41e) rdbss!RxpWorkerThreadDispatcher

lkd> ln 0x804DC87D

(804dc87d) nt!KiTimerExpiration | (804dc93b) nt!KeSetTimerEx

lkd> ln 0xFB48D5E0

(fb48d5e0) atapi!IdePortInterrupt | (fb48d622) atapi!IdeCheckEmptyChannel

The first is a DPC for a timer expiration for a timer queued by the file system redirector

client driver. The second is a DPC for a generic timer expiration. The third address is the

address of the ISR for the ATAPI port driver. For more information, see http://

Copyrighted material.

108 Microsoft Windows Internals, Fourth Edition

Asynchronous Procedure Call (APC) Interrupts Asynchronous procedure calls (APCs)

provide a way for user programs and system code to execute in the context of a particular user

thread (and hence a particular process address space). Because APCs are queued to execute in

the context of a particular thread and run at an IRQL less than DPC/dispatch level, they don’t

operate under the same restrictions as a DPC. An APC routine can acquire resources (objects),

wait for object handles, incur page faults, and call system services.

APCs are described by a kernel control object, called an APC object. APCs waiting to execute

reside in a kernel-managed APC queue. Unlike the DPC queue, which is systemwide, the APC

queue is thread-specific—each thread has its own APC queue. When asked to queue an APC,

the kernel inserts it into the queue belonging to the thread that will execute the APC routine.

The kernel, in turn, requests a software interrupt at APC level, and when the thread eventually

begins running, it executes the APC.

There are two kinds of APCs: kernel mode and user mode. Kernel-mode APCs don’t require

“permission” from a target thread to run in that thread’s context, while user-mode APCs do.

Kernel-mode APCs interrupt a thread and execute a procedure without the thread’s intervention

or consent. There are also two types of kernel-mode APCs: normal and special. A thread

can disable both types by raising the IRQL to APC_LEVEL or by calling KeEnterGuardedRegion,

which was introduced in Windows Server 2003. KeEnterGuardedRegionThread disables

APC delivery by setting the SpecialApcDisable field in the calling thread’s KTHREAD structure

(described further in Chapter 6). A thread can disable normal APCs only by calling KeEnterCriticalRegion,

which sets the KernelApcDisable field in the thread’s KTHREAD structure.

The executive uses kernel-mode APCs to perform operating system work that must be completed

within the address space (in the context) of a particular thread. It can use special kernel-

mode APCs to direct a thread to stop executing an interruptible system service, for

example, or to record the results of an asynchronous I/O operation in a thread’s address

space. Environment subsystems use special kernel-mode APCs to make a thread suspend or

terminate itself or to get or set its user-mode execution context. The POSIX subsystem uses

kernel-mode APCs to emulate the delivery of POSIX signals to POSIX processes.

Device drivers also use kernel-mode APCs. For example, if an I/O operation is initiated and a

thread goes into a wait state, another thread in another process can be scheduled to run.

When the device finishes transferring data, the I/O system must somehow get back into the

context of the thread that initiated the I/O so that it can copy the results of the I/O operation

to the buffer in the address space of the process containing that thread. The I/O system uses

a special kernel-mode APC to perform this action. (The use of APCs in the I/O system is discussed

in more detail in Chapter 9.)

Copyrighted material.

Chapter 3: System Mechanisms 109

Several Windows APIs, such as ReadFileEx, WriteFileEx, and QueueUserAPC, use user-mode

APCs. For example, the ReadFileEx and WriteFileEx functions allow the caller to specify a completion

routine to be called when the I/O operation finishes. The I/O completion is implemented

by queueing an APC to the thread that issued the I/O. However, the callback to the

completion routine doesn’t necessarily take place when the APC is queued because user-mode

APCs are delivered to a thread only when it’s in an alertable wait state. A thread can enter a wait

state either by waiting for an object handle and specifying that its wait is alertable (with the

Windows WaitForMultipleObjectsEx function) or by testing directly whether it has a pending

APC (using SleepEx). In both cases, if a user-mode APC is pending, the kernel interrupts

(alerts) the thread, transfers control to the APC routine, and resumes the thread’s execution

when the APC routine completes. Unlike kernel-mode APCs, which execute at APC level, usermode

APCs execute at passive level.

APC delivery can reorder the wait queues—the lists of which threads are waiting for what, and

in what order they are waiting. (Wait resolution is described in the section “Low-IRQL Synchronization”

later in this chapter.) If the thread is in a wait state when an APC is delivered,

after the APC routine completes, the wait is reissued or reexecuted. If the wait still isn’t

resolved, the thread returns to the wait state, but now it will be at the end of the list of objects

it’s waiting for. For example, because APCs are used to suspend a thread from execution, if the

thread is waiting for any objects, its wait will be removed until the thread is resumed, after

which that thread will be at the end of the list of threads waiting to access the objects it was

waiting for.

Exception Dispatching

In contrast to interrupts, which can occur at any time, exceptions are conditions that result

directly from the execution of the program that is running. Windows introduced a facility

known as structured exception handling, which allows applications to gain control when exceptions

occur. The application can then fix the condition and return to the place the exception

occurred, unwind the stack (thus terminating execution of the subroutine that raised the

exception), or declare back to the system that the exception isn’t recognized and the system

should continue searching for an exception handler that might process the exception. This

section assumes you’re familiar with the basic concepts behind Windows structured exception

handling—if you’re not, you should read the overview in the Windows API reference documentation

on the Platform SDK or chapters 23 through 25 in Jeffrey Richter’s book

Programming Applications for Microsoft Windows (Fourth Edition, Microsoft Press, 2000) before

proceeding. Keep in mind that although exception handling is made accessible through language

extensions (for example, the __try construct in Microsoft Visual C++), it is a system

mechanism and hence isn’t language-specific. Other examples of consumers of Windows

exception handling include C++ and Java exceptions.

Copyrighted material.

110 Microsoft Windows Internals, Fourth Edition

On the x86, all exceptions have predefined interrupt numbers that directly correspond to the

entry in the IDT that points to the trap handler for a particular exception. Table 3-2 shows

x86-defined exceptions and their assigned interrupt numbers. Because the first entries of the

IDT are used for exceptions, hardware interrupts are assigned entries later in the table, as

mentioned earlier.

All exceptions, except those simple enough to be resolved by the trap handler, are serviced by

a kernel module called the exception dispatcher. The exception dispatcher’s job is to find an

exception handler that can “dispose of” the exception. Examples of architecture-independent

exceptions that the kernel defines include memory access violations, integer divide-by-zero,

integer overflow, floating-point exceptions, and debugger breakpoints. For a complete list of

architecture-independent exceptions, consult the Windows API reference documentation.

The kernel traps and handles some of these exceptions transparently to user programs. For

example, encountering a breakpoint while executing a program being debugged generates an

exception, which the kernel handles by calling the debugger. The kernel handles certain other

exceptions by returning an unsuccessful status code to the caller.

A few exceptions are allowed to filter back, untouched, to user mode. For example, a memory

access violation or an arithmetic overflow generates an exception that the operating system

doesn’t handle. An environment subsystem can establish frame-based exception handlers to

Table 3-2 x86 Exceptions and Their Interrupt Numbers

Interrupt Number Exception

0 Divide Error


2 NMI/NPX Error

3 Breakpoint

4 Overflow

5 BOUND/Print Screen

6 Invalid Opcode

7 NPX Not Available

8 Double Exception

9 NPX Segment Overrun

A Invalid Task State Segment (TSS)

B Segment Not Present

C Stack Fault

D General Protection

E Page Fault

F Intel Reserved

10 Floating Point

11 Alignment Check

Copyrighted material.

Chapter 3: System Mechanisms 111

deal with these exceptions. The term frame-based refers to an exception handler’s association

with a particular procedure activation. When a procedure is invoked, a stack frame representing

that activation of the procedure is pushed onto the stack. A stack frame can have one or

more exception handlers associated with it, each of which protects a particular block of code

in the source program. When an exception occurs, the kernel searches for an exception handler

associated with the current stack frame. If none exists, the kernel searches for an exception

handler associated with the previous stack frame, and so on, until it finds a frame-based

exception handler. If no exception handler is found, the kernel calls its own default exception


When an exception occurs, whether it is explicitly raised by software or implicitly raised by

hardware, a chain of events begins in the kernel. The CPU hardware transfers control to the

kernel trap handler, which creates a trap frame (as it does when an interrupt occurs). The trap

frame allows the system to resume where it left off if the exception is resolved. The trap handler

also creates an exception record that contains the reason for the exception and other pertinent


If the exception occurred in kernel mode, the exception dispatcher simply calls a routine to

locate a frame-based exception handler that will handle the exception. Because unhandled

kernel-mode exceptions are considered fatal operating system errors, you can assume that the

dispatcher always finds an exception handler.

If the exception occurred in user mode, the exception dispatcher does something more elaborate.

As you’ll see in Chapter 6, the Windows subsystem has a debugger port and an exception

port to receive notification of user-mode exceptions in Windows processes. The kernel

uses these in its default exception handling, as illustrated in Figure 3-8.

Figure 3-8 Dispatching an exception










(first chance)






(second chance)





Kernel default


Function call


Copyrighted material.

112 Microsoft Windows Internals, Fourth Edition

Debugger breakpoints are common sources of exceptions. Therefore, the first action the

exception dispatcher takes is to see whether the process that incurred the exception has an

associated debugger process. If it does and the system is Windows 2000, the exception dispatcher

sends the first-chance debug message via an LPC to the debugger port associated with

the process that incurred the exception. The LPC message is sent to the session manager process,

which then dispatches it to the appropriate debugger process. On Windows XP and Windows

Server 2003, the exception dispatcher sends a debugger object message to the debug

object associated with the process (which internally the system refers to as a port).

If the process has no debugger process attached, or if the debugger doesn’t handle the exception,

the exception dispatcher switches into user mode, copies the trap frame to the user stack

formatted as a CONTEXT data structure (documented in the Platform SDK), and calls a routine

to find a frame-based exception handler. If none is found, or if none handles the exception,

the exception dispatcher switches back into kernel mode and calls the debugger again to

allow the user to do more debugging. (This is called the second-chance notification.)

If the debugger isn’t running and no frame-based handlers are found, the kernel sends a message

to the exception port associated with the thread’s process. This exception port, if one

exists, was registered by the environment subsystem that controls this thread. The exception

port gives the environment subsystem, which presumably is listening at the port, the opportunity

to translate the exception into an environment-specific signal or exception. CSRSS (Client/

Server Run-Time Subsystem) simply presents a message box notifying the user of the fault

and terminates the process, and when POSIX gets a message from the kernel that one of its

threads generated an exception, the POSIX subsystem sends a POSIX-style signal to the

thread that caused the exception. However, if the kernel progresses this far in processing the

exception and the subsystem doesn’t handle the exception, the kernel executes a default

exception handler that simply terminates the process whose thread caused the exception.

Unhandled Exceptions

All Windows threads have an exception handler declared at the top of the stack that processes

unhandled exceptions. This exception handler is declared in the internal Windows start-ofprocess

or start-of-thread function. The start-of-process function runs when the first thread in a

process begins execution. It calls the main entry point in the image. The start-of-thread function

runs when a user creates additional threads. It calls the user-supplied thread start routine

specified in the CreateThread call.

Copyrighted material.

Chapter 3: System Mechanisms 113

EXPERIMENT: Viewing the Real User Start Address for Windows


The fact that each Windows thread begins execution in a system-supplied function (and

not the user-supplied function) explains why the start address for thread 0 is the same

for every Windows process in the system (and why the start addresses for secondary

threads are also the same). The start address for thread 0 in Windows processes is the

Windows start-of-process function; the start address for any other threads would be the

Windows start-of-thread function. To see the user-supplied function address, use the

Tlist utility in the Windows Support Tools. Type tlist process-name or tlist process-id

to get the detailed process output that includes this information. For example, compare

the thread start addresses for the Windows Explorer process as reported by Pstat (in the

Platform SDK) and Tlist:

C:\> pstat


pid:3f8 pri: 8 Hnd: 329 Pf: 80043 Ws: 4620K explorer.exe

tid pri Ctx Swtch StrtAddr User Time Kernel Time State

7c 9 16442 77E878C1 0:00:01.241 0:00:01.251 Wait:UserRequest

42c 11 157888 77E92C50 0:00:07.110 0:00:34.309 Wait:UserRequest

44c 8 6357 77E92C50 0:00:00.070 0:00:00.140 Wait:UserRequest

1cc 8 3318 77E92C50 0:00:00.030 0:00:00.070 Wait:DelayExecution


C:\> tlist explorer

1016 explorer.exe Program Manager

CWD: C:\

CmdLine: Explorer.exe

VirtualSize: 25348 KB PeakVirtualSize: 31052 KB

WorkingSetSize: 1804 KB PeakWorkingSetSize: 3276 KB

NumberOfThreads: 4

149 Win32StartAddr:0x01009dbd LastErr:0x0000007e State:Waiting

86 Win32StartAddr:0x77c5d4a5 LastErr:0x00000000 State:Waiting

62 Win32StartAddr:0x00000977 LastErr:0x00000000 State:Waiting

179 Win32StartAddr:0x0100d8d4 LastErr:0x00000002 State:Waiting

The start address of thread 0 reported by Pstat is the internal Windows start-of-process

function; the start addresses for threads 1 through 3 are the internal Windows start-ofthread

functions. Tlist, on the other hand, shows the user-supplied Windows start

address (the user function called by the internal Windows start function).

Copyrighted material.

114 Microsoft Windows Internals, Fourth Edition

Because most threads in Windows processes start at one of the system-supplied wrapper

functions, Process Explorer, when displaying the start address of threads in a process,

skips the initial call frame that represents the wrapper function and instead shows

the second frame on the stack. For example, notice the thread start address of a process

running Notepad.exe:

Process Explorer does display the complete call hierarchy when it displays the call stack.

Notice the following results when the Stack button is clicked:

Line 12 in the preceding figure is the first frame on the stack—the start of the process

wrapper. The second frame (line 11) is the main entry point into Notepad.exe.

Copyrighted material.

Chapter 3: System Mechanisms 115

The generic code for these internal start functions is shown here:

void Win32StartOfProcess(


LPVOID lpvThreadParm){

__try {

DWORD dwThreadExitCode = lpStartAddr(lpvThreadParm);


} __except(UnhandledExceptionFilter(

GetExceptionInformation())) {




Notice that the Windows unhandled exception filter is called if the thread has an exception

that it doesn’t handle. The purpose of this function is to provide the system-defined behavior

for what to do when an exception is not handled, which is based on the contents of the

HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug registry key. There

are two important values: Auto and Debugger. Auto tells the unhandled exception filter

whether to automatically run the debugger or ask the user what to do. By default, it is set to 1,

which means that it will launch the debugger automatically. However, installing development

tools such as Visual Studio changes this to 0. The Debugger value is a string that points to the

path of the debugger executable to run in the case of an unhandled exception.

The default debugger is \Windows\System32\Drwtsn32.exe (Dr. Watson), which isn’t really

a debugger but rather a postmortem tool that captures the state of the application “crash” and

records it in a log file (Drwtsn32.log) and a process crash dump file (User.dmp), both found

by default in the \Documents And Settings\All Users\Documents\DrWatson folder. To see

(or modify) the configuration for Dr. Watson, run it interactively—it displays a window with

the current settings, as shown in Figure 3-9.

Figure 3-9 Windows 2000 Dr. Watson default settings

Copyrighted material.

116 Microsoft Windows Internals, Fourth Edition

The log file contains basic information such as the exception code, the name of the image that

failed, a list of loaded DLLs, and a stack and instruction trace for the thread that incurred the

exception. For a detailed description of the contents of the log file, run Dr. Watson and click

the Help button shown in Figure 3-9.

The crash dump file contains the private pages in the process at the time of the exception.

(The file doesn’t include code pages from EXEs or DLLs.) This file can be opened by WinDbg,

the Windows debugger that comes with the Debugging Tools package, or by Visual Studio

2003 and later. Because the User.dmp file is overwritten each time a process crashes, unless

you rename or copy the file after each process crash, you’ll have only the latest one on your


On Windows 2000 Professional systems, visual notification is turned on by default. The message

box shown in Figure 3-10 is displayed by Dr. Watson after it generates the crash dump

and records information in its log file.

Figure 3-10 Windows 2000 Dr. Watson error message

The Dr. Watson process remains until the message box is dismissed, which is why on Windows

2000 Server systems visual notification is turned off by default. This default is used

because if a server application fails, there is usually nobody at the console to see it and dismiss

the message box. Instead, server applications should log errors to the Windows event log.

On Windows 2000, if the Auto value is set to zero, the message box shown in Figure 3-11 is


Figure 3-11 Windows 2000 Unhandled exception message

If the OK button is clicked, the process exits. If Cancel is clicked, the system defined debugger

process (specified by the Debugger’s value in the registry path referred to earlier) is launched.

Copyrighted material.

Chapter 3: System Mechanisms 117

EXPERIMENT: Unhandled Exceptions

To see a sample Dr. Watson log file, download and run the program Accvio.exe, which

you can download from This program

generates a memory access violation by attempting to write to address 0, which is always

an invalid address in Windows processes. (See Table 7-6 in Chapter 7.)

1. Run the Registry Editor, and locate HKLM\SOFTWARE\ Microsoft\Windows


2. If the Debugger value is “drwtsn32 -p %ld -e %ld –g”, your system is set up to run

Dr. Watson as the default debugger. Proceed to step 4.

3. If the value of Debugger was not set up to run Drwtsn32.exe, you can still test

Dr. Watson by temporarily installing it and then restoring your previous debugger


a. Save the current value somewhere (for example, in a Notepad file or in the

current paste buffer).

b. Select Run from the taskbar Start menu, and then type drwtsn32 –i. (This

initializes the Debugger field to run Dr. Watson.)

3. Run the test program Accvio.exe.

4. You should see one of the message boxes described earlier (depending on which

version of Windows you are running).

5. If you have the default Dr. Watson settings, you should now be able to examine the

log file and dump file in the dump file directory. To see the configuration settings

for Dr. Watson, run drwtsn32 with no additional arguments. (Select Run from the

Start menu, and then type drwtsn32.)

6. Alternatively, in the list of Application Errors displayed by Dr. Watson, click on the

last entry and then click the View button—the portion of the Dr. Watson log file

containing the details of the access violation from Accvio.exe will be displayed.

(For details on the log file format, open the help in Dr. Watson and select Dr. Watson

Log File Overview.)

7. If the original value of Debugger wasn’t the default Dr. Watson settings, restore the

saved value from step 1.

As another experiment, try changing the value of Debugger to another program, such as

Notepad.exe (Notepad editor) or Sol.exe (Solitaire). Rerun Accvio.exe, and notice that

whatever program is specified in the Debugger value is run—that is, there’s no validation

that the program defined in Debugger is actually a debugger. Make sure you restore your

registry settings. (As noted in step 3b, to reset to the system default Dr. Watson settings,

type drwtsn32 –i in the Run dialog box or at a command prompt.)

Copyrighted material.

118 Microsoft Windows Internals, Fourth Edition

Windows Error Reporting

Windows XP and Windows Server 2003 have a new, more sophisticated error-reporting

mechanism called Windows Error Reporting that automates the submission of both usermode

process crashes as well as kernel-mode system crashes. (For a description of how this

applies to system crashes, see Chapter 14).

Windows Error Reporting can be configured by going to My Computer, selecting Properties,

Advanced, and then Error Reporting (which brings up the dialog box shown in Figure 3-12)

or by local or domain group policy settings under System, Error Reporting. These settings are

stored in the registry under the key HKLM\Software\Microsoft\PCHealth\ErrorReporting.

Figure 3-12 Error Reporting Configuration dialog box

When an unhandled exception is caught by the unhandled exception filter (described in the

previous section), an initial check is made to see whether or not to initiate Windows Error

Reporting. If the registry value HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\

AeDebug\Auto is set to zero or the Debugger string contains the text “Drwtsn32”, the

unhandled exception filter loads \Windows\System32\Faultrep.dll into the failing process

and calls its ReportFault function. ReportFault then checks the error-reporting configuration

stored under HKLM\Software\Microsoft\PCHealth\ErrorReporting to see whether this process

crash should be reported, and if so, how. In the normal case, ReportFault creates a process

running \Windows\System32\Dwwin.exe, which displays a message box announcing the process

crash along with an option to submit the error report to Microsoft as seen in Figure 3-13.

Copyrighted material.

Chapter 3: System Mechanisms 119

Figure 3-13 Windows Error Reporting dialog box

If the Send Error Report button is clicked, the error report (a minidump and a text file with

details on the DLL version numbers loaded in the process) is sent to Microsoft’s online crash

analysis server, (Unlike kernel mode system crashes, in this situation

there is no way to find out whether a solution is available at the time of the report submission.)

Then the unhandled exception filter creates a process to run the system-defined debugger (normally

Drwtsn32.exe), which by default creates its own dump file and log entry. Unlike Windows

2000, the dump file is a minidump, not a full dump. So, in the case where a full process memory

dump is needed to debug a failing application, you can change the configuration of Dr. Watson

by running it with no command-line arguments as described in the previous section.

In environments where systems are not connected to the Internet or where the administrator

wants to control which error reports are submitted to Microsoft, the destination for the error

report can be configured to be an internal file server. Microsoft provides to qualified customers

a tool set called Corporate Error Reporting that understands the directory structure created

by Windows Error Reporting and provides the administrator with the option to take

selective error reports and submit them to Microsoft. (For more information, see http://

System Service Dispatching

As Figure 3-1 illustrated, the kernel’s trap handlers dispatch interrupts, exceptions, and system

service calls. In the preceding sections, you’ve seen how interrupt and exception handling

work; in this section, you’ll learn about system services. A system service dispatch is triggered

as a result of executing an instruction assigned to system service dispatching. The instruction

that Windows uses for system service dispatching depends on the processor on which it’s executing.

32-Bit System Service Dispatching

On x86 processors prior to the Pentium II, Windows uses the int 0x2e instruction (46) decimal,

which results in a trap. Windows fills in entry 46 in the IDT to point to the system service

Copyrighted material.

120 Microsoft Windows Internals, Fourth Edition

dispatcher. (Refer to Table 3-1.) The trap causes the executing thread to transition into kernel

mode and enter the system service dispatcher. A numeric argument passed in the EAX processor

register indicates the system service number being requested. The EBX register points to

the list of parameters the caller passes to the system service.

On x86 Pentium II processors and higher, Windows uses the special sysenter instruction,

which Intel defined specifically for fast system service dispatches. To support the instruction,

Windows stores at boot time the address of the kernel’s system service dispatcher routine in

a register associated with the instruction. The execution of the instruction causes the change

to kernel-mode and execution of the system service dispatcher. The system service number is

passed in the EAX processor register, and the EDX register points to the list of caller arguments.

To return to user-mode, the system service dispatcher usually executes the sysexit

instruction. (In some cases, like when the single-step flag is enabled on the processor, the system

service dispatcher uses the iretd instruction instead.)

On K6 and higher 32-bit AMD processors, Windows uses the special syscall instruction, which

functions similar to the x86 sysenter instruction, with Windows configuring a syscall-associated

processor register with the address of the kernel’s system service dispatcher. The system

call number is passed in the EAX register, and the stack stores the caller arguments. After completing

the dispatch, the kernel executes the sysret instruction.

At boot time, Windows detects the type of processor on which it’s executing and sets up the

appropriate system call code to be used. The system service code for NtReadFile in user mode

looks like this:


77f5bfa8 b8b7000000 mov eax,0xb7

77f5bfad ba0003fe7f mov edx,0x7ffe0300

77f5bfb2 ffd2 call edx

77f5bfb4 c22400 ret 0x24

The system service number is 0xb7 (183 in decimal) and the call instruction executes the system

service dispatch code set up by the kernel, which in this example is at address

0x7ffe0300. Because this was taken from a Pentium M, it uses sysenter:


7ffe0300 8bd4 mov edx,esp

7ffe0302 0f34 sysenter

7ffe0304 c3 ret

64-Bit System Service Dispatching

On the x64 architecture, Windows uses the syscall instruction, which functions like the AMD

K6’s syscall instruction, for system service dispatching, passing the system call number in the

EAX register, the first four parameters in registers, and any parameters beyond those four on

the stack:

Copyrighted material.

Chapter 3: System Mechanisms 121


00000000`77f9fc60 4c8bd1 mov r10,rcx

00000000`77f9fc63 b8bf000000 mov eax,0xbf

00000000`77f9fc68 0f05 syscall

00000000`77f9fc6a c3 ret

On the IA64 architecture, Windows uses the epc (Enter Privileged Mode) instruction. The first

eight system call arguments are passed in registers, and the rest are passed on the stack.

Kernel-Mode System Service Dispatching

As Figure 3-14 illustrates, the kernel uses this argument to locate the system service information

in the system service dispatch table. This table is similar to the interrupt dispatch table

described earlier in the chapter except that each entry contains a pointer to a system service

rather than to an interrupt handling routine.

Note System service numbers can change between service packs—Microsoft occasionally

adds or removes system services, and the system service numbers are generated automatically

as part of a kernel compile.

Figure 3-14 System service exceptions

The system service dispatcher, KiSystemService, copies the caller’s arguments from the thread’s

user-mode stack to its kernel-mode stack (so that the user can’t change the arguments as the

kernel is accessing them), and then executes the system service. If the arguments passed to a

system service point to buffers in user space, these buffers must be probed for accessibility

before kernel-mode code can copy data to or from them.

As you’ll see in Chapter 6, each thread has a pointer to its system service table. Windows has

two built-in system service tables, but up to four are supported. The system service dispatcher

determines which table contains the requested service by interpreting a 2-bit field in the 32-


service call



dispatcher System service 2

System service

dispatch table

Kernel mode

User mode






• • •

Copyrighted material.

122 Microsoft Windows Internals, Fourth Edition

bit system service number as a table index. The low 12 bits of the system service number serve

as the index into the table specified by the table index. The fields are shown in Figure 3-15.

Figure 3-15 System service number to system service translation

Service Descriptor Tables

A primary default array table, KeServiceDescriptorTable, defines the core executive system services

implemented in Ntosrknl.exe. The other table array, KeServiceDescriptorTableShadow,

includes the Windows USER and GDI services implemented in the kernel-mode part of the

Windows subsystem, Win32k.sys. The first time a Windows thread calls a Windows USER or

GDI service, the address of the thread’s system service table is changed to point to a table that

includes the Windows USER and GDI services. The KeAddSystemServiceTable function allows

Win32k.sys and other device drivers to add system service tables. If you install Internet Information

Services (IIS) on Windows 2000, its support driver (Spud.sys) upon loading defines

an additional service table, leaving only one left for definition by third parties. With the exception

of the Win32k.sys service table, a service table added with KeAddSystemServiceTable is

copied into both the KeServiceDescriptorTable array and the KeServiceDescriptorTableShadow

array. Windows supports the addition of only two system service tables beyond the core and

Win32 tables.

Table Index

Index into table System service number

31 13 11 0









Native API


IIS Spud Driver

Native API

Win32k.sys API

IIS Spud Driver

KeServiceDescriptorTable KeServiceDescriptorTableShadow

Copyrighted material.

Chapter 3: System Mechanisms 123

Note Windows Server 2003 service pack 1 and higher does not support adding additional

system service tables beyond that added by Win32k.sys, so adding system service tables is not

a way to extend the functionality of those systems.

The system service dispatch instructions for Windows executive services exist in the system

library Ntdll.dll. Subsystem DLLs call functions in Ntdll to implement their documented

functions. The exception is Windows USER and GDI functions, in which the system service

dispatch instructions are implemented directly in User32.dll and Gdi32.dll—there is no

Ntdll.dll involved. These two cases are shown in Figure 3-16.

Figure 3-16 System service dispatching

As shown in Figure 3-16, the Windows WriteFile function in Kernel32.dll calls the NtWriteFile

function in Ntdll.dll, which in turn executes the appropriate instruction to cause a system service

trap, passing the system service number representing NtWriteFile. The system service dispatcher

(function KiSystemService in Ntoskrnl.exe) then calls the real NtWriteFile to process

Windows kernel APIs

Windows USER and




WriteFile in


NtWriteFile in



Used by all



or User32.dll




in Ntoskrnl.exe

NtWriteFile in


Service entry point

in Win32k.sys

KiSystemService in


Software interrupt Software interrupt

User mode

Kernel mode

Call USER or

GDI service(...)

Int 2E

Return to caller

Call Windows


Dismiss interrupt

Do the operation

Return to caller

Call WriteFile(...)

Call NtWriteFile

Return to caller

Int 2E

Return to caller

Call NtWriteFile

Dismiss interrupt

Do the operation

Return to caller

Copyrighted material.

124 Microsoft Windows Internals, Fourth Edition

the I/O request. For Windows USER and GDI functions, the system service dispatch calls

functions in the loadable kernel-mode part of the Windows subsystem, Win32k.sys.

EXPERIMENT: Viewing System Service Activity

You can monitor system service activity by watching the System Calls/Sec performance

counter in the System object. Run the Performance tool, and in chart view, click the Add

button to add a counter to the chart; select the System object, select the System Calls/

Sec counter, and then click the Add button to add the counter to the chart.

Object Manager

As mentioned in Chapter 2, Windows implements an object model to provide consistent and

secure access to the various internal services implemented in the executive. This section

describes the Windows object manager, the executive component responsible for creating,

deleting, protecting, and tracking objects. The object manager centralizes resource control

operations that otherwise would be scattered throughout the operating system. It was

designed to meet the goals listed on later in the chapter.

EXPERIMENT: Exploring the Object Manager

Throughout this section, you’ll find experiments that show you how to peer into the

object manager database. These experiments use the following tools, which you should

become familiar with if you aren’t already:

Winobj (available from displays the internal object manager’s

namespace. There is also a version of Winobj in the Platform SDK (in \Program

Files\Microsoft Platform SDK\Bin\Winnt\Winobj.exe), but the Winobj

from displays more accurate information about objects (such

as the reference count, the number of open handles, security descriptors, and so


Process Explorer and Handle from (introduced in

Chapter 1) displays the open handles for a process.

Oh.exe (available in Windows resource kits) displays the open handles for a process,

but it requires a global flag to be set in order to operate.

The Openfiles /query command (in Windows XP and Windows Server 2003) displays

the open handles for a process, but it requires a global flag to be set in order

to operate.

The kernel debugger !handle command displays the open handles for a process.

Copyrighted material.

Chapter 3: System Mechanisms 125

The object viewer provides a way to traverse the namespace that the object manager

maintains. (As we’ll explain later, not all objects have names.) Try running the WinObj

object manager utility from and examining the layout, shown here:

As noted previously, both the OH utility and the Openfiles /query command require that

a Windows global flag called maintain objects list be enabled. (See the “Windows Global

Flags” section later in this chapter for more details about global flags.) OH will set the

flag if it is not set. If you type Openfiles /Local, it will tell you whether the flag is

enabled. You can enable it with the Openfiles /Local ON command. In either case, you

must reboot the system for the setting to take effect. Neither Process Explorer nor Handle

from require object tracking to be turned on because they use

a device driver to obtain the information.

The object manager was designed to meet the following goals:

Provide a common, uniform mechanism for using system resources

Isolate object protection to one location in the operating system so that C2 security compliance

can be achieved

Provide a mechanism to charge processes for their use of objects so that limits can be

placed on the usage of system resources

Establish an object-naming scheme that can readily incorporate existing objects, such as

the devices, files, and directories of a file system, or other independent collections of


Copyrighted material.

126 Microsoft Windows Internals, Fourth Edition

Support the requirements of various operating system environments, such as the ability

of a process to inherit resources from a parent process (needed by Windows and POSIX)

and the ability to create case-sensitive filenames (needed by POSIX)

Establish uniform rules for object retention (that is, for keeping an object available until

all processes have finished using it)

Internally, Windows has two kinds of objects: executive objects and kernel objects. Executive

objects are objects implemented by various components of the executive (such as the process

manager, memory manager, I/O subsystem, and so on). Kernel objects are a more primitive

set of objects implemented by the Windows kernel. These objects are not visible to user-mode

code but are created and used only within the executive. Kernel objects provide fundamental

capabilities, such as synchronization, on which executive objects are built. Thus, many executive

objects contain (encapsulate) one or more kernel objects, as shown in Figure 3-17.

Details about the structure of kernel objects and how they are used to implement synchronization

are given later in this chapter. In the remainder of this section, we’ll focus on how the

object manager works and on the structure of executive objects, handles, and handle tables.

Here we’ll just briefly describe how objects are involved in implementing Windows security

access checking; we’ll cover this topic thoroughly in Chapter 8.

Figure 3-17 Executive objects that contain kernel objects

Executive Objects

Each Windows environment subsystem projects to its applications a different image of the

operating system. The executive objects and object services are primitives that the environment

subsystems use to construct their own versions of objects and other resources.

Kernel object





Owned by the Executive object


Owned by the


Owned by the

object manager

Copyrighted material.

Chapter 3: System Mechanisms 127

Executive objects are typically created either by an environment subsystem on behalf of a user

application or by various components of the operating system as part of their normal operation.

For example, to create a file, a Windows application calls the Windows CreateFile function, implemented

in the Windows subsystem DLL Kernel32.dll. After some validation and initialization,

CreateFile in turn calls the native Windows service NtCreateFile to create an executive file object.

The set of objects an environment subsystem supplies to its applications might be larger or

smaller than the set the executive provides. The Windows subsystem uses executive objects to

export its own set of objects, many of which correspond directly to executive objects. For

example, the Windows mutexes and semaphores are directly based on executive objects

(which are in turn based on corresponding kernel objects). In addition, the Windows subsystem

supplies named pipes and mailslots, resources that are based on executive file objects.

Some subsystems, such as POSIX, don’t support objects as objects at all. The POSIX subsystem

uses executive objects and services as the basis for presenting POSIX-style processes,

pipes, and other resources to its applications.

Table 3-3 lists the primary objects the executive provides and briefly describes what they represent.

You can find further details on executive objects in the chapters that describe the

related executive components (or in the case of executive objects directly exported to Windows,

in the Windows API reference documentation).

Note The executive implements a total of 27 object types in Windows 2000 and 29 on Windows

XP and Windows Server 2003. (These newer Windows versions add the DebugObject and

KeyedEvent objects.) Many of these objects are for use only by the executive component that

defines them and are not directly accessible by Windows APIs. Examples of these objects

include Driver, Device, and EventPair.

Table 3-3 Executive Objects Exposed to the Windows API

Object Type Represents

Symbolic link A mechanism for referring to an object name indirectly.

Process The virtual address space and control information necessary for the execution

of a set of thread objects.

Thread An executable entity within a process.

Job A collection of processes manageable as a single entity through the job.

Section A region of shared memory (known as a file mapping object in Windows).

File An instance of an opened file or an I/O device.

Access token The security profile (security ID, user rights, and so on) of a process or a thread.

Event An object with a persistent state (signaled or not signaled) that can be used for

synchronization or notification.

Semaphore A counter that provides a resource gate by allowing some maximum number of

threads to access the resources protected by the semaphore.

Mutex* A synchronization mechanism used to serialize access to a resource.

Timer A mechanism to notify a thread when a fixed period of time elapses.

Copyrighted material.

128 Microsoft Windows Internals, Fourth Edition

Note Externally in the Windows API, mutants are called mutexes. Internally, the kernel

object that underlies mutexes is called a mutant.

Object Structure

As shown in Figure 3-18, each object has an object header and an object body. The object manager

controls the object headers, and the owning executive components control the object

bodies of the object types they create. In addition, each object header points to the list of processes

that have the object open and to a special object called the type object that contains

information common to each instance of the object.

Figure 3-18 Structure of an object

IoCompletion A method for threads to enqueue and dequeue notifications of the completion

of I/O operations (known as an I/O completion port in the Windows API).

Key A mechanism to refer to data in the registry. Although keys appear in the

object manager namespace, they are managed by the configuration manager,

in a way similar to that in which file objects are managed by file system drivers.

Zero or more key values are associated with a key object; key values contain

data about the key.

WindowStation An object that contains a clipboard, a set of global atoms, and a group of desktop


Desktop An object contained within a window station. A desktop has a logical display

surface and contains windows, menus, and hooks.

Table 3-3 Executive Objects Exposed to the Windows API

Object Type Represents

Object name

Object directory

Security descriptor

Quota charges

Open handle count

Open handles list

Object type

Reference count







Type name

Pool type

Default quota charges

Access types

Generic access rights mapping

Synchronizable? (Y/N)


Open, close, delete,

parse, security,

query name

Object header

Object body

Type object

Object-specific data

Copyrighted material.

Chapter 3: System Mechanisms 129

Object Headers and Bodies

The object manager uses the data stored in an object’s header to manage objects without

regard to their type. Table 3-4 briefly describes the object header attributes.

In addition to an object header, each object has an object body whose format and contents are

unique to its object type; all objects of the same type share the same object body format. By

creating an object type and supplying services for it, an executive component can control the

manipulation of data in all object bodies of that type.

The object manager provides a small set of generic services that operate on the attributes

stored in an object’s header and can be used on objects of any type (although some generic

services don’t make sense for certain objects). These generic services, some of which the Windows

subsystem makes available to Windows applications, are listed in Table 3-5.

Although these generic object services are supported for all object types, each object has its

own create, open, and query services. For example, the I/O system implements a create file

service for its file objects, and the process manager implements a create process service for its

process objects. Although a single create object service could have been implemented, such a

routine would have been quite complicated, because the set of parameters required to initialize

a file object, for example, differs markedly from that required to initialize a process object.

Also, the object manager would have incurred additional processing overhead each time a

thread called an object service to determine the type of object the handle referred to and to

call the appropriate version of the service. For these reasons and others, the create, open, and

query services are implemented separately for each object type.

Table 3-4 Standard Object Header Attributes

Attribute Purpose

Object name Makes an object visible to other processes for sharing

Object directory Provides a hierarchical structure in which to store object names

Security descriptor Determines who can use the object and what they can do with it (Note: it

might be null for objects without a name.)

Quota charges Lists the resource charges levied against a process when it opens a handle

to the object

Open handle count Counts the number of times a handle has been opened to the object

Open handles list Points to the list of processes that have opened handles to the object (not

present for all objects)

Object type Points to a type object that contains attributes common to objects of this


Reference count Counts the number of times a kernel-mode component has referenced

the address of the object

Copyrighted material.

130 Microsoft Windows Internals, Fourth Edition

Type Objects

Object headers contain data that is common to all objects but that can take on different values

for each instance of an object. For example, each object has a unique name and can have a

unique security descriptor. However, objects also contain some data that remains constant for

all objects of a particular type. For example, you can select from a set of access rights specific

to a type of object when you open a handle to objects of that type. The executive supplies terminate

and suspend access (among others) for thread objects and read, write, append, and

delete access (among others) for file objects. Another example of an object-type-specific

attribute is synchronization, which is described shortly.

To conserve memory, the object manager stores these static, object-type-specific attributes

once when creating a new object type. It uses an object of its own, a type object, to record this

data. As Figure 3-19 illustrates, if the object-tracking debug flag (described in the “Windows

Global Flags” section later in this chapter) is set, a type object also links together all objects of

the same type (in this case the Process type), allowing the object manager to find and enumerate

them, if necessary.

Figure 3-19 Process objects and the process type object

Table 3-5 Generic Object Services

Service Purpose

Close Closes a handle to an object

Duplicate Shares an object by duplicating a handle and giving it to another


Query object Gets information about an object’s standard attributes

Query security Gets an object’s security descriptor

Set security Changes the protection on an object

Wait for a single object Synchronizes a thread’s execution with one object

Wait for multiple objects Synchronizes a thread’s execution with multiple objects





Object 1


Object 2


Object 3


Object 4

Copyrighted material.

Chapter 3: System Mechanisms 131

EXPERIMENT: Viewing Object Headers and Type Objects

You can see the list of type objects declared to the object manager with the Winobj tool

from After running Winobj, open the \ObjectTypes directory, as

shown here:

You can look at the process object type data structure in the kernel debugger by first

identifying a process object with the !process command:

kd> !process 0 0


PROCESS 8a4ce668 SessionId: none Cid: 0004 Peb: 00000000 ParentCid: 0000

DirBase: 00039000 ObjectTable: e1001c88 HandleCount: 474.

Image: System

Then execute the !object command with the process object address as the argument:

kd> !object 8a4ce668

Object: 8a4ce668 Type: (8a4ceca0) Process

ObjectHeader: 8a4ce650

HandleCount: 2 PointerCount: 89

Notice that the object header starts 0x18 (24 decimal) bytes prior to the start of the

object body. You can view the object header with this command:

kd> dt _object_header 8a4ce650


+0x000 PointerCount : 79

+0x004 HandleCount : 2

+0x004 NextToFree : 0x00000002

+0x008 Type : 0x8a4ceca0

+0x00c NameInfoOffset : 0 ’’

+0x00d HandleInfoOffset : 0 ’’

+0x00e QuotaInfoOffset : 0 ’’

Copyrighted material.

132 Microsoft Windows Internals, Fourth Edition

+0x00f Flags : 0x22 ’"‘

+0x010 ObjectCreateInfo : 0x80545620

+0x010 QuotaBlockCharged : 0x80545620

+0x014 SecurityDescriptor : 0xe10001dc

+0x018 Body : _QUAD

Now look at the object type data structure by obtaining its address from the Type field

of the object header data structure:

kd> dt _object_type 8a4ceca0


+0x000 Mutex : _ERESOURCE

+0x038 TypeList : _LIST_ENTRY [ 0x8a4cecd8 - 0x8a4cecd8 ]

+0x040 Name : _UNICODE_STRING “Process"

+0x048 DefaultObject : (null)

+0x04c Index : 5

+0x050 TotalNumberOfObjects : 0x30

+0x054 TotalNumberOfHandles : 0x1b4

+0x058 HighWaterNumberOfObjects : 0x3f

+0x05c HighWaterNumberOfHandles : 0x1b8


+0x0ac Key : 0x636f7250

+0x0b0 ObjectLocks : [4] _ERESOURCE

The output shows that the object type structure includes the name of the object type,

tracks the total number of active objects of that type, and tracks the peak number of

handles and objects of that type. The TypeInfo field stores the pointer to the data structure

that stores attributes common to all objects of the object type as well as pointers to

the object type’s methods:

kd> dt _object_type_initializer 8a4ceca0+60


+0x000 Length : 0x4c

+0x002 UseDefaultObject : 0 ’’

+0x003 CaseInsensitive : 0 ’’

+0x004 InvalidAttributes : 0xb0

+0x008 GenericMapping : _GENERIC_MAPPING

+0x018 ValidAccessMask : 0x1f0fff

+0x01c SecurityRequired : 0x1 ’’

+0x01d MaintainHandleCount : 0 ’’

+0x01e MaintainTypeList : 0 ’’

+0x020 PoolType : 0 ( NonPagedPool )

+0x024 DefaultPagedPoolCharge : 0x1000

+0x028 DefaultNonPagedPoolCharge : 0x288

+0x02c DumpProcedure : (null)

+0x030 OpenProcedure : (null)

+0x034 CloseProcedure : (null)

+0x038 DeleteProcedure : 0x805abe6e nt!PspProcessDelete+0

+0x03c ParseProcedure : (null)

+0x040 SecurityProcedure : 0x805cf682 nt!SeDefaultObjectMethod+0

+0x044 QueryNameProcedure : (null)

+0x048 OkayToCloseProcedure : (null)

Copyrighted material.

Chapter 3: System Mechanisms 133

Type objects can’t be manipulated from user mode because the object manager supplies no

services for them. However, some of the attributes they define are visible through certain

native services and through Windows API routines. The attributes stored in the type objects

are described in Table 3-6.

Synchronization, one of the attributes visible to Windows applications, refers to a thread’s

ability to synchronize its execution by waiting for an object to change from one state to

another. A thread can synchronize with executive job, process, thread, file, event, semaphore,

mutex, and timer objects. Other executive objects don’t support synchronization. An object’s

ability to support synchronization is based on whether the object contains an embedded dispatcher

object, a kernel object that is covered in the section “Low-IRQL Synchronization” later

in this chapter.

Object Methods

The last attribute in Table 3-6, methods, comprises a set of internal routines that are similar to

C++ constructors and destructors—that is, routines that are automatically called when an

object is created or destroyed. The object manager extends this idea by calling an object

method in other situations as well, such as when someone opens or closes a handle to an

object or when someone attempts to change the protection on an object. Some object types

specify methods, whereas others don’t, depending on how the object type is to be used.

When an executive component creates a new object type, it can register one or more methods

with the object manager. Thereafter, the object manager calls the methods at well-defined

points in the lifetime of objects of that type, usually when an object is created, deleted, or modified

in some way. The methods that the object manager supports are listed in Table 3-7.

Table 3-6 Type Object Attributes

Attribute Purpose

Type name The name for objects of this type (“process,” “event,” “port,”

and so on)

Pool type Indicates whether objects of this type should be allocated from

paged or nonpaged memory

Default quota charges Default paged and nonpaged pool values to charge to process


Access types The types of access a thread can request when opening a handle

to an object of this type (“read,” “write,” “terminate,” “suspend,”

and so on)

Generic access rights mapping A mapping between the four generic access rights (read, write,

execute, and all) to the type-specific access rights

Synchronization Indicates whether a thread can wait for objects of this type

Methods One or more routines that the object manager calls automatically

at certain points in an object’s lifetime

Copyrighted material.

134 Microsoft Windows Internals, Fourth Edition

The object manager calls the open method whenever it creates a handle to an object, which it

does when an object is created or opened. However, only one object type, the Windowstation,

defines an open method. The Windowstation object type requires an open method so that

Win32k.sys can share a piece of memory with the process that serves as a desktop-related

memory pool.

An example of the use of a close method occurs in the I/O system. The I/O manager registers

a close method for the file object type, and the object manager calls the close method each

time it closes a file object handle. This close method checks whether the process that is closing

the file handle owns any outstanding locks on the file and, if so, removes them. Checking

for file locks isn’t something the object manager itself could or should do.

The object manager calls a delete method, if one is registered, before it deletes a temporary

object from memory. The memory manager, for example, registers a delete method for the section

object type that frees the physical pages being used by the section. It also verifies that any

internal data structures the memory manager has allocated for a section are deleted before the

section object is deleted. Once again, the object manager can’t do this work because it knows

nothing about the internal workings of the memory manager. Delete methods for other types

of objects perform similar functions.

The parse method (and similarly, the query name method) allows the object manager to relinquish

control of finding an object to a secondary object manager if it finds an object that exists

outside the object manager namespace. When the object manager looks up an object name, it

suspends its search when it encounters an object in the path that has an associated parse

method. The object manager calls the parse method, passing to it the remainder of the object

name it is looking for. There are two namespaces in Windows in addition to the object manager’s:

the registry namespace, which the configuration manager implements, and the file system

namespace, which the I/O manager implements with the aid of file system drivers. (See

Chapter 5 for more information on the configuration manager and Chapter 9 for more about

the I/O manager and file system drivers.)

Table 3-7 Object Methods

Method When Method Is Called

Open When an object handle is opened

Close When an object handle is closed

Delete Before the object manager deletes an object

Query name When a thread requests the name of an object, such as a file, that exists in a

secondary object namespace

Parse When the object manager is searching for an object name that exists in a

secondary object namespace

Security When a process reads or changes the protection of an object, such as a file, that

exists in a secondary object namespace

Copyrighted material.

Chapter 3: System Mechanisms 135

For example, when a process opens a handle to the object named \Device\Floppy0\

docs\resume.doc, the object manager traverses its name tree until it reaches the device object

named Floppy0. It sees that a parse method is associated with this object, and it calls the

method, passing to it the rest of the object name it was searching for—in this case, the string

\docs\resume.doc. The parse method for device objects is an I/O routine because the I/O

manager defines the device object type and registers a parse method for it. The I/O manager’s

parse routine takes the name string and passes it to the appropriate file system, which finds

the file on the disk and opens it.

The security method, which the I/O system also uses, is similar to the parse method. It is

called whenever a thread tries to query or change the security information protecting a file.

This information is different for files than for other objects because security information is

stored in the file itself rather than in memory. The I/O system, therefore, must be called to

find the security information and read or change it.

Object Handles and the Process Handle Table

When a process creates or opens an object by name, it receives a handle that represents its

access to the object. Referring to an object by its handle is faster than using its name because

the object manager can skip the name lookup and find the object directly. Processes can also

acquire handles to objects by inheriting handles at process creation time (if the creator specifies

the inherit handle flag on the CreateProcess call and the handle was marked as inheritable,

either at the time it was created or afterward by using the Windows SetHandleInformation

function) or by receiving a duplicated handle from another process. (See the Windows

DuplicateHandle function.)

All user-mode processes must own a handle to an object before their threads can use the

object. Using handles to manipulate system resources isn’t a new idea. C and Pascal (and

other language) run-time libraries, for example, return handles to opened files. Handles serve

as indirect pointers to system resources; this indirection keeps application programs from fiddling

directly with system data structures.

Note Executive components and device drivers can access objects directly because they are

running in kernel mode and therefore have access to the object structures in system memory.

However, they must declare their usage of the object by incrementing the reference count so

that the object won’t be deallocated while it’s still being used. (See the section “Object Retention”

later in this chapter for more details.)

Object handles provide additional benefits. First, except for what they refer to, there is no difference

between a file handle, an event handle, and a process handle. This similarity provides a consistent

interface to reference objects, regardless of their type. Second, the object manager has the

exclusive right to create handles and to locate an object that a handle refers to. This means that

the object manager can scrutinize every user-mode action that affects an object to see whether

the security profile of the caller allows the operation requested on the object in question.

Copyrighted material.

136 Microsoft Windows Internals, Fourth Edition

EXPERIMENT: Viewing Open Handles

Run Process Explorer, and make sure the lower pane is enabled and configured to show

open handles. (Click on View, Lower Pane View, and then Handles). Then open a command

prompt and view the handle table for the new Cmd.exe process. You should see

an open file handle to the current directory. For example, assuming the current directory

is C:\, Process Explorer shows the following:

If you then change the current directory with the CD command, you will see in Process

Explorer that the handle to the previous current directory is closed and a new handle is

opened to the new current directory. The previous handle is highlighted briefly in red,

and the new handle is highlighted in green. The duration of the highlight can be

adjusted by clicking Options and then Difference Highlight Duration.

Process Explorer’s differences highlighting feature makes it easy to see changes in the

handle table. For example, if a process is leaking handles, viewing the handle table with

Process Explorer can quickly show what handle or handles are being opened but not

closed. This information can assist the programmer to find the handle leak.

Copyrighted material.

Chapter 3: System Mechanisms 137

You can also display the open handle table by using the command line Handle tool from For example, note the following partial output of Handle examining

the handle table for a Cmd.exe process before and after changing the directory:

C:\>handle -p cmd.exe

Handle v2.2

Copyright (C) 1997-2004 Mark Russinovich

Sysinternals -


cmd.exe pid: 3184 BIGDAVID\dsolomon

b0: File C:\

C:\>cd windows

C:\WINDOWS>handle -p cmd.exe



cmd.exe pid: 3184 BIGDAVID\dsolomon

b4: File C:\WINDOWS

An object handle is an index into a process-specific handle table, pointed to by the executive

process (EPROCESS) block (described in Chapter 6). The first handle index is 4, the second

8, and so on. A process’s handle table contains pointers to all the objects that the process has

opened a handle to. Handle tables are implemented as a three-level scheme, similar to the way

that the x86 memory management unit implements virtual-to-physical address translation,

giving a maximum of more than 16,000,000 handles per process. (See Chapter 7 for details

about memory management in x86 systems.)

In Windows 2000, when a process is created, the object manager allocates the top level of the

handle table, which contains pointers to the middle-level tables; the middle level, which contains

the first array of pointers to subhandle tables; and the lowest level, which contains the

first subhandle table. Figure 3-20 illustrates the Windows 2000 handle table architecture. In

Windows 2000, the object manager treats the low 24 bits of an object handle’s value as three

8-bit fields that index into each of the three levels in the handle table. In Windows XP and

Windows Server 2003, only the lowest level handle table is allocated on process creation—the

other levels are created as needed. In Windows 2000, the subhandle table consists of 255

usable entries. In Windows XP and Windows Server 2003, the subhandle table consists of as

many entries as will fit in a page minus one entry that is used for handle auditing. For example,

for x86 systems a page is 4096 bytes, divided by the size of a handle table entry (8 bytes),

which is 512, minus 1, which is a total of 511 entries in the lowest level handle table. In Windows

XP and Windows Server 2003, the mid-level handle table contains a full page of pointers

to subhandle tables, so the number of subhandle tables depends on the size of the page and

the size of a pointer for the platform.

Copyrighted material.

138 Microsoft Windows Internals, Fourth Edition

Figure 3-20 Windows 2000 process handle table architecture

EXPERIMENT: Creating the Maximum Number of Handles

The test program Testlimit from has an

option to open handles to an object until it cannot open any more handles. You can use

this to see how many handles can be created in a single process on your system. Because

handle tables are allocated from paged pool, you might run out of paged pool before you

hit the maximum number of handles that can be created in a single process. To see how

many handles you can create on your system, follow these steps:

1. Download the Testlimit zip file from the link just mentioned, and unzip it into a


2. Run Process Explorer, and click View and then System Information. Notice the

current and maximum size of paged pool. (To display the maximum pool size values,

Process Explorer must be configured properly to access the symbols for the

kernel image, Ntoskrnl.exe.) Leave this system information display running so

that you can see pool utilization when you run the Testlimit program.

3. Open a command prompt.

4. Run the Testlimit program with the “-h” switch (do this by typing testlimit –h).

When Testlimit fails to open a new handle, it will display the total number of handles

it was able to create. If the number is less than approximately 16 million, you

are probably running out of paged pool before hitting the theoretical per-process

handle limit.

5. Close the command-prompt window; doing this will kill the Testlimit process,

thus closing all the open handles.
















Copyrighted material.

Chapter 3: System Mechanisms 139

As shown in Figure 3-21, on x86 systems, each handle entry consists of a structure with two

32-bit members: a pointer to the object (with flags), and the granted access mask. On 64-bit

systems, a handle table entry is 12 bytes long: a 64-bit pointer to the object header and a 32-

bit access mask. (Access masks are described in Chapter 8.)

On Windows 2000, the first 32-bit member contains both a pointer to the object header and

four flags. Because object headers are always 8-byte aligned, the low-order 3 bits of this field

are free for use as flags. An entry’s high bit is used as a lock. When the object manager translates

a handle to an object pointer, it locks the handle entry while the translation is in

progress. Because all objects are located in the system address space, the high bit of the object

pointer is set. (The addresses are guaranteed to be higher than 0x80000000 even on systems

with the /3GB boot switch.) Thus, the object manager can keep the high bit clear when a handle

table entry is unlocked and, in the process of locking the entry, set the bit and obtain the

object’s correct pointer value. The object manager needs to lock a process’s entire handle

table, using a handle table lock associated with each process, only when the process creates a

new handle or closes an existing handle. In Windows XP and Windows Server 2003, the lock

bit is the low-order bit of the object pointer. The flag that was stored in this low-order bit in

Windows 2000 is now stored in an unused bit in the access mask.

Figure 3-21 Structure of a handle table entry

The first flag indicates whether the caller is allowed to close this handle. The second flag is the

inheritance designation—that is, it indicates whether processes created by this process will get

a copy of this handle in their handle tables. As already noted, handle inheritance can be specified

on handle creation or later with the SetHandleInformation function. (This flag can also be

specified with the Windows SetHandleInformation function.) The third flag indicates whether

closing the object should generate an audit message. (This flag isn’t exposed to Windows—the

object manager uses it internally.)

System components and device drivers often need to open handles to objects that user-mode

applications shouldn’t have access to. This is done by creating handles in the kernel handle

table (referenced internally with the name ObpKernelHandleTable). The handles in this table

are accessible only from kernel mode and in any process context. This means that a kernelmode

function can reference the handle in any process context with no performance impact.

The object manager recognizes references to handles from the kernel handle table when the

Audit on close

Protect from close



Pointer to object header A I P

Access mask

32 bits

Copyrighted material.

140 Microsoft Windows Internals, Fourth Edition

high bit of the handle is set—that is, when references to kernel-handle-table handles have values

greater than 0x80000000. On Windows 2000, the kernel-handle table is an independent

handle table, but on Windows XP and Windows Server 2003 the kernel-handle table also

serves as the handle table for the System process.

EXPERIMENT: Viewing the Handle Table with the Kernel


The !handle command in the kernel debugger takes three arguments:

!handle <handle index> <flags> <processid>

The handle index identifies the handle entry in the handle table. (Zero means display all

handles.) The first handle is index 4, the second 8, and so on. For example, typing !handle

4 will show the first handle for the current process.

The flags you can specify are a bitmask, where bit 0 means display only the information

in the handle entry, bit 1 means display free handles (not just used handles), and bit 2

means display information about the object that the handle refers to. The following

command displays full details about the handle table for process ID 0x408:

kd> !handle 0 7 408

processor number 0

Searching for Process with Cid == 408

PROCESS 865f0790 SessionId: 0 Cid: 0408 Peb: 7ffdf000 ParentCid: 01dc

DirBase: 04fd3000 ObjectTable: 856ca888 TableSize: 21.

Image: i386kd.exe

Handle Table at e2125000 with 21 Entries in use

0000: free handle

0004: Object: e20da2e0 GrantedAccess: 000f001f

Object: e20da2e0 Type: (81491b80) Section

ObjectHeader: e20da2c8

HandleCount: 1 PointerCount: 1

0008: Object: 80b13330 GrantedAccess: 00100003

Object: 80b13330 Type: (81495100) Event

ObjectHeader: 80b13318

HandleCount: 1 PointerCount: 1

Object Security

When you open a file, you must specify whether you intend to read or to write. If you try to

write to a file that is opened for read access, you get an error. Likewise, in the executive, when

a process creates an object or opens a handle to an existing object, the process must specify a

set of desired access rights—that is, what it wants to do with the object. It can request either a set

of standard access rights (such as read, write, and execute) that apply to all object types or

Copyrighted material.

Chapter 3: System Mechanisms 141

specific access rights that vary depending on the object type. For example, the process can

request delete access or append access to a file object. Similarly, it might require the ability to

suspend or terminate a thread object.

When a process opens a handle to an object, the object manager calls the security reference

monitor, the kernel-mode portion of the security system, sending it the process’s set of desired

access rights. The security reference monitor checks whether the object’s security descriptor

permits the type of access the process is requesting. If it does, the reference monitor returns a

set of granted access rights that the process is allowed, and the object manager stores them in

the object handle it creates. How the security system determines who gets access to which

objects is explored in Chapter 8.

Thereafter, whenever the process’s threads use the handle, the object manager can quickly

check whether the set of granted access rights stored in the handle corresponds to the usage

implied by the object service the threads have called. For example, if the caller asked for read

access to a section object but then calls a service to write to it, the service fails.

Object Retention

There are two types of objects: temporary and permanent. Most objects are temporary—that is,

they remain while they are in use and are freed when they are no longer needed. Permanent

objects remain until they are explicitly freed. Because most objects are temporary, the rest of this

section describes how the object manager implements object retention—that is, retaining temporary

objects only as long as they are in use and then deleting them. Because all user-mode processes

that access an object must first open a handle to it, the object manager can easily track

how many of these processes, and even which ones, are using an object. Tracking these handles

represents one part in implementing retention. The object manager implements object retention

in two phases. The first phase is called name retention, and it is controlled by the number of open

handles to an object that exist. Every time a process opens a handle to an object, the object manager

increments the open handle counter in the object’s header. As processes finish using the

object and close their handles to it, the object manager decrements the open handle counter.

When the counter drops to 0, the object manager deletes the object’s name from its global

namespace. This deletion prevents new processes from opening a handle to the object.

The second phase of object retention is to stop retaining the objects themselves (that is, to

delete them) when they are no longer in use. Because operating system code usually accesses

objects by using pointers instead of handles, the object manager must also record how many

object pointers it has dispensed to operating system processes. It increments a reference count

for an object each time it gives out a pointer to the object; when kernel-mode components finish

using the pointer, they call the object manager to decrement the object’s reference count.

The system also increments the reference count when it increments the handle count, and

likewise decrements the reference count when the handle count decrements, because a handle

is also a reference to the object that must be tracked. (For further details on object retention,

see the DDK documentation on the functions ObReferenceObjectByPointer and


Copyrighted material.

142 Microsoft Windows Internals, Fourth Edition

Figure 3-22 illustrates two event objects that are in use. Process A has the first event open. Process

B has both events open. In addition, the first event is being referenced by some kernelmode

structure; thus, the reference count is 3. So even if processes A and B closed their handles

to the first event object, it would continue to exist because its reference count is 1. However,

when process B closes its handle to the second event object, the object would be


Figure 3-22 Handles and reference counts

So even after an object’s open handle counter reaches 0, the object’s reference count might

remain positive, indicating that the operating system is still using the object. Ultimately, when

the reference count drops to 0, the object manager deletes the object from memory.

Because of the way object retention works, an application can ensure that an object and its

name remain in memory simply by keeping a handle open to the object. Programmers who

write applications that contain two or more cooperating processes need not be concerned

that one process might delete an object before the other process has finished using it. In addition,

closing an application’s object handles won’t cause an object to be deleted if the operating

system is still using it. For example, one process might create a second process to execute

a program in the background; it then immediately closes its handle to the process. Because

the operating system needs the second process to run the program, it maintains a reference to

its process object. Only when the background program finishes executing does the object

manager decrement the second process’s reference count and then delete it.

Other structure



Event object



Event object

Handle table

Process A System space

Process B

Handle table




Copyrighted material.

Chapter 3: System Mechanisms 143

Resource Accounting

Resource accounting, like object retention, is closely related to the use of object handles. A

positive open handle count indicates that some process is using that resource. It also indicates

that some process is being charged for the memory the object occupies. When an object’s handle

count and reference count drop to 0, the process that was using the object should no

longer be charged for it.

Many operating systems use a quota system to limit processes’ access to system resources.

However, the types of quotas imposed on processes are sometimes diverse and complicated,

and the code to track the quotas is spread throughout the operating system. For example, in

some operating systems, an I/O component might record and limit the number of files a process

can open, whereas a memory component might impose a limit on the amount of memory

a process’s threads can allocate. A process component might limit users to some maximum

number of new processes they can create or a maximum number of threads within a process.

Each of these limits is tracked and enforced in different parts of the operating system.

In contrast, the Windows object manager provides a central facility for resource accounting.

Each object header contains an attribute called quota charges that records how much the

object manager subtracts from a process’s allotted paged and/or nonpaged pool quota when

a thread in the process opens a handle to the object.

Each process on Windows points to a quota structure that records the limits and current values

for nonpaged pool, paged pool, and page file usage. (Type dt nt!_EPROCESS_QUOTA_ENTRY

in the kernel debugger to see the format of this structure.) These quotas default to 0 (no limit)

but can be specified by modifying registry values. (See NonPagedPoolQuota, PagedPoolQuota, and

PagingFileQuota under HKLM\System\CurrentControlSet\Session Manager\Memory Management.)

Note that all the processes in an interactive session share the same quota block (and

there’s no documented way to create processes with their own quota blocks).

Object Names

An important consideration in creating a multitude of objects is the need to devise a successful

system for keeping track of them. The object manager requires the following information

to help you do so:

A way to distinguish one object from another

A method for finding and retrieving a particular object

The first requirement is served by allowing names to be assigned to objects. This is an extension

of what most operating systems provide—the ability to name selected resources, files,

pipes, or a block of shared memory, for example. The executive, in contrast, allows any

resource represented by an object to have a name. The second requirement, finding and

retrieving an object, is also satisfied by object names. If the object manager stores objects by

name, it can find an object by looking up its name.

Copyrighted material.

144 Microsoft Windows Internals, Fourth Edition

Object names also satisfy a third requirement, which is to allow processes to share objects.

The executive’s object namespace is a global one, visible to all processes in the system. One

process can create an object and place its name in the global namespace, and a second process

can open a handle to the object by specifying the object’s name. If an object isn’t meant to be

shared in this way, its creator doesn’t need to give it a name.

To increase efficiency, the object manager doesn’t look up an object’s name each time someone

uses the object. Instead, it looks up a name under only two circumstances. The first is

when a process creates a named object: the object manager looks up the name to verify that it

doesn’t already exist before storing the new name in the global namespace. The second is

when a process opens a handle to a named object: the object manager looks up the name,

finds the object, and then returns an object handle to the caller; thereafter, the caller uses the

handle to refer to the object. When looking up a name, the object manager allows the caller to

select either a case-sensitive or a case-insensitive search, a feature that supports POSIX and

other environments that use case-sensitive filenames.

Where the names of objects are stored depends on the object type. Table 3-8 lists the standard

object directories found on all Windows systems and what types of objects have their names

stored there. Of the directories listed, only \BaseNamedObjects and \GLOBAL?? (\?? on Windows

2000) are visible to user programs (see the Session Namespace section later in this

chapter for more information).

Because the base kernel objects such as mutexes, events, semaphores, waitable timers, and

sections have their names stored in a single object directory, no two of these objects can have

the same name, even if they are of a different type. This restriction emphasizes the need to

choose names carefully so that they don’t collide with other names (for example, prefix

names with your company and product name).

Table 3-8 Standard Object Directories

Directory Types of Object Names Stored

\GLOBAL?? (\?? in Windows


MS-DOS device names (\DosDevices is a symbolic link to this


\BaseNamedObjects Mutexes, events, semaphores, waitable timers, and section


\Callback Callback objects

\Device Device objects

\Driver Driver objects

\FileSystem File system driver objects and file system recognizer device


\KnownDlls Section names and path for known DLLs (DLLs mapped by the

system at startup time)

\Nls Section names for mapped national language support tables

\ObjectTypes Names of types of objects

\RPC Control Port objects used by remote procedure calls (RPCs)

Copyrighted material.

Chapter 3: System Mechanisms 145

Object names are global to a single computer (or to all processors on a multiprocessor computer),

but they’re not visible across a network. However, the object manager’s parse method

makes it possible to access named objects that exist on other computers. For example, the

I/O manager, which supplies file object services, extends the functions of the object manager

to remote files. When asked to open a remote file object, the object manager calls a

parse method, which allows the I/O manager to intercept the request and deliver it to a network

redirector, a driver that accesses files across the network. Server code on the remote

Windows system calls the object manager and the I/O manager on that system to find the file

object and return the information back across the network.

EXPERIMENT: Looking at the Base Named Objects

You can see the list of base objects that have names with the Winobj tool from www.sysinternals.

com. Run Winobj.exe and click on \BaseNamedObjects, as shown here:

The named objects are shown on the right. The icons indicate the object type.

Mutexes are indicated with a stop sign.

Sections (Windows file mapping objects) are shown as memory chips.

Events are shown as exclamation points.

Semaphores are indicated with an icon that resembles a traffic signal.

Symbolic links have icons that are curved arrows.

\Security Names of objects specific to the security subsystem

\Windows Windows subsystem ports and window stations

Table 3-8 Standard Object Directories

Directory Types of Object Names Stored

Copyrighted material.

146 Microsoft Windows Internals, Fourth Edition

Object directories The object directory object is the object manager’s means for supporting

this hierarchical naming structure. This object is analogous to a file system directory and

contains the names of other objects, possibly even other object directories. The object directory

object maintains enough information to translate these object names into pointers to the

objects themselves. The object manager uses the pointers to construct the object handles that

it returns to user-mode callers. Both kernel-mode code (including executive components and

device drivers) and user-mode code (such as subsystems) can create object directories in

which to store objects. For example, the I/O manager creates an object directory named

\Device, which contains the names of objects representing I/O devices.

Symbolic links In certain file systems (on NTFS and some UNIX systems, for example), a

symbolic link lets a user create a filename or a directory name that, when used, is translated by

the operating system into a different file or directory name. Using a symbolic link is a simple

method for allowing users to indirectly share a file or the contents of a directory, creating a

cross-link between different directories in the ordinarily hierarchical directory structure.

The object manager implements an object called a symbolic link object, which performs a similar

function for object names in its object namespace. A symbolic link can occur anywhere

within an object name string. When a caller refers to a symbolic link object’s name, the object

manager traverses its object namespace until it reaches the symbolic link object. It looks

inside the symbolic link and finds a string that it substitutes for the symbolic link name. It

then restarts its name lookup.

One place in which the executive uses symbolic link objects is in translating MS-DOS-style

device names into Windows internal device names. In Windows, a user refers to floppy and

hard disk drives using the names A:, B:, C:, and so on and serial ports as COM1, COM2, and

so on. The Windows subsystem makes these symbolic link objects protected, global data by

placing them in the object manager namespace under the \?? object directory on Windows

2000 and the \Global?? directory on Windows XP and Windows Server 2003.

Session Namespace

Windows NT was originally written with the assumption that only one user would log on to

the system interactively and that the system would run only one instance of any interactive

application. The addition of Windows Terminal Services in Windows 2000 Server and fast

user switching in Windows XP changed these assumptions, thus requiring changes to the

object manager namespace model to support multiple users. (For a basic description of terminal

services and sessions, see Chapter 1.)

Copyrighted material.

Chapter 3: System Mechanisms 147

A user logged on to the console session has access to the global namespace, a namespace that

serves as the first instance of the namespace. Additional sessions are given a session-private

view of the namespace known as a local namespace. The parts of the namespace that are localized

for each session include \DosDevices, \Windows, and \BaseNamedObjects. Making separate

copies of the same parts of the namespace is known as instancing the namespace.

Instancing \DosDevices makes it possible for each user to have different network drive letters

and Windows objects such as serial ports. On Windows 2000, the global \DosDevices directory

is named \?? and is the directory to which the \DosDevices symbolic link points, and

local \DosDevices directories are identified by the session id for the terminal server session.

On Windows XP and later, the global \DosDevices directory is named \Global?? and is the

directory to which \DosDevices points, and local \DosDevices directories are identified by

the logon session ID.

The \Windows directory is where Win32k.sys creates the interactive window station,

\WinSta0. A Terminal Services environment can support multiple interactive users, but each

user needs an individual version of WinSta0 to preserve the illusion that he or she is accessing

the predefined interactive window station in Windows. Finally, applications and the system

create shared objects in \BaseNamedObjects, including events, mutexes, and memory sections.

If two users are running an application that creates a named object, each user session

must have a private version of the object so that the two instances of the application don’t

interfere with one another by accessing the same object.

The object manager implements a local namespace by creating the private versions of the

three directories mentioned under a directory associated with the user’s session under \Sessions\

X (where X is the session identifier). When a Windows application in remote session

two creates a named event, for example, the object manager transparently redirects the

object’s name from \BaseNamedObjects to \Sessions\2\BaseNamedObjects.

All object manager functions related to namespace management are aware of the instanced

directories and participate in providing the illusion that nonconsole sessions use the same

namespace as the console session. Windows subsystem DLLs prefix names passed by Windows

applications that reference objects in \DosDevices with \?? (for example, C:\Windows

becomes \??\C:\Windows). When the object manager sees the special \?? prefix, the steps it

takes depends on the version of Windows, but it always relies on a field named DeviceMap in

the executive process object (EPROCESS, which is described further in Chapter 6) that points

to a data structure shared by other processes in the same session. The DosDevicesDirectory

field of the DeviceMap structure points at the object manager directory that represents the

process’s local \DosDevices. The target directory varies depending on the system:

If the system is Windows 2000 and Terminal Services are not installed, the DosDevices-

Directory field of the DeviceMap structure of the process points at the \?? directory

because there are no local namespaces.

Copyrighted material.

148 Microsoft Windows Internals, Fourth Edition

If the system is Windows 2000 and Terminal Services are installed, when a new session

becomes active the system copies all the objects from the global \?? directory into the

session’s local \Devices directory and the DosDevicesDirectory field of the DeviceMap

structure points at the local directory.

On Windows XP and Windows Server 2003, the system does not make copies of global

objects in the local DosDevices directories. When the object manager sees a reference to

\??, it locates the process’s local \DosDevices by using the DosDevicesDirectory field of

the DeviceMap. If the object manager doesn’t find the object in that directory, it checks

the DeviceMap field of the directory object, and if it’s valid it looks for the object in the

directory pointed to by the GlobalDosDevicesDirectory field of the DeviceMap structure,

which is always \Global??.

Under certain circumstances, applications that are Terminal Services–aware need to access

objects in the console session even if the application is running in a remote session. The application

might want to do this to synchronize with instances of itself running in other remote

sessions or with the console session. For these cases, the object manager provides the special

override “\Global” that an application can prefix to any object name to access the global

namespace. For example, an application in session two opening an object named \Global\

ApplicationInitialized is directed to \BasedNamedObjects\ApplicationInitialized instead

of \Sessions\2\BaseNamedObjects\ApplicationInitialized.

On Windows XP and Windows Server 2003, an application that wants to access an object in

the global \DosDevices directory does not need to use the \Global prefix as long as the object

doesn’t exist in its local \DosDevices directory. This is because the object manager will automatically

look in the global directory for the object if it doesn’t find it in the local directory.

However, an application running on Windows 2000 with Terminal Services must always specify

the \Global prefix to access objects in the global \DosDevices directory.

EXPERIMENT: Viewing Namespace Instancing

You can see the object manager instance of the namespace by creating a session other

than the console session and then viewing the handle table for a process in that session.

On Windows XP Home Edition or on a Windows XP Professional system that is not a

member of a domain, disconnect the console session (by clicking Start, clicking Log Off,

and choosing Disconnect and Switch User, or by pressing the Windows key + L) and

logging in to a new account. If you have a Windows 2000 Server, Advanced Server, or

Datacenter Server system, run the Terminal Services client, connect to the server, and

log in.

Copyrighted material.

Chapter 3: System Mechanisms 149

Once you are logged in to the new session, run Winobj.exe from

and click on the \Sessions directory. You’ll see a subdirectory with a numeric name for

each active remote session. If you open one of these directories, you’ll see subdirectories

named \DosDevices, \Windows, and \BaseNamedObjects, which are the local

namespace subdirectories of the session. The following screen shot shows a local


Next run Process Explorer and select a process in the new session (such as

Explorer.exe), and then view the handle table (by clicking View, Lower Pane View, and

then Handles). You should see a handle to \Windows\Windowstations\WinSta0

underneath \Sessions\n, where n is the session id. Objects with global names will

appear under \Sessions\n\BaseNamedObjects.


The concept of mutual exclusion is a crucial one in operating systems development. It refers to

the guarantee that one, and only one, thread can access a particular resource at a time. Mutual

exclusion is necessary when a resource doesn’t lend itself to shared access or when sharing

would result in an unpredictable outcome. For example, if two threads copy a file to a printer

port at the same time, their output could be interspersed. Similarly, if one thread reads a memory

location while another one writes to it, the first thread will receive unpredictable data. In

general, writable resources can’t be shared without restrictions, whereas resources that aren’t

subject to modification can be shared. Figure 3-23 illustrates what happens when two threads

running on different processors both write data to a circular queue.

Copyrighted material.

150 Microsoft Windows Internals, Fourth Edition

Figure 3-23 Incorrect sharing of memory

Because the second thread got the value of the queue tail pointer before the first thread had

finished updating it, the second thread inserted its data into the same location that the first

thread had used, overwriting data and leaving one queue location empty. Even though this

figure illustrates what could happen on a multiprocessor system, the same error could occur

on a single-processor system if the operating system were to perform a context switch to the

second thread before the first thread updated the queue tail pointer.

Sections of code that access a nonshareable resource are called critical sections. To ensure correct

code, only one thread at a time can execute in a critical section. While one thread is writing

to a file, updating a database, or modifying a shared variable, no other thread can be

allowed to access the same resource. The pseudocode shown in Figure 3-23 is a critical section

that incorrectly accesses a shared data structure without mutual exclusion.

The issue of mutual exclusion, although important for all operating systems, is especially

important (and intricate) for a tightly coupled, symmetric multiprocessing (SMP) operating system

such as Windows, in which the same system code runs simultaneously on more than one

processor, sharing certain data structures stored in global memory. In Windows, it is the kernel’s

job to provide mechanisms that system code can use to prevent two threads from modifying

the same structure at the same time. The kernel provides mutual-exclusion primitives

that it and the rest of the executive use to synchronize their access to global data structures.

Because the scheduler synchronizes access to its data structures at DPC/Dispatch level IRQL,

the kernel and executive cannot rely on synchronization mechanisms that would result in a

page fault or reschedule operation to synchronize access to data structures when the IRQL is

DPC/Dispatch level or higher (levels known as an elevated or high IRQL). In the following sections,

you’ll find out how the kernel and executive uses mutual exclusion to protect its global


Get queue tail

Insert data at current location

• • •

• • • • • • • • •

• • •

Increment tail pointer

Processor A Processor B

Get queue tail

Insert data at current location /*ERROR*/

Increment tail pointer

Copyrighted material.

Chapter 3: System Mechanisms 151

data structures when the IRQL is high and what mutual-exclusion and synchronization mechanisms

the kernel and executive use when the IRQL is low (below DPC/Dispatch level).

High-IRQL Synchronization

At various stages during its execution, the kernel must guarantee that one, and only one, processor

at a time is executing within a critical section. Kernel critical sections are the code segments

that modify a global data structure such as the kernel’s dispatcher database or its DPC

queue. The operating system can’t function correctly unless the kernel can guarantee that

threads access these data structures in a mutually exclusive manner.

The biggest area of concern is interrupts. For example, the kernel might be updating a global

data structure when an interrupt occurs whose interrupt-handling routine also modifies the

structure. Simple single-processor operating systems sometimes prevent such a scenario by

disabling all interrupts each time they access global data, but the Windows kernel has a more

sophisticated solution. Before using a global resource, the kernel temporarily masks those

interrupts whose interrupt handlers also use the resource. It does so by raising the processor’s

IRQL to the highest level used by any potential interrupt source that accesses the global

data. For example, an interrupt at DPC/dispatch level causes the dispatcher, which uses the

dispatcher database, to run. Therefore, any other part of the kernel that uses the dispatcher

database raises the IRQL to DPC/dispatch level, masking DPC/dispatch-level interrupts

before using the dispatcher database.

This strategy is fine for a single-processor system, but it’s inadequate for a multiprocessor configuration.

Raising the IRQL on one processor doesn’t prevent an interrupt from occurring on

another processor. The kernel also needs to guarantee mutually exclusive access across several


Interlocked Operations

The simplest form of synchronization mechanisms rely on hardware support for multiprocessor-

safe manipulating integer values and for performing comparisons. They include functions

such as InterlockedIncrement, InterlockedDecrement, InterlockedExchange, and Interlocked-

CompareExchange. The InterlockedDecrement function, for example, uses the x86 lock instruction

prefix (for example, lock xadd) to lock the multiprocessor bus during the subtraction

operation so that another processor that’s also modifying the memory location being decremented

won’t be able to modify between the decrement’s read of the original value and write of

the decremented value. This form of basic synchronization is used by the kernel and drivers.

Copyrighted material.

152 Microsoft Windows Internals, Fourth Edition


The mechanism the kernel uses to achieve multiprocessor mutual exclusion is called a spinlock.

A spinlock is a locking primitive associated with a global data structure, such as the DPC

queue shown in Figure 3-24.

Figure 3-24 Using a spinlock

Before entering either critical section shown in the figure, the kernel must acquire the spinlock

associated with the protected DPC queue. If the spinlock isn’t free, the kernel keeps trying

to acquire the lock until it succeeds. The spinlock gets its name from the fact that the

kernel (and thus, the processor) is held in limbo, “spinning,” until it gets the lock.

Spinlocks, like the data structures they protect, reside in global memory. The code to acquire

and release a spinlock is written in assembly language for speed and to exploit whatever locking

mechanism the underlying processor architecture provides. On many architectures, spinlocks

are implemented with a hardware-supported test-and-set operation, which tests the

value of a lock variable and acquires the lock in one atomic instruction. Testing and acquiring

the lock in one instruction prevents a second thread from grabbing the lock between the time

when the first thread tests the variable and the time when it acquires the lock.

All kernel-mode spinlocks in Windows have an associated IRQL that is always at DPC/dispatch

level or higher. Thus, when a thread is trying to acquire a spinlock, all other activity at

the spinlock’s IRQL or lower ceases on that processor. Because thread dispatching happens at

DPC/dispatch level, a thread that holds a spinlock is never preempted because the IRQL

masks the dispatching mechanisms. This masking allows code executing a critical section protected

by a spinlock to continue executing so that it will release the lock quickly. The kernel

uses spinlocks with great care, minimizing the number of instructions it executes while it

holds a spinlock.



Add DPC from queue


Release DPC queue spinlock Release DPC queue spinlock


Remove DPC from queue


• • •

• • •

DPC queue

Critical section


Try to acquire

DPC queue




Try to acquire

DPC queue



Processor A Processor B


Copyrighted material.

Chapter 3: System Mechanisms 153

Note Because the IRQL is an effective synchronization mechanism on uniprocessors, the

spinlock acquisition and release functions of uniprocessor HALs don’t implement spinlocks—

they simply raise and lower the IRQL.

The kernel makes spinlocks available to other parts of the executive through a set of kernel

functions, including KeAcquireSpinlock and KeReleaseSpinlock. Device drivers, for example,

require spinlocks in order to guarantee that device registers and other global data structures

are accessed by only one part of a device driver (and from only one processor) at a time. Spinlocks

are not for use by user programs—user programs should use the objects described in the

next section.

Kernel spinlocks carry with them restrictions for code that uses them. Because spinlocks

always have an IRQL of DPC/dispatch level or higher, as explained earlier, code holding a

spinlock will crash the system if it attempts to make the scheduler perform a dispatch operation

or if it causes a page fault.

Queued Spinlocks

A special type of spinlock called a queued spinlock is used in some circumstances instead of a

standard spinlock. A queued spinlock is a form of spinlock that scales better on multiprocessors

than a standard spinlock. In general, Windows will use only standard spinlocks when it

expects there to be low contention for the lock.

A queued spinlock work like this: When a processor wants to acquire a queued spinlock that

is currently held, it places its identifier in a queue associated with the spinlock. When the processor

that’s holding the spinlock releases it, it hands the lock over to the first processor identified

in the queue. In the meantime, a processor waiting for a busy spinlock checks the status

not of the spinlock itself but of a per-processor flag that the processor ahead of it in the queue

sets to indicate that the waiting processor’s turn has arrived.

The fact that queued spinlocks result in spinning on per-processor flags rather than global

spinlocks has two effects. The first is that the multiprocessor’s bus isn’t as heavily trafficked by

interprocessor synchronization. The second is that instead of a random processor in a waiting

group acquiring a spinlock, the queued spinlock enforces first-in, first-out (FIFO) ordering to

the lock. FIFO ordering means more consistent performance across processors accessing the

same locks.

Windows defines a number of global queued spinlocks by storing pointers to them in an array