您的位置:首页 > 其它

How Do Windows NT System Calls REALLY Work?--Windows NT的系统调用究竟是如何工作的?

2006-07-03 01:10 1151 查看
出处: http://www.codeguru.com/Cpp/W-P/system/devicedriverdevelopment/article.php/c8035/

Most texts that describe Windows NT system calls keep many of the important details in the dark. This leads to confusion when trying to understand exactly what is going on when a user-mode application "calls into" kernel mode. The following article will shed light on the exact mechanism that Windows NT uses when switching to kernel-mode to execute a system service. The description is for an x86 compatible CPU running in protected mode. Other platforms supported by Windows NT will have a similar mechanism for switching to kernel-mode.

许多文章在描述Windows NT的系统调用时,忽略了很多重要的细节.不了解这些细节的话,那么会妨碍我们确切地理解一个用户态的程序如何能够"调用进入"内核模式.接下来的文章将会揭示Windows NT通过什么样的机制切换到内核模式去执行系统服务.此处的描述适合于在保护模式下面运行的x86兼容的CPU.其它支持Windows NT的CPU应该也会有类似的切换内核模式的方式.

What is kernel-mode?(什么是内核模式?)

Contrary to what most developers believe (even kernel-mode developers) there is no mode of the x86 CPU called "Kernel-mode". Other CPUs such as the Motorola 68000 has two processor modes "built into" the CPU, i.e. it has a flag in a status register that tells the CPU if it is currently executing in user-mode or supervisor-mode. Intel x86 CPUs do not have such a flag. Instead, it is the privilege level of the code segment that is currently executing that determines the privilege level of the executing program. Each code segment in an application that runs in protected mode on an x86 CPU is described by an 8 byte data structure called a Segment Descriptor. A segment descriptor contains (among other information) the start address of the code segment that is described by the descriptor, the length of the code segment and the privilege level that the code in the code segment will execute at. Code that executes in a code segment with a privilege level of 3 is said to run in user mode and code that executes in a code segment with a privilege level of 0 is said to execute in kernel mode. In other words, kernel-mode (privilege level 0) and user-mode (privilege level 3) are attributes of the code and not of the CPU. Intel calls privilege level 0 "Ring 0" and privilege level 3 "Ring 3". There are two more privilege levels in the x86 CPU that are not used by Windows NT (ring 1 and 2). The reason privilege levels 1 and 2 are not used is because Windows NT was designed to run on several other hardware platforms that may or may not have four privilege levels like the Intel x86 CPU.

与诸多开发者(甚至是内核模式的开发者)普遍认为x86系列CPU没有所谓的"内核模式"相比,其它的CPU比如Motorola 68000都在CPU里面"内建"了两种处理器模式,也就是说状态寄存器有个标志位可以让CPU知道自己当前是在用户模式还是在内核模式.Inter x86系列CPU确实没有类似的标志.取而代之的是程序的权限由当前执行的代码段的权限来决定.在x86保护模式下面运行的程序的每一个代码段都有一个对应的8字节大小的被称为段描述子的结构.一个段描述子包含了对应的代码段的开始地址、代码段的长度以及代码段的权限.在一个权限为3的代码段里面的代码是运行在用户模式下面,权限为0的代码段里面的代码是运行在内核模式下面.换句话说,内核模式(权限为0)以及用户模式(权限为3)是代码的属性而不是CPU的.Inter称权限0为"Ring 0"而权限3为"Ring 3".在x86系列CPU里面还有两个权限没有被Windows NT使用到(ring 1和ring 2).原因是Windows NT设计为可以在多个硬件平台下面运行,而这些硬件平台很可能不像Inter x86系列CPU那样有四种权限.

The x86 CPU will not allow code that is running at a lower privilege level (numerically higher) to call into code that is running at a higher privilege level (numerically lower). If this is attempted a general protection (GP) exception is automatically generated by the CPU. A general protection exception handler in the operating system will be called and the appropriate action can be taken (warn the user, terminate the application etc). Note that all memory protection discussed above, including the privilege levels, are features of the x86 CPU and not of Windows NT. Without the support from the CPU Windows NT cannot implement memory protection like described above.

x86系列CPU不允许低权限的代码段里面的代码调用进入高权限的代码段里面的代码.如果试图这么做的话,就会引起一个CPU的一般性保护异常.这个一般性保护异常会被操作系统调用恰当的操作来处理(比如警告用户,退出程序等等).注意所有的上面说到的内存保护以及权限保护都是x86系列CPU的特性而不是Windows NT的.没有CPU的支持Windows NT是没有办法做到上面所说的内存保护以及权限保护.

Where do the Segment Descriptors reside?(段描述子存放在哪里?)

Since each code segment that exists in the system is described by a segment descriptor and since there are potentially many, many code segments in a system (each program may have many) the segment descriptors must be stored somewhere so that the CPU can read them in order to accept or deny access to a program that wishes to execute code in a segment. Intel did not choose to store all this information on the CPU chip itself but instead in the main memory. There are two tables in main memory that store segment descriptors; the Global Descriptor Table (GDT) and the Local Descriptor Table (LDT). There are also two registers in the CPU that holds the addresses to and sizes of these descriptor tables so that the CPU can find the segment descriptors. These registers are the Global Descriptor Table Register (GDTR) and the Local Descriptor Table Register (LDTR). It is the operating system's responsibility to set up these descriptor tables and to load the GDTR and LDTR registers with the addresses of the GDT and LDT respectively. This has to be done very early in the boot process, even before the CPU is switched into protected mode, because without the descriptor tables no memory segments can be accessed in protected mode. Figure 1 below illustrates the relationship between the GDTR, LDTR, GDT and the LDT.

因为每个代码段都需要一个段描述子并且这些代码段数量还可能会非常多,而这些段描述子必须存储在某些CPU可以读取的位置以便接受或者拒绝某个某个程序请求执行某个代码段的代码的要求.Inter并没有选择在CPU本身储存这些信息,而是选择在主存里面保护.在主存里面有两张表:一张是全局描述子表(GDT),一张是局部描述子表(LDT).在CPU里面也有两个寄存器保存这两张表的地址和大小以便CPU可以寻找到描述子.这两个寄存器分别是全局描述子表寄存器(GDTR)和局部描述子表寄存器(LDTR).配置GDT、LDT以及GDTR、LDTR是操作系统的责任.这些会在启动过程的前期完成,甚至比CPU切换到保护模式的时间还要早,因为没有GDT、LDT以及GDTR、LDTR是没有办法进入保护模式的,图1描述了GDT、LDT以及GDTR、LDTR的关系.



Since there are two segment descriptor tables it is not enough to use an index to uniquely select a segment descriptor. A bit that identifies in which of the two tables the segment descriptor resides is necessary. The index combined with the table indicator bit is called a segment selector. The segment selector format is displayed below.
因为这里有两个段描述子表,所以光用一个序号不足以表示一个唯一的段描述子.用一位来表示是在哪个描述子表里面还是必要的.因此被称作段选择子索引里面还组合了一个的指示位.段选择子的结构如下所示:




As can be seen in figure 2 above, the segment selector also contains a two-bit field called a Requestor Privilege Level (RPL). These bits are used to determine if a certain piece of code can access the code segment descriptor that the selector points to. For instance, if a piece of code that runs at privilege level 3 (user mode) tries to make a jump or call code in the code segment that is described by the code segment descriptor that the selector points to and the RPL in the selector indicates that only code that runs at privilege level 0 can read the code segment a general protection exception occurs. This is the way the x86 CPU can make sure that no ring 3 (user mode) code can get access to ring 0 (kernel-mode) code. In fact, the truth is slightly more complicated than this. For the information-eager please see the further reading list, "Protected Mode Software Architecture" for the details of the RPL field. For our purposes it is enough to know that the RPL field is used for privilege checks of the code trying to use the segment selector to read a segment descriptor.
从图2中可以看出,段选择子还包括了长为两位的被称作请求者权限的域(RPL),RPL用于判断某个代码是否能够调用所描述的代码段里的代码.举例说明,如果在Ring 3执行的一段代码试图跳进或者调用进一个被段描述子里面RPL指定为ring 0的代码段里面的代码,那么会导致一般性异常的发生.这就是x86系列CPU保证ring 3的代码无法进入ring 0的代码的方法.实际上的做法要比这里的复杂些.如果希望得到更多的信息请参考进一步的读物<Protected Mode Software Architecture>里面关于RPL的内容,当前来说知道RPL用作权限访问保护的就可以了.


Interrupt gates(中断门)

So if application code running in user-mode (at privilege level 3) cannot call code running in kernel-mode (at privilege level 0) how do system calls in Windows NT work? The answer again is that they use features of the CPU. In order to control transitions between code executing at different privilege levels, Windows NT uses a feature of the x86 CPU called an interrupt gate. In order to understand interrupt gates we must first understand how interrupts are used in an x86 CPU executing in protected mode.
所以在Ring 3运行的代码无法调用Ring 0运行的代码.那么Windows NT的系统调用如何进行呐?答案还是利用CPU的特性,为了在不同权限级别的代码之间切换,Windows NT使用了x86系列CPU的Interrupt gates特性.为了理解Interrupt gates我们首先要理解在x86 CPU的保护模式下面如何使用中断.
Like most other CPUs, the x86 CPU has an interrupt vector table that contains information about how each interrupt should be handled. In real-mode, the x86 CPU's interrupt vector table simply contains pointers (4 byte values) to the Interrupt Service Routines that will handle the interrupts. In protected-mode, however, the interrupt vector table contains Interrupt Gate Descriptors which are 8 byte data structures that describe how the interrupt should be handled. An Interrupt Gate Descriptor contains information about what code segment the Interrupt Service Routine resides in and where in that code segment the ISR starts. The reason for having an Interrupt Gate Descriptor instead of a simple pointer in the interrupt vector table is the requirement that code executing in user-mode cannot directly call into kernel-mode. By checking the privilege level in the Interrupt Gate Descriptor the CPU can verify that the calling application is allowed to call the protected code at well defined locations (this is the reason for the name "Interrupt Gate", i.e. it is a well defined gate through which user-mode code can transfer control to kernel-mode code).
和其它大多数CPU类似,x86 CPU也有中断向量表来描述每个中断应该如何处理.在实模式下面,x86 CPU的中断向量表简单的包含了一个用于处理中断的中断服务函数(ISR)的地址(4位).但是在保护模式下面,中断向量表里面包含的是Interrupt Gate描述子(8位)来描述如何处理对应中断.一个Interrupt Gate描述子包含了ISR在哪个代码段里面以及ISR在代码段里面的开始地址.使用Interrupt Gate描述子来替代简单的ISR地址是为了保证Ring 3的代码不能直接谮越调用Ring 0的代码.通过检查Interrupt Gate描述子里面的执行权限CPU可以确认程序是否可以通过特定位置的代码来调用被保护的代码(这也是被称为Interrupt Gate的原因,也就是说它是一个具有良好行为定义的可以控制用户模式向内核模式转换的门).
The Interrupt Gate Descriptor contains a Segment Selector which uniquely defines the Code Segment Descriptor that describes the code segment that contains the Interrupt Service Routine. In the case of our Windows NT system call, the segment selector points to a Code Segment Descriptor in the Global Descriptor Table. The Global Descriptor Table contains all Segment Descriptors that are "global", i.e. that are not associated with any particular process running in the system (in other words, the GDT contains Segment Descriptors that describe operating system code and data segments). See figure 3 below for the relationship between the Interrupt Descriptor Table Entry associated with the 'int 2e' instruction, the Global Descriptor Table Entry and the Interrupt Service Routine in the target code segment.

Interrupt Gate描述子里面包含一个段选择子指向该ISR所在段的段描述子.在Windows NT的系统调用的情况下,段选择子指向的段描述子是在GDT里面.GDT里面包含所有的"全局"段描述子,也就是说,不是只与某个单独进程相关(换句话说,GDT里面存放的是描述系统的代码以及数据段的段描述子).下面的图3说明了包含"int 2e"命令的中断描述表项(IDT)与对应的ISR所在的GDT表项的关系.




Back to the NT system call

Now after having covered the background material we are ready to describe exactly how a Windows NT system call finds its way from user-mode into kernel-mode. System calls in Windows NT are initiated by executing an "int 2e" instruction. The 'int' instructor causes the CPU to execute a software interrupt, i.e. it will go into the Interrupt Descriptor Table at index 2e and read the Interrupt Gate Descriptor at that location. The Interrupt Gate Descriptor contains the Segment Selector of the Code Segment that contains the Interrupt Service Routine (the ISR). It also contains the offset to the ISR within the target code segment. The CPU will use the Segment Selector in the Interrupt Gate Descriptor to index into the GDT or LDT (depending on the TI-bit in the segment selector). Once the CPU knows the information in the target segment descriptor it loads the information from the segment descriptor into the CPU. It also loads the EIP register from the Offset in the Interrupt Gate Descriptor. At this point the CPU is almost set up to start executing the ISR code in the kernel-mode code segment.
介绍完有关背景知识之后,我们来探讨Windows NT的系统调用如何从Ring 3进去Ring 0.Windows NT的系统调用是由"int 2e"指令的执行引发的."int"指令导致CPU执行一个软件中断,也就是,它将会从IDT里面索引为2e的项中读取出Interrupt Gate描述子.Interrupt Gate描述子包含了ISR所在段的段描述子,也包含了ISR在段里面的偏移.CPU将会使用此段描述子从GDT或者LDT(根据TI域)里面获取信息.一旦CPU获取了目标段描述子的信息后,将会把目标段描述子信息信息加载进CPU.CPU会根据Interrupt Gate描述子里面的偏移设置EIP寄存器.这样CPU就准备好了执行在RIng 0代码段里的ISR(土星按:中断和调用不一样,中断不会受段选择子里面的RPL限制,而调用要受限制).


The CPU switches automatically to the kernel-mode stack

Before the CPU starts to execute the ISR in the kernel-mode code segment, it needs to switch to the kernel-mode stack. The reason for this is that the kernel-mode code cannot trust the user-mode stack to have enough room to execute the kernel-mode code. For instance, malicious user-mode code could modify its stack pointer to point to invalid memory, execute an 'int 2e' instruction and thereby crash the system when the kernel-mode functions uses the invalid stack pointer. Each privilege level in the x86 Protected Mode environment therefore has its own stack. When making function calls to a higher-privileged level through an interrupt gate descriptor like described above, the CPU automatically saves the user-mode program's SS, ESP, EFLAGS, CS and EIP registers on the kernel-mode stack. In the case of our Windows NT system service dispatcher function (KiSystemService) it needs access to the parameters that the user-mode code pushed onto its stack before it called 'int 2e'. By convention, the user-mode code must set up the EBX register to contain a pointer to the user-mode stack's parameters before executing the 'int 2e' instruction. The KiSystemService can then simply copy over as many arguments as the called system function needs from the user-mode stack to the kernel-mode stack before calling the system function. See figure 4 below for an illustration of this.
在CPU执行Ring0代码段里面的ISR之前,还需要切换到Ring0的堆栈.这样做的原因是Ring0的代码不能确定Ring3的堆栈是否可以提供足够的空间来运行.比如某些恶意的Ring3代码可能会修改自己的堆栈指针使之指向无效地址,那么执行"int 2e"指令时使用无效指针的Ring0代码会导致系统崩溃.因此在x86保护模式下面每个权限都有自己的堆栈.当像上所述的通过Interrupt Gate描述子调用更高权限级别的代码时,CPU会自动保存Ring3的SS、ESP、EFLAGS、CS以及EIP寄存器到Ring0的堆栈里面去.在Windows NT系统里系统服务分发函数(KiSystemService)还需要能够取得"int 2e"调用前推入堆栈的参数.为了方便,Ring3必须把指向Ring3堆栈里面的参数指针在调用"int 2e"指令的时候放置到EBX里面去.KiSystemService这样就可以简单的在系统服务调用之前根据被调用的系统服务所需要的参数个数把参数从Ring3的堆栈里面拷贝到Ring0的堆栈里面去.可以可以参考图4.




What system call are we calling?

Since all Windows NT system calls use the same 'int 2e' software interrupt to switch into kernel-mode, how does the user-mode code tell the kernel-mode code what system function to execute? The answer is that an index is placed in the EAX register before the int 2e instruction is executed. The kernel-mode ISR looks in the EAX register and calls the specified kernel-mode function if all parameters passed from user-mode appears to be correct. The call parameters (for instance passed to our OpenFile function) are passed to the kernel-mode function by the ISR.
因为所有的Windows NT的系统调用都通过"int 2e"软件中断切换到内核模式里面去,那么系统怎么知道执行哪个系统函数呢?答案是在"int 2e"执行前EAX寄存器里面就保存了一个索引.Ring0下面的ISR根据EAX里面的值调用对应的RIng0函数(如果参数也都正确的话).参数(比如传给OpenFile的参数)都通过ISR传给Ring0的函数.

Returning from the system call

Once the system call has completed the CPU automatically restores the running program's original registers by executing an IRET instruction. This pops all the saved register values from the kernel-mode stack and causes the CPU to continue the execution at the point in the user-mode code next after the 'int 2e' call.

当系统调用完成,CPU就自动使用IRET指令恢复之前的寄存器状态.这个指令会从Ring0的堆栈中推出之前保存的所有值并跳到"int 2e"下面一句继续运行.

Experiment

By examining the Interrupt Gate Descriptor for entry 2e in the Interrupt Descriptor Table we can confirm that the CPU finds the Windows NT system service dispatcher routine like described in this article. The code sample for this article contains a debugger extension for the WinDbg kernel-mode debugger that dumps out a descriptor in the GDT, LDT or IDT.

通过对IDT里面2e项的Interrupt Gate描述子观察,我们可以肯定CPU在分发系统调用的时候就如本文所说,本文的离子代码里面包含了一个WinDbg的扩展,可以导出了GDT,LDT以及IDT的描述(土星按:原文的例子文件损坏了).

The WinDbg debugger extension is a DLL called 'protmode.dll' (Protected Mode). It is loaded into WinDbg by using the following command: ".load protmode.dll" after having copied the DLL into the directory that contains the kdextx86.dll for our target platform. Break into the WinDbg debugger (CTRL-C) once you are connected to your target platform. The syntax for displaying the IDT descriptor for 'int 2e' is "!descriptor IDT 2e". This dumps out the following information:

这个WinDbg的扩展是个名为'protmode.dll' 的DLL.把此DLL拷贝到目标平台上包含有kdextx86.dll的目录下面之后通过WinDbg的命令 ".load protmode.dll"加载.生成"int 2e"的IDT描述的命令是"!descriptor IDT 2e".下面是转存出来的结果:

kd>!descriptor IDT 2e
------------------- Interrupt Gate Descriptor --------------------
IDT base = 0x80036400, Index = 0x2e, Descriptor @ 0x80036570
80036570 c0 62 08 00 00 ee 46 80
Segment is present, DPL = 3, System segment, 32-bit descriptor
Target code segment selector = 0x0008 (GDT Index = 1, RPL = 0)
Target code segment offset = 0x804662c0
------------------- Code Segment Descriptor --------------------
GDT base = 0x80036000, Index = 0x01, Descriptor @ 0x80036008
80036008 ff ff 00 00 00 9b cf 00
Segment size is in 4KB pages, 32-bit default operand and data size
Segment is present, DPL = 0, Not system segment, Code segment
Segment is not conforming, Segment is readable, Segment is accessed
Target code segment base address = 0x00000000
Target code segment size = 0x000fffff
The 'descriptor' command reveals the following:

The descriptor at index 2e in the IDT is at address 0x80036570.

The raw descriptor data is C0 62 08 00 00 EE 46 80.

This means that:

The segment that contains the Code Segment Descriptor described by the Interrupt Gate Descriptor's Segment Selector is present.

Code running at least privilege level 3 can access this Interrupt Gate.

The Segment that contains the interrupt handler for our system call (2e) is described by a Segment Descriptor residing at index 1 in the GDT.

The KiSystemService starts at offset 0x804552c0 within the target segment.




这个"descriptor "指令揭示了以下含义

IDT的2e项位于地址0x80036570

Interrupt Gate描述子的原始数据为C0 62 08 00 00 EE 46 80.

含义如下

段选择子所指向的段已经存在(土星按:加载在内存中)

在Ring3运行的代码可以获得这个Interrupt Gate

该中断的处理函数所在段的段描述子位于在GDT里的1项

KiSystemService从该段的偏移0x804552c0开始执行




The "!descriptor IDT 2e" command also dumps out the target code segment descriptor at index 1 in the GDT. This is an explanation of the data dumped from the GDT descriptor:

The Code Segment Descriptor at index 1 in the GDT is at address 0x80036008.

The raw descriptor data is FF FF 00 00 00 9B CF 00.

This means that:

The size is in 4KB pages. What this means is that the size field (0x000fffff) should be multiplied with the virtual memory page size (4096 bytes) to get the actual size of the segment described by the descriptor. This yields 4GB which happens to be the size of the full address space which can be accessed from kernel-mode. In other words, the whole 4GB address space is described by this segment descriptor. This is the reason kernel-mode code can access any address in user-mode as well as in kernel-mode.

The segment is a kernel-mode segment (DPL=0).

The segment is not conforming. See further reading, "Protected Mode Software Architecture" for a full discussion of this field.

The segment is readable. This means that code can read from the segment. This is used for memory protection. See further reading, "Protected Mode Software Architecture" for a full discussion of this field.

The segment has been accessed. See further reading, "Protected Mode Software Architecture" for a full discussion of this field.




"!descriptor IDT 2e" 命令也转存出了GDT中1项的中的段描述子.下面是对数据的解释:

GDT中1项所在地址为0x80036008.

段描述子原始数据是FF FF 00 00 00 9B CF 00.

含义如下

页面大小是4KB.也就是说大小(0x000fffff)需要乘上页面大小 (4096 bytes) 来得到描述的段大小. 这也是4GB大小,正好是内核模式下面所能访问的最大地址. 换句话说,整个4GB空间都被此段描述子包含.这也就是Ring0下面可以访问任意地址--包括Ring3以及Ring0--的原因.

此段是Ring0段(DPL=0).

此段不一致(Conforming?).参考 "Protected Mode Software Architecture".

此段可读.也就是说代码可以读此段,这个用于内存保护.参见"Protected Mode Software Architecture".

此段已经被访问过了.参见"Protected Mode Software Architecture".




土星按:后面的就不翻译了,主要内容已经介绍完毕

To build the ProtMode.dll WinDbg debugger extension DLL, open the project in Visual Studio 6.0 and click build. For an introduction of how to create debugger extensions like ProtMode.dll, see the SDK that comes with the "Debugging Tools for Windows" which is a free download from Microsoft.

Further Reading

For information on the Protected Mode of the Intel x86 CPU there are two great sources:

"Intel Architecture Software Developers Manual, Volume 3 - System Programming Guide". Available from Intel's web site in PDF format.

"Protected Mode Software Architecture" by Tom Shanley. Available from Amazon.com (published by Addison Wesley).

For more programming details about the x86 CPU, must-haves are:

Intel Architecture Software Developers Manual, Volume 1 - Basic Architecture.

Intel Architecture Software Developers Manual, Volume 2 - Instruction Set Reference Manual.

Both these books are available in PDF format on the Intel web site (you can also get a free hardcopy of these two books. Volume 3 is however only available in PDF format).
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐