您的位置:首页 > 其它

Windbg内核调试之五: 一次利用Dump文件调试Deadlock的实战

2012-06-08 14:52 681 查看


/article/5316526.html

最近遇到项目的一个bug,安装产品之后系统hang住。大致的现象是系统logon之后,Taskbar显示,然后hang住,Ctrl+Alt+Del不能调出TaskManager,Mouse不能移动,硬盘LED不闪烁。由于机器不在手边,远程调试又极其慢(实际上是国外的机器),所以只好请求对方手动触发一个蓝屏(有关手动触发蓝屏的方式,请参考:http://msdn.microsoft.com/en-us/library/cc266483.aspx),得到一个完全dump文件(有关完全dump文件的设置,请参考:Windbg内核调试之四:
Dump文件分析),利用该dump文件分析系统hang住时刻的状态。

注:这个事例中,利用Ctrl+Scroll Lock+Scroll Lock手动触发蓝屏,此键盘消息可以响应。

祭起windbg,Open Crash Dump:



Symbol search path is: srv*c:\symbols*http://msdl.microsoft.com/download/symbols;E:\Win7.Symbols

Executable search path is:

Windows 7 Kernel Version 7201 MP (2 procs) Free x86 compatible

Product: WinNt, suite: TerminalServer SingleUserTS Personal

Built by: 7201.0.x86fre.winmain_win7ids.090601-1516

Kernel base = 0x8363e000 PsLoadedModuleList = 0x83786810

Debug session time: Thu Aug 6 17:04:40.745 2009 (GMT+8)

System Uptime: 0 days 0:09:38.352

Loading Kernel Symbols............................................................................................................

Loading User Symbols

Loading unloaded module list.....................................................................................................




系统是Window 7的RC版本7201,运行在x86系统上。先analyze –v一下,看看调试器给定的信息是否可用:



0: kd> !analyze -v

******************************************************************************

*

* Bugcheck Analysis

*

******************************************************************************

MANUALLY_INITIATED_CRASH (e2)

The user manually initiated this crash dump.

Arguments:

Arg1: 00000000

Arg2: 00000000

Arg3: 00000000

Arg4: 00000000

Debugging Details:

------------------

DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT

BUGCHECK_STR: 0xE2

PROCESS_NAME: System

CURRENT_IRQL: 2

LAST_CONTROL_TRANSFER: from 9066a1aa to 8371ac88

STACK_TEXT:

837649e4 9066a1aa 000000e2 00000000 00000000 nt!KeBugCheckEx+0x1e

83764a0c 9066a2ba 88a60b48 000000c6 00000000 kbdhid!KbdHidProcessCrashDump+0x1e2

83764a30 8de20d50 00000001 83764a4c 000000c6 kbdhid!KbdHid_InsertCodesIntoQueue+0x88

83764a44 8de20edd 000000c6 00000000 9066a232 HIDPARSE!HidP_KbdPutKey+0x30

83764a68 8de20dbd 8de216a8 00000009 9066a232 HIDPARSE!HidP_ModifierCode+0x81

83764a8c 8de210c0 00000047 00000000 88a56890 HIDPARSE!HidP_TranslateUsage+0x61

83764abc 9066a50f 88a5692d 0000000e 00000000 HIDPARSE!HidP_TranslateUsageAndPagesToI8042ScanCodes+0x66

83764aec 836a6ab3 00000000 014b49d0 00000000 kbdhid!KbdHid_ReadComplete+0x18d

83764b34 8de0e424 83764b3c 83764b3c 88883020 nt!IopfCompleteRequest+0x128

83764b50 8de0e5df 02883008 88a28508 00000009 HIDCLASS!HidpDistributeInterruptReport+0xce

83764b8c 836a6ab3 00000000 8838c060 88a56de4 HIDCLASS!HidpInterruptReadComplete+0x1a7

83764bd0 92753868 8367e474 8902a6e0 00000000 nt!IopfCompleteRequest+0x128

83764c00 92754178 89147d70 8838c060 88c82b28 USBPORT!USBPORT_Core_iCompleteDoneTransfer+0x6e0

83764c2c 927579af 8683c028 8683c0f0 8683ca98 USBPORT!USBPORT_Core_iIrpCsqCompleteDoneTransfer+0x33b

83764c54 92751d18 8683c028 8683ca98 8683c002 USBPORT!USBPORT_Core_UsbIocDpc_Worker+0xbc

83764c78 836a6335 8683caa4 8683c002 00000000 USBPORT!USBPORT_Xdpc_Worker+0x173

83764cd4 836a6198 83767d20 83771280 00000000 nt!KiExecuteAllDpcs+0xf9

83764d20 836a5fb8 00000000 0000000e 00000000 nt!KiRetireDpcList+0xd5

83764d24 00000000 0000000e 00000000 00000000 nt!KiIdleLoop+0x38

STACK_COMMAND: kb

FOLLOWUP_IP:

kbdhid!KbdHidProcessCrashDump+1e2

9066a1aa 837d1000 cmp dword ptr [ebp+10h],0

SYMBOL_STACK_INDEX: 1

SYMBOL_NAME: kbdhid!KbdHidProcessCrashDump+1e2

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: kbdhid

IMAGE_NAME: kbdhid.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 4a2487af

FAILURE_BUCKET_ID: 0xE2_kbdhid!KbdHidProcessCrashDump+1e2

BUCKET_ID: 0xE2_kbdhid!KbdHidProcessCrashDump+1e2

Followup: MachineOwner

---------



给出的信息似乎并不充分,只是针对目标系统蓝屏的一些信息,并不是想要的系统如果hang住的分析。看看进程列表:



0: kd> !process 0 0

**** NT ACTIVE PROCESS DUMP ****

PROCESS 85f22920 SessionId: none Cid: 0004 Peb: 00000000 ParentCid: 0000

DirBase: 00185000 ObjectTable: 8cc01c58 HandleCount: 2369.

Image: System

PROCESS 86c568a0 SessionId: none Cid: 0178 Peb: 7ffdf000 ParentCid: 0004

DirBase: b9c6d020 ObjectTable: 8cd97f30 HandleCount: 31.

Image: smss.exe

PROCESS 884de8f0 SessionId: 0 Cid: 0210 Peb: 7ffdd000 ParentCid: 01c4

DirBase: b9c6d060 ObjectTable: 8cc0ab68 HandleCount: 721.

Image: csrss.exe

............



进程太多就没全列出来。死锁有两种情况,一种是系统死锁,像本例一样,系统没有任何响应,这种死锁多由内核驱动引起,调试相对困难,也不易找到到底是哪个进程中的线程引发死锁,哪个进程中的线程持有资源;另一种是应用程序引发的死锁,这种情况系统多数是可以响应的,鼠标或桌面,只是某个应用程序hang住,也许是TaskManager,也许是其它应用程序。这种问题相对系统级死锁要简单的多,只要attach到hang住的应用程序上,一般都可排查出问题所在,多数情况下都是应用程序进程中的某个线程持有Critical Section,引起其它线程等待资源;另一种情况是并不是多个线程的相互等待,而是线程自旋无法继续执行。所以,判断应用程序的挂死,最有效的方法就是观察CPU的使用率(Ctrl+Alt+Del),看是否达到峰值或为0。

系统级死锁不容易找到引起问题的进程,需要关注的System、Csrss.exe、Service.exe这样比较重要的进程;应用程序死锁则重点关注applicaton本身就可以了。

本例明显是系统级的死锁,基本可以肯定是第三方驱动程序导致。进程列表是看不出什么可疑的原因的。既然怀疑是死锁的问题,运行一下!locks(!kdext*.locks,Kernal级)看看当前系统持有的锁:



0: kd> !locks

**** DUMP OF ALL RESOURCE OBJECTS ****

KD: Scanning for held locks.

Resource @ nt!PnpRegistryDeviceResource (0x837a3e80) Exclusively owned

Contention Count = 48

NumberOfExclusiveWaiters = 3

Threads: 85f55688-01<*>

Threads Waiting On Exclusive Access:

85f59798 85f59a70 8893cd48

KD: Scanning for held locks..................

Resource @ 0x87b286dc Shared 12 owning threads

Threads: 89484568-02<*> 88c5d030-03<*> 88c5a7e0-02<*> 88c5b030-02<*>

88c5cd48-02<*> 88c5c2f8-02<*> 88f858f8-02<*> 88df7030-02<*>

8932ab08-02<*> 889e4d48-02<*> 88afa3e8-02<*> 8936bd48-02<*>

KD: Scanning for held locks........................................................................................................

17099 total locks, 2 locks currently held




当前两个锁被持有,其中一个为Exclusively owned,另一个是共享锁。第一个锁为Exclusively owned访问,即专有访问,这种锁往往是引起死锁问题的原因。死锁的一种情况是由于系统正在执行的一个线程需要一个锁,而这个锁被当前系统未执行的某个线程所持有。系统中大部分是共享锁。

注意到第一个锁当前持有的线程为85f55688-01<*> (-01<*>代表当前持有的进程),等待队列中有三个线程在等待这个锁;第二个锁为共享锁,所有线程列表中的thread都持有它。

这两个锁实际上是两个resource的形式给出,通过命令看一下这两个resouce的结构:



0: kd> dt nt!_ERESOURCE -b 0x837a3e80

+0x000 SystemResourcesList : _LIST_ENTRY [ 0x8377f3c0 - 0x8377eac0 ]

+0x000 Flink : 0x8377f3c0

+0x004 Blink : 0x8377eac0

+0x008 OwnerTable : 0x87bfb068

+0x00c ActiveCount : 1

+0x00e Flag : 0x84

+0x010 SharedWaiters : 0x87db2088

+0x014 ExclusiveWaiters : 0x87db20c0

+0x018 OwnerEntry : _OWNER_ENTRY

+0x000 OwnerThread : 0x85f55688

+0x004 IoPriorityBoosted : 0y0

+0x004 OwnerReferenced : 0y0

+0x004 OwnerCount : 0y000000000000000000000000000001 (0x1)

+0x004 TableSize : 4

+0x020 ActiveEntries : 1

+0x024 ContentionCount : 0x30

+0x028 NumberOfSharedWaiters : 0

+0x02c NumberOfExclusiveWaiters : 3

+0x030 Address : (null)

+0x030 CreatorBackTraceIndex : 0

+0x034 SpinLock : 0

0: kd> dt nt!_ERESOURCE -b 0x87b286dc

+0x000 SystemResourcesList : _LIST_ENTRY [ 0x87b286a4 - 0x87b11234 ]

+0x000 Flink : 0x87b286a4

+0x004 Blink : 0x87b11234

+0x008 OwnerTable : 0x892d0130

+0x00c ActiveCount : 1

+0x00e Flag : 0

+0x010 SharedWaiters : (null)

+0x014 ExclusiveWaiters : (null)

+0x018 OwnerEntry : _OWNER_ENTRY

+0x000 OwnerThread : 0x89484568

+0x004 IoPriorityBoosted : 0y0

+0x004 OwnerReferenced : 0y0

+0x004 OwnerCount : 0y000000000000000000000000000010 (0x2)

+0x004 TableSize : 8

+0x020 ActiveEntries : 0xc

+0x024 ContentionCount : 0

+0x028 NumberOfSharedWaiters : 0

+0x02c NumberOfExclusiveWaiters : 0

+0x030 Address : (null)

+0x030 CreatorBackTraceIndex : 0

+0x034 SpinLock : 0



能看到ERESOURCE结构的OwnerThread域正是上述分析的持有锁的线程。ERESOURCE结构定义大致如下:



typedef struct _ERESOURCE {

LIST_ENTRY SystemResourcesList;

POWNER_ENTRY OwnerTable;

SHORT ActiveCount;

USHORT Flag;

PKSEMAPHORE SharedWaiters;

PKEVENT ExclusiveWaiters;

OWNER_ENTRY OwnerThreads[2];

ULONG ContentionCount;

USHORT NumberOfSharedWaiters;

USHORT NumberOfExclusiveWaiters;

union{

PVOID Address;

ULONG_PTR CreatorBackTraceIndex;

};

KSPIN_LOCK SpinLock;

} ERESOURCE, *PERESOURCE;




现在比较可疑的是线程0x85f55688所持有的资源锁,列出这个线程的stack trace看看有没有hot spot:



0: kd> !thread 0x85f55688

THREAD 85f55688 Cid 0004.0024 Teb: 00000000 Win32Thread: 00000000 WAIT: (WrLpcReply) KernelMode Non-Alertable

85f558bc Semaphore Limit 0x1

Waiting for reply to ALPC Message 8e4a46b0 : queued at port 88c5cc18 : owned by process 88c09400

Not impersonating

DeviceMap 8cc08a00

Owning Process 0 Image: <Unknown>

Attached Process 85f22920 Image: System

Wait Start TickCount 6525 Ticks: 30548 (0:00:07:56.551)

Context Switch Count 9141

UserTime 00:00:00.000

KernelTime 00:00:00.546

Win32 Start Address nt!ExpWorkerThread (0x836abb1a)

Stack Init 8e907fd0 Current 8e907220 Base 8e908000 Limit 8e905000 Call 0

Priority 13 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5

ChildEBP RetAddr Args to Child

8e907238 836aca58 85f55688 00000000 83767d20 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])

8e907270 836ab113 85f55748 85f55688 85f558bc nt!KiSwapThread+0x266

8e907298 836a524f 85f55688 85f55748 00000000 nt!KiCommitThreadWait+0x1df

8e907310 836f4ad7 85f558bc 00000011 00000000 nt!KeWaitForSingleObject+0x393

8e907338 838b3bae 85f558bc 00000000 00000000 nt!AlpcpSignalAndWait+0x7b

8e90735c 838a9c2f 00000000 8e9073c8 00000000 nt!AlpcpReceiveSynchronousReply+0x27

8e9073ec 8389d28f 88c5a6b0 00020000 8e907628 nt!AlpcpProcessSynchronousRequest+0x276

8e907448 838b4ad7 88c5a6b0 8e907628 00000000 nt!LpcpRequestWaitReplyPort+0x6a

8e907470 8368157a 80000c54 8e907628 8e907528 nt!NtRequestWaitReplyPort+0x4c

8e907470 8368024d 80000c54 8e907628 8e907528 nt!KiFastCallEntry+0x12a (FPO: [0,3] TrapFrame @ 8e907484)

8e9074f4 98b63434 80000c54 8e907628 8e907528 nt!ZwRequestWaitReplyPort+0x11 (FPO: [3,0,0])

WARNING: Stack unwind information not available. Following frames may be wrong.

8e90772c 98b638a1 00000000 00000000 98b7d6ec tmcomm!CWorkerThreadPool::CWorkerThreadPool+0xa54

8e907744 98b9d97e 8e90775c 00000000 98ba6238 tmcomm!KmCallUm+0x75

8e907784 98b9ee09 9530af7c 162a554d 98ba6238 tmactmon+0x397e

8e9077c0 98b9ef4d 0000000a 80000002 99ccc490 tmactmon+0x4e09

8e9077ec 98b9efbc 8e907844 00000000 0000000a tmactmon+0x4f4d

8e907800 98b8c450 8e907844 00000000 8e907950 tmactmon+0x4fbc

8e907828 98b91f97 8e907802 8e9078a4 8e907950 tmevtmgr+0x1450

8e90789c 98b921d7 00000000 88be10a4 8e907950 tmevtmgr!TMEvtCommunicateRoutine+0x2569

8e9078b4 98b8c450 8e907900 88be0edc 8e9079c8 tmevtmgr!TMEvtCommunicateRoutine+0x27a9

8e9078dc 98b90d0d 8e907903 8e9079a0 8e907a0c tmevtmgr+0x1450

8e9079ac 98b8e7b0 8e9079c8 8e9079e4 98b8e944 tmevtmgr!TMEvtCommunicateRoutine+0x12df

8e9079b8 98b8e944 88be0f5c 8e9079c8 8e907acc tmevtmgr+0x37b0

8e9079e4 8368157a 88be0f5c 8e907acc 82000000 tmevtmgr+0x3944

8e9079e4 8367f069 88be0f5c 8e907acc 82000000 nt!KiFastCallEntry+0x12a (FPO: [0,3] TrapFrame @ 8e907a0c)

8e907a7c 838266b5 8e907acc 82000000 8e907ad8 nt!ZwCreateKey+0x11 (FPO: [7,0,0])

8e907af0 8380e477 8e907cb8 82000000 00000000 nt!IopCreateRegistryKeyEx+0x61

8e907c18 8380e0bd 85f4d530 00000002 82000000 nt!IopOpenOrCreateDeviceRegistryKey+0x3b1

8e907c30 8c07e594 85f4d530 00000002 82000000 nt!IoOpenDeviceRegistryKey+0x18

8e907cc8 8c016489 87dbb0e0 00000000 00000000 ndis!ndisCheckAdapterBindings+0xa7 (FPO: [Non-Fpo])

8e907cf0 8c07d013 87dbbc70 87dbb0e0 8e907d50 ndis!ndisQueuedRestoreMiniportstack+0x36 (FPO: [Non-Fpo])

8e907d00 836abc27 87dbbc70 00000000 85f55688 ndis!ndisWorkItemHandler+0xe (FPO: [Non-Fpo])

8e907d50 8384c643 00000000 a63db72a 00000000 nt!ExpWorkerThread+0x10d

8e907d90 836fe059 836abb1a 00000000 00000000 nt!PspSystemThreadStartup+0x9e

00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x19



上述stack trace没有看出什么可疑的状态。再看看另外一个线程0x89484568(持有共享锁):



0: kd> !thread 0x89484568

THREAD 89484568 Cid 0cd0.0f50 Teb: 7ffa6000 Win32Thread: ffb39180 WAIT: (Executive) KernelMode Non-Alertable

b44c6680 NotificationEvent

IRP List:

89493980: (0006,0244) Flags: 00000884 Mdl: 00000000

894d6db8: (0006,0244) Flags: 00000884 Mdl: 00000000

88c4e9a0: (0006,0244) Flags: 00000884 Mdl: 00000000

894d39a0: (0006,0244) Flags: 00000884 Mdl: 00000000

Not impersonating

DeviceMap 8e12ea70

Owning Process 89020508 Image: IMJPCMNT.EXE

Wait Start TickCount 4518 Ticks: 32555 (0:00:08:27.861)

Context Switch Count 330

UserTime 00:00:00.000

KernelTime 00:00:00.031

Win32 Start Address 0x697ee751

Stack Init b44c7fd0 Current b44c6550 Base b44c8000 Limit b44c5000 Call 6a0

Priority 10 BasePriority 8 PriorityDecrement 16 IoPriority 2 PagePriority 5

*** ERROR: Module load completed but symbols could not be loaded for flyfs.sys

ChildEBP RetAddr Args to Child

b44c6568 836aca58 89484568 00000000 807c2120 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])

b44c65a0 836ab113 89484628 89484568 b44c6680 nt!KiSwapThread+0x266

b44c65c8 836a524f 89484568 89484628 00000000 nt!KiCommitThreadWait+0x1df

b44c6640 8bd6054f b44c6680 00000000 00000000 nt!KeWaitForSingleObject+0x393

b44c66a8 8bd5c54a 8bd5c4aa 894d5c90 00000003 fltmgr!FltpPostSyncOperation+0x8d (FPO: [Non-Fpo])

b44c66cc 8bd59765 894d5c90 89435060 894d5c90 fltmgr!FltpGetNormalizedFileName+0x5e (FPO: [Non-Fpo])

b44c66e4 8bd43773 894d5c90 00000000 894d5c90 fltmgr!FltpCreateFileNameInformation+0x81 (FPO: [Non-Fpo])

b44c6704 8bd438c7 877273c0 00000000 00000000 fltmgr!HandleStreamListNotSupported+0x125 (FPO: [Non-Fpo])

b44c6734 8bd43fa3 c00000bb 89435068 87b287d8 fltmgr!FltpGetFileNameInformation+0xc7 (FPO: [Non-Fpo])

b44c675c 8bd97a1d 00435068 00000101 b3187e2c fltmgr!FltGetFileNameInformation+0x12b (FPO: [Non-Fpo])

b44c67d4 8bd892be 89435068 00000101 b3187e2c syscow32v!SysCowGetFileNameInformation+0xcd (FPO: [Non-Fpo]) (CONV: stdcall) [d:\code\engineering\products\systemok\restore\driver\support.cpp @ 2397]

b44c6868 8bd3daeb 89435068 b44c6888 b44c68b4 syscow32v!SysCowPreCreate+0x31e (FPO: [Non-Fpo]) (CONV: stdcall) [d:\code\engineering\products\systemok\restore\driver\create.cpp @ 299]

b44c68d4 8bd409f0 b44c6918 89493980 00000000 fltmgr!FltpPerformPreCallbacks+0x34d (FPO: [Non-Fpo])

b44c68ec 8bd541fe b44c6918 8bd57f3c 00000000 fltmgr!FltpPassThroughInternal+0x40 (FPO: [Non-Fpo])

b44c6900 8bd548b7 b44c6918 89493980 89493b98 fltmgr!FltpCreateInternal+0x24 (FPO: [Non-Fpo])

b44c6944 8367a567 87e5d668 87727008 89493bbc fltmgr!FltpCreate+0x2c9 (FPO: [Non-Fpo])

b44c695c 8bda7bd1 89493ba0 89493980 892be848 nt!IofCallDriver+0x63

WARNING: Stack unwind information not available. Following frames may be wrong.

b44c6980 8bda91ec 87e5d7b0 89493980 b44c6a30 flyfs+0x1bd1

b44c699c 8bda94f3 87e5d7b0 0000001b 894dab28 flyfs+0x31ec

b44c6a50 8bda9f60 87e5d7b0 89493980 b44c6a70 flyfs+0x34f3

b44c6a60 8bda9f8a 87e5d7b0 89493980 b44c6a88 flyfs+0x3f60

b44c6a70 8367a567 87e5d7b0 89493980 894dab84 flyfs+0x3f8a

b44c6a88 8387e5f5 9ce1a1da b44c6c30 00000000 nt!IofCallDriver+0x63

b44c6b60 8385f1a7 876f0678 87f6dde8 89403d20 nt!IopParseDevice+0xed7

b44c6bdc 83885215 00000000 b44c6c30 00000240 nt!ObpLookupObjectName+0x4fa

b44c6c3c 8387d573 b44c6dc0 85f6dde8 89484500 nt!ObOpenObjectByName+0x159

b44c6cb8 838b4d59 b44c6e04 00120089 b44c6dc0 nt!IopCreateFile+0x673

b44c6d14 83806c27 b44c6e04 00120089 b44c6dc0 nt!IoCreateFileEx+0x9e

b44c6d70 8bdac7c7 b44c6e04 00120089 b44c6dc0 nt!IoCreateFileSpecifyDeviceObjectHint+0x59

b44c6e10 8bda9593 0000004d b44c6ec0 87e5d7b0 flyfs+0x67c7

b44c6ec8 8bda9f60 87e5d7b0 894d6db8 b44c6ee8 flyfs+0x3593

b44c6ed8 8bda9f8a 87e5d7b0 894d6db8 b44c6f00 flyfs+0x3f60

b44c6ee8 8367a567 87e5d7b0 894d6db8 892e76ac flyfs+0x3f8a

b44c6f00 8387e5f5 9ce1a562 b44c70a8 00000000 nt!IofCallDriver+0x63

b44c6fd8 8385f1a7 876f0678 85f6dde8 894d8698 nt!IopParseDevice+0xed7

b44c7054 83885215 00000000 b44c70a8 00000240 nt!ObpLookupObjectName+0x4fa

b44c70b0 8387d573 b44c71d8 85f6dde8 00010000 nt!ObOpenObjectByName+0x159

b44c712c 838b4d59 894d5490 00100001 b44c71d8 nt!IopCreateFile+0x673

b44c7188 8bd5b8e5 894d5490 00100001 b44c71d8 nt!IoCreateFileEx+0x9e

b44c7220 8bd5bc59 00000000 00000000 0000fffe fltmgr!FltpExpandFilePathWorker+0x167 (FPO: [Non-Fpo])



IMJPCMNT.EXE是Microsoft输入法进程。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: