您的位置：首页 > 其它

经典定位等待事件问题方法

2016-04-12 18:03 267 查看

今天遇到个问题，客户说某天的11:45开始，系统遇到了大量的cursor pin s wait on X，经历一个小时后自动消失，需要查找原因。

这报错一般是某个会话需要申请S模式的mutex，而mutex被其他会话以X模式占有了。查holder也很容易，11g版本前看p2raw的前8位，将16进制转换成10进制即为holder的sid，在11g之后只需直接看blocking_session即可。

在11g中，我们有ash，这个很方便能查到过去发生的一切。

（1）根据客户反馈的情况，cursor pin S wait on X最严重的一个时间段为12:25~12:35，取这个时间段为例。发现cursor: pin S wait on X的进程都被sid为2的进程堵塞。

SQL>
select event,blocking_session,sql_id,count(*)
from dba_hist_active_sess_history
ash

2 where
sample_time>=to_timestamp('2013-06-24 12:25:00','yyyy-mm-dd
hh24:mi:ss')

3 and
sample_time<=to_timestamp('2013-06-24 12:35:00','yyyy-mm-dd
hh24:mi:ss')

4 and
event='cursor: pin S wait on X'

5 group
by event,blocking_session,sql_id

6 /

EVENT
BLOCKING_SESSION
SQL_ID COUNT(*)

---------------------------------------------------------------- ---------------- ------------- ----------
cursor: pin
S wait
on X 2
8duehra3kdx62 1050

--cursor:
pin S
wait on
X的SQL为：
SQL> select
sql_text from
v$sql
where sql_id='8duehra3kdx62';

SQL_TEXT

--------------------------------------------------------------------------------

insert into mytable_a1_p1_h1 select * from mytable_a2_p2_h2

（2）SID为2的进程被sid为1129的进程堵塞：

SQL>
select event,blocking_session,sql_id,count(*)
from dba_hist_active_sess_history
ash

2 where
sample_time>=to_timestamp('2013-06-24 12:25:00','yyyy-mm-dd
hh24:mi:ss')

3 and
sample_time<=to_timestamp('2013-06-24 12:35:00','yyyy-mm-dd
hh24:mi:ss')

4 and
ash.session_id=2

5 group
by event,blocking_session,sql_id

6 /

EVENT
BLOCKING_SESSION
SQL_ID COUNT(*)

---------------------------------------------------------------- ---------------- ------------- ----------
library cache
lock 1129
8duehra3kdx62 60

SQL>

（3）sid为1129的进程被sid 951的堵塞：

SQL>
select event,blocking_session,sql_id,count(*)
from dba_hist_active_sess_history
ash

2 where
sample_time>=to_timestamp('2013-06-24 12:25:00','yyyy-mm-dd
hh24:mi:ss')

3 and
sample_time<=to_timestamp('2013-06-24 12:35:00','yyyy-mm-dd
hh24:mi:ss')

4 and
ash.session_id=1129

5 group
by event,blocking_session,sql_id

6 /

EVENT
BLOCKING_SESSION
SQL_ID COUNT(*)

---------------------------------------------------------------- ---------------- ------------- ----------
row cache
lock
951 5nkcsttxny4wz 60

SQL>

--而sid 1129的sql为：
SQL> select
sql_text from
v$sql
where sql_id='5nkcsttxny4wz';

SQL_TEXT

--------------------------------------------------------------------------------

truncate table mytable_a2_p2_h2

SQL>

（4）sid 951的进程被Wnnn的进程堵塞：

SQL>
select event,blocking_session,sql_id,program,count(*)
from dba_hist_active_sess_history
ash

2 where
sample_time>=to_timestamp('2013-06-24 12:25:00','yyyy-mm-dd
hh24:mi:ss')

3 and
sample_time<=to_timestamp('2013-06-24 12:35:00','yyyy-mm-dd
hh24:mi:ss')

4 and
ash.session_id=951

5 group
by event,blocking_session,sql_id,program

6 /

EVENT
BLOCKING_SESSION
SQL_ID PROGRAM
COUNT(*)

---------------------------------------------------------------- ---------------- ------------- ---------------------------------------------------------------- ----------

   oracle@g4as1010
(W004)
8
log file
sequential
read
oracle@g4as1010
(W004) 52

（5）最终的堵塞进程为Wnnn。

Wnnn是11g新的进程，有SMCO进程spawn出来，用于主动空间的管理(It perform proactive space allocation and space reclaimation)。

从上述SQL看，由于有insert又有truncate，因此空间清理将有SMCO和Wnnn进程介入。在操作的过程中，堵塞其他的进程的操作。

到这里，我们可以有个初步的解决方案：

既然是Wnnn进程堵塞了其他进程的操作，那么我们禁用这个11g的新功能，禁用SMCO的空间管理 “Tablespace-level space（Extent）pre-allocation”，即可避免该问题的发生。

alter
system
set "_ENABLE_SPACE_PREALLOCATION"=0

还原的反向操作为：

alter
system
set "_ENABLE_SPACE_PREALLOCATION"=3

注：

* 0 to turn off the tbs pre-extension feature.

* 1 To enable tablespace extension.

* 2 To enable segment growth.

* 3 To enable extent growth.

* 4 To enable chunk allocation.

但是我们到这里也会问一个问题，为什么Wnnn的操作没有blocking_session，却要操作那么久的时间。这时一个不常见的等待事件映入眼帘：log file sequential read 。

log file sequential read 一般是在恢复或者logmnr的时候，才将redo log file的内容顺序的读入到内存中，此时为什么需要做log file sequential read？

根据上述的时间点，我们进一步结合alertlog看，我们发现在对应的时间点，都有关于w004的ora600的报错。

Sun Jun 23 01:45:01 2013

Errors in file /aabb/sx/kgg/data/app/oracle/diag/rdbms/kgg/kgg/trace/kgg_w007_8283.trc (incident=104227):

ORA-00600: internal error code, arguments: [ktecgsc:kcbz_objdchk], [0], [0], [1], [], [], [], [], [], [], [], []

Incident details in: /aabb/sx/kgg/data/app/oracle/diag/rdbms/kgg/kgg/incident/incdir_104227/kgg_w007_8283_i104227.trc

……

结合ora -600 [ktecgsc:kcbz_objdchk]查询MOS，我们可以定位到一个11.2.0.2的bug 10237773，并且从该bug的Rediscovery Note中，也可以看到：The problem is likely to occur in a Wnnn background process ktssupd_segment_extblks_bkg is likely to appear on the stack.

Bug 10237773 ORA-600 [kcbz_check_objd_typ] / ORA-600 [ktecgsc:kcbz_objdchk]

This note gives a brief overview of bug 10237773.

The content was last updated on: 20-OCT-2011

Click here for details of each of the sections below.

Affects:

Product (Component) Oracle Server (Rdbms)

Range of versions believed to be affected Versions >= 11.2 but BELOW 12.1

Versions confirmed as being affected ? 11.2.0.2

Platforms affected Generic (all / most platforms affected)

Fixed:

This issue is fixed in ? 12.1 (Future Release)

? 11.2.0.3

? 11.2.0.2.4 Patch Set Update

? 11.2.0.2 Bundle Patch 12 for Exadata Database

? 11.2.0.2 Patch 5 on Windows Platforms

Symptoms: Related To:

? Internal Error May Occur (ORA-600)

? ORA-600 [ktecgsc:kcbz_objdchk]

? ORA-600 [kcbz_check_objd_typ]

? Stack is likely to include ktssupd_segment_extblks_bkg ? (None Specified)





Description

ORA-600 [kcbz_check_objd_typ] / ORA-600 [ktecgsc:kcbz_objdchk] can

occur as a background process tries to read a segment header into the

cache from disk for an object that has undergone a TRUNCATE or

similar operation changing the blocks data object id.

Rediscovery Notes:

The problem is likely to occur in a Wnnn background process

ktssupd_segment_extblks_bkg is likely to appear on the stack.

HOOKS OERI:ktecgsc:kcbz_objdchk OERI:kcbz_check_objd_typ STACKHAS:ktssupd_segment_extblks_bkg LIKELYAFFECTS XAFFECTS_11.2.0.1 XAFFECTS_V11020001 AFFECTS=11.2.0.1 XAFFECTS_11.2.0.2 XAFFECTS_V11020002 AFFECTS=11.2.0.2 XPRODID_5 PRODUCT_ID=5 PRODID-5 RDBMS XCOMP_RDBMS
COMPONENT=RDBMS TAG_OERI OERI FIXED_11.2.0.2.4 FIXED_11.2.0.2.BP12 FIXED_11.2.0.3 FIXED_12.1.0.0 FIXED_WIN:B202P05

Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support.

References

Bug:10237773 (This link will only work for PUBLISHED bugs)

Note:245840.1 Information on the sections in this article

进一步分析trace文件，我们可以看到11：40到12：43都在dump redo record。因此，我们在那段时间看到wnnn进程的log file sequential read的等待。即wnnn进程在读取redo log file做dump。

Dump redo record（11：40到12：43）：

*** 2013-06-24 11:40:14.021
<<<<<<<<redo
record
dump的开始时间

REDO RECORD
- Thread:1
RBA: 0x011002.00080f4e.0024
LEN: 0x003c
VLD: 0x01
SCN: 0x0830.ffe6a968
SUBSCN:
3 06/23/2013
11:40:03

(LWN RBA:
0x011002.00080f27.0010
LEN: 0133
NST: 04d
SCN: 0x0830.ffe6a967)
CHANGE #1
TYP:0 CLS:4
AFN:4 DBA:0x01003c6a
OBJ:341379
SCN:0x0830.ffe6a8a7
SEQ:1 OP:14.2
ENC:0 RBL:0
ktelk redo:
xid: 0x0004.015.00137dd7

REDO RECORD
- Thread:1
RBA: 0x011002.00080f69.019c
LEN: 0x00f8
VLD: 0x01
SCN: 0x0830.ffe6a96a
SUBSCN:
38 06/23/2013
11:40:03
CHANGE #1
TYP:0 CLS:24
AFN:3 DBA:0x00cd58a4
OBJ:4294967295
SCN:0x0830.ffe6a96a
SEQ:3 OP:5.1
ENC:0 RBL:0
ktudb redo:
siz: 96
spc: 7696
flg: 0x0022
seq: 0x8a5b
rec: 0x06

   xid:
0x0004.015.00137dd7
ktubu redo:
slt: 21
rci: 5
opc: 14.5
objn: 1
objd: 341379
tsn: 4

.....

REDO RECORD
- Thread:1
RBA: 0x011002.00080fad.0178
LEN: 0x0034
VLD: 0x01
SCN: 0x0830.ffe6a96f
SUBSCN:
1 06/23/2013
11:40:03

(LWN RBA:
0x011002.00080fac.0010
LEN: 0026
NST: 04d
SCN: 0x0830.ffe6a96d)
CHANGE #1
TYP:0 CLS:4
AFN:4 DBA:0x01003c6a
OBJ:341379
SCN:0x0830.ffe6a96a
SEQ:1 OP:14.1
ENC:0 RBL:0
ktecush
redo: clear
extent
control lock

REDO RECORD
- Thread:1
RBA: 0x011002.00080fae.0010
LEN: 0x00c4
VLD: 0x01
SCN: 0x0830.ffe6a96f
SUBSCN:
3 06/23/2013
11:40:03
CHANGE #1
TYP:0 CLS:4
AFN:4 DBA:0x01003c6a
OBJ:341379
SCN:0x0830.ffe6a96f
SEQ:1 OP:13.28
ENC:0 RBL:0
High HWM

   Highwater::
0x033ffb58
ext#:
1
blk#:
8
ext size:
8

#blocks
in seg.
hdr's
freelists:
0

#blocks
below: 13

mapblk
0x00000000 offset:
1
lfdba:
0x01003c68 CHANGE #2
TYP:0 CLS:8
AFN:4 DBA:0x01003c68
OBJ:341379
SCN:0x0830.ffe6a96f
SEQ:1 OP:13.22
ENC:0 RBL:0
Redo on
Level1
Bitmap Block
Redo to
set hwm
Opcode:
32 Highwater::
0x033ffb58
ext#:
1
blk#:
8
ext size:
8

#blocks
in seg.
hdr's
freelists:
0

#blocks
below: 13

mapblk
0x00000000 offset:
1

REDO RECORD
- Thread:1
RBA: 0x011002.00080fb0.0050
LEN: 0x0084
VLD: 0x01
SCN: 0x0830.ffe6a970
SUBSCN:
10 06/23/2013
11:40:03
CHANGE #1
TYP:0 CLS:4
AFN:4 DBA:0x01003c6a
OBJ:341379
SCN:0x0830.ffe6a96f
SEQ:2 OP:13.28
ENC:0 RBL:0
Low HWM

   Highwater::
0x033ffb58
ext#:
1
blk#:
8
ext size:
8

#blocks
in seg.
hdr's
freelists:
0

#blocks
below: 13

....

*** 2013-06-24
11:40:20.839

*** 2013-06-24
12:43:25.874
descrip:"Thread 0001, Seq# 0000069858, SCN 0x08312e298f39-0xffffffffffff"

*** 2013-06-24
12:43:27.003 <<<<<<<<<<redo
record
dump的结束时间
END OF
DUMP REDO

另外，在trace文件中，也查到的bug中所说的ktssupd_segment_extblks_bkg的调用。

----- Call Stack Trace -----

calling call entry argument values in hex

location type point (? means dubious value)

-------------------- -------- -------------------- ----------------------------

ksedst1()+96 CALL skdstdst() FFFFFFFF7FFF1420 ?

   1006B0CA0 ? 000000000 ?

   00000000A ? 000000001 ?

   10BD98EB8 ?

……

   FFFFFFFF7FFFF8F0 ?

   FFFFFFFF7BD00200 ?

   FFFFFFFF7FFFB4A1 ?

ktecgshx()+28 CALL ktecgetsh() FFFFFFFF7FFFBA50 ?

   000000002 ? 000000002 ?

   FFFFFFFF7FFFB558 ?

   00010C222 ? 733B9EE60 ?

ktssupd_segment_ext CALL ktecgshx() FFFFFFFF7FFFBA50 ? <<<<<此处，发现调用ktssupd_segment_extblks_bkg函数
blks_bkg()+636 000000002
? 000000002 ?

   000000155 ?
733B9EE60 ?

   00010C212 ?
ktsj_execute_task()
PTR_CALL
ktssupd_segment_ext
71F1F2380 ? 000000024 ?

+676
blks_bkg() 000000004 ?
10A6E6000 ?

   10A6E6000 ?
00010C20D ?
ktsj_slave_main()+1
CALL ktsj_execute_task()
FFFFFFFF7FFFC570 ?
024
7417E8D78 ?
10C20E240 ?

   1023C8300 ?
000380025 ?

   10C20E090 ?
ksvrdp()+2304
PTR_CALL
ktsj_slave_main()
000000000 ? 000380000 ?

   000380025 ?
7448C78F8 ?

   000380025 ?
380025000 ?
opirip()+912 CALL ksvrdp()
000002000 ?
000000062 ?

   00038000B ?
74ABE9CD0 ?

   000001400 ?
000001648 ?
opidrv()+780 CALL opirip()
10A6E7000 ?
000000000 ?

   000380000 ?
00038000B ?

   38002A000 ?
000000001 ?
sou2o()+92 CALL opidrv()
000000032 ?
000000004 ?

   FFFFFFFF7FFFF530 ?

   0001EB250 ?

   FFFFFFFF7AB45110 ?

   FFFFFFFF7FFFF9B0 ?
opimai_real()+516
CALL sou2o()
FFFFFFFF7FFFF508 ?

   000000032 ?
000000004 ?

   FFFFFFFF7FFFF530 ?

   00010C000 ?
00010B800 ?
ssthrdmain()+320 PTR_CALL
opimai_real()
000000003 ?

   FFFFFFFF7FFFF7D8 ?

   FFFFFFFF7FFFFB2D ?

   FFFFFFFF7FFFFB8A ?

   000000000 ?
000000000 ?
main()+308 CALL ssthrdmain()
00010C000 ?
000000003 ?

   00044D000 ?
100644B60 ?

   10C233000 ?
00010C233 ?
_start()+380 CALL main()
000000003 ?

   FFFFFFFF7FFFF8E8 ?

   000000000 ?

   FFFFFFFF7FFFF7E0 ?

   FFFFFFFF7FFFF8F0 ?

   FFFFFFFF7BD00200 ?

---------------------
Binary
Stack Dump
---------------------

因此，问题的原因应该是这样的：

wnnn进程做空间管理时，读取segment header，此时有另外的进程执行truncate。触发bug 10237773 ，报错ora600[ktecgsc:kcbz_objdchk] >> 触发bug后需要dump redo，从11：40到12：43w004进程都在做redo的dump >> 做dump时w004进程（sid 951）处于log file sequential read等待 >>堵塞sid 1129进程，处于
row cache lock等待 >>堵塞sid 2进程，处于 library cache lock等待 >>堵塞其他的进程，处于 cursor pin s wait on x等待。

找到了问题根源，对应的解决方法也就明了了。workaround是禁用11g新特性，SMCO的空间管理，fix是打bug 10237773的补丁，或者升级到11.2.0.3。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航

经典 定位等待事件问题方法

经典定位等待事件问题方法