Oracle 11.2.0.3数据库CJQ进程造成row cache lock等待事件影响job无法停止问题分析
2018-02-25 21:01
876 查看
一, 现象描述 某生产系统应用人员反映自己发起的job异常,没有正常运行,导致相关表无数据,自己尝试停止job但未成功,也无法重新发起job。
二, 问题分析 job的登录模式一般为’DBMS_SCHEDULER’,通过PL/SQL工具登录运行如下语句进行查看 select S.STATUS, S.INST_ID, S.SID, S.PADDR, S.SQL_ID, S.SQL_EXEC_START, S.ACTION, S.LAST_CALL_ET, S.BLOCKING_INSTANCE, S.BLOCKING_SESSION, S.EVENT, S.WAIT_CLASS from v$session S where module = 'DBMS_SCHEDULER';
查询到一个被kill的job,状态为killed,并未释放,通过下面的语句查询到spid后,通过登录后台kill -9 的方式杀掉了系统进程。
Select p.spid from v$session s,v$processes p where p.addr=s.paddr and s.sid=508
之后让应用再次发起job,但job依然无法执行,通过后台发现该job发生了等待事件row cache lock,等待会话为1209,P1值为11,通过下面的语句查看无返回结果,不能定位row cache lock发生在哪里 Select * from v$rowcache where cache#=11;
查询1209会话发现为系统后台进程CJQ0, CJQn和Jnnn均为为oracle作业队列进程,后台进程为什么会锁正常的job呢? Select * from v$session where sid=1209;
应用人员通过PL/SQL提供的JOB任务发现存在如下异常现象,
(上面的截图缺少了任务CSH_BRANCH_D_1DAY,因为截取的是已经处理后的图像,其实在出现问的时候是有的,只不过忘了截取了) 当前运行很多job,但没有都没有session_id,感觉像是僵尸job,既然无法通过会话杀掉job,
我们尝试通过dbms_scheduler包进行停止,但无法停止,而且依然出现了row cache lock事件,依然被1209后台进程锁住。
Row cache lock无法通过P1定位到具体的锁对象,后台进程为什么会锁住job呢,通过上面的结果初步分析是因为之前有很多僵死的job没有被正常停止,导致再次发起同样的job任务或对该job进行处理就会出现锁的情况,而且是被后台进程CJQn锁住。 SQL>Show parameter processes 通过查看参数job_queue_processes值为1000,aq_tm_processes为10000,并未达到峰值
进一步分析 登录后台通过DEBUG命令收集相关信息分析 用sysdba登录到数据库上:
$sqlplus / as sysdba
或者
$sqlplus -prelim / as sysdba <==当数据库已经很慢或者hang到无法连接 以下为操作截屏,采用的hanganalyze 3,systemstate dump 258级别。
相关说明: Level 的含义: 1-2:只有hanganalyze输出,不dump任何进程
3:Level2+Dump出在IN_HANG状态的进程
4:Level3+Dump出在等待链里面的blockers(状态为LEAF/LEAF_NW/IGN_DMP)
5:Level4+Dump出所有在等待链中的进程(状态为NLEAF)
Oracle官方建议不要超过level 3,一般level 3也能够解决问题,超过level 3会给系统带来额外负担。
systemstate dump有多个级别:
2: dump (不包括lock element) 10: dump
11: dump + global cache of RAC
256: short stack (函数堆栈)
258: 256+2 -->short stack +dump(不包括lock element)
266: 256+10 -->short stack+ dump
267: 256+11 -->short stack+ dump + global cache of RAC
level 11和 267会 dump global cache, 会生成较大的trace 文件,一般情况下不推荐。
一般情况下,如果进程不是太多,推荐用266,因为这样可以dump出来进程的函数堆栈,可以用来分析进程在执行什么操作。
但是生成short stack比较耗时,如果进程非常多,比如2000个进程,那么可能耗时30分钟以上。这种情况下,可以生成level 10 或者 level 258, level 258 比 level 10会多收集short short stack, 但比level 10少收集一些lock element data.
以下为hanganalyze的截取输出,可以看到主要等待为row cache lock,session id:7/porcess id:128被session id:1209/process id: 49锁住,和上面的分析一致。
Chains most likely to have caused the hang: [a] Chain 1 Signature: 'rdbms ipc message'<='row cache lock' Chain 1 Signature Hash: 0xfe4c0898
Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 2 (cmbd.cmbd2) os id: 43516004 process id: 128, oracle@jkptdb2 session id: 7 session serial #: 47563 } is waiting for 'row cache lock' with wait info: { p1: 'cache id'=0xb p2: 'mode'=0x0 p3: 'request'=0x5 time in wait: 0.077179 sec heur. time in wait: 16 min 18 sec timeout after: 2.922821 sec wait id: 13467 blocking: 0 sessions wait history: * time between current wait and wait #1: 0.000045 sec 1. event: 'row cache lock' time waited: 3.000020 sec wait id: 13466 p1: 'cache id'=0xb p2: 'mode'=0x0 p3: 'request'=0x5 * time between wait #1 and #2: 0.000030 sec 2. event: 'row cache lock' time waited: 3.000016 sec wait id: 13465 p1: 'cache id'=0xb p2: 'mode'=0x0 p3: 'request'=0x5 * time between wait #2 and #3: 0.000032 sec 3. event: 'row cache lock' time waited: 3.000016 sec wait id: 13464 p1: 'cache id'=0xb p2: 'mode'=0x0 p3: 'request'=0x5 } and is blocked by => Oracle session identified by: { instance: 2 (cmbd.cmbd2) os id: 19792004 process id: 49, oracle@jkptdb2 (CJQ0) session id: 1209 session serial #: 9 } which is waiting for 'rdbms ipc message' with wait info: { p1: 'timeout'=0x14 time in wait: 0.168064 sec heur. time in wait: 2.171697 sec timeout after: 0.031936 sec wait id: 72756966 blocking: 1 session wait history: * time between current wait and wait #1: 0.000350 sec 1. event: 'rdbms ipc message' time waited: 0.200014 sec wait id: 72756965 p1: 'timeout'=0x14 * time between wait #1 and #2: 0.000304 sec 2. event: 'rdbms ipc message' time waited: 0.200016 sec wait id: 72756964 p1: 'timeout'=0x14 * time between wait #2 and #3: 0.000307 sec 3. event: 'rdbms ipc message' time waited: 0.200041 sec wait id: 72756963 p1: 'timeout'=0x14 }
Chain 1 Signature: 'rdbms ipc message'<='row cache lock' Chain 1 Signature Hash: 0xfe4c0898
以下为systemstate dump(部分截取信息)输出如下:
上锁进程(session id:129): PROCESS 49: CJQ0
下面的描述大致为:系统进程49,也就是session id 129锁住一个会话是sid:7,等待事件为row cache lock,和上面的分析吻合。 There are 1 sessions blocked by this session. Dumping one waiter: inst: 2, sid: 7, ser: 47563 wait event: 'row cache lock' p1: 'cache id'=0xb p2: 'mode'=0x0 p3: 'request'=0x5 row_wait_obj#: 520, block#: 11728, row#: 0, file# 1 min_blocked_time: 0 secs, waiter_cache_ver: 21656
下面可以看到CJQ0进程持有S锁,而对象正是=CSH_BRANCH_D_1DAY_JOB,也就是我们上面打算停止的job名称。
kssdmh: so 0x7000001b029ced8 (type 75, "row cache enqueues") may not have children ---------------------------------------- SO: 0x7000001b02dd868, type: 75, owner: 0x700000191e0d808, flag: INIT/-/-/0x00 if: 0x1 c: 0x1 proc=0x7000001aa6dc268, name=row cache enqueues, file=kqr.h LINE:2064 ID:, pg=0 row cache enqueue: count=2 session=7000001adb58460 object=7000001810e7238, mode=S savepoint=0x351e5fa5 row cache parent object: address=7000001810e7238 cid=11(dc_objects) hash=af8ef272 typ=21 transaction=0 flags=00000002 objectno=352585 ownerid=57 nsp=1 name=CSH_BRANCH_D_1DAY_JOB own=7000001810e7300[7000001b02dd910,7000001b02dd910] wat=7000001810e7310[7000001b2c1bf00,7000001b2c1bf00] mode=S status=VALID/-/-/-/-/-/-/-/- request=N release=FALSE flags=8 instance lock=QI af8ef272 41d157e6 set=0, complete=FALSE data= 00000039 00154353 485f4252 414e4348 5f445f31 4441595f 4a4f4200 00000000 00000000 01000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
被锁进程(session id:7) PROCESS 128: ---------------------------------------- Short stack dump: ksedsts()+360<-ksdxfstk()+44<-ksdxcb()+3384<-sspuser()+116<-47dc<-sskgpwwait()+32<-skgpwwait()+180 <-ksliwat()+11032<-kslwaitctx()+180<-kqrigt()+1424<-kqrLockAndPinPo()+532<-kqrpre1()+864<-kqrpre()+28 <-jskaObjGet()+1024<-jskaJobRun()+2788<-jsiRunJob()+776<-jsaRunJob()+780<-spefcmpa()+600 <-spefmccallstd()+1340<-pextproc()+168<-peftrusted()+164<-psdexsp()+276<-rpiswu2()+480 <-psdextp()+632<-pefccal()+588<-pefcal()+260<-pevm_FCAL()+144<-pfrinstr_FCAL()+96 <-pfrrun_no_tool()+96<-pfrrun()+1032<-plsql_run()+716<-peicnt()+320<-kkxexe()+508 <-opiexe()+17168<-kpoal8()+1856<-opiodr()+720<-ttcpip()+1028<-opitsk()+1508<-opiino()+940 <-opiodr()+720<-opidrv()+1132<-sou2o()+136<-opimai_real()+608<-ssthrdmain()+268 <-main()+204<-__start()+112
当前正在等待事件row cache lock Current Wait Stack: 0: waiting for 'row cache lock' cache id=0xb, mode=0x0, request=0x5 wait_id=13602 seq_num=13603 snap_id=1 wait times: snap=0.003162 sec, exc=0.003162 sec, total=0.003162 sec wait times: max=3.000000 sec, heur=23 min 0 sec wait counts: calls=1 os=1 in_wait=1 iflags=0x5a2
下面显示了该进程在请求一个X锁,请求的对象是正是CSH_BRANCH_D_1DAY_JOB,和上面的分析是温和的,CJQ进程对 CSH_BRANCH_D_1DAY_JOB持有一个S锁,而当前的进程需要对CSH_BRANCH_D_1DAY_JOB进程操作,需要一个X锁。
SO: 0x7000001b2c1be58, type: 75, owner: 0x70000019c72b078, flag: INIT/-/-/0x00 if: 0x1 c: 0x1 proc=0x7000001ab6cb410, name=row cache enqueues, file=kqr.h LINE:2064 ID:, pg=0 row cache enqueue: count=1 session=7000001ab826ce0 object=7000001810e7238, request=X savepoint=0x893a row cache parent object: address=7000001810e7238 cid=11(dc_objects) hash=af8ef272 typ=21 transaction=0 flags=00000002 objectno=352585 ownerid=57 nsp=1 name=CSH_BRANCH_D_1DAY_JOB own=7000001810e7300[7000001b02dd910,7000001b02dd910] wat=7000001810e7310[7000001b2c1bf00,7000001b2c1bf00] mode=S status=VALID/-/-/-/-/-/-/-/- request=N release=FALSE flags=8 instance lock=QI af8ef272 41d157e6 set=0, complete=FALSE data= 00000039 00154353 485f4252 414e4348 5f445f31 4441595f 4a4f4200 00000000 00000000 01000000 00000000 00000000 00000000 00000000 00000000 00000000
011044c6 00000000 07000001 810e7300
下面记录了操作语句: begin sys.dbms_scheduler.run_job('CMBD.CSH_BRANCH_D_1DAY_JOB', false); end;
---------------------------------------- SO: 0x70000015d5f35c8, type: 78, owner: 0x7000001ab826ce0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x7000001ab6cb410, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8548 ID:, pg=0
LibraryObjectLock: Address=70000015d5f35c8 Handle=7000000ba27f998 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=1
User=7000001ab826ce0 Session=7000001ab826ce0 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=5a52c6da LibraryHandle: Address=7000000ba27f998 Hash=a53cfcdd LockMode=N PinMode=0 LoadLockMode=0 Status=VALD ObjectName:Name=begin sys.dbms_scheduler.run_job('CMBD.CSH_BRANCH_D_1DAY_JOB', false); end;
三,问题定位 通过相关关键字row cache lock cjq在metlink搜索查看,最终定位到了一个BUG16994952 –Unable to unscheduled propagation due to CJQ self-deadlock/hang。通过BUG的描述发现和上面的分析吻合。
The content was last updated on: 23-AUG-2015
Click here for details of each of the sections below.
Interim patches may be available for earlier versions - click here to check.
Rediscovery Information:
This bug may be the cause if the CJQ0 process is stuck in a
self-deadlock waiting on "row cache lock" for a dc_objects
entry in X mode for an object that it already holds in S mode.
The stack is likely to include jskaObjGet
Workaround:
Kill the hung CJQ process
HOOKS "WAITEVENT:row cache lock" STACKHAS:jskaObjGet WAITEVENT:row cache lock STACKHAS:jskaObjGet LIKELYAFFECTS XAFFECTS_11.2.0.1 XAFFECTS_V11020001 AFFECTS=11.2.0.1 XAFFECTS_11.2.0.2 XAFFECTS_V11020002 AFFECTS=11.2.0.2 XAFFECTS_11.2.0.3 XAFFECTS_V11020003 AFFECTS=11.2.0.3 XAFFECTS_11.2.0.4 XAFFECTS_V11020004 AFFECTS=11.2.0.4 XPRODID_5 PRODUCT_ID=5 PRODID-5 RDBMS XCOMP_RDBMS COMPONENT=RDBMS TAG_AQ TAG_DEADLOCK TAG_HANG TAG_JOBQ AQ DEADLOCK HANG JOBQ FIXED_12.1.0.1 SOLVED_SR=2 SOLVED_SR_BAND_2
Note:245840.1 Information on the sections in this article
四,问题处理 Kill the hung CJQ process,杀掉CJQ进程
SQL>Select spid from v$session s,v$process p where p.addr=s.paddr and s.sid=129;
再次通过dbms_scheduler包进行停止,可以正常停止,依次停止僵死的job后,应用可以再次正常发起job,恢复正常。
五, 问题总结 如何快速的处理这个问题呢,如果我们上来就搜索:11g 无法停止job,很难找到和我们需要的答案,同样按照上面分析处理可能需要时间比较长,其实当我们分析到上锁的进程是CJQ的时候,就应该意识到可能是BUG,马上进行精确搜索定位问题,关键字:row cache lock CJQ。对于11G及以下版本的数据库我们遇到的99%的问题都应该是已经有过记录或其他用户出现过的。所以在处理问题的时候怎么做到快,准,稳。可以快速发现问题并处理问题,这就需要我们不断积累经验,同时要有分析处理问题的思路。同样在处理问题后要完成具体的分析报告,养成记录故障日志的习惯,并进行案例分享。
二, 问题分析 job的登录模式一般为’DBMS_SCHEDULER’,通过PL/SQL工具登录运行如下语句进行查看 select S.STATUS, S.INST_ID, S.SID, S.PADDR, S.SQL_ID, S.SQL_EXEC_START, S.ACTION, S.LAST_CALL_ET, S.BLOCKING_INSTANCE, S.BLOCKING_SESSION, S.EVENT, S.WAIT_CLASS from v$session S where module = 'DBMS_SCHEDULER';
查询到一个被kill的job,状态为killed,并未释放,通过下面的语句查询到spid后,通过登录后台kill -9 的方式杀掉了系统进程。
Select p.spid from v$session s,v$processes p where p.addr=s.paddr and s.sid=508
之后让应用再次发起job,但job依然无法执行,通过后台发现该job发生了等待事件row cache lock,等待会话为1209,P1值为11,通过下面的语句查看无返回结果,不能定位row cache lock发生在哪里 Select * from v$rowcache where cache#=11;
查询1209会话发现为系统后台进程CJQ0, CJQn和Jnnn均为为oracle作业队列进程,后台进程为什么会锁正常的job呢? Select * from v$session where sid=1209;
应用人员通过PL/SQL提供的JOB任务发现存在如下异常现象,
(上面的截图缺少了任务CSH_BRANCH_D_1DAY,因为截取的是已经处理后的图像,其实在出现问的时候是有的,只不过忘了截取了) 当前运行很多job,但没有都没有session_id,感觉像是僵尸job,既然无法通过会话杀掉job,
我们尝试通过dbms_scheduler包进行停止,但无法停止,而且依然出现了row cache lock事件,依然被1209后台进程锁住。
Row cache lock无法通过P1定位到具体的锁对象,后台进程为什么会锁住job呢,通过上面的结果初步分析是因为之前有很多僵死的job没有被正常停止,导致再次发起同样的job任务或对该job进行处理就会出现锁的情况,而且是被后台进程CJQn锁住。 SQL>Show parameter processes 通过查看参数job_queue_processes值为1000,aq_tm_processes为10000,并未达到峰值
进一步分析 登录后台通过DEBUG命令收集相关信息分析 用sysdba登录到数据库上:
$sqlplus / as sysdba
或者
$sqlplus -prelim / as sysdba <==当数据库已经很慢或者hang到无法连接 以下为操作截屏,采用的hanganalyze 3,systemstate dump 258级别。
相关说明: Level 的含义: 1-2:只有hanganalyze输出,不dump任何进程
3:Level2+Dump出在IN_HANG状态的进程
4:Level3+Dump出在等待链里面的blockers(状态为LEAF/LEAF_NW/IGN_DMP)
5:Level4+Dump出所有在等待链中的进程(状态为NLEAF)
Oracle官方建议不要超过level 3,一般level 3也能够解决问题,超过level 3会给系统带来额外负担。
systemstate dump有多个级别:
2: dump (不包括lock element) 10: dump
11: dump + global cache of RAC
256: short stack (函数堆栈)
258: 256+2 -->short stack +dump(不包括lock element)
266: 256+10 -->short stack+ dump
267: 256+11 -->short stack+ dump + global cache of RAC
level 11和 267会 dump global cache, 会生成较大的trace 文件,一般情况下不推荐。
一般情况下,如果进程不是太多,推荐用266,因为这样可以dump出来进程的函数堆栈,可以用来分析进程在执行什么操作。
但是生成short stack比较耗时,如果进程非常多,比如2000个进程,那么可能耗时30分钟以上。这种情况下,可以生成level 10 或者 level 258, level 258 比 level 10会多收集short short stack, 但比level 10少收集一些lock element data.
以下为hanganalyze的截取输出,可以看到主要等待为row cache lock,session id:7/porcess id:128被session id:1209/process id: 49锁住,和上面的分析一致。
Chains most likely to have caused the hang: [a] Chain 1 Signature: 'rdbms ipc message'<='row cache lock' Chain 1 Signature Hash: 0xfe4c0898
Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 2 (cmbd.cmbd2) os id: 43516004 process id: 128, oracle@jkptdb2 session id: 7 session serial #: 47563 } is waiting for 'row cache lock' with wait info: { p1: 'cache id'=0xb p2: 'mode'=0x0 p3: 'request'=0x5 time in wait: 0.077179 sec heur. time in wait: 16 min 18 sec timeout after: 2.922821 sec wait id: 13467 blocking: 0 sessions wait history: * time between current wait and wait #1: 0.000045 sec 1. event: 'row cache lock' time waited: 3.000020 sec wait id: 13466 p1: 'cache id'=0xb p2: 'mode'=0x0 p3: 'request'=0x5 * time between wait #1 and #2: 0.000030 sec 2. event: 'row cache lock' time waited: 3.000016 sec wait id: 13465 p1: 'cache id'=0xb p2: 'mode'=0x0 p3: 'request'=0x5 * time between wait #2 and #3: 0.000032 sec 3. event: 'row cache lock' time waited: 3.000016 sec wait id: 13464 p1: 'cache id'=0xb p2: 'mode'=0x0 p3: 'request'=0x5 } and is blocked by => Oracle session identified by: { instance: 2 (cmbd.cmbd2) os id: 19792004 process id: 49, oracle@jkptdb2 (CJQ0) session id: 1209 session serial #: 9 } which is waiting for 'rdbms ipc message' with wait info: { p1: 'timeout'=0x14 time in wait: 0.168064 sec heur. time in wait: 2.171697 sec timeout after: 0.031936 sec wait id: 72756966 blocking: 1 session wait history: * time between current wait and wait #1: 0.000350 sec 1. event: 'rdbms ipc message' time waited: 0.200014 sec wait id: 72756965 p1: 'timeout'=0x14 * time between wait #1 and #2: 0.000304 sec 2. event: 'rdbms ipc message' time waited: 0.200016 sec wait id: 72756964 p1: 'timeout'=0x14 * time between wait #2 and #3: 0.000307 sec 3. event: 'rdbms ipc message' time waited: 0.200041 sec wait id: 72756963 p1: 'timeout'=0x14 }
Chain 1 Signature: 'rdbms ipc message'<='row cache lock' Chain 1 Signature Hash: 0xfe4c0898
以下为systemstate dump(部分截取信息)输出如下:
上锁进程(session id:129): PROCESS 49: CJQ0
下面的描述大致为:系统进程49,也就是session id 129锁住一个会话是sid:7,等待事件为row cache lock,和上面的分析吻合。 There are 1 sessions blocked by this session. Dumping one waiter: inst: 2, sid: 7, ser: 47563 wait event: 'row cache lock' p1: 'cache id'=0xb p2: 'mode'=0x0 p3: 'request'=0x5 row_wait_obj#: 520, block#: 11728, row#: 0, file# 1 min_blocked_time: 0 secs, waiter_cache_ver: 21656
下面可以看到CJQ0进程持有S锁,而对象正是=CSH_BRANCH_D_1DAY_JOB,也就是我们上面打算停止的job名称。
kssdmh: so 0x7000001b029ced8 (type 75, "row cache enqueues") may not have children ---------------------------------------- SO: 0x7000001b02dd868, type: 75, owner: 0x700000191e0d808, flag: INIT/-/-/0x00 if: 0x1 c: 0x1 proc=0x7000001aa6dc268, name=row cache enqueues, file=kqr.h LINE:2064 ID:, pg=0 row cache enqueue: count=2 session=7000001adb58460 object=7000001810e7238, mode=S savepoint=0x351e5fa5 row cache parent object: address=7000001810e7238 cid=11(dc_objects) hash=af8ef272 typ=21 transaction=0 flags=00000002 objectno=352585 ownerid=57 nsp=1 name=CSH_BRANCH_D_1DAY_JOB own=7000001810e7300[7000001b02dd910,7000001b02dd910] wat=7000001810e7310[7000001b2c1bf00,7000001b2c1bf00] mode=S status=VALID/-/-/-/-/-/-/-/- request=N release=FALSE flags=8 instance lock=QI af8ef272 41d157e6 set=0, complete=FALSE data= 00000039 00154353 485f4252 414e4348 5f445f31 4441595f 4a4f4200 00000000 00000000 01000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
被锁进程(session id:7) PROCESS 128: ---------------------------------------- Short stack dump: ksedsts()+360<-ksdxfstk()+44<-ksdxcb()+3384<-sspuser()+116<-47dc<-sskgpwwait()+32<-skgpwwait()+180 <-ksliwat()+11032<-kslwaitctx()+180<-kqrigt()+1424<-kqrLockAndPinPo()+532<-kqrpre1()+864<-kqrpre()+28 <-jskaObjGet()+1024<-jskaJobRun()+2788<-jsiRunJob()+776<-jsaRunJob()+780<-spefcmpa()+600 <-spefmccallstd()+1340<-pextproc()+168<-peftrusted()+164<-psdexsp()+276<-rpiswu2()+480 <-psdextp()+632<-pefccal()+588<-pefcal()+260<-pevm_FCAL()+144<-pfrinstr_FCAL()+96 <-pfrrun_no_tool()+96<-pfrrun()+1032<-plsql_run()+716<-peicnt()+320<-kkxexe()+508 <-opiexe()+17168<-kpoal8()+1856<-opiodr()+720<-ttcpip()+1028<-opitsk()+1508<-opiino()+940 <-opiodr()+720<-opidrv()+1132<-sou2o()+136<-opimai_real()+608<-ssthrdmain()+268 <-main()+204<-__start()+112
当前正在等待事件row cache lock Current Wait Stack: 0: waiting for 'row cache lock' cache id=0xb, mode=0x0, request=0x5 wait_id=13602 seq_num=13603 snap_id=1 wait times: snap=0.003162 sec, exc=0.003162 sec, total=0.003162 sec wait times: max=3.000000 sec, heur=23 min 0 sec wait counts: calls=1 os=1 in_wait=1 iflags=0x5a2
下面显示了该进程在请求一个X锁,请求的对象是正是CSH_BRANCH_D_1DAY_JOB,和上面的分析是温和的,CJQ进程对 CSH_BRANCH_D_1DAY_JOB持有一个S锁,而当前的进程需要对CSH_BRANCH_D_1DAY_JOB进程操作,需要一个X锁。
SO: 0x7000001b2c1be58, type: 75, owner: 0x70000019c72b078, flag: INIT/-/-/0x00 if: 0x1 c: 0x1 proc=0x7000001ab6cb410, name=row cache enqueues, file=kqr.h LINE:2064 ID:, pg=0 row cache enqueue: count=1 session=7000001ab826ce0 object=7000001810e7238, request=X savepoint=0x893a row cache parent object: address=7000001810e7238 cid=11(dc_objects) hash=af8ef272 typ=21 transaction=0 flags=00000002 objectno=352585 ownerid=57 nsp=1 name=CSH_BRANCH_D_1DAY_JOB own=7000001810e7300[7000001b02dd910,7000001b02dd910] wat=7000001810e7310[7000001b2c1bf00,7000001b2c1bf00] mode=S status=VALID/-/-/-/-/-/-/-/- request=N release=FALSE flags=8 instance lock=QI af8ef272 41d157e6 set=0, complete=FALSE data= 00000039 00154353 485f4252 414e4348 5f445f31 4441595f 4a4f4200 00000000 00000000 01000000 00000000 00000000 00000000 00000000 00000000 00000000
011044c6 00000000 07000001 810e7300
下面记录了操作语句: begin sys.dbms_scheduler.run_job('CMBD.CSH_BRANCH_D_1DAY_JOB', false); end;
---------------------------------------- SO: 0x70000015d5f35c8, type: 78, owner: 0x7000001ab826ce0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x7000001ab6cb410, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8548 ID:, pg=0
LibraryObjectLock: Address=70000015d5f35c8 Handle=7000000ba27f998 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=1
User=7000001ab826ce0 Session=7000001ab826ce0 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=5a52c6da LibraryHandle: Address=7000000ba27f998 Hash=a53cfcdd LockMode=N PinMode=0 LoadLockMode=0 Status=VALD ObjectName:Name=begin sys.dbms_scheduler.run_job('CMBD.CSH_BRANCH_D_1DAY_JOB', false); end;
三,问题定位 通过相关关键字row cache lock cjq在metlink搜索查看,最终定位到了一个BUG16994952 –Unable to unscheduled propagation due to CJQ self-deadlock/hang。通过BUG的描述发现和上面的分析吻合。
Bug 16994952 Unable to unschedule propagation due to CJQ self-deadlock / hang
This note gives a brief overview of bug 16994952.The content was last updated on: 23-AUG-2015
Click here for details of each of the sections below.
Affects:
[b]Product (Component)[/b] | Oracle Server (Rdbms) |
[b]Range of versions believed to be affected[/b] | (Not specified) |
[b]Versions confirmed as being affected[/b] | · 11.2.0.3 |
[b]Platforms affected[/b] | Generic (all / most platforms affected) |
Fixed:
[b]The fix for 16994952 is first included in[/b] | · 12.1.0.1 (Base Release) |
[b][u]Symptoms:[/u][/b] | [b][u]Related To:[/u][/b] |
· Deadlock · Hang (Process Hang) · Waits for "row cache lock" · Stack is likely to include jskaObjGet | · Advanced Queuing · Job Queues |
Description
Unable to unschedule propagation due to CJQ self-deadlock / hang.Rediscovery Information:
This bug may be the cause if the CJQ0 process is stuck in a
self-deadlock waiting on "row cache lock" for a dc_objects
entry in X mode for an object that it already holds in S mode.
The stack is likely to include jskaObjGet
Workaround:
Kill the hung CJQ process
HOOKS "WAITEVENT:row cache lock" STACKHAS:jskaObjGet WAITEVENT:row cache lock STACKHAS:jskaObjGet LIKELYAFFECTS XAFFECTS_11.2.0.1 XAFFECTS_V11020001 AFFECTS=11.2.0.1 XAFFECTS_11.2.0.2 XAFFECTS_V11020002 AFFECTS=11.2.0.2 XAFFECTS_11.2.0.3 XAFFECTS_V11020003 AFFECTS=11.2.0.3 XAFFECTS_11.2.0.4 XAFFECTS_V11020004 AFFECTS=11.2.0.4 XPRODID_5 PRODUCT_ID=5 PRODID-5 RDBMS XCOMP_RDBMS COMPONENT=RDBMS TAG_AQ TAG_DEADLOCK TAG_HANG TAG_JOBQ AQ DEADLOCK HANG JOBQ FIXED_12.1.0.1 SOLVED_SR=2 SOLVED_SR_BAND_2
[b]Please note:[/b] The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support. |
References
Bug:16994952 (This link will only work for PUBLISHED bugs)Note:245840.1 Information on the sections in this article
四,问题处理 Kill the hung CJQ process,杀掉CJQ进程
SQL>Select spid from v$session s,v$process p where p.addr=s.paddr and s.sid=129;
再次通过dbms_scheduler包进行停止,可以正常停止,依次停止僵死的job后,应用可以再次正常发起job,恢复正常。
五, 问题总结 如何快速的处理这个问题呢,如果我们上来就搜索:11g 无法停止job,很难找到和我们需要的答案,同样按照上面分析处理可能需要时间比较长,其实当我们分析到上锁的进程是CJQ的时候,就应该意识到可能是BUG,马上进行精确搜索定位问题,关键字:row cache lock CJQ。对于11G及以下版本的数据库我们遇到的99%的问题都应该是已经有过记录或其他用户出现过的。所以在处理问题的时候怎么做到快,准,稳。可以快速发现问题并处理问题,这就需要我们不断积累经验,同时要有分析处理问题的思路。同样在处理问题后要完成具体的分析报告,养成记录故障日志的习惯,并进行案例分享。
相关文章推荐
- [ORACLE 11G]ROW CACHE LOCK 等待
- Oracle grant 授权 出现 library cache lock 等待事件 处理
- oracle等待事件实验--enq: TX - row lock contention
- ORACLE等待事件:enq: TX - row lock contention
- enq:SQ contention / row cache lock / DFS lock handle(SV) 等待事件 说明
- Oracle 11g下重现library cache lock等待事件
- enq:SQ contention / row cache lock / DFS lock handle(SV) 等待事件 说明【转自dave偶像大神】
- Oracle “enq: TX - row lock contention 等待事件 ”
- enq:SQ contention / row cache lock / DFS lock handle(SV) 等待事件 说明
- 等待事件enq TX row lock contention分析
- enq:SQ contention / row cache lock / DFS lock handle(SV) 等待事件 说明
- enq:SQ contention / row cache lock / DFS lock handle(SV) 等待事件 说明
- 关于library cache pin和lock等待事件的理解
- 等待事件:enq: TX - row lock contention
- library cache pin 和lock 等待分析
- enq: TX - row lock/index contention、allocate ITL等待事件
- Oracle系统存在latch free等待事件及运行超长SQL监控分析
- CPU值满resmgr:cpu quantum造成的Oracle等待事件解决办法
- latch:library cache lock等待事件
- 有关数据库 行 锁 的几个问题(rowlock)