您的位置:首页 > 数据库

一次enq: CF - contention 导致数据库宕机的故障分析

2015-04-08 13:56 441 查看
数据库意外宕机,alert日志信息如下:

Sun Jul 27 01:02:48 2014

System State dumped to trace file /oracle/app/oracle/product/1020/admin/jcgl/bdump/jcgl2_diag_569650.trc

Sun Jul 27 01:03:48 2014

Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000

 by killing session 3289.1

Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000

 by terminating the process

LMD0: terminating instance due to error 2103

Sun Jul 27 01:03:48 2014

System state dump is made for local instance

Sun Jul 27 01:03:48 2014

Errors in file /oracle/app/oracle/product/1020/admin/jcgl/bdump/jcgl2_lms5_1048666.trc:

ORA-02103: PCC: inconsistent cursor cache (out-of-range cuc ref)

Sun Jul 27 01:03:48 2014

Trace dumping is performing id=[cdmp_20140727010348]

Sun Jul 27 01:03:49 2014

Shutting down instance (abort)

License high water mark = 1575

Sun Jul 27 01:03:57 2014

Termination issued to instance processes. Waiting for the processes to exit

Sun Jul 27 01:04:02 2014

Instance termination failed to kill one or more processes

Sun Jul 27 01:08:53 2014

Termination issued to instance processes. Waiting for the processes to exit

Sun Jul 27 01:08:59 2014

Instance termination failed to kill one or more processes

Sun Jul 27 01:12:30 2014

Shutting down instance (abort)

License high water mark = 1575

Termination issued to instance processes. Waiting for the processes to exit

Sun Jul 27 01:12:45 2014

Instance termination failed to kill one or more processes

Sun Jul 27 01:13:21 2014

Shutting down instance (abort)

License high water mark = 1575

Termination issued to instance processes. Waiting for the processes to exit

Sun Jul 27 01:13:36 2014

Instance termination failed to kill one or more processes

Sun Jul 27 01:21:06 2014

Instance terminated by USER, pid = 16228674

Sun Jul 27 01:21:52 2014

Instance terminated by USER, pid = 4215324

Sun Jul 27 01:28:44 2014

Instance terminated by USER, pid = 5947584

-----------------------------------------------------------------------------------------------------------------

分析:

首先看到,27号1点03分

oracle执行如下操作:  意思是杀死阻塞CF事件的进程号721256

Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000

 by killing session 3289.1

Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000

查看历史trace文件,查找进程721256在实例崩溃前是什么进程:

通过查看当时时间点的trace文件可知:jcgl2_ckpt_721256.trw

进程721256是oracle的检查点进程,也就是说,oracle自动杀死了ckpt进程

从而出现了如下信息:LMD0进程中止了实例

LMD0: terminating instance due to error 2103

那oracle为什么会自动杀死持有CF锁的进程呢?

为什么杀死:(参见metalink doc id:753290.1)

在oracle中:

a process was unable to obtain the CF enqueue within the specified timeout (default 900 sesconds)

如果某个进程持有CF锁,超过这个参数设置值则:

The instance terminated the blocker pid 721256 to avoid hanging up all the instances behind the CF enqueue blocker.

这个操作是受隐含参数 _kill_controlfile_enqueue_blocker 控制

若是想避免这种操作,可以将此参数设为false

当此参数设置为false的时候,不管任何情况的CF阻塞,oracle都不会再干预,

如果我们仅仅是想让oracle不干预oracle后台程序的话,即后台进程不被kill

则需设置参数:_kill_enqueue_blocker=1 

那么为什么ckpt会长时间持有CF锁:

关于ckpt持有锁CF:(doc id :406191.1)

出现进程持有CF锁时间过长问题的原因有:

1、very slow I/O subsystem where the control files are stored

2、frequent log switching -redo logs too small or insufficient logs

3、async I/O issue or multiple db_writers,you can't use both of them,back one out

4、OS / hardware issues

再次查看alert日志进行分析:

Sun Jul 27 00:44:14 2014

Thread 2 advanced to log sequence 388559 (LGWR switch)

  Current log# 19 seq# 388559 mem# 0: /gljcdata/zgls/redo19.log

Sun Jul 27 00:45:29 2014

Thread 2 advanced to log sequence 388560 (LGWR switch)

  Current log# 20 seq# 388560 mem# 0: /gljcdata/zgls/redo20.log

Sun Jul 27 00:46:07 2014

Thread 2 advanced to log sequence 388561 (LGWR switch)

  Current log# 11 seq# 388561 mem# 0: /gljcdata/zgls/redo11.log

Sun Jul 27 00:47:16 2014

Thread 2 advanced to log sequence 388562 (LGWR switch)

  Current log# 12 seq# 388562 mem# 0: /gljcdata/zgls/redo12.log

Sun Jul 27 01:02:48 2014

可以看出,在故障发生之前,redo log的切换大概在1分钟一次(切换频率过快),而且是有规律的,但是突然在零点47分之后,到1点02分之间,没有任何日志切换存在

大概时间是15分钟,也就是900秒,正好佐证了CF锁持有时间超过了900秒,导致oracle杀死该持有进程。

类似文档见(doc id 28045.1)

建议修改redo logfile大小,让系统保持日志切换大概在半小时左右一次,另外查看当时系统日志,是否存在I/O缓慢情况,是不是有异步I/O的使用。

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

另外,和CF锁相关,常会出现ora-2103,ora-600 [2103] 之类的错误,以后遇见此类错误的话,可以立马着手分析是否是CF问题
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: