您的位置：首页 > 数据库

一次enq: CF - contention 导致数据库宕机的故障分析

2015-04-08 13:56 441 查看

数据库意外宕机，alert日志信息如下：

Sun Jul 27 01:02:48 2014

System State dumped to trace file /oracle/app/oracle/product/1020/admin/jcgl/bdump/jcgl2_diag_569650.trc

Sun Jul 27 01:03:48 2014

Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000

by killing session 3289.1

Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000

by terminating the process

LMD0: terminating instance due to error 2103

Sun Jul 27 01:03:48 2014

System state dump is made for local instance

Sun Jul 27 01:03:48 2014

Errors in file /oracle/app/oracle/product/1020/admin/jcgl/bdump/jcgl2_lms5_1048666.trc:

ORA-02103: PCC: inconsistent cursor cache (out-of-range cuc ref)

Sun Jul 27 01:03:48 2014

Trace dumping is performing id=[cdmp_20140727010348]

Sun Jul 27 01:03:49 2014

Shutting down instance (abort)

License high water mark = 1575

Sun Jul 27 01:03:57 2014

Termination issued to instance processes. Waiting for the processes to exit

Sun Jul 27 01:04:02 2014

Instance termination failed to kill one or more processes

Sun Jul 27 01:08:53 2014

Termination issued to instance processes. Waiting for the processes to exit

Sun Jul 27 01:08:59 2014

Instance termination failed to kill one or more processes

Sun Jul 27 01:12:30 2014

Shutting down instance (abort)

License high water mark = 1575

Termination issued to instance processes. Waiting for the processes to exit

Sun Jul 27 01:12:45 2014

Instance termination failed to kill one or more processes

Sun Jul 27 01:13:21 2014

Shutting down instance (abort)

License high water mark = 1575

Termination issued to instance processes. Waiting for the processes to exit

Sun Jul 27 01:13:36 2014

Instance termination failed to kill one or more processes

Sun Jul 27 01:21:06 2014

Instance terminated by USER, pid = 16228674

Sun Jul 27 01:21:52 2014

Instance terminated by USER, pid = 4215324

Sun Jul 27 01:28:44 2014

Instance terminated by USER, pid = 5947584

-----------------------------------------------------------------------------------------------------------------

分析：

首先看到，27号1点03分

oracle执行如下操作：意思是杀死阻塞CF事件的进程号721256

Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000

by killing session 3289.1

Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000

查看历史trace文件，查找进程721256在实例崩溃前是什么进程：

通过查看当时时间点的trace文件可知：jcgl2_ckpt_721256.trw

进程721256是oracle的检查点进程，也就是说，oracle自动杀死了ckpt进程

从而出现了如下信息：LMD0进程中止了实例

LMD0: terminating instance due to error 2103

那oracle为什么会自动杀死持有CF锁的进程呢？

为什么杀死：（参见metalink doc id：753290.1）

在oracle中：

a process was unable to obtain the CF enqueue within the specified timeout (default 900 sesconds)

如果某个进程持有CF锁，超过这个参数设置值则：

The instance terminated the blocker pid 721256 to avoid hanging up all the instances behind the CF enqueue blocker.

这个操作是受隐含参数 _kill_controlfile_enqueue_blocker 控制

若是想避免这种操作，可以将此参数设为false

当此参数设置为false的时候，不管任何情况的CF阻塞，oracle都不会再干预，

如果我们仅仅是想让oracle不干预oracle后台程序的话，即后台进程不被kill

则需设置参数：_kill_enqueue_blocker=1

那么为什么ckpt会长时间持有CF锁：

关于ckpt持有锁CF：（doc id ：406191.1）

出现进程持有CF锁时间过长问题的原因有：

1、very slow I/O subsystem where the control files are stored

2、frequent log switching -redo logs too small or insufficient logs

3、async I/O issue or multiple db_writers,you can't use both of them,back one out

4、OS / hardware issues

再次查看alert日志进行分析：

Sun Jul 27 00:44:14 2014

Thread 2 advanced to log sequence 388559 (LGWR switch)

Current log# 19 seq# 388559 mem# 0: /gljcdata/zgls/redo19.log

Sun Jul 27 00:45:29 2014

Thread 2 advanced to log sequence 388560 (LGWR switch)

Current log# 20 seq# 388560 mem# 0: /gljcdata/zgls/redo20.log

Sun Jul 27 00:46:07 2014

Thread 2 advanced to log sequence 388561 (LGWR switch)

Current log# 11 seq# 388561 mem# 0: /gljcdata/zgls/redo11.log

Sun Jul 27 00:47:16 2014

Thread 2 advanced to log sequence 388562 (LGWR switch)

Current log# 12 seq# 388562 mem# 0: /gljcdata/zgls/redo12.log

Sun Jul 27 01:02:48 2014

可以看出，在故障发生之前，redo log的切换大概在1分钟一次（切换频率过快），而且是有规律的，但是突然在零点47分之后，到1点02分之间，没有任何日志切换存在

大概时间是15分钟，也就是900秒，正好佐证了CF锁持有时间超过了900秒，导致oracle杀死该持有进程。

类似文档见（doc id 28045.1）

建议修改redo logfile大小，让系统保持日志切换大概在半小时左右一次，另外查看当时系统日志，是否存在I/O缓慢情况，是不是有异步I/O的使用。

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

另外，和CF锁相关，常会出现ora-2103，ora-600 [2103] 之类的错误，以后遇见此类错误的话，可以立马着手分析是否是CF问题

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航