一次enq: CF - contention 导致数据库宕机的故障分析
2015-04-08 13:56
441 查看
数据库意外宕机,alert日志信息如下:
Sun Jul 27 01:02:48 2014
System State dumped to trace file /oracle/app/oracle/product/1020/admin/jcgl/bdump/jcgl2_diag_569650.trc
Sun Jul 27 01:03:48 2014
Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000
by killing session 3289.1
Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000
by terminating the process
LMD0: terminating instance due to error 2103
Sun Jul 27 01:03:48 2014
System state dump is made for local instance
Sun Jul 27 01:03:48 2014
Errors in file /oracle/app/oracle/product/1020/admin/jcgl/bdump/jcgl2_lms5_1048666.trc:
ORA-02103: PCC: inconsistent cursor cache (out-of-range cuc ref)
Sun Jul 27 01:03:48 2014
Trace dumping is performing id=[cdmp_20140727010348]
Sun Jul 27 01:03:49 2014
Shutting down instance (abort)
License high water mark = 1575
Sun Jul 27 01:03:57 2014
Termination issued to instance processes. Waiting for the processes to exit
Sun Jul 27 01:04:02 2014
Instance termination failed to kill one or more processes
Sun Jul 27 01:08:53 2014
Termination issued to instance processes. Waiting for the processes to exit
Sun Jul 27 01:08:59 2014
Instance termination failed to kill one or more processes
Sun Jul 27 01:12:30 2014
Shutting down instance (abort)
License high water mark = 1575
Termination issued to instance processes. Waiting for the processes to exit
Sun Jul 27 01:12:45 2014
Instance termination failed to kill one or more processes
Sun Jul 27 01:13:21 2014
Shutting down instance (abort)
License high water mark = 1575
Termination issued to instance processes. Waiting for the processes to exit
Sun Jul 27 01:13:36 2014
Instance termination failed to kill one or more processes
Sun Jul 27 01:21:06 2014
Instance terminated by USER, pid = 16228674
Sun Jul 27 01:21:52 2014
Instance terminated by USER, pid = 4215324
Sun Jul 27 01:28:44 2014
Instance terminated by USER, pid = 5947584
-----------------------------------------------------------------------------------------------------------------
分析:
首先看到,27号1点03分
oracle执行如下操作: 意思是杀死阻塞CF事件的进程号721256
Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000
by killing session 3289.1
Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000
查看历史trace文件,查找进程721256在实例崩溃前是什么进程:
通过查看当时时间点的trace文件可知:jcgl2_ckpt_721256.trw
进程721256是oracle的检查点进程,也就是说,oracle自动杀死了ckpt进程
从而出现了如下信息:LMD0进程中止了实例
LMD0: terminating instance due to error 2103
那oracle为什么会自动杀死持有CF锁的进程呢?
为什么杀死:(参见metalink doc id:753290.1)
在oracle中:
a process was unable to obtain the CF enqueue within the specified timeout (default 900 sesconds)
如果某个进程持有CF锁,超过这个参数设置值则:
The instance terminated the blocker pid 721256 to avoid hanging up all the instances behind the CF enqueue blocker.
这个操作是受隐含参数 _kill_controlfile_enqueue_blocker 控制
若是想避免这种操作,可以将此参数设为false
当此参数设置为false的时候,不管任何情况的CF阻塞,oracle都不会再干预,
如果我们仅仅是想让oracle不干预oracle后台程序的话,即后台进程不被kill
则需设置参数:_kill_enqueue_blocker=1
那么为什么ckpt会长时间持有CF锁:
关于ckpt持有锁CF:(doc id :406191.1)
出现进程持有CF锁时间过长问题的原因有:
1、very slow I/O subsystem where the control files are stored
2、frequent log switching -redo logs too small or insufficient logs
3、async I/O issue or multiple db_writers,you can't use both of them,back one out
4、OS / hardware issues
再次查看alert日志进行分析:
Sun Jul 27 00:44:14 2014
Thread 2 advanced to log sequence 388559 (LGWR switch)
Current log# 19 seq# 388559 mem# 0: /gljcdata/zgls/redo19.log
Sun Jul 27 00:45:29 2014
Thread 2 advanced to log sequence 388560 (LGWR switch)
Current log# 20 seq# 388560 mem# 0: /gljcdata/zgls/redo20.log
Sun Jul 27 00:46:07 2014
Thread 2 advanced to log sequence 388561 (LGWR switch)
Current log# 11 seq# 388561 mem# 0: /gljcdata/zgls/redo11.log
Sun Jul 27 00:47:16 2014
Thread 2 advanced to log sequence 388562 (LGWR switch)
Current log# 12 seq# 388562 mem# 0: /gljcdata/zgls/redo12.log
Sun Jul 27 01:02:48 2014
可以看出,在故障发生之前,redo log的切换大概在1分钟一次(切换频率过快),而且是有规律的,但是突然在零点47分之后,到1点02分之间,没有任何日志切换存在
大概时间是15分钟,也就是900秒,正好佐证了CF锁持有时间超过了900秒,导致oracle杀死该持有进程。
类似文档见(doc id 28045.1)
建议修改redo logfile大小,让系统保持日志切换大概在半小时左右一次,另外查看当时系统日志,是否存在I/O缓慢情况,是不是有异步I/O的使用。
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
另外,和CF锁相关,常会出现ora-2103,ora-600 [2103] 之类的错误,以后遇见此类错误的话,可以立马着手分析是否是CF问题
Sun Jul 27 01:02:48 2014
System State dumped to trace file /oracle/app/oracle/product/1020/admin/jcgl/bdump/jcgl2_diag_569650.trc
Sun Jul 27 01:03:48 2014
Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000
by killing session 3289.1
Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000
by terminating the process
LMD0: terminating instance due to error 2103
Sun Jul 27 01:03:48 2014
System state dump is made for local instance
Sun Jul 27 01:03:48 2014
Errors in file /oracle/app/oracle/product/1020/admin/jcgl/bdump/jcgl2_lms5_1048666.trc:
ORA-02103: PCC: inconsistent cursor cache (out-of-range cuc ref)
Sun Jul 27 01:03:48 2014
Trace dumping is performing id=[cdmp_20140727010348]
Sun Jul 27 01:03:49 2014
Shutting down instance (abort)
License high water mark = 1575
Sun Jul 27 01:03:57 2014
Termination issued to instance processes. Waiting for the processes to exit
Sun Jul 27 01:04:02 2014
Instance termination failed to kill one or more processes
Sun Jul 27 01:08:53 2014
Termination issued to instance processes. Waiting for the processes to exit
Sun Jul 27 01:08:59 2014
Instance termination failed to kill one or more processes
Sun Jul 27 01:12:30 2014
Shutting down instance (abort)
License high water mark = 1575
Termination issued to instance processes. Waiting for the processes to exit
Sun Jul 27 01:12:45 2014
Instance termination failed to kill one or more processes
Sun Jul 27 01:13:21 2014
Shutting down instance (abort)
License high water mark = 1575
Termination issued to instance processes. Waiting for the processes to exit
Sun Jul 27 01:13:36 2014
Instance termination failed to kill one or more processes
Sun Jul 27 01:21:06 2014
Instance terminated by USER, pid = 16228674
Sun Jul 27 01:21:52 2014
Instance terminated by USER, pid = 4215324
Sun Jul 27 01:28:44 2014
Instance terminated by USER, pid = 5947584
-----------------------------------------------------------------------------------------------------------------
分析:
首先看到,27号1点03分
oracle执行如下操作: 意思是杀死阻塞CF事件的进程号721256
Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000
by killing session 3289.1
Killing enqueue blocker (pid=721256) on resource CF-00000000-00000000
查看历史trace文件,查找进程721256在实例崩溃前是什么进程:
通过查看当时时间点的trace文件可知:jcgl2_ckpt_721256.trw
进程721256是oracle的检查点进程,也就是说,oracle自动杀死了ckpt进程
从而出现了如下信息:LMD0进程中止了实例
LMD0: terminating instance due to error 2103
那oracle为什么会自动杀死持有CF锁的进程呢?
为什么杀死:(参见metalink doc id:753290.1)
在oracle中:
a process was unable to obtain the CF enqueue within the specified timeout (default 900 sesconds)
如果某个进程持有CF锁,超过这个参数设置值则:
The instance terminated the blocker pid 721256 to avoid hanging up all the instances behind the CF enqueue blocker.
这个操作是受隐含参数 _kill_controlfile_enqueue_blocker 控制
若是想避免这种操作,可以将此参数设为false
当此参数设置为false的时候,不管任何情况的CF阻塞,oracle都不会再干预,
如果我们仅仅是想让oracle不干预oracle后台程序的话,即后台进程不被kill
则需设置参数:_kill_enqueue_blocker=1
那么为什么ckpt会长时间持有CF锁:
关于ckpt持有锁CF:(doc id :406191.1)
出现进程持有CF锁时间过长问题的原因有:
1、very slow I/O subsystem where the control files are stored
2、frequent log switching -redo logs too small or insufficient logs
3、async I/O issue or multiple db_writers,you can't use both of them,back one out
4、OS / hardware issues
再次查看alert日志进行分析:
Sun Jul 27 00:44:14 2014
Thread 2 advanced to log sequence 388559 (LGWR switch)
Current log# 19 seq# 388559 mem# 0: /gljcdata/zgls/redo19.log
Sun Jul 27 00:45:29 2014
Thread 2 advanced to log sequence 388560 (LGWR switch)
Current log# 20 seq# 388560 mem# 0: /gljcdata/zgls/redo20.log
Sun Jul 27 00:46:07 2014
Thread 2 advanced to log sequence 388561 (LGWR switch)
Current log# 11 seq# 388561 mem# 0: /gljcdata/zgls/redo11.log
Sun Jul 27 00:47:16 2014
Thread 2 advanced to log sequence 388562 (LGWR switch)
Current log# 12 seq# 388562 mem# 0: /gljcdata/zgls/redo12.log
Sun Jul 27 01:02:48 2014
可以看出,在故障发生之前,redo log的切换大概在1分钟一次(切换频率过快),而且是有规律的,但是突然在零点47分之后,到1点02分之间,没有任何日志切换存在
大概时间是15分钟,也就是900秒,正好佐证了CF锁持有时间超过了900秒,导致oracle杀死该持有进程。
类似文档见(doc id 28045.1)
建议修改redo logfile大小,让系统保持日志切换大概在半小时左右一次,另外查看当时系统日志,是否存在I/O缓慢情况,是不是有异步I/O的使用。
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
另外,和CF锁相关,常会出现ora-2103,ora-600 [2103] 之类的错误,以后遇见此类错误的话,可以立马着手分析是否是CF问题
相关文章推荐
- 一次数据库不繁忙时一条sql语句2个执行计划导致业务超时的故障处理
- 一次vm 虚拟机时间倒流而导致的oracle 数据库启动故障
- Oracle enq: DX - contention 一次问题分析过程
- Oracle 10g FRA区域没有可用空间导致数据库故障一例分析
- 记一次存储故障导致数据库坏块处理过程 推荐
- Yarn FairScheduler 的资源预留机制导致的一次宕机事故分析
- ORA-04031导致数据库宕机问题分析
- 【故障处理141119】一次数据库不繁忙时一条sql语句2个运行计划导致业务超时的故障处理
- 一次数据库相关操作卡住的排查--enq: TX - row lock contention
- 一次vm 虚拟机时间倒流而导致的oracle 数据库启动故障
- 【Oracle】记一次数据库连接没有关闭导致数据库宕机的排查过程
- 数据库未关闭导致的故障分析
- Linux索引节点(inode)用满导致的一次故障
- 数据库的升级有可能会导致各种连接协议的故障
- 归档日志卷空间不足导致数据库故障解决方法
- 一次web 服务器无法连接上oracle 数据库的故障处理
- 存储故障,导致数据库不可用
- Linux索引节点(inode)用满导致的一次故障(转自张宴博客 http://blog.s135.com)
- 【恢复】非归档模式下因误删除数据文件导致数据库无法OPEN的故障处理
- 对一次网络故障的分析