您的位置：首页 > 其它

solaris系统下使用asm的bug (solaris系统248天未重启导致asm进程异常）

2017-09-11 17:07 1086 查看

Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB
with ORA-29770 (文档 ID 2159643.1)

In this Document

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.2.0.4 to 12.1.0.2 [Release 10.2 to 12.1]

Oracle Solaris on SPARC (64-bit)

DESCRIPTION

OCCURRENCE

Only affects Solaris SPARC with Oracle RAC/non-RAC with ASM

SYMPTOMS

ASM or DB processes may start spinning in a RAC environment after the instance has been running continuously for more than 248 days. This issue only affects Solaris platforms and is due to a faulty C compiler optimization. The same problem can also affect
non RAC / ASM sessions (in particular if SQLNET.EXPIRE_TIME is used but this is not a requirement to hit the problem).

The RAC instance (DB or ASM) has been up for more than 248 days continuously with no shutdown.
The spinning processes show stacks similar to: sslssalck <- sskgxp_alarm_set <- skgxp_setalarm() <- sslsstehdlr() <- __sighndlr() <- call_user_handler() <- __pollsys() <- _pollsys() ...
Instance crashes can occur following process/es starting to spin due to various timeouts and blocked resources and so the symptoms of a crash due to this issue can vary.
On Non-RAC/ASM User sessions are seen to spin with a stack like above especially if SQLNET.EXPIRE_TIME is set in the server side SQLNET.ORA ( Note: It is NOT a requirement to have EXPIRE_TIME set to encounter this bug issue, but if that is set then the
issue may be more visible).
Example traces from 11gR2 on a two node SPARC (64-bit) cluster.

LMHB terminated one of the instances.

The following errors were reported in the Alert log:

Errors in file /oracle/product/diag/rdbms/<dbname>/<sid>/trace/<sid>_lmhb_29680.trc (incident=144129):
ORA-29770: global enqueue process LMON (OSID 29660) is hung for more than 70 seconds
ERROR: Some process(s) is not making progress.
LMHB (ospid: 29680) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
LMHB (ospid: 29680): terminating the instance due to error 29770
For each LM BG processes (LMON, LMD0, LMS0, LMS1 and LCK0 in this case), the trace file shows information similar to the following:

*** 2012-06-17 02:06:31.607
==============================
LMON (ospid: 29660) has not moved for 30 sec (1339891590.1339891560)
: waiting for event 'rdbms ipc message' for 25 secs with wait_id 3381334669.
===[ Wait Chain ]===
Wait chain is empty.
*** 2012-06-17 02:06:36.617
==============================
LMD0 (ospid: 29662) has not moved for 32 sec (1339891595.1339891563)
: waiting for event 'ges remote message' for 32 secs with wait_id 2766541183.
===[ Wait Chain ]===
Wait chain is empty.

Note that in the trace output, the wait_id related to each of the BG processes is not changing throughout the LMHB trace file .

Hence in this example, all LMON 'waiting for event' reports in the trace file reflect the same wait_id (3381334669 in this example)

Non-RAC ASM instance crashed after the following error message

Sat Dec 28 15:53:27 2013
NOTE: ASM client db0:db died unexpectedly.
NOTE: Process state recorded in trace file /opt/app/oracle/diag/asm/+asm/+ASM0/trace/+ASM0_ora_295.trc
Sat Dec 28 15:54:06 2013
Errors in file /opt/app/oracle/diag/asm/+asm/+ASM0/trace/+ASM0_pmon_28911.trc:
ORA-00490: PSP process terminated with error
PMON (ospid: 28911): terminating the instance due to error 490
Sat Dec 28 15:54:09 2013
ORA-1092 : opitsk aborting process
Sat Dec 28 15:54:09 2013
License high water mark = 6
Instance terminated by PMON, pid = 28911
USER (ospid: 3728): terminating the instance
Instance terminated by USER, pid = 3728
Sat Dec 28 15:54:15 2013
Starting ORACLE instance (normal)
Instance up for about 248 days(last startup time 2013-04-24 and crashed time 2013-12-28).

Wed Apr 24 02:36:39 2013
Starting ORACLE instance (normal)
....

Sat Dec 28 15:54:06 2013
Errors in file /opt/app/oracle/diag/asm/+asm/+ASM0/trace/+ASM0_pmon_28911.trc:
ORA-00490: PSP process terminated with error
PMON (ospid: 28911): terminating the instance due to error 490

WORKAROUND

As a workaround restart the instance every less than 248 days or whenever you hit this problem.

PATCHES

Request/apply patch for bug 18740837 or apply the 12.1.0.2.160719 (Jul 2016) Database Patch Set Update (DB PSU).

HISTORY

07/11/2016 - note created

08/05/2016 - first published