您的位置:首页 > 其它

虚拟机RAC的ASM磁盘组坏块导致重建DB

2011-11-23 16:15 169 查看
2011.11.23虚拟机RAC的ASM磁盘组坏块导致重建DB

刚刚在公司的一台PC机器上用vmware workstation8搭建了一套10gr2的rac环境,用的是裸设备+ASM搭建,在安装成功后,不小心被直接重启了下主机,结果再次启动虚拟机的时候提示到有磁盘损坏,也没有在意。但是在启动RAC的时候出现了问题,一开始的现象是如下几个个资源没办法随着其他资源一起启动:

ora.node1.LISTENER_NODE1.lsnr

ora.node2.LISTENER_NODE2.lsnr

ora.RAC.RAC1.inst

ora.RAC.RAC2.inst

ora.RAC.db

看具体的启动过程:

[oracle@node1 bin]$ crs_stat -t

Name Type Target State Host

------------------------------------------------------------

ora....C1.inst application OFFLINE OFFLINE

ora....C2.inst application OFFLINE OFFLINE

ora.RAC.db application OFFLINE OFFLINE

ora....SM1.asm application OFFLINE OFFLINE

ora....E1.lsnr application OFFLINE OFFLINE

ora.node1.gsd application OFFLINE OFFLINE

ora.node1.ons application OFFLINE OFFLINE

ora.node1.vip application OFFLINE OFFLINE

ora....SM2.asm application OFFLINE OFFLINE

ora....E2.lsnr application OFFLINE OFFLINE

ora.node2.gsd application OFFLINE OFFLINE

ora.node2.ons application OFFLINE OFFLINE

ora.node2.vip application OFFLINE OFFLINE

[oracle@node1 bin]$ crs_start -all

Attempting to start `ora.node1.vip` on member `node1`

Attempting to start `ora.node2.vip` on member `node2`

Start of `ora.node1.vip` on member `node1` succeeded.

Start of `ora.node2.vip` on member `node2` succeeded.

Attempting to start `ora.node1.ASM1.asm` on member `node1`

Attempting to start `ora.node2.ASM2.asm` on member `node2`

Start of `ora.node2.ASM2.asm` on member `node2` succeeded.

Attempting to start `ora.RAC.RAC2.inst` on member `node2`

Start of `ora.RAC.RAC2.inst` on member `node2` failed.

node1 : CRS-1018: Resource ora.node2.vip (application) is already running on node2

node1 : CRS-1018: Resource ora.node2.vip (application) is already running on node2

Start of `ora.node1.ASM1.asm` on member `node1` succeeded.

Attempting to start `ora.RAC.RAC1.inst` on member `node1`

Start of `ora.RAC.RAC1.inst` on member `node1` failed.

node2 : CRS-1018: Resource ora.node1.vip (application) is already running on node1

node2 : CRS-1018: Resource ora.node1.vip (application) is already running on node1

CRS-1002: Resource 'ora.node1.ons' is already running on member 'node1'

CRS-1002: Resource 'ora.node2.ons' is already running on member 'node2'

Attempting to start `ora.node1.gsd` on member `node1`

Attempting to start `ora.RAC.db` on member `node1`

Attempting to start `ora.node2.gsd` on member `node2`

Start of `ora.node1.gsd` on member `node1` succeeded.

Start of `ora.node2.gsd` on member `node2` succeeded.

Start of `ora.RAC.db` on member `node1` failed.

Attempting to start `ora.RAC.db` on member `node2`

Start of `ora.RAC.db` on member `node2` failed.

CRS-1006: No more members to consider

CRS-0215: Could not start resource 'ora.RAC.RAC1.inst'.

CRS-0215: Could not start resource 'ora.RAC.RAC2.inst'.

CRS-0215: Could not start resource 'ora.RAC.db'.

CRS-0223: Resource 'ora.node1.LISTENER_NODE1.lsnr' has placement error.

CRS-0223: Resource 'ora.node1.ons' has placement error.

CRS-0223: Resource 'ora.node2.LISTENER_NODE2.lsnr' has placement error.

CRS-0223: Resource 'ora.node2.ons' has placement error.

[oracle@node1 bin]$ crs_stat -t

Name Type Target State Host

------------------------------------------------------------

ora....C1.inst application ONLINE OFFLINE

ora....C2.inst application ONLINE OFFLINE

ora.RAC.db application ONLINE OFFLINE

ora....SM1.asm application ONLINE ONLINE node1

ora....E1.lsnr application OFFLINE OFFLINE

ora.node1.gsd application ONLINE ONLINE node1

ora.node1.ons application ONLINE ONLINE node1

ora.node1.vip application ONLINE ONLINE node1

ora....SM2.asm application ONLINE ONLINE node2

ora....E2.lsnr application OFFLINE OFFLINE

ora.node2.gsd application ONLINE ONLINE node2

ora.node2.ons application ONLINE ONLINE node2

ora.node2.vip application ONLINE ONLINE node2

尝试先把lsnr起来:

[oracle@node1 bin]$ crs_start ora.node1.LISTENER_NODE1.lsnr

Attempting to start `ora.node1.LISTENER_NODE1.lsnr` on member `node1`

Start of `ora.node1.LISTENER_NODE1.lsnr` on member `node1` succeeded.

[oracle@node1 bin]$ crs_start ora.node2.LISTENER_NODE2.lsnr

Attempting to start `ora.node2.LISTENER_NODE2.lsnr` on member `node2`

Start of `ora.node2.LISTENER_NODE2.lsnr` on member `node2` succeeded.

接着启动两个inst,接着出现问题了,inst无法拉起来:

[oracle@node1 bin]$ crs_start ora.RAC.RAC1.inst

Attempting to start `ora.RAC.RAC1.inst` on member `node1`

Start of `ora.RAC.RAC1.inst` on member `node1` failed.

node2 : CRS-1018: Resource ora.node1.vip (application) is already running on node1

CRS-0215: Could not start resource 'ora.RAC.RAC1.inst'.

检查相关的日志:

首先查看了下asm的日志:

alert_+ASM1.log

ed Nov 23 15:14:12 2011

Starting ORACLE instance (normal)

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Interface type 1 eth1 192.168.91.0 configured from OCR for use as a cluster interconnect

Interface type 1 eth0 192.168.88.0 configured from OCR for use as a public interface

Picked latch-free SCN scheme 2

Using LOG_ARCHIVE_DEST_1 parameter default value as /opt/app/product/10.2.0/db_1/dbs/arch

Autotune of undo retention is turned off.

LICENSE_MAX_USERS = 0

SYS auditing is disabled

ksdpec: called for event 13740 prior to event group initialization

Starting up ORACLE RDBMS Version: 10.2.0.1.0.

System parameters with non-default values:

large_pool_size = 12582912

instance_type = asm

cluster_database = TRUE

instance_number = 1

remote_login_passwordfile= EXCLUSIVE

background_dump_dest = /opt/app/admin/+ASM/bdump

user_dump_dest = /opt/app/admin/+ASM/udump

core_dump_dest = /opt/app/admin/+ASM/cdump

asm_diskgroups = DATA1

Cluster communication is configured to use the following interface(s) for this instance

192.168.91.100

Wed Nov 23 15:14:13 2011

cluster interconnect IPC version:Oracle UDP/IP

IPC Vendor 1 proto 2

PMON started with pid=2, OS id=25132

DIAG started with pid=3, OS id=25134

PSP0 started with pid=4, OS id=25136

LMON started with pid=5, OS id=25138

LMD0 started with pid=6, OS id=25140

LMS0 started with pid=7, OS id=25142

MMAN started with pid=8, OS id=25152

DBW0 started with pid=9, OS id=25154

LGWR started with pid=10, OS id=25156

CKPT started with pid=11, OS id=25158

SMON started with pid=12, OS id=25160

RBAL started with pid=13, OS id=25162

GMON started with pid=14, OS id=25164

Wed Nov 23 15:14:13 2011

lmon registered with NM - instance id 1 (internal mem no 0)

Wed Nov 23 15:14:13 2011

Reconfiguration started (old inc 0, new inc 1)

ASM instance

List of nodes:

0 1

Global Resource Directory frozen

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Wed Nov 23 15:14:14 2011

LMS 0: 0 GCS shadows cancelled, 0 closed

Set master node info

Submitted all remote-enqueue requests

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

Post SMON to start 1st pass IR

Wed Nov 23 15:14:14 2011

LMS 0: 0 GCS shadows traversed, 0 replayed

Wed Nov 23 15:14:14 2011

Submitted all GCS remote-cache requests

Post SMON to start 1st pass IR

Fix write in gcs resources

Reconfiguration complete

LCK0 started with pid=15, OS id=25208

Wed Nov 23 15:14:15 2011

SQL> ALTER DISKGROUP ALL MOUNT

Wed Nov 23 15:14:15 2011

NOTE: cache registered group DATA1 number=1 incarn=0x6f877cd9

* allocate domain 1, invalid = TRUE

freeing rdom 1

Received dirty detach msg from node 1 for dom 1

Wed Nov 23 15:14:22 2011

Loaded ASM Library - Generic Linux, version 2.0.4 (KABI_V2) library for asmlib interface

Wed Nov 23 15:14:22 2011

ORA-15186: ASMLIB error function = [asm_open], error = [1], mesg = [Operation not permitted]

Wed Nov 23 15:14:22 2011

ORA-15186: ASMLIB error function = [asm_open], error = [1], mesg = [Operation not permitted]

Wed Nov 23 15:14:23 2011

NOTE: Hbeat: instance first (grp 1)

Wed Nov 23 15:14:27 2011

NOTE: start heartbeating (grp 1)

NOTE: cache opening disk 0 of grp 1: DATA1_0000 path:/dev/raw/raw3

Wed Nov 23 15:14:27 2011

NOTE: F1X0 found on disk 0 fcn 0.0

NOTE: cache opening disk 1 of grp 1: DATA1_0001 path:/dev/raw/raw4

NOTE: cache mounting (first) group 1/0x6F877CD9 (DATA1)

* allocate domain 1, invalid = TRUE

kjbdomatt send to node 1

Wed Nov 23 15:14:27 2011

NOTE: attached to recovery domain 1

Wed Nov 23 15:14:27 2011

NOTE: starting recovery of thread=1 ckpt=3.315

NOTE: starting recovery of thread=2 ckpt=3.50

WARNING: cache failed to read fn=3 indblk=0 from disk(s): 1

ORA-15196: invalid ASM block header [kfc.c:7910] [endian_kfbh] [3] [2147483648] [0 != 1]

NOTE: a corrupted block was dumped to the trace file

System State dumped to trace file /opt/app/admin/+ASM/udump/+asm1_ora_25219.trc

NOTE: cache initiating offline of disk 1 group 1

WARNING: offlining disk 1.3914828841 (DATA1_0001) with mask 0x3

NOTE: PST update: grp = 1, dsk = 1, mode = 0x6

Wed Nov 23 15:14:27 2011

ERROR: too many offline disks in PST (grp 1)

Wed Nov 23 15:14:27 2011

NOTE: halting all I/Os to diskgroup DATA1

NOTE: active pin found: 0x0x2427ccd0

NOTE: active pin found: 0x0x2427cc64

Abort recovery for domain 1

NOTE: crash recovery signalled OER-15130

ERROR: ORA-15130 signalled during mount of diskgroup DATA1

NOTE: cache dismounting group 1/0x6F877CD9 (DATA1)

Wed Nov 23 15:14:28 2011

kjbdomdet send to node 1

detach from dom 1, sending detach message to node 1

Wed Nov 23 15:14:28 2011

Dirty detach reconfiguration started (old inc 1, new inc 1)

List of nodes:

0 1

Global Resource Directory partially frozen for dirty detach

* dirty detach - domain 1 invalid = TRUE

0 GCS resources traversed, 0 cancelled

Dirty Detach Reconfiguration complete

Wed Nov 23 15:14:28 2011

freeing rdom 1

Wed Nov 23 15:14:28 2011

WARNING: dirty detached from domain 1

Wed Nov 23 15:14:28 2011

ERROR: diskgroup DATA1 was not mounted

Wed Nov 23 15:14:28 2011

WARNING: PST-initiated MANDATORY DISMOUNT of group DATA1 not performed - group not mounted

Wed Nov 23 15:14:28 2011

Errors in file /opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc:

ORA-15001: diskgroup "DATA1" does not exist or is not mounted

[oracle@node1 bdump]$

从下面2段内容可以看到asm在mount diskgroup的时候出现错误了:

。。。。。

WARNING: cache failed to read fn=3 indblk=0 from disk(s): 1

ORA-15196: invalid ASM block header [kfc.c:7910] [endian_kfbh] [3] [2147483648] [0 != 1]

NOTE: a corrupted block was dumped to the trace file

System State dumped to trace file /opt/app/admin/+ASM/udump/+asm1_ora_25219.trc

NOTE: cache initiating offline of disk 1 group 1

WARNING: offlining disk 1.3914828841 (DATA1_0001) with mask 0x3

。。。。

Wed Nov 23 15:14:28 2011

WARNING: dirty detached from domain 1

Wed Nov 23 15:14:28 2011

ERROR: diskgroup DATA1 was not mounted

Wed Nov 23 15:14:28 2011

WARNING: PST-initiated MANDATORY DISMOUNT of group DATA1 not performed - group not mounted

Wed Nov 23 15:14:28 2011

Errors in file /opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc:

ORA-15001: diskgroup "DATA1" does not exist or is not mounted

查看具体的trace:

cat /opt/app/admin/+ASM/udump/+asm1_ora_25219.trc | less

找到如下错误提示

******************************************************

*** 2011-11-23 15:14:28.703

ksedmp: internal or fatal error

ORA-00600: internal error code, arguments: [723], [529336], [529336], [memory leak], [], [], [], []

Current SQL information unavailable - no SGA.

cat /opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc| less

/opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production

With the Partitioning, Real Application Clusters, OLAP and Data Mining options

ORACLE_HOME = /opt/app/product/10.2.0/db_1

System name: Linux

Node name: node1

Release: 2.6.18-164.el5

Version: #1 SMP Tue Aug 18 15:51:54 EDT 2009

Machine: i686

Instance name: +ASM1

Redo thread mounted by this instance: 0 <none>

Oracle process number: 17

Unix process pid: 25521, image: oracle@node1 (B000)

*** SERVICE NAME:() 2011-11-23 15:14:28.679

*** SESSION ID:(33.1) 2011-11-23 15:14:28.679

ORA-15001: diskgroup "DATA1" does not exist or is not mounted

怎么看都是没有成功mount磁盘组,还是先收工mount下磁盘组看下:

[oracle@node1 bdump]$ sqlplus /nolog

SQL*Plus: Release 10.2.0.1.0 - Production on Wed Nov 23 15:33:36 2011

Copyright (c) 1982, 2005, Oracle. All rights reserved.

SQL> exit

[oracle@node1 bdump]$ export ORACLE_SID=+ASM1

[oracle@node1 bdump]$ sqlplus /nolog

SQL*Plus: Release 10.2.0.1.0 - Production on Wed Nov 23 15:33:41 2011

Copyright (c) 1982, 2005, Oracle. All rights reserved.

SQL> conn /as sysdba

Connected.

SQL> desc v$asm_diskgroup;

Name Null? Type

----------------------------------------- -------- ----------------------------

GROUP_NUMBER NUMBER

NAME VARCHAR2(30)

SECTOR_SIZE NUMBER

BLOCK_SIZE NUMBER

ALLOCATION_UNIT_SIZE NUMBER

STATE VARCHAR2(11)

TYPE VARCHAR2(6)

TOTAL_MB NUMBER

FREE_MB NUMBER

REQUIRED_MIRROR_FREE_MB NUMBER

USABLE_FILE_MB NUMBER

OFFLINE_DISKS NUMBER

UNBALANCED VARCHAR2(1)

COMPATIBILITY VARCHAR2(60)

DATABASE_COMPATIBILITY VARCHAR2(60)

SQL> set linesize 150

SQL> column name format a30;

SQL> column state format a10;

SQL> select name,state from v$asm_diskgroup;

NAME STATE

------------------------------ ----------

DATA1 DISMOUNTED

果然磁盘组没有加载成功,尝试收工mount磁盘组:

SQL> alter diskgroup data1 mount;

alter diskgroup data1 mount

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15130: diskgroup "DATA1" is being dismounted

ORA-15066: offlining disk "DATA1_0001" may result in a data loss

SQL>

报错了,看看日志:

[oracle@node1 bdump]$ tail -50 alert_+ASM1.log

NOTE: F1X0 found on disk 0 fcn 0.0

NOTE: cache opening disk 1 of grp 1: DATA1_0001 path:/dev/raw/raw4

NOTE: cache mounting (first) group 1/0x26277CDE (DATA1)

* allocate domain 1, invalid = TRUE

kjbdomatt send to node 1

Wed Nov 23 15:37:49 2011

NOTE: attached to recovery domain 1

Wed Nov 23 15:37:49 2011

NOTE: starting recovery of thread=1 ckpt=3.315

NOTE: starting recovery of thread=2 ckpt=3.50

WARNING: cache failed to read fn=3 indblk=0 from disk(s): 1

ORA-15196: invalid ASM block header [kfc.c:7910] [endian_kfbh] [3] [2147483648] [0 != 1]

NOTE: a corrupted block was dumped to the trace file

System State dumped to trace file /opt/app/admin/+ASM/udump/+asm1_ora_21931.trc

NOTE: cache initiating offline of disk 1 group 1

WARNING: offlining disk 1.3914828843 (DATA1_0001) with mask 0x3

NOTE: PST update: grp = 1, dsk = 1, mode = 0x6

Wed Nov 23 15:37:49 2011

ERROR: too many offline disks in PST (grp 1)

Wed Nov 23 15:37:49 2011

NOTE: halting all I/Os to diskgroup DATA1

NOTE: active pin found: 0x0x2427ccd0

NOTE: active pin found: 0x0x2427cc64

Abort recovery for domain 1

NOTE: crash recovery signalled OER-15130

ERROR: ORA-15130 signalled during mount of diskgroup DATA1

NOTE: cache dismounting group 1/0x26277CDE (DATA1)

Wed Nov 23 15:37:51 2011

kjbdomdet send to node 1

detach from dom 1, sending detach message to node 1

Wed Nov 23 15:37:51 2011

Dirty detach reconfiguration started (old inc 1, new inc 1)

List of nodes:

0 1

Global Resource Directory partially frozen for dirty detach

* dirty detach - domain 1 invalid = TRUE

0 GCS resources traversed, 0 cancelled

Wed Nov 23 15:37:51 2011

freeing rdom 1

Dirty Detach Reconfiguration complete

Wed Nov 23 15:37:51 2011

WARNING: dirty detached from domain 1

Wed Nov 23 15:37:51 2011

ERROR: diskgroup DATA1 was not mounted

Wed Nov 23 15:37:52 2011

WARNING: PST-initiated MANDATORY DISMOUNT of group DATA1 not performed - group not mounted

Wed Nov 23 15:37:52 2011

Errors in file /opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc:

ORA-15001: diskgroup "DATA1" does not exist or is not mounted

ORA-15001: diskgroup "DATA1" does not exist or is not mounted

致命的ORA-15196: invalid ASM block header ,提示磁盘坏块了。

[oracle@node1 bdump]$ oerr ora 15196

15196, 00000, "invalid ASM block header [%s:%s] [%s] [%s] [%s] [%s != %s]"

// *Cause: ASM encountered an invalid metadata block.

// *Action: Contact Oracle Support Services.

//

[oracle@node1 bdump]$ oerr ora 15001

15001, 00000, "diskgroup \"%s\" does not exist or is not mounted"

// *Cause: An operation failed because the diskgroup specified does not

// exist or is not mounted by the current ASM instance.

// *Action: Verify that the diskgroup name used is valid, that the

// diskgroup exists, and that the diskgroup is mounted by

// the current ASM instance.

//

没辙了,好在是测试环境,重建吧:

先dbca卸载DB,然后重建diskgroup,最后重建db。

在两个在节点上root用户操作,注意raw3和raw4是要创建磁盘组的设备:

dd if=/dev/zero of=/dev/raw/raw3 bs=1024 count=4

dd if=/dev/zero of=/dev/raw/raw4 bs=1024 count=4

接着重建磁盘组:

SQL> column header_status format a15;

SQL> column path format a30;

SQL> select header_status,path from v$asm_disk;

HEADER_STATUS PATH

--------------- ------------------------------

CANDIDATE /dev/raw/raw3

CANDIDATE /dev/raw/raw4

UNKNOWN ORCL:VOL2

FOREIGN /dev/raw/raw1

UNKNOWN ORCL:VOL1

FOREIGN /dev/raw/raw2

6 rows selected.

SQL> create diskgroup datadisk1 external redundancy disk '/dev/raw/raw3' name d1 disk '/dev/raw/raw4' name d2;

Diskgroup created.

最后重新dbca重建db。

重建之后,重启了虚拟机和主机几把,还有再次发现问题,ok。

-The End-
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: