Search

OakieTags

Who's online

There are currently 0 users and 25 guests online.

Recent comments

11g Release 2

ASM normal redundancy and failure groups spanning SANs

Julian Dyke has started an interesting thread on the Oak Table mailing list after the latest UKOUG RAC and HA SIG. Unfortunately I couldn’t attend that event, I wish I had, and I knew it would be great.

Anyway, the question revolved around an ASM disk group created with normal redundancy spanning two storage arrays. This should in theory protect against the failure of an array, although at a high price. All ASM disks exported from an array would be 1 failure group. Remember that disks in a failure group all fail if the supporting infrastructure (network, HBA, controller etc) fails. So what would happen with such a setup, if you followed these steps:

  • Shutdown the array for failure group 2
  • Stop the database
  • Shutdown the second array – failure group 1
  • Do some more maintenance…
  • Startup failgroup B SAN
  • Start the database
  • Startup failgroup A SAN

ASM can tolerate the failure of one failgroup (capacity permitting), so the failure of failgroup 2 should not bring the disk group down, which would result in immediate loss of service. But what happens if it comes up after the data in the other failure group has been modified? Will there be data corruption?

Replaying

To simulate two storage arrays my distinguished filer01 and filer02 OpenFiler appliances have been used, each exporting 2 approx. 4G “LUNS” to my database host. At this time I only had access to my 11.1.0.7 2 node RAC system, if time permits I’ll repeat this with 10.2.0.5 and 11.2.0.2. The RAC cluster in the SIG presentation was 10.2. I am skipping the bit about the LUN creation and presentation to the hosts, and assume the following setup:

[root@rac11gr1node1 ~]# iscsiadm -m session
tcp: [1] 192.168.99.51:3260,1 iqn.2006-01.com.openfiler:filer02DiskB
tcp: [2] 192.168.99.50:3260,1 iqn.2006-01.com.openfiler:filer01DiskA
tcp: [3] 192.168.99.51:3260,1 iqn.2006-01.com.openfiler:filer02DiskA
tcp: [4] 192.168.99.50:3260,1 iqn.2006-01.com.openfiler:filer01DiskB

192.168.99.50 is my first openfiler instance, .192.168.99.51 the second. As you can see each export DISKA and DISKB. Mapped to the hosts, this is the target mapping (use iscsiadm –mode session –print 3 to find out):

  • filer02DiskB: /dev/sda
  • filer01DiskA: /dev/sdb
  • filer02DiskA: /dev/sdc
  • filer01DiskB: /dev/sdd

I am using ASMLib (as always on the lab) to label these disks:

[root@rac11gr1node1 ~]# oracleasm listdisks
DATA1
DATA2
FILER01DISKA
FILER01DISKB
FILER02DISKA
FILER02DISKB

DATA1 and DATA2 will not play a role in this article, I’m interested in the other disks. Assuming that the scandisks command completed on all nodes, I can add the disks to the new diskgroup:

SQL> select path from v$asm_disk

PATH
--------------------------------------------------------------------------------
ORCL:FILER01DISKA
ORCL:FILER01DISKB
ORCL:FILER02DISKA
ORCL:FILER02DISKB
ORCL:DATA1
ORCL:DATA2

Let’s create the diskgroup. The important part is to create failure groups per storage array. By the way this is not different from extended distance RAC!

SQL> create diskgroup fgtest normal redundancy
 2  failgroup filer01 disk 'ORCL:FILER01DISKA', 'ORCL:FILER01DISKB'
 3  failgroup filer02 disk 'ORCL:FILER02DISKA', 'ORCL:FILER02DISKB'
 4  attribute 'compatible.asm'='11.1';

Diskgroup created.

With that done let’s have a look at the asm disk information:

SQL> select MOUNT_STATUS,HEADER_STATUS,STATE,REDUNDANCY,FAILGROUP,PATH from v$asm_disk where group_number=2;

MOUNT_S HEADER_STATU STATE    REDUNDA FAILGROUP                      PATH
------- ------------ -------- ------- ------------------------------ --------------------
CACHED  MEMBER       NORMAL   UNKNOWN FILER01                        ORCL:FILER01DISKA
CACHED  MEMBER       NORMAL   UNKNOWN FILER01                        ORCL:FILER01DISKB
CACHED  MEMBER       NORMAL   UNKNOWN FILER02                        ORCL:FILER02DISKA
CACHED  MEMBER       NORMAL   UNKNOWN FILER02                        ORCL:FILER02DISKB

I have set the disk repair time to 24 hours and raised compatible parameters for RDBMS and ASM to 11.1, resulting in these attributes:

SQL> select * from v$asm_attribute

NAME                           VALUE                GROUP_NUMBER ATTRIBUTE_INDEX ATTRIBUTE_INCARNATION READ_ON SYSTEM_
------------------------------ -------------------- ------------ --------------- --------------------- ------- -------
disk_repair_time               3.6h                            2               0                     1 N       Y
au_size                        1048576                         2               8                     1 Y       Y
compatible.asm                 11.1.0.0.0                      2              20                     1 N       Y
compatible.rdbms               11.1.0.0.0                      2              21                     1 N       Y

Unlike 11.2 where disk groups are managed as Clusterware resource, 11.1 requires you to manually start them or append the new disk group to ASM_DISKGORUPS. You should query gv$asm_diskgroup.state to ensure the new diskgroup is mounted on all cluster nodes.

I need some data! A small demo database can be restored to the new failure group to provide some experimental playground. This is quite easily done by using an RMAN duplicate with the correct {db|log}_file_name_convert parameter set.

Mirror

The diskgroup is created with normal redundancy, which means that ASM will create a mirror for every primary extent, taking failure groups into consideration. I wanted to ensure that the data is actually mirrored on the new disk group, which has group number 2.I need to get this information from V$ASM_FILE and V$ASM_ALIAS:

SQL> select * from v$asm_file where group_number = 2

GROUP_NUMBER FILE_NUMBER COMPOUND_INDEX INCARNATION BLOCK_SIZE     BLOCKS      BYTES      SPACE TYPE                 REDUND STRIPE CREATION_ MODIFICAT R
------------ ----------- -------------- ----------- ---------- ---------- ---------- ---------- -------------------- ------ ------ --------- --------- -
 2         256       33554688   747669775      16384       1129   18497536   78643200 CONTROLFILE          HIGH   FINE   05-APR-11 05-APR-11 U
 2         257       33554689   747669829       8192      69769  571547648 1148190720 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         258       33554690   747669829       8192      60161  492838912  990904320 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         259       33554691   747669829       8192      44801  367009792  739246080 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         260       33554692   747669831       8192      25601  209723392  424673280 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         261       33554693   747669831       8192        641    5251072   12582912 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         262       33554694   747670409        512     102401   52429312  120586240 ONLINELOG            MIRROR FINE   05-APR-11 05-APR-11 U
 2         263       33554695   747670409        512     102401   52429312  120586240 ONLINELOG            MIRROR FINE   05-APR-11 05-APR-11 U
 2         264       33554696   747670417        512     102401   52429312  120586240 ONLINELOG            MIRROR FINE   05-APR-11 05-APR-11 U
 2         265       33554697   747670417        512     102401   52429312  120586240 ONLINELOG            MIRROR FINE   05-APR-11 05-APR-11 U
 2         266       33554698   747670419       8192       2561   20979712   44040192 TEMPFILE             MIRROR COARSE 05-APR-11 05-APR-11 U

11 rows selected.

SQL> select * from v$asm_alias where group_NUMBER=2

NAME                           GROUP_NUMBER FILE_NUMBER FILE_INCARNATION ALIAS_INDEX ALIAS_INCARNATION PARENT_INDEX REFERENCE_INDEX A S
------------------------------ ------------ ----------- ---------------- ----------- ----------------- ------------ --------------- - -
RAC11G                                    2  4294967295       4294967295           0                 3     33554432        33554485 Y Y
CONTROLFILE                               2  4294967295       4294967295          53                 3     33554485        33554538 Y Y
current.256.747669775                     2         256        747669775         106                 3     33554538        50331647 N Y
DATAFILE                                  2  4294967295       4294967295          54                 1     33554485        33554591 Y Y
SYSAUX.257.747669829                      2         257        747669829         159                 1     33554591        50331647 N Y
SYSTEM.258.747669829                      2         258        747669829         160                 1     33554591        50331647 N Y
UNDOTBS1.259.747669829                    2         259        747669829         161                 1     33554591        50331647 N Y
UNDOTBS2.260.747669831                    2         260        747669831         162                 1     33554591        50331647 N Y
USERS.261.747669831                       2         261        747669831         163                 1     33554591        50331647 N Y
ONLINELOG                                 2  4294967295       4294967295          55                 1     33554485        33554644 Y Y
group_1.262.747670409                     2         262        747670409         212                 1     33554644        50331647 N Y
group_2.263.747670409                     2         263        747670409         213                 1     33554644        50331647 N Y
group_3.264.747670417                     2         264        747670417         214                 1     33554644        50331647 N Y
group_4.265.747670417                     2         265        747670417         215                 1     33554644        50331647 N Y
TEMPFILE                                  2  4294967295       4294967295          56                 1     33554485        33554697 Y Y
TEMP.266.747670419                        2         266        747670419         265                 1     33554697        50331647 N Y

My USERS tablespace which I am interested in most has file number 261-I chose it for this example as it’s only 5M in size. Taking my 1 MB allocation unit into account, it means I don’t have to trawl through thousands of line of output when getting the extent map.

Credit where credit is due-the next queries are partly based on the excellent work by Luca Canali from CERN, who has looked at ASM internals for a while. Make sure you have a look at the excellent reference available here: https://twiki.cern.ch/twiki/bin/view/PDBService/ASM_Internals. So to answer the question if the extents making up my users tablespace we need to have a look at the X$KFFXP, i.e. file extent pointers view:

SQL> select GROUP_KFFXP,DISK_KFFXP,AU_KFFXP from x$kffxp where number_kffxp=261 and group_kffxp=2 order by disk_kffxp;

GROUP_KFFXP DISK_KFFXP   AU_KFFXP
----------- ---------- ----------
 2          0        864
 2          0        865
 2          0        866
 2          1        832
 2          1        831
 2          1        833
 2          2        864
 2          2        866
 2          2        865
 2          3        832
 2          3        833
 2          3        831

12 rows selected.

As you can see, I have a number of extents, all evenly spread over my disks. I can verify that this information is correct by querying the X$KFDAT view as well which contains similar information, but more related to the disk  to AU mapping

SQL> select GROUP_KFDAT,NUMBER_KFDAT,AUNUM_KFDAT from x$kfdat where fnum_kfdat = 261 and group_kfdat=2

GROUP_KFDAT NUMBER_KFDAT AUNUM_KFDAT
----------- ------------ -----------
 2            0         864
 2            0         865
 2            0         866
 2            1         831
 2            1         832
 2            1         833
 2            2         864
 2            2         865
 2            2         866
 2            3         831
 2            3         832
 2            3         833

12 rows selected.

OK so I am confident that my data is actually mirrored-otherwise the following test would not make any sense. I have double checked that the disks in failgroup FILER01 actually belong to my OpenFiler “filer01″, and the same for filer02. Going back to the original scenario:

Shut down Filer02

This will take down all the disks of failure group B. Two minutes after taking the filer down I checked if it was indeed shut down:

martin@dom0:~> sudo xm list | grep filer
filer01                                    183   512     1     -b----   1159.6
filer02                                          512     1              1179.6
filer03                                    185   512     1     -b----   1044.4

Yes, no doubt about it-it’s down. What would the effect be? Surely I/O errors, but I wanted to enforce a check. Connected to +ASM2 I issued the “select * from v$asm_disk” command. This caused quite significant logging in the instance’s alert.log:

NOTE: ASMB process exiting due to lack of ASM file activity for 5 seconds
Wed Apr 06 17:17:39 2011
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKA disk:0x0.0x97954459 au:0
 iop:0x2b2997b61330 bufp:0x2b29977b3e00 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKB disk:0x1.0x9795445a au:0
 iop:0x2b2997b61220 bufp:0x2b29977b0200 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
Wed Apr 06 17:17:58 2011
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKB disk:0x1.0x9795445a au:0
 iop:0x2b2997b61440 bufp:0x2b29977a6400 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKA disk:0x0.0x97954459 au:0
 iop:0x2b2997b61550 bufp:0x2b29977b1600 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
Wed Apr 06 17:18:03 2011
WARNING: Disk (FILER02DISKA) will be dropped in: (86400) secs on ASM inst: (2)
WARNING: Disk (FILER02DISKB) will be dropped in: (86400) secs on ASM inst: (2)
GMON SlaveB: Deferred DG Ops completed.
Wed Apr 06 17:19:26 2011
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKB disk:0x1.0x9795445a au:0
 iop:0x2b2997b61550 bufp:0x2b29977b1600 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKA disk:0x0.0x97954459 au:0
 iop:0x2b2997b61440 bufp:0x2b29977b0200 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
Wed Apr 06 17:20:10 2011
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKA disk:0x0.0x97954459 au:0
 iop:0x2b2997b61000 bufp:0x2b29977a6400 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKB disk:0x1.0x9795445a au:0
 iop:0x2b2997b61550 bufp:0x2b29977b1600 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
Wed Apr 06 17:21:07 2011
WARNING: Disk (FILER02DISKA) will be dropped in: (86217) secs on ASM inst: (2)
WARNING: Disk (FILER02DISKB) will be dropped in: (86217) secs on ASM inst: (2)
GMON SlaveB: Deferred DG Ops completed.
Wed Apr 06 17:27:15 2011
WARNING: Disk (FILER02DISKA) will be dropped in: (85849) secs on ASM inst: (2)
WARNING: Disk (FILER02DISKB) will be dropped in: (85849) secs on ASM inst: (2)
GMON SlaveB: Deferred DG Ops completed.

The interesting lines are “all mirror sides found readable, no repair required”. So taking down the failgroup didn’t cause an outage. The other ASM instance complained as well a little later:

2011-04-06 17:16:58.393000 +01:00
NOTE: initiating PST update: grp = 2, dsk = 2, mode = 0x15
NOTE: initiating PST update: grp = 2, dsk = 3, mode = 0x15
kfdp_updateDsk(): 24
kfdp_updateDskBg(): 24
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:2, diskname:ORCL:FILER02DISKA disk:0x2.0x97a6d9f7 au:1
 iop:0x2b9ea4855e70 bufp:0x2b9ea4850a00 offset(bytes):1052672 iosz:4096 operation:2(Write) synchronous:1
 result: 4 osderr:0x3 osderr1:0x2e pid:870
NOTE: group FGTEST: updated PST location: disk 0000 (PST copy 0)
2011-04-06 17:17:03.508000 +01:00
NOTE: ASMB process exiting due to lack of ASM file activity for 5 seconds
NOTE: PST update grp = 2 completed successfully
NOTE: initiating PST update: grp = 2, dsk = 2, mode = 0x1
NOTE: initiating PST update: grp = 2, dsk = 3, mode = 0x1
kfdp_updateDsk(): 25
kfdp_updateDskBg(): 25
2011-04-06 17:17:07.454000 +01:00
NOTE: group FGTEST: updated PST location: disk 0000 (PST copy 0)
NOTE: PST update grp = 2 completed successfully
NOTE: cache closing disk 2 of grp 2: FILER02DISKA
NOTE: cache closing disk 3 of grp 2: FILER02DISKB
SUCCESS: extent 0 of file 267 group 2 repaired by offlining the disk
NOTE: repairing group 2 file 267 extent 0
SUCCESS: extent 0 of file 267 group 2 repaired - all mirror sides found readable, no repair required
2011-04-06 17:19:04.526000 +01:00
GMON SlaveB: Deferred DG Ops completed.
2011-04-06 17:22:07.487000 +01:00
GMON SlaveB: Deferred DG Ops completed.

No interruption of service though, which is good-the GV$ASM_CLIENT view reported all database instances still connected.

SQL> select * from gv$asm_client;

 INST_ID GROUP_NUMBER INSTANCE_NAME                                                    DB_NAME  STATUS       SOFTWARE_VERSIO COMPATIBLE_VERS
---------- ------------ ---------------------------------------------------------------- -------- ------------ --------------- ---------------
 2            2 rac11g2                                                          rac11g   CONNECTED    11.1.0.7.0      11.1.0.0.0
 1            2 rac11g1                                                          rac11g   CONNECTED    11.1.0.7.0      11.1.0.0.0

The result in the V$ASM_DISK view was as follows:

SQL> select name,state,header_status,path from v$asm_disk;

NAME                           STATE    HEADER_STATU PATH                                               FAILGROUP
------------------------------ -------- ------------ -------------------------------------------------- ------------------------------
 NORMAL   UNKNOWN      ORCL:FILER02DISKA
 NORMAL   UNKNOWN      ORCL:FILER02DISKB
DATA1                          NORMAL   MEMBER       ORCL:DATA1                                         DATA1
DATA2                          NORMAL   MEMBER       ORCL:DATA2                                         DATA2
FILER01DISKA                   NORMAL   MEMBER       ORCL:FILER01DISKA                                  FILER01
FILER01DISKB                   NORMAL   MEMBER       ORCL:FILER01DISKB                                  FILER01
FILER02DISKA                   NORMAL   UNKNOWN                                                         FILER02
FILER02DISKB                   NORMAL   UNKNOWN                                                         FILER02

8 rows selected.

As I expected the disks for failgroup filer02 are gone, and so is the information about the failure group. My disk repair time should be high enough to protect me from having to rebuild the whole disk group. Now I’m really curious if my database can become corrupted-I’ll increase the SCN.

[oracle@rac11gr1node1 ~]$ . setsid.sh rac11g
[oracle@rac11gr1node1 ~]$ sq

SQL*Plus: Release 11.1.0.7.0 - Production on Wed Apr 6 17:24:18 2011

Copyright (c) 1982, 2008, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
 5999304

SQL> begin
 2   for i in 1..5 loop
 3    execute immediate 'alter system switch logfile';
 4   end loop;
 5  end;
 6  /

PL/SQL procedure successfully completed.

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
 5999378

SQL>

Back to the test case.

Stop the Database

[oracle@rac11gr1node1 ~]$ srvctl stop database -d rac11g
[oracle@rac11gr1node1 ~]$ srvctl status database -d rac11g
Instance rac11g2 is not running on node rac11gr1node1
Instance rac11g1 is not running on node rac11gr1node2

Done-this part was simple. Next they stopped their first filer. To prevent bad things from happening I’ll shut down ASM on all nodes. I hope that doesn’t invalidate the test but I can’t see how ASM would not get a problem if the other failgroup went down as well.

Shut down Filer01 and start Filer02

Also quite simple. Shutting down this filer will allow me to follow the story. After filer01 was down I started filer02. I’m curious as to how ASM will react. I have deliberately NOT put disk group FGTEST into the ASM_DISKSTRING, I want to start it manually to get a better understanding of what happens.

After having started ASM on both nodes, I queried V$ASM_DISK and tried to mount the disk group:

SQL> select disk_number,name,state,header_status,path,failgroup from v$asm_disk;

DISK_NUMBER NAME                           STATE    HEADER_STATU PATH                                               FAILGROUP
----------- ------------------------------ -------- ------------ -------------------------------------------------- ------------------------------
 0                                NORMAL   MEMBER       ORCL:FILER02DISKA
 1                                NORMAL   MEMBER       ORCL:FILER02DISKB
 2                                NORMAL   UNKNOWN      ORCL:FILER01DISKA
 3                                NORMAL   UNKNOWN      ORCL:FILER01DISKB
 0 DATA1                          NORMAL   MEMBER       ORCL:DATA1                                         DATA1
 1 DATA2                          NORMAL   MEMBER       ORCL:DATA2                                         DATA2

6 rows selected.

Ooops, now they are both gone….

SQL> alter diskgroup fgtest mount;
alter diskgroup fgtest mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "1" is missing
ORA-15042: ASM disk "0" is missing
ORA-15080: synchronous I/O operation to a disk failed
ORA-15080: synchronous I/O operation to a disk failed

OK, I have a problem here. Both ASM instances report I/O errors with the FGTEST diskgroup, and I can’t mount it. That means I can’t mount the database either-in a way it proves I won’t have corruption. But neither will I have a database, what is worse?

Can I get around this problem?

I think I’ll have to start filer01 and see if that makes a difference. Hopefully I can recover my system with the information in failgroup filer01. Soon after filer01 came online I tried the query against v$asmdisk again and tried to mount it.

SQL> select disk_number,name,state,header_status,path,failgroup from v$asm_disk;

DISK_NUMBER NAME                           STATE    HEADER_STATU PATH                                               FAILGROUP
----------- ------------------------------ -------- ------------ -------------------------------------------------- ------------------------------
 0                                NORMAL   MEMBER       ORCL:FILER02DISKA
 1                                NORMAL   MEMBER       ORCL:FILER02DISKB
 2                                NORMAL   MEMBER       ORCL:FILER01DISKA
 3                                NORMAL   MEMBER       ORCL:FILER01DISKB
 0 DATA1                          NORMAL   MEMBER       ORCL:DATA1                                         DATA1
 1 DATA2                          NORMAL   MEMBER       ORCL:DATA2                                         DATA2

6 rows selected.

That worked!

Wed Apr 06 17:45:32 2011
SQL> alter diskgroup fgtest mount
NOTE: cache registered group FGTEST number=2 incarn=0x72c150d7
NOTE: cache began mount (first) of group FGTEST number=2 incarn=0x72c150d7
NOTE: Assigning number (2,0) to disk (ORCL:FILER01DISKA)
NOTE: Assigning number (2,1) to disk (ORCL:FILER01DISKB)
NOTE: Assigning number (2,2) to disk (ORCL:FILER02DISKA)
NOTE: Assigning number (2,3) to disk (ORCL:FILER02DISKB)
Wed Apr 06 17:45:33 2011
NOTE: start heartbeating (grp 2)
kfdp_query(): 12
kfdp_queryBg(): 12
NOTE: cache opening disk 0 of grp 2: FILER01DISKA label:FILER01DISKA
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 2: FILER01DISKB label:FILER01DISKB
NOTE: cache opening disk 2 of grp 2: FILER02DISKA label:FILER02DISKA
NOTE: F1X0 found on disk 2 fcn 0.0
NOTE: cache opening disk 3 of grp 2: FILER02DISKB label:FILER02DISKB
NOTE: cache mounting (first) group 2/0x72C150D7 (FGTEST)
Wed Apr 06 17:45:33 2011
* allocate domain 2, invalid = TRUE
kjbdomatt send to node 0
Wed Apr 06 17:45:33 2011
NOTE: attached to recovery domain 2
NOTE: cache recovered group 2 to fcn 0.7252
Wed Apr 06 17:45:33 2011
NOTE: LGWR attempting to mount thread 1 for diskgroup 2
NOTE: LGWR mounted thread 1 for disk group 2
NOTE: opening chunk 1 at fcn 0.7252 ABA
NOTE: seq=3 blk=337
NOTE: cache mounting group 2/0x72C150D7 (FGTEST) succeeded
NOTE: cache ending mount (success) of group FGTEST number=2 incarn=0x72c150d7
Wed Apr 06 17:45:33 2011
kfdp_query(): 13
kfdp_queryBg(): 13
NOTE: Instance updated compatible.asm to 11.1.0.0.0 for grp 2
SUCCESS: diskgroup FGTEST was mounted
SUCCESS: alter diskgroup fgtest mount

The V$ASM_DISK view is nicely updated and everything seems to be green:

SQL> select disk_number,name,state,header_status,path,failgroup from v$asm_disk;

DISK_NUMBER NAME                           STATE    HEADER_STATU PATH                                               FAILGROUP
----------- ------------------------------ -------- ------------ -------------------------------------------------- ------------------------------
 0 DATA1                          NORMAL   MEMBER       ORCL:DATA1                                         DATA1
 1 DATA2                          NORMAL   MEMBER       ORCL:DATA2                                         DATA2
 0 FILER01DISKA                   NORMAL   MEMBER       ORCL:FILER01DISKA                                  FILER01
 1 FILER01DISKB                   NORMAL   MEMBER       ORCL:FILER01DISKB                                  FILER01
 2 FILER02DISKA                   NORMAL   MEMBER       ORCL:FILER02DISKA                                  FILER02
 3 FILER02DISKB                   NORMAL   MEMBER       ORCL:FILER02DISKB                                  FILER02

6 rows selected.

Brilliant-will it have an effect on the database?

Starting the Database

Even though things looked ok, they weren’t! I didn’t expect this to happen:

[oracle@rac11gr1node1 ~]$ srvctl start database -d rac11g
PRKP-1001 : Error starting instance rac11g2 on node rac11gr1node1
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:SQL*Plus: Release 11.1.0.7.0 - Production on Wed Apr 6 17:48:58 2011
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:Copyright (c) 1982, 2008, Oracle.  All rights reserved.
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:Enter user-name: Connected to an idle instance.
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:SQL> ORACLE instance started.
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:Total System Global Area 1720328192 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:Fixed Size                    2160392 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:Variable Size              1291847928 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:Database Buffers    419430400 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:Redo Buffers                  6889472 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [9572],
rac11gr1node1:ora.rac11g.rac11g2.inst:[9533], [0x000000000], [], [], [], [], [], [], [], []
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
rac11gr1node1:ora.rac11g.rac11g2.inst:With the Partitioning, Real Application Clusters, OLAP, Data Mining
rac11gr1node1:ora.rac11g.rac11g2.inst:and Real Application Testing options
rac11gr1node1:ora.rac11g.rac11g2.inst:
CRS-0215: Could not start resource 'ora.rac11g.rac11g2.inst'.
PRKP-1001 : Error starting instance rac11g1 on node rac11gr1node2
CRS-0215: Could not start resource 'ora.rac11g.rac11g1.inst'.

Oops. A quick search on Metalink revealed note Ora-00600: [Kccpb_sanity_check_2], [3621501],[3621462] On Startup (Doc ID 435436.1). The explanation for the ORA-600 is that “the seq# of the last read block is higher than the seq# of the control file header block.” Oracle Support explains it with a lost write, but here the situation is quite different. Interesting! I have to leave that for another blog post.

Summer Seminars

I am doing a couple of one day seminars with Oracle University, currently planned for Austria and Switzerland. They go by the title “Grid Intrastructure and Database High Availability Deep Dive”, and can be accessed via these links.

To save you from having to get the abstract, I copied it from the Oracle University website:

Providing a highly available database architecture fit for today’s fast changing requirements can be a complex task. Many technologies are available to provide resilience, each with its own advantages and possible disadvantages. This seminar begins with an overview of available HA technologies (hard and soft partitioning of servers, cold failover clusters, RAC and RAC One Node) and complementary tools and techniques to provide recovery from site failure (Data Guard or storage replication).

In the second part of the seminar, we look at Grid Infrastructure in great detail. Oracle Grid Infrastructure is the latest incarnation of the Clusterware HA framework which successfully powers every single 10g and 11g RAC installation. Despite its widespread implementation, many of its features are still not well understood by its users. We focus on Grid Infrastructure, what it is, what it does and how it can be put to best use, including the creation of an active/passive cold failover cluster for web and database resources. Special focus will be placed on the various storage options (Cluster File System, ASM, etc), the cluster interconnect and other implementation choices and on troubleshooting Grid Infrastructure. In the final part of the seminar, we explore Real Application Clusters and its various uses, from HA to scalability to consolidation. We discuss patching and workload management, coding for RAC and other techniques that will allow users to maximise the full potential of the package.

See you there if you are interested!

Copying the password file for RAC databases

This post is inspired by a recent thread on the oracle-l mailing list. In post “11g RAC orapw file issue- RAC nodes not updated” the fact that the password file is local to the instance has been brought up. In fact, all users with the SYSOPER or SYSDBA role granted are stored in the password file, and changing the account for the SYS user on one instance doesn’t mean the password change is reflected on the other RAC instances. Furthermore, your Data Guard configuration will break as well, since the SYS account is used to log in to the standby database.

On a related note, the change of the sys password for the ASM instance in GRID_HOME will propagate to all cluster nodes automatically, a fact I have first seen mentioned on the Dutch Prutser’s weblog, Harald van Breederode.

Now to get over the annoyance of having to manually copy the new password file to all cluster nodes I have written a small shell script, which I use for all my Linux clusters. It takes the ORACLE_SID of the local instance for input, then works out the corresponding ORACLE_HOME and copies the password file to all instances in the cluster, as listed in the output of olsnodes. The script can deal with separation of duty, i.e. Systems where GRID_HOME is owned by a different owner then the RDBMS ORACLE_HOME. The script is by no means perfect, and could be extended to deal with a more general setup. My assumption is that all cluster nodes have a 1:1 mapping of Oracle instance and ORACLE_SID, for example instance PROD1 will be hosted on the first cluster node, prodnode1.

The script is shown below, it’s been written and tested on Linux:

#!/bin/bash

# A small and simple script to copy a password file
# to all nodes of a cluster
# This works for me, it doesn't necessarily work for you,
# and the script is provided "as is"-I will not take
# responsibility for its operation and it comes with no
# warrenty of any sorts
#
# Martin Bach 2011
#
# You are free to use the script as you feel fit, but please
# retain the reference to the author.
#
# Usage: requires the local ORACLE_SID as a parameter.
# requires the ORACLE_SID or DBNAME to be in oratab

ORACLE_SID=$1
[[ $ORACLE_SID == "" ]] && {
 echo usage `basename $0` ORACLE_SID
 exit 1
}

#### TUNEABLES

# change to /var/opt/oracle/oratab for Solaris
ORATAB=/etc/oratab
GRID_HOME=/u01/crs/11.2.0.2

#### this section doesn't normally have to be changed

DBNAME=${ORACLE_SID%*[0-9]}
ORACLE_HOME=`grep $DBNAME $ORATAB | awk -F":" '{print $2}'`
[[ $ORACLE_HOME == "" ]] && {
 echo cannot find ORACLE_HOME for database $DBNAME in $ORATAB
 exit 2
}

cd $ORACLE_HOME/dbs
cp -v orapw$ORACLE_SID /tmp
INST=1

echo starting copy of passwordfile
for NODE in `$GRID_HOME/bin/olsnodes`; do
 echo copying orapw$ORACLE_SID to $NODE as orapw${DBNAME}${INST}
 scp orapw$ORACLE_SID $NODE:${ORACLE_HOME}/dbs/orapw${DBNAME}${INST}
 INST=$(( $INST + 1))
done

It’s fairly straight forward, we first get the ORACLE_SID and use this to get the ORACLE_HOME for the database.  The GRID_HOME has to be hard coded to keep it compatible with < 11.2 database where you could have a CRS_HOME different from the ASM_HOME. For Oracle < 11.2, you need to set the GRID_HOME variable to your Clusterware home.

The DBNAME is the $ORACLE_SID without trailing number, which I need to work out the SIDs of the other cluster nodes. Before copying the password file from the local node to all cluster nodes a copy is taken to /tmp, just in case.

The main logic is in the loop provided by the output of olsnodes, and the local password file is copied across all cluster nodes.

Feel free to use at your own risk, and modify/distribute as needed. This works well for me, especially across the 8 node cluster.

Troubleshooting Grid Infrastructure startup

This has been an interesting story today when one of my blades decided to reboot after an EXT3 journal error. The hard facts first:

  • Oracle Linux 5.5 with kernel 2.6.18-194.11.4.0.1.el5
  • Oracle 11.2.0.2 RAC
  • Bonded NICs for private and public networks
  • BL685-G6 with 128G RAM

First I noticed the node had problems when I tried to get all databases configured on the cluster. I got the dreaded “cannot communicate with the CRSD”

[oracle@node1.example.com] $ srvctl config database
PRCR-1119 : Failed to look up CRS resources of database type
PRCR-1115 : Failed to find entities of type resource that match filters (TYPE ==ora.database.type) and contain attributes DB_UNIQUE_NAME,ORACLE_HOME,VERSION
Cannot communicate with crsd

Not too great, especially since everything worked when I left yesterday. What could have gone wrong?An obvious reason for this could be a reboot, and fair enough, there has been one:

[grid@node1.example.com] $ uptime
09:09:22 up  2:40,  1 user,  load average: 1.47, 1.46, 1.42

The next step was to check if the local CRS stack was up, or better, to check what was down. Sometimes it’s only crsd which has a problem. In my case everything was down:

[grid@node1.example.com] $ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
[grid@node1.example.com] $ crsctl check cluster -all
**************************************************************
node1:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
**************************************************************
CRS-4404: The following nodes did not reply within the allotted time:
node2,node3, node4, node5, node6, node7, node8

The CRS-4404 was slightly misleading, I assumed all cluster nodes were down after a clusterwide reboot. Sometimes a single node reboot triggers worse things. However, logging on to node 2 I saw that all but the first node were ok.

CRSD really needs CSSD to be up and running, and CSSD requires the OCR to be there. I wanted to know if the OCR was impacted in any way:

[grid@node1.example.com] $ ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage
ORA-29701: unable to connect to Cluster Synchronization Service

Well it seemed that the OCR location was unavailable. I know that on this cluster, the OCR is stored on ASM. Common reasons for the PROC-26 error are

  • Unix admin upgrades the kernel but forgets to upgrade the ASMLib kernel module (common grief with ASMLib!)
  • Storage is not visible on the host, i.e. SAN connectivity broken/taken away (happens quite frequently with storage/sys admin unaware of ASM)
  • Permissions not set correctly on the block devices (not an issue when using asmlib)

I checked ASMLib and it reported a working status:

[oracle@node1.example.com] $ /etc/init.d/oracleasm status
Checking if ASM is loaded: yes
Checking if /dev/oracleasm is mounted: yes

That was promising, /dev/oracleasm/ was populated and the matching kernel modules loaded. /etc/init.d/oracleasm listdisks listed all my disks as well. Physical storage not accessible (PROC-26) seemed a bit unlikely now.

I could rule out permission problems since ASMLib was working fine, and I also rule out the kernel upgrade/missing libs problem by comparing the RPM with the kernel version: they matched. So maybe it’s storage related?

Why did the node go down?

Good question, usually to be asked towards the unix administration team. Luckily I have a good contact placed right inside that team and I could get the following excerpt from /var/log/messages arond the time of the crash (6:31 this morning):

Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192116, count = 1
Mar 17 06:26:06 node1 kernel: Aborting journal on device dm-2.
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192216, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 56 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192166, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192122, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 55 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192140, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 56 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks in system zones - Block = 8192174, count = 1
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_free_blocks_sb: Journal has aborted
Mar 17 06:26:06 node1 last message repeated 10 times
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_truncate: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_orphan_del: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_reserve_inode_write: Journal has aborted
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2) in ext3_delete_inode: Journal has aborted
Mar 17 06:26:06 node1 kernel: __journal_remove_journal_head: freeing b_committed_data
Mar 17 06:26:06 node1 kernel: ext3_abort called.
Mar 17 06:26:06 node1 kernel: EXT3-fs error (device dm-2): ext3_journal_start_sb: Detected aborted journal
Mar 17 06:26:06 node1 kernel: Remounting filesystem read-only
Mar 17 06:26:06 node1 kernel: __journal_remove_journal_head: freeing b_committed_data
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Received SNMP packet(s) from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:19030
Mar 17 06:26:06 node1 snmpd[25651]: Connection from UDP: [127.0.0.1]:41076
Mar 17 06:26:06 node1 snmpd[25651]: Received SNMP packet(s) from UDP: [127.0.0.1]:41076
Mar 17 06:26:09 node1 kernel: SysRq : Resetting
Mar 17 06:31:15 node1 syslogd 1.4.1: restart.

So it looks like a file system error triggered the reboot-I’m glad the box came back up ok on it’s own. The $GRID_HOME/log/hostname/alerthostname.log didn’t show anything specific to storage. Normally you would see that it starts counting a node down if it lost contact to the voting disks (in this case OCR and voting disks share the same diskgroup).

And why does Clusteware not start?

After some more investigation it seems there was no underlying problem with the storage, so I tried to manually start the cluster, traililng the ocssd.log file for possible clues.

[root@node1 ~]# crsctl start cluster
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′
CRS-2674: Start of ‘ora.cssd’ on ‘node1′ failed
CRS-2679: Attempting to clean ‘ora.cssd’ on ‘node1′
CRS-2681: Clean of ‘ora.cssd’ on ‘node1′ succeeded
CRS-5804: Communication error with agent process
CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘node1′
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′

… the command eventually failed. The ocssd.log file showed this:

...
2011-03-17 09:47:49.073: [GIPCHALO][1081923904] gipchaLowerProcessNode: no valid interfaces found to node for 10996354 ms, node 0x2aaab008a260 { host 'node4', haName 'CSS_lngdsu1-c1', srcLuid b04d4b7b-a7491097, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [61 : 61], createTime 10936224, flags 0x4 }
2011-03-17 09:47:49.084: [GIPCHALO][1081923904] gipchaLowerProcessNode: no valid interfaces found to node for 10996364 ms, node 0x2aaab008a630 { host 'node6', haName 'CSS_lngdsu1-c1', srcLuid b04d4b7b-2f6ece1c, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [61 : 61], createTime 10936224, flags 0x4 }
2011-03-17 09:47:49.113: [    CSSD][1113332032]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 2, node2, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 30846440, LATS 10996434, lastSeqNo 30846437, uniqueness 1300108895, timestamp 1300355268/3605443434
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 3, node3, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355257, LATS 10996434, lastSeqNo 31355254, uniqueness 1300344405, timestamp 1300355268/10388584
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 4, node4, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31372473, LATS 10996434, lastSeqNo 31372470, uniqueness 1297097908, timestamp 1300355268/3605182454
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 5, node5, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31384686, LATS 10996434, lastSeqNo 31384683, uniqueness 1297098093, timestamp 1300355268/3604696294
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 6, node6, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31388819, LATS 10996434, lastSeqNo 31388816, uniqueness 1297098327, timestamp 1300355268/3604712934
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 7, node7, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 29612975, LATS 10996434, lastSeqNo 29612972, uniqueness 1297685443, timestamp 1300355268/3603054884
2011-03-17 09:47:49.158: [    CSSD][1090197824]clssnmvDHBValidateNCopy: node 8, node8, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31203293, LATS 10996434, lastSeqNo 31203290, uniqueness 1297156000, timestamp 1300355268/3604855704
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 3, node33, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355258, LATS 10996434, lastSeqNo 31355255, uniqueness 1300344405, timestamp 1300355268/10388624
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 4, node4, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31372474, LATS 10996434, lastSeqNo 31372471, uniqueness 1297097908, timestamp 1300355268/3605182494
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 5, node5, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31384687, LATS 10996434, lastSeqNo 31384684, uniqueness 1297098093, timestamp 1300355268/3604696304
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 6, node6, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31388821, LATS 10996434, lastSeqNo 31388818, uniqueness 1297098327, timestamp 1300355268/3604713224
2011-03-17 09:47:49.161: [    CSSD][1085155648]clssnmvDHBValidateNCopy: node 7, node7, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 29612977, LATS 10996434, lastSeqNo 29612974, uniqueness 1297685443, timestamp 1300355268/3603055224
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 2, node2, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 30846441, LATS 10996474, lastSeqNo 30846438, uniqueness 1300108895, timestamp 1300355269/3605443654
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 3, node3, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31355259, LATS 10996474, lastSeqNo 31355256, uniqueness 1300344405, timestamp 1300355268/10389264
2011-03-17 09:47:49.197: [    CSSD][1094928704]clssnmvDHBValidateNCopy: node 8, node8, has a disk HB, but no network HB, DHB has rcfg 176226183, wrtcnt, 31203294, LATS 10996474, lastSeqNo 31203291, uniqueness 1297156000, timestamp 1300355269/3604855914
2011-03-17 09:47:49.619: [    CSSD][1116485952]clssnmSendingThread: sending join msg to all nodes
...

The interesting bit is the “BUT NO NETWORK HB”, i.e. something must be wrong with the network configuration. I quickly checked the output of ifconfig and found a missing entry for my private interconnect. This is defined in the GPnP profile if you are unsure:


 
 
 

Now that’s a starting point! If tried to bring up bond1.251, but that failed:

[root@node1 network-scripts]# ifup bond1.251
ERROR: trying to add VLAN #251 to IF -:bond1:-  error: Invalid argument
ERROR: could not add vlan 251 as bond1.251 on dev bond1

The “invalid argument” didn’t mean too much to me, so I ran ifup with the “-x” flag to get more information about which argument was invalid:

[root@node1 network-scripts]# which ifup
/sbin/ifup
[root@node1 network-scripts]# view /sbin/ifup
# turned out it's a shell script! Let's run with debug output enabled
[root@node1 network-scripts]# bash -x /sbin/ifup bond1.251
+ unset WINDOW
...
+ MATCH='^(eth|hsi|bond)[0-9]+\.[0-9]{1,4}$'
+ [[ bond1.251 =~ ^(eth|hsi|bond)[0-9]+\.[0-9]{1,4}$ ]]
++ echo bond1.251
++ LC_ALL=C
++ sed 's/^[a-z0-9]*\.0*//'
+ VID=251
+ PHYSDEV=bond1
+ [[ bond1.251 =~ ^vlan[0-9]{1,4}? ]]
+ '[' -n 251 ']'
+ '[' '!' -d /proc/net/vlan ']'
+ test -z ''
+ VLAN_NAME_TYPE=DEV_PLUS_VID_NO_PAD
+ /sbin/vconfig set_name_type DEV_PLUS_VID_NO_PAD
+ is_available bond1
+ LC_ALL=
+ LANG=
+ ip -o link
+ grep -q bond1
+ '[' 0 = 1 ']'
+ return 0
+ check_device_down bond1
+ echo bond1
+ grep -q :
+ LC_ALL=C
+ ip -o link
+ grep -q 'bond1[:@].*,UP'
+ return 1
+ '[' '!' -f /proc/net/vlan/bond1.251 ']'
+ /sbin/vconfig add bond1 251
ERROR: trying to add VLAN #251 to IF -:bond1:-  error: Invalid argument
+ /usr/bin/logger -p daemon.info -t ifup 'ERROR: could not add vlan 251 as bond1.251 on dev bond1'
+ echo 'ERROR: could not add vlan 251 as bond1.251 on dev bond1'
ERROR: could not add vlan 251 as bond1.251 on dev bond1
+ exit 1

Hmmm so it seemed that the underlying interface bond1 was missing-which was true. The output of ifconfig didn’t show it as configured, and trying to start it manually using ifup bond1 failed as well. It turned out that the ifcfg-bond1 file was missing and had to be recreated from the documentation. All network configuration files in Red Hat based systems belong into /etc/sysconfig/network-scripts/ifcfg-interfaceName. With the recreated file in place, I was back in the running:

[root@node1 network-scripts]# ll *bond1*
-rw-r–r– 1 root root 129 Mar 17 10:07 ifcfg-bond1
-rw-r–r– 1 root root 168 May 19  2010 ifcfg-bond1.251
[root@node1 network-scripts]# ifup bond1
[root@node1 network-scripts]# ifup bond1.251
Added VLAN with VID == 251 to IF -:bond1:-
[root@node1 network-scripts]#

Now I could try to start the lower stack again:

CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘node1′
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘node1′
CRS-2676: Start of ‘ora.cssd’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.cluster_interconnect.haip’ on ‘node1′
CRS-2672: Attempting to start ‘ora.ctssd’ on ‘node1′
CRS-2676: Start of ‘ora.ctssd’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.evmd’ on ‘node1′
CRS-2676: Start of ‘ora.evmd’ on ‘node1′ succeeded
CRS-2676: Start of ‘ora.cluster_interconnect.haip’ on ‘node1′ succeeded
CRS-2679: Attempting to clean ‘ora.asm’ on ‘node1′
CRS-2681: Clean of ‘ora.asm’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.asm’ on ‘node1′
CRS-2676: Start of ‘ora.asm’ on ‘node1′ succeeded
CRS-2672: Attempting to start ‘ora.crsd’ on ‘node1′
CRS-2676: Start of ‘ora.crsd’ on ‘node1′ succeeded
[root@node1 network-scripts]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

Brilliant-problem solved. This is actually the first time that an incorrect network config prevented a cluster I looked after from starting. The best indication in this case is in the gipcd log file, but it didn’t occur to me to have a look at is as the error was clearly related to storage.

DBCA failing to create RAC databases

This is a weird problem I ran in today. As part of an automation project the code deploys RAC One databases across a cluster, depending on the capacity available of the node. These are 128G RAM BL685c G6 currently but will be upgraded to G7 later.

Now, my problem was that after the weekend we couldn’t deploy any more RAC One databases, except for 1 node. DBCA simply created single instance databases instead. Newly created databases were properly registered in the OCR, and their build completed ok, but not as RAC One databases. Take for example this database:

$ srvctl config database -d MYDB
Database unique name: MYDB
Database name: MYDB
Oracle home: /u01/app/oracle/product/11.2.0.2
Oracle user: oracle
Spfile: +DATA/MYDB/spfileMYDB.ora
Domain: example.com
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: MYDB
Database instances: MYDB
Disk Groups: DATA
Mount point paths:
Services:
Type: SINGLE
Database is administrator managed

How come? We are sure that we pass the RACOneNode flag to dbca, which can be found in the command line. Trying again I spotted these (alongside the sys and system passwords … you should change these as soon as DBCA completes!)

[rac]oracle@node2.example.com $ ps -ef|grep MYDB
oracle   14865 14854 65 11:22 ?        00:00:07 /u01/app/oracle/product/11.2.0.2/jdk/jre/bin/java -Doracle.installer.not_bootstrap=true -DORACLE_HOME=/u01/app/oracle/product/11.2.0.2 -DSET_LAF= -Dsun.java2d.font.DisableAlgorithmicStyles=true -Dice.pilots.html4.ignoreNonGenericFonts=true -DDISPLAY= -DJDBC_PROTOCOL=thin -mx128m -classpath ... oracle.sysman.assistants.dbca.Dbca -silent -createDatabase -templateName /u01/app/oracle/product/admin/templates/Default.dbc -gdbName MYDB.example.com -RACOneNode -RACOneNodeServiceName MYDB_APP.example.com -sid MYDB -sysPassword xxx -systemPassword xxx -emConfiguration NONE -totalMemory 4096 -storageType ASM -asmSysPassword xxx -diskGroupName DATA -initParams db_create_file_dest=+DATA,cpu_count=1 -nodelist node2
oracle   15415  5109  0 11:22 pts/0    00:00:00 grep MYDB

So why the problem? Looking at the dbca trace file I found these lines

[main] [ 2011-03-14 14:15:31.845 GMT ] [SQLEngine.initialize:363]  Starting Reader Thread...
[main] [ 2011-03-14 14:15:31.927 GMT ] [OracleHome.initOptions:1240]  executing: startup nomount pfile='/u01/app/oracle/product/11.2.0.2/dbs/initDBUA0.ora'
[main] [ 2011-03-14 14:15:55.417 GMT ] [SQLEngine.done:2167]  Done called
[main] [ 2011-03-14 14:15:55.418 GMT ] [OracleHome.initOptions:1247]  ORA-00304: requested INSTANCE_NUMBER is busy

oracle.sysman.assistants.util.sqlEngine.SQLFatalErrorException: ORA-00304: requested INSTANCE_NUMBER is busy

 at oracle.sysman.assistants.util.sqlEngine.SQLEngine.executeImpl(SQLEngine.java:1655)
 at oracle.sysman.assistants.util.sqlEngine.SQLEngine.executeSql(SQLEngine.java:1903)
 at oracle.sysman.assistants.util.OracleHome.initOptions(OracleHome.java:1241)
 at oracle.sysman.assistants.dbca.backend.SilentHost.initialize(SilentHost.java:179)
 at oracle.sysman.assistants.dbca.Dbca.execute(Dbca.java:116)
 at oracle.sysman.assistants.dbca.Dbca.main(Dbca.java:180)
[main] [ 2011-03-14 14:15:55.420 GMT ] [OracleHome.initOptions:1250]  executing: select parameter from v$option where value='TRUE'
[main] [ 2011-03-14 14:15:55.420 GMT ] [SQLEngine.reInitialize:735]  Reinitializing SQLEngine...

Interesting-the ORA-304 error sticks out. The DBCA logs are in $ORACLE_BASE/cfgtoollogs/dbca/dbName/ in 11.2 btw. Further down the logfile it then determines that the RAC option is not available. This isn’t true-and I checked on each node:

$ cd $ORACLE_HOME/rdbms/lib
$ nm -r libknlopt.a | grep -c kcsm.o
1

That was identical on all nodes. So we definitely had RAC compiled into the oracle binary. I also compared the size and timestamp of all oracle binaries in $ORACLE_HOME only to find them identical. However dbca didn’t seem impressed with my contradiction and went on creating single instance databases. That now became a little inconvenient.

I then tried relocating one of the succesfully created RAC One databases to the nodes where we had problems building them, hoping to find out more about the problem. At this stage I was convinced there was a problem with semophores or other SysV IPC.

I ceratainly didn’t want to use the Windows Fix and reboot!

Moving On

So to recap, we should be able to build RAC (One Node) databases as the option is compiled into the binary, and yet it doesn’t work. From the trace I gathered that Oracle builds an auxiliary instance first, and uses initDBUA0.ora in $ORACLE_HOME to start it. So where are it’s logs/where’s the diagnostic dest? Turns out it is in $ORACLE_HOME/log/ – simply set your ADR base to this location and use the familiar commands. And this finally give me a clue:

*** 2011-03-14 12:06:40.104
2011-03-14 12:06:40.104: [ CSSCLNT]clssgsGroupJoin: member in use group(0/DBDBUA0)
kgxgnreg: error: status 14
kgxgnreg: error: member number 0 is already in use
kjxgmjoin: can not join the group (DBDBUA0) with id 0 (inst 1)
kjxgmjoin: kgxgn error 3

So somewhere else in the cluster had to be a DBUA0 instance that prevented my new instance from starting. A quick trawl through the process table on all nodes revealed that DBUA was active on node6. Shutting that down solved the problem!

Summary

DBCA is a nice tool to create databases, together with user definable templates it is really flexible. From a technical point of view it works as follows:

  • For RAC and RAC One Node it tries to create an auxiliary instance, called DBUA0, as a cluster database. If DBUA0 is used on the same node, it will use DBUA1 etc.
  • Next it will rename the database to what we assign on the command line
  • It then performs a lot more actions which are not of relevance here.

In my case, one of these DBUA0s aux instances was still present on a different node in the cluster as a result of a crashed database creation. When subsequent calls to dbca created another auxiliary (cluster!) DBUA0 instance on a different node, it wasn’t aware that there was a DBUA0 already and LMON refused to create it. This is expected behaviour- instance names have to be unique across the cluster. The DBUA0 of node2 for example clashed with the one on node6.

Why did it work on p6 then I hear you ask? DBCA seems to have code logic to establish that DBUA0 on a node is in use, and uses DBUA1 next.

Update

I got this update from Oracle Support who acknowledge this as a bug:

Notes 16-Mar-2011 1:21:16 GMT+00:00 PM Oracle Support
Unscheduled
Generic Note
————————
Dear Martin,

Following bug is created for this particular issue.

Bug 11877668 – DBCA DOESN’T CREATE A RAC ONE DATABASE IN SILENT MODE

Cluster callouts to create blackouts in EM

Finally I got around to providing a useful example for a cluster callout script. It is actually on the verge of taking too long-remember that scripts in the $GRID_HOME/racg/usrco/ directory should execute quickly. Before deploying this, you should definitely ensure that the script executes quickly enough-the “time” utility can help you with this. Nevertheless this has been necessary to work around a limitation of Grid Control: RAC One Node databases are not supported in GC 11.1 (I complained about that earlier).

The Problem

To work around the problem I wrote a script which can alleviate one of the arising problems: when using srvctl relocate database, another instance (usually called dbName_2) will be started to allow existing sessions to survive the failover operation if they use TAF or FAN/FCF.

This poses a big problem to Grid Control though-the second instance didn’t exist when you registered the database as a target, hence GC doesn’t know about it. Subsequently you may get paged that the database is down when in reality it is not. Receiving one of the “false positive” alarms is annoying at best at 02:00 AM in the morning. Actually, Grid Control is right in assuming that the database is down: although detected as a cluster database target, it only consists of 1 instance. If that’s down, it has to be assumed that the whole cluster database is down. In a perfect world we wouldn’t have this problem-GC was aware that the RON database moved to another node in the cluster and update its configuration accordingly. This is planned for the next major release sometime later in 2011. Apparently dbconsole has the ability to deal with such a situation.Now with the background explained, management had to weigh the possibilities-either not register the RAC One database in Grid Control and have no monitoring at all or to bite the bullet and have monitoring only when the initial instance is started on the primary node. The decision was made to have (limited) monitoring. To prevent the DBA from being woken up I developed the simple script below to automatically create a blackout in GC if the “_2″ instance starts. Subsequently, the blackout is taken off when the “_1″ instance starts.

Room for improvement: if the script assumes that a RON database can only have a maximum of 2 member servers-if your database can run on more than 2 nodes then you should use the relocate_target if the _1 instance comes up on a different node from what GC expects.

The Script

My algorithm checks for cluster events, and if an instance dbName_2 starts, I create a blackout on the initial instance to prevent being paged until Oracle have come up with a better solution (we are flying blind once the 2nd database instance has started).

The script assumes that you have deployed emcli on each cluster node (or ACFS). EMCLI is the Enterprise Manager Command Line Lnterface, it’s located on your OMS together with the installation instructions. This is the default location:: https://oms.example.com:7799/em/console/emcli/download – 7799 is the default port for Grid Control.

Let’s have a look at the script:

#!/bin/bash

# enable debugging if needed
set -x
exec >> /tmp/autoBlackout.log 2>&1

EVENTTYPE=$1

# only SERVICEMEMBER populates instance, database and service as needed
# for the blackout section below.
if [ "$EVENTTYPE" != "SERVICEMEMBER" ]; then
 exit 0
fi

# adjust to your needs or set to the empty string if you are not using db_domain
# assumes that both database and service have the same domain
DOMAIN=example.com

# bail out if there are too many instances of this script running. Inform the
# admin via email
ME=`basename $0`
RUNNING=`ps -ef | grep -v grep | grep $ME | wc -l`
if [ $RUNNING -ge 6 ]; then
 echo Too many instances of this script running, aborting
 echo Too many instances of $ME running, aborting |
 mail -s "$RUNNING instances of $ME detected on `hostname`" admin@example.com
fi

# set up for emcli (emcli requires jdk 1.6)
JAVA_HOME=/shared/acfs/emcli/jdk1.6.0_24
PATH=$JAVA_HOME/bin:$PATH
EMCLI=/shared/acfs/emcli/emcli
export JAVA_HOME PATH

# turn off debugging for a moment - the below parsing of the command line
# parameters is very verbose.
set +x

# read the parameters passed to us-modified version of a script
# found at rachelp.nl
for ARGS in $*;
 do
 PROPERTY=`echo $ARGS | /bin/awk -F"=|[ ]" '{print $1}'`
 VALUE=`echo $ARGS | /bin/awk -F"=|[ ]" '{print $2}'`
 case $PROPERTY in
 VERSION|version)    VERSION=$VALUE ;;
 SERVICE|service)    SERVICE=$VALUE ;;
 DATABASE|database)    DATABASE=$VALUE ;;
 INSTANCE|instance)    INSTANCE=$VALUE ;;
 HOST|host)        HOST=$VALUE ;;
 STATUS|status)        STATUS=$VALUE ;;
 REASON|reason)        REASON=$VALUE ;;
 CARD|card)        CARDINALITY=$VALUE ;;
 TIMESTAMP|timestamp)    LOGDATE=$VALUE ;;
 ??:??:??)        LOGTIME=$VALUE ;;
 esac
 done

# and turn debugging on again
set -x

# targets are reported in lower case :( Someone please suggest a better
# way to get a lower case string to upper case
DATABASE=`echo $DATABASE | tr "[a-z]" "[A-Z]"`

# targets affected are rac_database and the oracle_database (instance)
# not using emcli here as it has to be quick. A rac_database target is a
# composite target, consisting of multiple oracle_database targets. In
# RAC One Node there is only one 1 instance - see output from GC below:
# $ emcli get_targets | grep "RAC"
# 0       Down           oracle_database  RAC.example.com_RAC_1
# 0       Down           rac_database     RAC.example.com

# define what we want to black out (only ever the primary instance!)
BLACKOUT_NAME=blackout_${DATABASE}
BLACKOUT_TARGETS="$DATABASE.${DOMAIN}:rac_database;${DATABASE}.${DOMAIN}_${DATABASE}_1:oracle_database"

# create a blackout if the secondary instance is up (we only ever register the _1 instance)
# the blackout duration is indefinite-it will be stopped and lifted automatically. You may
# want to limit this to a few hours to raise visibility.
if [[ $STATUS == "up" && ${INSTANCE: -2} == "_2" ]]; then
 echo create blackout
 $EMCLI login -username=user -password=supersecretpassword
 $EMCLI create_blackout -name=${BLACKOUT_NAME} -add_targets=${BLACKOUT_TARGETS} \
 -reason="auto blackout" -schedule="frequency:once;duration:-1"
fi

# disable the blackout if instance *_1 starts
# this is where the script could be improved if the RON database can run on more
# than 2 nodes. You could use emcli relocate_target to relocate the target to another
# node
if [[ $STATUS == "up" && ${INSTANCE: -2} == "_1" ]]; then
 echo remove blackout
 $EMCLI login -username=user-password=supersecretpassword
 $EMCLI stop_blackout -name=${BLACKOUT_NAME}
 $EMCLI delete_blackout -name=${BLACKOUT_NAME}
fi

I tried to add a lot of comments to the script, which should make it easy for you to adjust it. I recommend you store it in ACFS and mount that directory on all cluster nodes. Create a symbolic link from the ACFS to $GRID_HOME/racg/usrco/ to make maintenance easier. You could enable log rotation for the logfile in /tmp if you liked, otherwise keep an eye on it so it doesn’t grow to gigabytes.

Using wget and proxy to download patches from MOS

This is a rather quick note, but can be quite useful in certain situations. I currently look after a system which is quite difficult to jump on. That means before I get to do a “sudo su – oracle” I need to get to a jump-off box, ssh to 2 other machines and then log in as myself. It’s secure, but not user friendly. Especially in this case where I needed to run the latest RDA for an an open support request.

So rather than “dragging” the RDA with me on each box I used the new (Flash) interface to get a small shell script which you just need to deploy to your machine and run. It then connects to updates.oracle.com and does its magic.

The script works mostly fine, but depending on your environment you have to make small changes. My example is for Solaris 10, any Linux should just work out of the box.To start with you need to log in to the Flash version of My Oracle Support (unsurprisingly) and click on “patches and updates”. Enter your patch number, for example 10376971 for RDA on Solaris SPARC 64bit. On the results page, click on the description (“remote Diagnostics Agent OCM Bundle….”). The line will now be highlighted. Next, click on the download icon to view a new pop up window.The magic is in the “wget options” link down to the left. Clicking on it you have the option to have a script created or either copy the script to the clipboard.

I chose the latter, and created /tmp/get.sh with the contents from the clipboard. The script is shown here:

#!/bin/sh

#
# Generated Wed, 23 Feb 2011 08:49:57 Coordinated Universal Time
# Start of user configurable variables
#

# SSO username and password
SSO_USERNAME=user@company.com
SSO_PASSWORD=

# E-Delivery token
# The EPD_TOKEN will expire 48 hours after the following generation date
# Wed, 23 Feb 2011 08:49:57 Coordinated Universal Time
EPD_TOKEN=

# Path to wget command
WGET="/usr/bin/wget"

# Location of cookie file
COOKIE_FILE=/tmp/$$.cookies

# Log directory and file
LOGDIR=.
LOGFILE=$LOGDIR/wgetlog-`date +%m-%d-%y-%H:%M`.log

# Output directory and file
OUTPUT_DIR=.

#
# End of user configurable variable
#

if [ "$SSO_PASSWORD " = " " ]
then
 echo "Please edit script and set SSO_PASSWORD"
 exit
fi

# Contact updates site so that we can get SSO Params for logging in
SSO_RESPONSE=`$WGET https://updates.oracle.com/Orion/Services/download 2>&1|grep Location`

# Extract request parameters for SSO
SSO_TOKEN=`echo $SSO_RESPONSE| cut -d '=' -f 2|cut -d ' ' -f 1`
SSO_SERVER=`echo $SSO_RESPONSE| cut -d ' ' -f 2|cut -d 'p' -f 1,2`
SSO_AUTH_URL=sso/auth
AUTH_DATA="ssousername=$SSO_USERNAME&password=$SSO_PASSWORD&site2pstoretoken=$SSO_TOKEN"

# The following command to authenticate uses HTTPS. This will work only if the wget in the environment
# where this script will be executed was compiled with OpenSSL. Remove the --secure-protocol option
# if wget was not compiled with OpenSSL
# Depending on the preference, the other options are --secure-protocol= auto|SSLv2|SSLv3|TLSv1
$WGET --secure-protocol=auto --post-data $AUTH_DATA --save-cookies=$COOKIE_FILE --keep-session-cookies $SSO_SERVER$SSO_AUTH_URL -O sso.out >> $LOGFILE 2>&1

rm -f sso.out

$WGET  --load-cookies=$COOKIE_FILE --save-cookies=$COOKIE_FILE --keep-session-cookies "https://updates.oracle.com/Orion/Services/download/p10376971_422_SOLARIS64.zip?aru=13243257&patch_file=p10376971_422_SOLARIS64.zip" -OUTPUT_DIR/p10376971_422_SOLARIS64.zip   >> $LOGFILE 2>&1

# Cleanup
rm -f $COOKIE_FILE

You need to check the SSO_USERNAME and SSO_PASSWORD variables to match your settings. I needed to do some minor modifications for my Solaris 10 installation . For example, wget is in the Sun Freeware directory (/usr/sfw/bin), which is not in the PATH. Change wget to “not check for certificates”, as shown in this example, line 20:

WGET=”/usr/sfw/bin/wget –no-check-certificate”

I also had to set a proxy-set the following environment variables, either in the script or on the command line:

export http_proxy=http://proxy.host:port
export https_proxy=https://proxy.host:port

With this set, everything worked smoothly. It pays off to run the script with the -x option as in bash -x /tmp/get.sh. As you can see the script writes a log file to your current working directory, and it saves the patch in it as well. Be careful not to fill up /tmp/ with a 3GB download :)

This technique obviously depends on the wget utility to be available. For security reasons your sys admin may not have installed it in which case you might use some fancy cascaded port-forwarding to get the patch to your box (or ask someome with more permissions after having spent 30 minutes raising a ticket which is going to be executed in the next 3 weeks)

Happy patching

RAC One Node and Database Protection

An email from fellow Oak Table Member James Morle about RAC One Node and failover got me thinking about the capabilities of the product.

I have written about RON (Rac One Node) in earlier posts, but haven’t really explored what happens with session failover during a database relocation.

Overview

So to clarify what happens in these two scenarios I have developed a simple test. Taking a RON database + a service I modified both to suit my test needs. Connected to the service I performed a database relocation to see what happens. Next I killed the instance (I wasn’t able to reboot the node) t o simulate what happens when the node crashes.

Setup

The setup used an existing database, “RON”. It also had a service defined already, but that needed tweaking. The database was defined as follows:

$ srvctl config database -d RON
Database unique name: RON
Database name: RON
Oracle home: /u01/app/oracle/product/11.2.0.2
Oracle user: oracle
Spfile: +DATA/RON/spfileRON.ora
Domain: example.com
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: RON
Database instances:
Disk Groups: DATA
Mount point paths:
Services: RON_APP.example.com
Type: RACOneNode
Online relocation timeout: 30
Instance name prefix: RON
Candidate servers: node1,node2
Database is administrator managed

The service was initially defined as follows:

srvctl config service -d RON
Service name: RON_APP.example.com
Service is enabled
Server pool: RON
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: NONE
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: BASIC
Edition:
Preferred instances: RON_1
Available instances:
[rac]oracle@node1.example.com $

Some of these attributes require special attention, we have 3 categories to deal with: preferred instances for TAF, the TAF setup itself and the runtime load balancing (RLB) advisory.

Transparent Application Failover with “real” RAC only works if there are two preferred instances. As you can see the service has only one preferred instance. Not sure if that can be changed though…Let’s try:

srvctl modify service -s RON_APP -d RON -i RON_1,RON_2
PRKO-2007 : Invalid instance name: RON_2

I wasn’t surprised-RON is not a RAC database so it has only 1 active instance. When registering the RAC One Node database you don’t add instances as you would with a RAC database, instead you set the database type to RACONENODE (srvctl add database -d name -c RACONEONE … )

I recommend setting TAF properties on the service level-that way you don’t miss crucial parameters in your tnsnames.ora file. This is the preferred way of doing it at least since 11.2. Changing the service is straight forward:

srvctl modify service -d RON -s RON_APP.example.com \
> -P BASIC -e SESSION -m BASIC

This piece of code instructs the service to use a BASIC TAF policy, a failover type of SESSION and the BASIC failover method. These parameters were normally be configured in the CONNECT_DATA section of your TNSNames.ora file.

With these changes made, the service configuration has changed to the below:

$ srvctl config service -d RON -s RON_APP.example.com
Service name: RON_APP.example.com
Service is enabled
Server pool: RON
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: SESSION
Failover method: BASIC
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: BASIC
Edition:
Preferred instances: RON_1
Available instances:
[rac]oracle@node1.example.com $

I also wanted to change the defaults to a more suitable RLB configuration. Instead of a CLB goal of “LONG” I wanted to set it up for “SHORT”. And I needed emphasis on SERVICE_TIME as well (my intention was to run swingbench). The change and resulting service configuration are shown below:

$ srvctl modify service -d RON -s RON_APP.example.com \
> -j short -B service_time

$ srvctl config service -d RON -s RON_APP.example.com
Service name: RON_APP.example.com
Service is enabled
Server pool: RON
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: SESSION
Failover method: BASIC
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: SHORT
Runtime Load Balancing Goal: SERVICE_TIME
TAF policy specification: BASIC
Edition:
Preferred instances: RON_1
Available instances:
[rac]oracle@node1.example.com $

My SCAN name was scan1.example.com, and I ensured that the local_listener was pointing to my none default port of 1821 and the remote_lister was using the EZConnect synatax (“scan1.example.com:1825). It is very important to set the local_listener parameter if you are not using the default port of 1521!

On my client system I defined a local TNS alias “RON” to connect  to the RON database. Note that it doesn’t use any TAF parameters.

C:\oracle\product\11.2.0\client_1\network\admin>tnsping ron

TNS Ping Utility for 32-bit Windows: Version 11.2.0.2.0 - Production on 15-FEB-2011 17:07:44

Copyright (c) 1997, 2010, Oracle.  All rights reserved.

Used parameter files:
C:\oracle\product\11.2.0\client_1\network\admin\sqlnet.ora

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = scan1.example.com)(PORT = 1825)
)) (CONNECT_DATA = (SERVICE_NAME = LRON_APP.uk.db.com)))
OK (10 msec)

Database Relocation

Oracle’s promise is that you don’t lose your session during a database relocation. Let’s see if that is actually true. Using my setup I connected to the RON database (sorry for the broken formatting!):

C:\oracle\product\11.2.0\client_1\network\admin>sqlplus martin/test@ron

SQL*Plus: Release 11.2.0.2.0 Production on Tue Feb 15 16:50:08 2011

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

SQL> select username,failover_method,failover_type,failed_over from gv$session where username='MARTIN';

USERNAME                       FAILOVER_M FAILOVER_TYPE FAI
------------------------------ ---------- ------------- ---
MARTIN                         BASIC      SESSION       NO

OK, so TAF is working. At this time I started the relocate command:

[RON_1]oracle@node1.example.com $ srvctl relocate database -d RON -n node2 -w 1 -v
Configuration updated to two instances
Instance RON_2 started

I then repeatedly required my SQL statement above while the instance was relocating. The output is shown below.

SQL> select inst_id,username,failover_method,failover_type,failed_over from gv$session where username='MARTIN';

 INST_ID USERNAME                       FAILOVER_M FAILOVER_TYPE FAI
---------- ------------------------------ ---------- ------------- ---
 1 MARTIN                         NONE       NONE          NO
 1 MARTIN                         BASIC      SESSION       NO
 2 MARTIN                         NONE       NONE          NO

SQL> select * from v$active_instances;

INST_NUMBER INST_NAME
----------- ------------------------------------------------------------
 1 node1.example.com:RON_1
 2 node2.example.com:RON_2

SQL> select inst_id,username,failover_method,failover_type,failed_over from gv$session where username='MARTIN';

 INST_ID USERNAME                       FAILOVER_M FAILOVER_TYPE FAI
---------- ------------------------------ ---------- ------------- ---
 1 MARTIN                         NONE       NONE          NO
 1 MARTIN                         BASIC      SESSION       NO
 2 MARTIN                         NONE       NONE          NO

As you can see the configuration change to 2 instances is reflected in the output of my query to v$active_instances. You also see the number of sessions increasing, pay attention to the inst_id colum: a new session is created on the second instance.

In my other session I saw the relocation completing:

Services relocated
Waiting for 1 minutes for instance RON_1 to stop.....
Instance RON_1 stopped
Configuration updated to one instance

I required one more time to see what happened-would my session survive?

SQL> /
select inst_id,username,failover_method,failover_type,failed_over from gv$session where username='MARTIN'
*
ERROR at line 1:
ORA-25408: can not safely replay call

SQL> /

 INST_ID USERNAME                       FAILOVER_M FAILOVER_TYPE FAI
---------- ------------------------------ ---------- ------------- ---
 2 MARTIN                         BASIC      SESSION       YES

SQL>
SQL> select * from v$active_instances;

INST_NUMBER INST_NAME
----------- ------------------------------------------------------------
 2 node2.example.com:RON_2

Well it did survice. Note the ORA-25408 error. That’s expected, since I’m using the SQL*Plus client I don’t have the opportunity to trap the error and replay my OCI call. You should capture this SQLException in Java or your preferred development environment. I have provided an example in chapter 11 of Pro Oracle Database 11g RAC on Linux.

Node failure

I couldn’t see how a session would survive in the case of a node failure… I said this in my email to James:

> I cannot see how TAF or FCF work in case of a node failure. In my
> tests I did for the book TAF only worked if there was a service with
> at least 2 preferred instances. And FCF requires a FAN aware
> connection pool which is rare to find. RAC one however only has only 1
> active node (unless you relocate it).

But better test before jumping to conclusions!

I could only kill an instance rather than the server which would have been a better test. I assumed Clusterware would try to restart the failed instance on the same node a few times and then relocate the resource if the stat was not successful.

I knew the database was now running on node 2 so I killed the SMON process. That sure results in an instance crash.

oracle@node2.example.com $ ps -ef | grep RON | grep smon
oracle   14208     1  0 16:52 ?        00:00:00 ora_smon_RON_2
$ kill -9 14208

Just as you would expect, the session didn’s survive this (how could it? There is no active second instance!)

SQL> /
select inst_id,username,failover_method,failover_type,failed_over from gv$session where username='MARTIN'
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 15649
Session ID: 183 Serial number: 3

And just as expected, Clusterware restarted the failed instance the second it detected the failure. The node’s alert log showed this:

System State dumped to trace file /u01/app/oracle/product/admin/RON/admin/diag/rdbms/RON/RON_2/trace/RON_2_diag_14174.trc
ORA-1092 : opitsk aborting process
2011-02-15 16:58:29.255000 +00:00
ORA-1092 : opitsk aborting process
License high water mark = 6
2011-02-15 16:58:32.569000 +00:00
Instance terminated by PMON, pid = 14164
USER (ospid: 16101): terminating the instance
Instance terminated by USER, pid = 16101
2011-02-15 16:58:35.286000 +00:00
Starting ORACLE instance (normal)
2011-02-15 16:58:36.477000 +00:00
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
2011-02-15 16:58:42.687000 +00:00
Private Interface ‘bond1:1′ configured from GPnP for use as a private interconnect.
[name='bond1:1',...
..., use=public/1]

Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u01/app/oracle/product/11.2.0.2/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is enabled
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 – 64bit Production

Summary

That concludes my testing. RON works better than a “classic” active/passive cluster and allows sessions to stay connected when migrating the database to a different host. And it makes it easier to convert the database to a full RAC database (I have an example but it needs a bit of tidying up before posting). On the other hand, virtualisation technology has allowed us to do the same for quite some time now. Xen and OracleVM can relocate domUs, and vmotion is the commercial alternative. Whatever suits your needs.

11.1 GC agent refuses to start

This is a follow-up post from my previous tale of how not to move networks for RAC. After having successfully restarted the cluster as described here in a previous post I went on to install a Grid Control 11.1 system. This was to be on Solaris 10 SPARC-why SPARC? Not my platform of choice when it comes to Oracle software, but my customer has a huge SPARC estate and wants to make most of it.

After the OMS has been built (hopefully I’ll find time to document this as it can be quite tricky on SPARC) I wanted to secure the agents on my cluster against it. That worked ok for the first node:

  • emctl clearstate agent
  • emctl secure agent
  • emctl start agent

Five minutes later the agent appeared in my list of agents in the Grid Control console. With this success backing me I went to do the same on the next cluster node.

Here things were different-here’s the sequence of commands I used:

$ emctl stop agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved
$

I didn’t pay too much attention to the fact that there has been no acknowledgement of the completion of the stop command. I noticed something wasn’t quite right when I tried to get the agent’s status:

$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
emctl stop agent
Error connecting to https://node2.example.com:3872/emd/main

Now that should have reported that the agent was down. Strange. I tried a few more commands,  such as the following one to start the agent.

[agent]oracle@node2.example.com $ emctl start agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Agent is already running

Which wasn’t the case-there was no agent process whatsoever in the process table. I also checked the emd.properties file. Note that the emd.properties file is in $AGENT_HOME/hostname/sysman/config/ now instead of $AGENT_HOME/sysman/config as it was in 10g.

Everything looked correct, and even a comparison with the first node didn’t reveal any discrepancy. So I scratched my head a little more until I found a MOS note on the subject stating that the agent cannot listen to multiple addresses. The note is for 10g only and has the rather clumsy title “Grid Control Agent Startup: “emctl start agent” Command Returns “Agent is already running” Although the Agent is Stopped (Doc ID 1079424.1)

Although stating it’s for 10g and multiple NICs it got me thinking. And indeed, the /etc/hosts file has not been updated, leaving the old cluster address in /etc/hosts while the new one was in DNS.

# grep node2 /etc/hosts
10.x.x4.42            node2.example.com node2
172.x.x.1x8          node2-priv.example.com node2-priv
# host node2.example.com
node2.example.com has address 10.x5.x8.3
[root@node2 ~]# grep ^hosts /etc/nsswitch.conf
hosts:      files dns

This also explained why the agent started on the first node-it had an updated /etc/hosts file. Why the other nodes didn’t have their hosts file updated will forever remain a mystery.

Things then changed dramatically after the hosts file has been updated:

$ emctl status agent

Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent is Not Running

Note how emctl acknowledges that the agent it down now. I successfully secured and started the agent:

$ emctl secure agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Agent is already stopped...   Done.
Securing agent...   Started.
Enter Agent Registration Password :
Securing agent...   Successful.

$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent is Not Running
$ emctl start agent

Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
Starting agent .............. started.

One smaller problem remained:

$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version     : 11.1.0.1.0
OMS Version       : 11.1.0.1.0
Protocol Version  : 11.1.0.0.0
Agent Home        : /u01/app/oracle/product/agent11g/node8.example.com
Agent binaries    : /u01/app/oracle/product/agent11g
Agent Process ID  : 14045
Parent Process ID : 14014
Agent URL         : https://node8.example.com:3872/emd/main
Repository URL    : https://oms.example.com:1159/em/upload
Started at        : 2011-02-14 09:59:03
Started by user   : oracle
Last Reload       : 2011-02-14 10:00:13
Last successful upload                       : 2011-02-14 10:00:19
Total Megabytes of XML files uploaded so far :    11.56
Number of XML files pending upload           :      188
Size of XML files pending upload(MB)         :    65.89
Available disk space on upload filesystem    :    60.11%
Collection Status                            : Disabled by Upload Manager
Last successful heartbeat to OMS             : 2011-02-14 10:00:17
---------------------------------------------------------------
Agent is Running and Ready

The message in red highlights the “Disabled by Upload Manager”. That’s because a lot of stuff hasn’t been transferred yet. Let’s force an upload-I know the communication between agent and OMS is working, so that should resolve the issue.

$ emctl upload
$ emctl status agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version     : 11.1.0.1.0
OMS Version       : 11.1.0.1.0
Protocol Version  : 11.1.0.0.0
Agent Home        : /u01/app/oracle/product/agent11g/node8.example.com
Agent binaries    : /u01/app/oracle/product/agent11g
Agent Process ID  : 14045
Parent Process ID : 14014
Agent URL         : https://node8.example.com:3872/emd/main
Repository URL    : https://oms.example.com:1159/em/upload
Started at        : 2011-02-14 09:59:03
Started by user   : oracle
Last Reload       : 2011-02-14 10:02:12
Last successful upload                       : 2011-02-14 10:02:53
Total Megabytes of XML files uploaded so far :    91.12
Number of XML files pending upload           :       22
Size of XML files pending upload(MB)         :     1.50
Available disk space on upload filesystem    :    60.30%
Last successful heartbeat to OMS             : 2011-02-14 10:02:19
---------------------------------------------------------------
Agent is Running and Ready

That’s about it-a few minutes later the agent was visible on the console. Now that only had to be repeated for all remaining 6 nodes…

NB: For the reasons shown in this article I don’t endorse duplicating host information in /etc/hosts and DNS-a resilient DNS infrastructure should always be used to store this kind of information.

Troubleshooting ora.net1.network on an 8 node cluster

It seems I am doing a lot of fixing broken stuff recently. So this time I have been asked to repair a broken 8 node RAC cluster on OEL 5.5 with Oracle RAC 11.2.0.2. The system has been moved into a different, more secure network, and its firewalls prevented all access to the machines except for ILO. Another way of “security through obscurity”. The new network didn’t allow any clients to connect to any of the 8 node RAC which means that it is actually quite expensive kit to sit idle. The cluster is not in production, it’s still being build to specification but this accessibility problem has been a holdup to the project for a little while now. Yesterday has been a breakthrough-the netops team found an error to their configuration and for the first time the hosts could be accessed via ssh. Unfortunately for me that access is possible via audited gateways using PowerBroker to which I don’t have access.An alternative was the ILO interface which has not yet been hardened to production standards. So after some discussion internally I was given the ILO access credentials. This is good and bad: good, because it was a thoroughly broken system, and bad because there is no copy and paste with a java based console. And if that wasn’t bad enough, I had to contend myself with 80×24 characters on the console (however in very big letters). I pretty much needed all of my 24″ screen to display it. But I digress.

When logging on, I found the following situation:

  • Only 1 out of 8 nodes had OHAS/CRSD started. The others were still down, a kernel upgrade has taken place, but the asmlib kernel module hasn’t been upgraded at the same time. The first node had the correct RPM installed and ASMLib has done its magic on this node
  • Clusterware’s lower stack was up. However the ora.net1.network and all resources depending on it (listener, scan, scan listener, etc) were down. Not a single byte went over the public interconnect. That was strange.

Running /sbin/ifconfig has been a dream on this machine – I saw all 3 SCAN IPs on it, and all 8 node virtual IP addresses. Plus it has 6 NICs for Oracle, bonded into pairs of 2. And this is exactly where the confusion starts. I found the following bonded interfaces defined:

  • bond0
  • bond1.251
  • bond0.212

It took a while to figure out why these interfaces were named as they were, but apparently the suffix is a VLAN name. It also filtered through that one of my colleagues has tried to replace the previously used bond0.212 with bond0 as the public interconnect. He was however not successful in doing so, leaving the cluster in the state it was in.

He used the following commands to update the public interface:

$ oifcfg getif
bond1.251  172.xxx.0  global  cluster_interconnect
bond0  10.2xxx8.0  global  public

He also changed the vip configuration, with the result shown here:

srvctl config vip -n node11
VIP exists: /node1-vip/10.2xx8.13/10.2xx8.0/255.255.255.0/bond0, hosting node node11

However

The VIP however remained unimpressed:

srvctl start vip -n node1
PRCR-1079 : Failed to start resource ora.node1.vip
CRS-2674: Start of 'ora.net1.network' on 'node1' failed
CRS-2632: There are no more servers to try to place resource 'ora.node1.vip' on that would satisfy its placement policy

That’s where I have been asked to cast a keen eye over the installation.

The Investigation

First of all I could find nothing wrong with what has been done so far. So starting my investigation I first thought there was something wrong with the public network so I decided to shut it down:

# ifdown bond0

I then checked the network configuration of /etc/sysconfig/network-scripts. The setting is shown here:

ifcfg-bond0

device=bond0
bonding_opts="use_carrier=0 miimon=0 mode=1 arp_interval=10000 arp_ip_target=10.xxx.4 primary=eth0"
bootproto=none
onboot=yes
network=10.2xxx.0
netmask=255.255.254.0
ipaddr=10.xxx.2
userctl=no

ifcfg-eth0

device=eth0
hwaddr=f4:ce:46:87:fa:d0
bootproto=none
onboot=yes
master=bond0
slave=y
userctl=no

ifcfg-eth1

device=eth1
hwaddr=f4:ce:46:87:fa:d4
bootproto=none
onboot=yes
master=bond0
slave=yes
userctl=no

The MAC addresses of ifcfg-eth* matched the output from the ifconfig command. In the lab I occasionally have the problem that my configurartion files don’t match the real MAC addresses and therefore my NICs don’t come up. But this wasn’t the case here.

I then checked if the kernel module is loaded correctly. Usually you’d find that in /etc/modprobe.conf but there was not entry. I added these lines as per the documentation:

alias bond0 bonding
alias bond1 bonding
alias bond1.251 bonding

With that all done I brought the bond0 interface back up (don’t ever try to bring down the private interconnect-it will cause a node eviction!). Still nothing. The output of crsctl status resource -t remained “OFFLINE” for resource ora.net1.network. BTW, you cannot manually start that a network resource using srvctl (it’s an ora.* resource so don’t even think about trying crsctl start resource ora.net1.network :). All you can do with a network resource is to get its configuration (srvctl config network -k 1…) and modify it (srvctl modify network -k 1…)

ORAROOTAGENT is responsible for starting the network, and it will try to do so every second or so. That’s CRSD’s ORAROOTAGENT by the way, the log file is in $GRID_HOME/log/`hostname -s`/agent/crsd/orarootagent_root/orarootagent_root.log.

After the modification to bond0 I could now ping the IP associated with bond0 so at least that was a success. One thing I learned that day is that the MAC address of the bonded NIC matches the primary eth* interface’s NIC, in my case it was that of eth0, i.e. f4:ce:46:87:fa:d0. If one of the enslaved NICs failed it would probably assume the failback NIC’s MAC address. So in summary:

  • the network bonding was correctly configured
  • I could ping bond0

At this point I could see no reason why starting of the network failed. Maybe a typo in the configuration? The network configuration can be queried with 2 commands: oifcfg and servctl config network. So I tried oifcfg first.oifcfg getif returns:

bond0 10.xx.x2.0           "good"
bond0 10.xx.x8.0           "old/bad"
bind1.251 172.xx.xx.160    interconnect
bind1.251 169.254.0.0

Hmmm, where’s that second bond0 interface from? The bond1.251 interface is in use and working, the 172.xxx IP matches the IP address assigned in ifcfg-bon1.251. The second entry for bind1.251 is created by the HAIP resource and has to do with the high available cluster interconnect which uses multicasting for communication (to the frustration of many users who upgraded to 11.2.0.2 only to find out that the lower stack doesn’t start on the second and other nodes).

So to be sure that I was seeing something unusual I compared the output with another node on the cluster. There I found I only have 3 interfaces …. bond0 and bond1 + the UDP multicast address. I initially tried to remove the bad network with oifcfg delif but that didn’t work. I then verified the output of srvctl config network to see if it matched what I expected to. And here was a surprise: the output of the network listed a wrong subnet mask. Instead of 255.255.254.0 (note the “254″!) i found 255.255.255.0. That was easy to fix and while I was back again trying to delete the old network using oifcfg I suddenly realised that the cluster has sprung back into life. Small typo-big consequences! Finally all the resources depending on ora.net1.network were started, including SCAN VIPs, SCAN listeners, listeners, VIPs…

References for NIC bonding on RHEL5