Search

Top 60 Oracle Blogs

Recent comments

NFS

warning: Invalid argument supplied for foreach() in /www/oaktable/sites/all/modules/cck/content.module on line 1284.

Oracle 12c RAC on Oracle Linux 6 and 7 using NFS

linux-tuxFollowing on from the last post, I’ve brought my NFS RAC stuff up to date also.

I noticed I had not done a RAC install using NFS on Oracle Linux 6, so I threw that in for good measure too. :)

How to get insights into the Linux Kernel

This is probably as much a note-to-self as it can possibly be. Recently I have enjoyed some more in-depth research about how the Linux kernel works. To that extent I started fairly low-level. Theoretically speaking, you need to understand the hardware-software interface first before you can understand the upper levels. But in practice you get by with less knowledge. But if you are truly interested in how computers work you might want to consider reading up on some background. Some very knowledgable people I deeply respect have recommended books by David A. Patterson and John L. Hennessy. I have these two:

TCP Trace Analysis for NFS

How do we know where latency comes from when  there is a disparity in reported I/O latency on  the I/O subsystem and that of the latency reported on the  client box requesting the I/O.

For example if I have an Oracle database requesting I/O  and Oracle says an 8Kb request takes 50 ms yet the I/O storage subsystem says 8Kb I/Os are taking 1ms (averages) , then where does the 49  extra ms come from?

When the I/O subsystem is connected to Oracle via NFS  then there are a lot of layers that could be causing the extra latency.

Screen Shot 2013-08-23 at 1.35.20 PM

Where does the difference in latency come from between NFS Server and Oracle’s timing of pread?

n/a

Build your own stretch cluster part V

This post is about the installation of Grid Infrastructure, and where it’s really getting exciting: the 3rd NFS voting disk is going to be presented and I am going to show you how simple it is to add it into the disk group chosen for OCR and voting disks.

Let’s start with the installation of Grid Infrastructure. This is really simple, and I won’t go into too much detail. Start by downloading the required file from MOS, a simple search for patch 10098816 should bring you to the download patch for 11.2.0.2 for Linux-just make sure you select the 64bit version. The file we need just now is called p10098816_112020_Linux-x86-64_3of7.zip. The file names don’t necessarily relate to their contents, the readme helps finding out which piece of the puzzle is used for what functionality.

I alluded to my software distribution method in one of the earlier posts, and here’s all the detail to come. My dom0 exports the /m directory to the 192.168.99.0/24 network, the one accessible to all my domUs. This really simplifies software deployments.

So starting off, the file has been unzipped:

openSUSE-112-64-minimal:/m/download/db11.2/11.2.0.2 # unzip -q p10098816_112020_Linux-x86-64_3of7.zip

This creates the subdirectory “grid”. Switch back to edcnode1 and log in as oracle. As I already explained I won’t use different accounts for Grid Infrastructure and the RDBMS in this example.

If not already done so, mount the /m directory on the domU (which requires root privileges). Move to the newly unzipped “grid” directory under your mount point and begin to set up the user equivalence. On edcnode1 and edcnode2, create RSA and DSA keys for SSH:

[oracle@edcnode1 ~]$ ssh-keygen -t rsa

Any questions can be answered with the return key, it’s important to leave the passphrase empty. Repeat the call to ssh-keygen with argument “-t dsa”. Navigate to ~/.ssh and create the authorized_keys file as follows:

[oracle@edcnode1 .ssh]$ cat *.pub >> authorized_keys

Then copy the authorized_keys file to edcnode2 and add the public keys:

[oracle@edcnode1 .ssh]$ scp authorized_keys oracle@edcnode2:`pwd`
[oracle@edcnode1 .ssh]$ ssh oracle@edcnode2

If you are prompted, add the host to the ~/.ssh/known_hosts file by typing in “yes”.

[oracle@edcnode2 .ssh]$ cat *.pub >> authorized_keys

Change the permissions on the authorized_keys file to 0400 on both hosts, otherwise it won’t be considered when trying to log in. With all of this done, you can add all the unknown hosts to each node’s known_hosts file. The easiest way is a for loop:

[oracle@edcnode1 ~]$ for i in edcnode1 edcnode2 edcnode1-priv edcnode2-priv; do  ssh $i hostname; don

Run this twice on each node, acknowledging the question if the new address should be added. Important: Ensure that there is no banner (/etc/motd, .profile, .bash_profile etc) writing to stdout or stderr or you are going to see strange error messages about user equivalence not being set up correctly.

I hear you say: but 11.2 can create user equivalence in OUI now-this is of course correct, but I wanted to run cluvfy now which requires a working setup.

Cluster Verification

It is good practice to run a check to see if the prerequisites for the Grid Infrastructure installation are met, and keep the output. Change to the NFS mount where the grid directory is exported, and execute runcluvfy.sh as in this example:

[oracle@edcnode1 grid]$ ./runcluvfy.sh stage -pre crsinst -n edcnode1,edcnode2 -verbose -fixup 2>&1 | tee /tmp/preCRS.tx

The nice thing is that you can run the fixup script now to fix kernel parameter settings:

[root@edcnode2 ~]# /tmp/CVU_11.2.0.2.0_oracle/runfixup.sh
/usr/bin/id
Response file being used is :/tmp/CVU_11.2.0.2.0_oracle/fixup.response
Enable file being used is :/tmp/CVU_11.2.0.2.0_oracle/fixup.enable
Log file location: /tmp/CVU_11.2.0.2.0_oracle/orarun.log
Setting Kernel Parameters...
fs.file-max = 327679
fs.file-max = 6815744
net.ipv4.ip_local_port_range = 9000 65500
net.core.wmem_max = 262144
net.core.wmem_max = 1048576

Repeat this on the second node, edcnode2. Obviously you should fix any other problem cluvfy reports before proceeding.

In the previous post I created the /u01 mount point-double check that /u01 is actually mounted-otherwise you’d end up writing on your root_vg’s root_lv, not an ideal situation.

You are now ready to start the installer: type in ./runInstaller to start the installation.

Grid Installation

This is rather mundane, and instaed of providing print screens, I opted for a description of the steps needed to execute in the OUI session.

  • Screen 01: Skip software updates (I don’t have an Internet connection on my lab)
  • Screen 02: Install and configure Grid Infrastructure for a cluster
  • Screen 03: Advanced Installation
  • Screen 04: Keep defaults or add additional languages
  • Screen 05: Cluster Name: edc, SCAN name edc-scan, SCAN port: 1521, do not configure GNS
  • Screen 06: Ensure that both hosts are listed in this screen. Add/edit as appropriate. Hostnames are edcode{1,2}.localdomain, VIPs are to be edcnode{1,2}-vip.localdomain. Enter the oracle  user’s password and click on next
  • Screen 07: Assign eth0 to public, eth1 to private and eth2 to “do not use”.
  • Screen 08: Select ASM
  • Screen 09: disk group name: OCRVOTE with NORMAL redundancy. Tick the boxes for “ORCL:OCR01FILER01″, “ORCL:OCR01FILER02″ and “ORCL:OCR02FILER01″
  • Screen 10: Choose suitable passwords for SYS and ASMSNMP
  • Screen 11: Don’t use IPMI
  • Screen 12: Assign DBA to OSDBA, OSOPER and OSASM. Again, in the real world you should think about role separation and assign different groups
  • Screen 13: ORACLE_BASE: /u01/app/oracle, Software location: /u01/app/11.2.0/grid
  • Screen 14: Oracle inventory: /u01/app/oraInventory
  • Screen 15: Ignore all-there should only be references to swap, cvuqdisk, ASM device checks and NTP. If you have additional warnings, fix them first!
  • Screen 16: Click on install!

The usual installation will now take place. At the end, run the root.sh script on edcnode1 and after it completes, on edcnode2. The output is included here for completeness:

[root@edcnode1 u01]# /u01/app/11.2.0/grid/root.sh 2>&1 | tee /tmp/root.sh.out
Running Oracle 11g root script...

The following environment variables are set as:
 ORACLE_OWNER= oracle
 ORACLE_HOME=  /u01/app/11.2.0/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]:
 Copying dbhome to /usr/local/bin ...
 Copying oraenv to /usr/local/bin ...
 Copying coraenv to /usr/local/bin ...

Creating /etc/oratab file...
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
Creating trace directory
LOCAL ADD MODE
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
OLR initialization - successful
 root wallet
 root wallet cert
 root cert export
 peer wallet
 profile reader wallet
 pa wallet
 peer wallet keys
 pa wallet keys
 peer cert request
 pa cert request
 peer cert
 pa cert
 peer root cert TP
 profile reader root cert TP
 pa root cert TP
 peer pa cert TP
 pa peer cert TP
 profile reader pa cert TP
 profile reader peer cert TP
 peer user cert
 pa user cert
Adding daemon to inittab
ACFS-9200: Supported
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies - this may take some time.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9309: ADVM/ACFS installation correctness verified.
CRS-2672: Attempting to start 'ora.mdnsd' on 'edcnode1'
CRS-2676: Start of 'ora.mdnsd' on 'edcnode1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'edcnode1'
CRS-2676: Start of 'ora.gpnpd' on 'edcnode1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'edcnode1'
CRS-2672: Attempting to start 'ora.gipcd' on 'edcnode1'
CRS-2676: Start of 'ora.gipcd' on 'edcnode1' succeeded
CRS-2676: Start of 'ora.cssdmonitor' on 'edcnode1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'edcnode1'
CRS-2672: Attempting to start 'ora.diskmon' on 'edcnode1'
CRS-2676: Start of 'ora.diskmon' on 'edcnode1' succeeded
CRS-2676: Start of 'ora.cssd' on 'edcnode1' succeeded

ASM created and started successfully.

Disk Group OCRVOTE created successfully.

clscfg: -install mode specified
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
CRS-4256: Updating the profile
Successful addition of voting disk 38f2caf7530c4f67bfe23bb170ed2bfe.
Successful addition of voting disk 9aee80ad14044f22bf6211b81fe6363e.
Successful addition of voting disk 29fde7c3919b4fd6bf626caf4777edaa.
Successfully replaced voting disk group with +OCRVOTE.
CRS-4256: Updating the profile
CRS-4266: Voting file(s) successfully replaced
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   38f2caf7530c4f67bfe23bb170ed2bfe (ORCL:OCR01FILER01) [OCRVOTE]
 2. ONLINE   9aee80ad14044f22bf6211b81fe6363e (ORCL:OCR01FILER02) [OCRVOTE]
 3. ONLINE   29fde7c3919b4fd6bf626caf4777edaa (ORCL:OCR02FILER01) [OCRVOTE]
Located 3 voting disk(s).
CRS-2672: Attempting to start 'ora.asm' on 'edcnode1'
CRS-2676: Start of 'ora.asm' on 'edcnode1' succeeded
CRS-2672: Attempting to start 'ora.OCRVOTE.dg' on 'edcnode1'
CRS-2676: Start of 'ora.OCRVOTE.dg' on 'edcnode1' succeeded
ACFS-9200: Supported
ACFS-9200: Supported
CRS-2672: Attempting to start 'ora.registry.acfs' on 'edcnode1'
CRS-2676: Start of 'ora.registry.acfs' on 'edcnode1' succeeded
Preparing packages for installation...
cvuqdisk-1.0.9-1
Configure Oracle Grid Infrastructure for a Cluster ... succeeded

[root@edcnode2 ~]# /u01/app/11.2.0/grid/root.sh 2>&1 | tee /tmp/rootsh.out
Running Oracle 11g root script...

The following environment variables are set as:
 ORACLE_OWNER= oracle
 ORACLE_HOME=  /u01/app/11.2.0/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]:
 Copying dbhome to /usr/local/bin ...
 Copying oraenv to /usr/local/bin ...
 Copying coraenv to /usr/local/bin ...

Creating /etc/oratab file...
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/11.2.0/grid/crs/install/crsconfig_params
Creating trace directory
LOCAL ADD MODE
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
OLR initialization - successful
Adding daemon to inittab
ACFS-9200: Supported
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies - this may take some time.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9309: ADVM/ACFS installation correctness verified.
CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node edcnode1, number 1, and is terminating
An active cluster was found during exclusive startup, restarting to join the cluster
Preparing packages for installation...
cvuqdisk-1.0.9-1
Configure Oracle Grid Infrastructure for a Cluster ... succeeded
[root@edcnode2 ~]#

Congratulations! You have a working setup! Check if everything is ok:

[root@edcnode2 ~]# crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.OCRVOTE.dg
 ONLINE  ONLINE       edcnode1
 ONLINE  ONLINE       edcnode2
ora.asm
 ONLINE  ONLINE       edcnode1                 Started
 ONLINE  ONLINE       edcnode2
ora.gsd
 OFFLINE OFFLINE      edcnode1
 OFFLINE OFFLINE      edcnode2
ora.net1.network
 ONLINE  ONLINE       edcnode1
 ONLINE  ONLINE       edcnode2
ora.ons
 ONLINE  ONLINE       edcnode1
 ONLINE  ONLINE       edcnode2
ora.registry.acfs
 ONLINE  ONLINE       edcnode1
 ONLINE  ONLINE       edcnode2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
 1        ONLINE  ONLINE       edcnode2
ora.LISTENER_SCAN2.lsnr
 1        ONLINE  ONLINE       edcnode1
ora.LISTENER_SCAN3.lsnr
 1        ONLINE  ONLINE       edcnode1
ora.cvu
 1        ONLINE  ONLINE       edcnode1
ora.edcnode1.vip
 1        ONLINE  ONLINE       edcnode1
ora.edcnode2.vip
 1        ONLINE  ONLINE       edcnode2
ora.oc4j
 1        ONLINE  ONLINE       edcnode1
ora.scan1.vip
 1        ONLINE  ONLINE       edcnode2
ora.scan2.vip
 1        ONLINE  ONLINE       edcnode1
ora.scan3.vip
 1        ONLINE  ONLINE       edcnode1
[root@edcnode2 ~]#

[root@edcnode1 ~]# crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   38f2caf7530c4f67bfe23bb170ed2bfe (ORCL:OCR01FILER01) [OCRVOTE]
 2. ONLINE   9aee80ad14044f22bf6211b81fe6363e (ORCL:OCR01FILER02) [OCRVOTE]
 3. ONLINE   29fde7c3919b4fd6bf626caf4777edaa (ORCL:OCR02FILER01) [OCRVOTE]
Located 3 voting disk(s).

Adding the NFS voting disk

It’s about time to deal with this subject. If not done so already, start the domU “filer03″. Log in as openfiler and ensure that the NFS server is started. On the services tab click on enable next to the NFS server if needed. Next navigate to the shares tab, where you should find the volume group and logical volume created earlier. The volume group I created is called “ocrvotenfs_vg”, and it has 1 logical volume, “nfsvol_lv”. Click on the name of the LV to create a new share. I named the new share “ocrvote” – enter this in the popup window and click on “create sub folder”.

The new share should appear underneath the nfsvol_lv now. Proceed by clicking on “ocrvote” to set the share’s properties. Before you get to enter these, click on “make share”. Scroll down to the host access configuration section in the following screen. In this section you could set all sorts of technologies-SMB, NFS, WebDAV, FTP and RSYNC. For this example, everything but NFS should be set to “NO”.

For NFS, the story is different: ensure you set the radio button to “RW” for both hosts. Then click on Edit for each machine. This is important! The anonymous UID and GID must match the Grid Owner’s uid and gid. In my scenario I entered “500″ for both-you can check your settings using the id command as oracle: it will print the UID and GID plus other information.

The UID/GID mapping then has to be set to all_squash, IO mode to sync, and write delay to wdelay. Leave the default for “requesting origin port”, which was set to “secure < 1024″ in my configuration.

I decided to create /ocrvote on both nodes to mount the NFS export:

[root@edcnode2 ~]# mkdir /ocrvote

Edit the /etc/fstab file to make the mount persistent across reboots. I added this line to the file on both nodes:

192.168.101.52:/mnt/ocrvotenfs_vg/nfsvol_lv/ocrvote /ocrvote nfs rw,bg,hard,intr,rsize=32768,wsize=32768,tcp,noac,nfsvers=3,timeo=600,addr=192.168.101.51

The “addr” command instructs Linux to use the storage network to mount the share. Now you are ready to mount the device on all nodes, using the “mount /ocrvote” command.

I changed the export on the filer to the uid/gid combination of the oracle account (or, on an installation with separate grid software owner, to its uid/gid combination):

[root@filer03 ~]# cd /mnt/ocrvotenfs_vg/nfsvol_lv/
[root@filer03 nfsvol_lv]# ls -l
total 44
-rw-------  1 root    root     6144 Sep 24 15:38 aquota.group
-rw-------  1 root    root     6144 Sep 24 15:38 aquota.user
drwxrwxrwx  2 root    root     4096 Sep 24 15:26 homes
drwx------  2 root    root    16384 Sep 24 15:26 lost+found
drwxrwsrwx  2 ofguest ofguest  4096 Sep 24 15:31 ocrvote
-rw-r--r--  1 root    root      974 Sep 24 15:45 ocrvote.info.xml
[root@filer03 nfsvol_lv]# chown 500:500 ocrvote
[root@filer03 nfsvol_lv]# ls -l
total 44
-rw-------  1 root root  7168 Sep 24 16:09 aquota.group
-rw-------  1 root root  7168 Sep 24 16:09 aquota.user
drwxrwxrwx  2 root root  4096 Sep 24 15:26 homes
drwx------  2 root root 16384 Sep 24 15:26 lost+found
drwxrwsrwx  2  500  500  4096 Sep 24 15:31 ocrvote
-rw-r--r--  1 root root   974 Sep 24 15:45 ocrvote.info.xml
[root@filer03 nfsvol_lv]#

ASM requires zero padded files asm “disks”, so create one:

[root@filer03 nfsvol_lv]# dd if=/dev/zero of=ocrvote/nfsvotedisk01 bs=1G count=2
[root@filer03 nfsvol_lv]# chown 500:500 ocrvote/nfsvotedisk01

Add the third voting disk

Almost there! Before performing any change to the cluster configuration it is always a good idea to take a backup.

[root@edcnode1 ~]# ocrconfig -manualbackup

edcnode1     2010/09/24 17:11:51     /u01/app/11.2.0/grid/cdata/edc/backup_20100924_171151.ocr

You only need to do this on one node. Recall that the current state is:

[oracle@edcnode1 ~]$ crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   38f2caf7530c4f67bfe23bb170ed2bfe (ORCL:OCR01FILER01) [OCRVOTE]
 2. ONLINE   9aee80ad14044f22bf6211b81fe6363e (ORCL:OCR01FILER02) [OCRVOTE]
 3. ONLINE   29fde7c3919b4fd6bf626caf4777edaa (ORCL:OCR02FILER01) [OCRVOTE]
Located 3 voting disk(s).

ASM sees it the same way:

SQL> select mount_status,header_status, name,failgroup,library
 2  from v$asm_disk
 3  /

MOUNT_S HEADER_STATU NAME                           FAILGROUP       LIBRARY
------- ------------ ------------------------------ --------------- ------------------------------------------------------------
CLOSED  PROVISIONED                                                 ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
CLOSED  PROVISIONED                                                 ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
CLOSED  PROVISIONED                                                 ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
CLOSED  PROVISIONED                                                 ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
CACHED  MEMBER       OCR01FILER01                   OCR01FILER01    ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
CACHED  MEMBER       OCR01FILER02                   OCR01FILER02    ASM Library - Generic Linux, version 2.0.4 (KABI_V2)
CACHED  MEMBER       OCR02FILER01                   OCR02FILER01    ASM Library - Generic Linux, version 2.0.4 (KABI_V2)

7 rows selected.

Now here’s the idea: you add the NFS location to the ASM diskstring in addition with “ORCL:*” and all is well. But that didn’t work:

SQL> show parameter disk  

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskgroups                       string
asm_diskstring                       string      ORCL:*
SQL> 

SQL> alter system set asm_diskstring = 'ORCL:*, /ocrvote/nfsvotedisk01' scope=memory sid='*';
alter system set asm_diskstring = 'ORCL:*, /ocrvote/nfsvotedisk01' scope=memory sid='*'
*
ERROR at line 1:
ORA-02097: parameter cannot be modified because specified value is invalid
ORA-15014: path 'ORCL:OCR01FILER01' is not in the discovery set

Regardless of what I tried, the system complained. Grudgingly I used the GUI – asmca.

After starting asmca, click on Disk Groups. Then select diskgroup “OCRVOTE”, and right click to “add disks”. The trick is to click on “change discovery path”. Enter “ORCL:*, /ocrvote/nfsvotedisk01″ (without quotes) to the dialog field and close it. Strangely, now the NFS disk now appears. Make two ticks: before disk path, and in the quorum box. A click on the OK button starts the magic, and you should be presented with a success message. The ASM instance reports a little more:

ALTER SYSTEM SET asm_diskstring='ORCL:*','/ocrvote/nfsvotedisk01' SCOPE=BOTH SID='*';
2010-09-29 10:54:52.557000 +01:00
SQL> ALTER DISKGROUP OCRVOTE ADD  QUORUM DISK '/ocrvote/nfsvotedisk01' SIZE 500M /* ASMCA */
NOTE: Assigning number (1,3) to disk (/ocrvote/nfsvotedisk01)
NOTE: requesting all-instance membership refresh for group=1
2010-09-29 10:54:54.445000 +01:00
NOTE: initializing header on grp 1 disk OCRVOTE_0003
NOTE: requesting all-instance disk validation for group=1
NOTE: skipping rediscovery for group 1/0xd032bc02 (OCRVOTE) on local instance.
2010-09-29 10:54:57.154000 +01:00
NOTE: requesting all-instance disk validation for group=1
NOTE: skipping rediscovery for group 1/0xd032bc02 (OCRVOTE) on local instance.
2010-09-29 10:55:00.718000 +01:00
GMON updating for reconfiguration, group 1 at 5 for pid 27, osid 15253
NOTE: group 1 PST updated.
NOTE: initiating PST update: grp = 1
GMON updating group 1 at 6 for pid 27, osid 15253
2010-09-29 10:55:02.896000 +01:00
NOTE: PST update grp = 1 completed successfully
NOTE: membership refresh pending for group 1/0xd032bc02 (OCRVOTE)
2010-09-29 10:55:05.285000 +01:00
GMON querying group 1 at 7 for pid 18, osid 4247
NOTE: cache opening disk 3 of grp 1: OCRVOTE_0003 path:/ocrvote/nfsvotedisk01
GMON querying group 1 at 8 for pid 18, osid 4247
SUCCESS: refreshed membership for 1/0xd032bc02 (OCRVOTE)
2010-09-29 10:55:06.528000 +01:00
SUCCESS: ALTER DISKGROUP OCRVOTE ADD  QUORUM DISK '/ocrvote/nfsvotedisk01' SIZE 500M /* ASMCA */
2010-09-29 10:55:08.656000 +01:00
NOTE: Attempting voting file refresh on diskgroup OCRVOTE
NOTE: Voting file relocation is required in diskgroup OCRVOTE
NOTE: Attempting voting file relocation on diskgroup OCRVOTE
NOTE: voting file allocation on grp 1 disk OCRVOTE_0003
2010-09-29 10:55:10.047000 +01:00
NOTE: voting file deletion on grp 1 disk OCR02FILER01
NOTE: starting rebalance of group 1/0xd032bc02 (OCRVOTE) at power 1
Starting background process ARB0
ARB0 started with pid=29, OS id=15446
NOTE: assigning ARB0 to group 1/0xd032bc02 (OCRVOTE) with 1 parallel I/O
2010-09-29 10:55:13.178000 +01:00
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
2010-09-29 10:55:15.533000 +01:00
NOTE: stopping process ARB0
SUCCESS: rebalance completed for group 1/0xd032bc02 (OCRVOTE)
GMON updating for reconfiguration, group 1 at 9 for pid 31, osid 15451
NOTE: group 1 PST updated.
2010-09-29 10:55:17.907000 +01:00
NOTE: membership refresh pending for group 1/0xd032bc02 (OCRVOTE)
2010-09-29 10:55:20.481000 +01:00
GMON querying group 1 at 10 for pid 18, osid 4247
SUCCESS: refreshed membership for 1/0xd032bc02 (OCRVOTE)
2010-09-29 10:55:23.490000 +01:00
NOTE: Attempting voting file refresh on diskgroup OCRVOTE
NOTE: Voting file relocation is required in diskgroup OCRVOTE
NOTE: Attempting voting file relocation on diskgroup OCRVOTE

Superb! But did it kick out the correct disk? Yes it did-you now see OCR01FILER01 and ORC01FILER02 plus the NFS disk:

[oracle@edcnode1 ~]$ crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   38f2caf7530c4f67bfe23bb170ed2bfe (ORCL:OCR01FILER01) [OCRVOTE]
 2. ONLINE   9aee80ad14044f22bf6211b81fe6363e (ORCL:OCR01FILER02) [OCRVOTE]
 3. ONLINE   6107050ad9ba4fd1bfebdf3a029c48be (/ocrvote/nfsvotedisk01) [OCRVOTE]
Located 3 voting disk(s).

Preferred Mirror Read

One of the cool new 11.1 features allowed administrators to instruct administrators of stretch RAC system to read mirrored extents rather than primary extents. This can speed up data access in cases where data would otherwise have been sent from the remote array. Setting this parameter is crucial to many implementations. In preparation of the RDBMS installation (to be detailed in the next post), I created a disk group consisting of 4 ASM disks, two from each filer. The syntax for the disk group creation is as follows:

SQL> create diskgroup data normal redundancy
  2  failgroup sitea disk 'ORCL:ASM01FILER01','ORCL:ASM01FILER02'
  3* failgroup siteb disk 'ORCL:ASM02FILER01','ORCL:ASM02FILER02'
SQL> /

Diskgroup created.

As you can see all disks from sitea are from filer01 and form one failure group. The other disks, originating from filer02 form the second failure group.

You can see the result in v$asm_disk, as this example shows:

SQL> select name,failgroup from v$asm_disk;

NAME                           FAILGROUP
------------------------------ ------------------------------
ASM01FILER01                   SITEA
ASM01FILER02                   SITEA
ASM02FILER01                   SITEB
ASM02FILER02                   SITEB
OCR01FILER01                   OCR01FILER01
OCR01FILER02                   OCR01FILER02
OCR02FILER01                   OCR02FILER01
OCRVOTE_0003                   OCRVOTE_0003

8 rows selected.

Now all that remains to be done is to instruct the ASM instances to read from the local storage if possible. This is performed by setting an instance-specific init.ora parameter. I used the following syntax:

SQL> alter system set asm_preferred_read_failure_groups='DATA.SITEB' scope=both sid='+ASM2';

System altered.

SQL> alter system set asm_preferred_read_failure_groups='DATA.SITEA' scope=both sid='+ASM1';

System altered.

So I’m all set for the next step, the installation of the RDBMS software. But that’s for another post…

Kernel NFS fights back… Oracle throughput matches Direct NFS with latest Solaris improvements

After my recent series of postings, I was made aware of David Lutz’s blog on NFS client performance with Solaris.  It turns out that you can vastly improve the performance of NFS clients using a new parameter to adjust the number of client connections.

root@saemrmb9> grep rpcmod /etc/system
set rpcmod:clnt_max_conns=8

This parameter was introduced in a patch for various flavors of Solaris.  For details on the various flavors, see David Lutz’s recent blog entry on improving NFS client performance.  Soon, it should be the default in Solaris making out-of-box client performance scream.

DSS query throughput with Kernel NFS

I re-ran the DSS query referenced in my last entry and now kNFS matches the throughput of dNFS with 10gigE.


Kernel NFS throughput with Solaris 10 Update 8 (set rpcmod:clnt_max_conns=8)

This is great news for customers not yet on Oracle 11g.  With this latest fix to Solaris, you can match the throughput of Direct NFS on older versions of Oracle.  In a future post, I will explore the CPU impact of dNFS and kNFS with OLTP style transactions.

Posted in Oracle, Storage Tagged: 11g, 7410, analytics, database, dNFS, NAS, NFS, Oracle, performance, Solaris, Sun, tuning

Direct NFS vs Kernel NFS bake-off with Oracle 11g and Solaris… and the winner is

NOTE::  Please see my next entry on Kernel NFS performance and the improvements that come with the latest Solaris.

==============

After experimenting with dNFS it was time to do a comparison with the “old” way.  I was a little surprised by the results, but I guess that really explains why Oracle decided to embed the NFS client into the database :)

bake-off with OLTP style transactions

This experiment was designed to load up a machine, a T5240, with OLTP style transactions until no more CPU available.  The dataset was big enough to push about 36,000 IOPS read and 1,500 IOPS write during peak throughput.  As you can see, dNFS performed well which allowed the system to scale until DB server CPU was fully utilized.   On the other hand, Kernel NFS throttles after 32 users and is unable to use the available CPU to scale transactional throughput.

lower cpu overhead yields better throughput

A common measure for benchmarks is to figure out how many transactions per CPU are possible.  Below, I plotted the CPU content needed for a particular transaction rate.  This chart shows the total measured CPU (user+system) to for a given TPS rate.


dNFS vs kNFS (TPS/CPU)

As expected, the transaction rate per CPU is greater when using dNFS vs kNFS.  Please do note, that this is a T5240 machine that has 128 threads or virtual CPUs.  I don’t want to go into semantics of sockets, cores, pipelines, and threads but thought it was at least worth noting.  Oracle sees a thread of a T5240 as a CPU, so that is what I used for this comparison.

silly little torture test

When doing the OLTP style tests with a normal sized SGA, I was not able to fully utilize the 10gigE interface or the Sun 7410 storage.   So, I decided to do a silly little micro benchmark with a real small SGA.  This benchmark just does simple read-only queries that essentially result in a bunch of random 8k IO.  I have included the output from the Fishworks analytics below for both kNFS and dNFS.


Random IOPS with kNFS and Sun Open Storage


Random IOPS with dNFS and Sun 7410 open storage

I was able to hit ~90K IOPS with 729MB/sec of throughput with just one 10gigE interface connected to Sun 7140 unified storage.  This is an excellent result with Oracle 11gR2 and dNFS for a random test IO test… but there is still more bandwidth available.  So, I decided to do a quick DSS style query to see if I could break the 1GB/sec barrier.

===dNFS===
SQL> select /*+ parallel(item,32) full(item) */ count(*) from item;
 COUNT(*)
----------
 40025111
Elapsed: 00:00:06.36

===kNFS===
SQL> select /*+ parallel(item,32) full(item) */ count(*) from item;
 COUNT(*)
----------
 40025111

Elapsed: 00:00:16.18

kNFS table scan


dNFS table scan

Excellent, with a simple scan I was able to do 1.14GB/sec with dNFS more than doubling the throughput of kNFS.

configuration notes and basic tuning

I was running on a T5240 with Solaris 10 Update 8.

$ cat /etc/release
Solaris 10 10/09 s10s_u8wos_08a SPARC
Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
Use is subject to license terms.
Assembled 16 September 2009

This machine has the a built-in 10gigE interface which uses multiple threads to increase throughput.  Out of the box, there is very little to tuned as long as you are on Solaris 10 Update 8.  I experimented with various settings, but found that only basic tcp settings were required.

ndd -set /dev/tcp tcp_recv_hiwat 400000
ndd -set /dev/tcp tcp_xmit_hiwat 400000
ndd -set /dev/tcp tcp_max_buf 2097152
ndd -set /dev/tcp tcp_cwnd_max 2097152

Finally, on the storage front, I was using the Sun Storage 7140 Unified storage server as the NFS server for this test.  This server was born out of the Fishworks project and is an excellent platform for deploying NFS based databases…. watch out NetApp.

what does it all mean?

dNFS wins hands down.  Standard kernel NFS only essentially allows one client per “mount” point.  So eventually, we see data queued to a mount point.  This essentially clips the throughput far too soon.   Direct NFS solves this problem by having each Oracle shadow process mount the device directly.  Also with dNFS, all the desired tuning and mount point options are not necessary.  Oracle knows what options are most efficient for transferring blocks of data and configures the connection properly.

When I began down this path of discovery, I was only using NFS attached storage because nothing else was available in our lab… and IO was not initially a huge part of the project at hand.  Being a performance guy who benchmarks systems to squeeze out the last percentage point of performance, I was skeptical about NAS devices.  Traditionally, NAS was limited by slow networks and clumsy SW stacks.   But times change.   Fast 10gigE networks and Fishworks storage combined with clever SW like Direct NFS really showed this old dog a new trick.

Posted in Oracle, Storage Tagged: 11g, 7410, analytics, dNFS, fishworks, NAS, NFS, Oracle, performance, Solaris, Sun

Monitoring Direct NFS with Oracle 11g and Solaris… pealing back the layers of the onion.

When I start a new project, I like to check performance from as many layers as possible.  This helps to verify things are working as expected and helps me to understand how the pieces fit together.  My recent work with dNFS and Oracle 11gR2, I started down the path to monitor performance and was surprised to see that things are not always as they seem.  This post will explore the various ways to monitor and verify performance when using dNFS with Oracle 11gR2 and Sun Open StorageFishworks“.

why is iostat lying to me?

iostat(1M)” is one of the most common tools to monitor IO.  Normally, I can see activity on local devices as well as NFS mounts via iostat.  But, with dNFS, my device seems idle during the middle of a performance run.

bash-3.0$ iostat -xcn 5
cpu
us sy wt id
8  5  0 87
extended device statistics
r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0    6.2    0.0   45.2  0.0  0.0    0.0    0.4   0   0 c1t0d0
0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 toromondo.west:/export/glennf
cpu
us sy wt id
7  5  0 89
extended device statistics
r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0   57.9    0.0  435.8  0.0  0.0    0.0    0.5   0   3 c1t0d0
0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 toromondo.west:/export/glennf

From the DB server perspective, I can’t see the IO.  I wonder what the array looks like.

what does fishworks analytics have to say about IO?

The analytics package available with fishworks is the best way to verify performance with Sun Open Storage.  This package is easy to use and indeed I was quickly able to verify activity on the array.

There are 48,987 NFSv3 operations/sec and ~403MB/sec going through the nge13 interface.  So, this array is cooking pretty good.  So, let’s take a peek at the network on the DB host.

nicstat to the rescue

nicstat is wonderful tool developed by Brendan Greg at Sun to show network performance.  Nicstat really shows you the critical data for monitoring network speeds and feeds by displaying packet size, utilization, and rates of the various interfaces.

root@saemrmb9> nicstat 5
Time          Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
15:32:11    nxge0    0.11    1.51    1.60    9.00   68.25   171.7  0.00   0.00
15:32:11    nxge1  392926 13382.1 95214.4 95161.8  4225.8   144.0  33.3   0.00

So, from the DB server point of view, we are transferring about 390MB/sec… which correlates to what we saw with the analytics from Fishworks.  Cool!

why not use DTrace?

Ok, I wouldn’t be a good Sun employee if I didn’t use DTrace once in a while.  I was curious to see the Oracle calls for dNFS so I broke out my favorite tool from the DTrace Toolkit. The “hotuser” tool shows which functions are being called the most.  For my purposes, I found an active Oracle shadow process and searched for NFS related functions.

root@saemrmb9> hotuser -p 681 |grep nfs
^C
oracle`kgnfs_getmsg                                         1   0.2%
oracle`kgnfs_complete_read                                  1   0.2%
oracle`kgnfswat                                             1   0.2%
oracle`kgnfs_getpmsg                                        1   0.2%
oracle`kgnfs_getaprocdata                                   1   0.2%
oracle`kgnfs_processmsg                                     1   0.2%
oracle`kgnfs_find_channel                                   1   0.2%
libnfsodm11.so`odm_io                                       1   0.2%
oracle`kgnfsfreemem                                         2   0.4%
oracle`kgnfs_flushmsg                                       2   0.4%
oracle`kgnfsallocmem                                        2   0.4%
oracle`skgnfs_recvmsg                                       3   0.5%
oracle`kgnfs_serializesendmsg                               3   0.5%

So, yes it seems Direct NFS is really being used by Oracle 11g.

performance geeks love V$ tables

There are a set of V$ tables that allow you to sample the performance of the performance of dNFS as seen by Oracle.  I like V$ tables because I can write SQL scripts until I run out of Mt. Dew.  The following views are available to monitor activity with dNFS.

  • v$dnfs_servers: Shows a table of servers accessed using Direct NFS.
  • v$dnfs_files: Shows a table of files now open with Direct NFS.
  • v$dnfs_channels: Shows a table of open network paths (or channels) to servers for which Direct NFS is providing files.
  • v$dnfs_stats: Shows a table of performance statistics for Direct NFS.

With some simple scripting, I was able to create a simple script to monitor the NFS IOPS by sampling the v$dnfs_stats view.  This script simply samples the nfs_read and nfs_write operations, pauses for 5 seconds, then samples again to determine the rate.

timestmp|nfsiops
15:30:31|48162
15:30:36|48752
15:30:41|48313
15:30:46|48517.4
15:30:51|48478
15:30:56|48509
15:31:01|48123
15:31:06|48118.8

Excellent!  Oracle shows 48,000 NFS IOPS which agrees with the analytics from Fishworks.

what about the AWR?

Consulting the AWR, shows “Physical reads” in agreement as well.

Load Profile              Per Second    Per Transaction   Per Exec   Per Call
~~~~~~~~~~~~         ---------------    --------------- ---------- ----------
      DB Time(s):               93.1            1,009.2       0.00       0.00
       DB CPU(s):               54.2              587.8       0.00       0.00
       Redo size:            4,340.3           47,036.8
   Logical reads:          385,809.7        4,181,152.4
   Block changes:                9.1               99.0
  Physical reads:           47,391.1          513,594.2
 Physical writes:                5.7               61.7
      User calls:           63,251.0          685,472.3
          Parses:                5.3               57.4
     Hard parses:                0.0                0.1
W/A MB processed:                0.1                1.1
          Logons:                0.1                0.7
        Executes:           45,637.8          494,593.0
       Rollbacks:                0.0                0.0
    Transactions:                0.1

so, why is iostat lying to me?

iostat(1M) monitors IO to devices and nfs mount points.  But with Oracle Direct NFS, the mount point is bypassed and each shadow process simply mounts files directly.  To monitor dNFS traffic you have to use other methods as described here.  Hopefully, this post was instructive on how to peel back the layers in-order to gain visibility into dNFS performance with Oracle and Sun Open Storage.

Posted in Oracle, Storage Tagged: 7410, analytics, dNFS, monitoring, network, NFS, Oracle, performance, Solaris