Search

Top 60 Oracle Blogs

Recent comments

11g Release 2

An introduction to collectl

Some of you may have seen on twitter that I was working on understanding collectl. So why did I start with this? First of all, I was after a tool that records a lot of information on a Linux box. It can also play information back, but this is out of scope of this introduction.

In the past I have used nmon to do similar things, and still love it for what it does. Especially in conjunction with the nmon-analyzer, an Excel plug in it can create very impressive reports. How does collectl compare?

Getting collectl

Getting collectl is quite easy-get it from sourceforge: http://sourceforge.net/projects/collectl/

The project website including very good documentation is available from sourceforge as well, but uses a slightly different URL: http://collectl.sourceforge.net/

I suggest you get the archive-independent RPM and install it on your system. This is all you need to get started! The impatient could type “collectl” at the command prompt now to get some information. Let’s have a look at the output:

$ collectl
waiting for 1 second sample...
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
1   0  1163  10496    113     14     18      4      8     55      5      19
0   0  1046  10544      0      0      2      3    164    195     30      60
0   0  1279  10603    144      9    746    148     20     67     11      19
3   0  1168  10615    144      9    414     69     14     69      5      20
1   0  1121  10416    362     28    225     19     11     71      8      35
Ouch!

The “ouch” has been caused by my CTRL-c to stop the execution.

Collectl is organised to work by subsystems, the standard option is to print CPU, disk and network subsystem, aggregated.

If you don’t know what information you are after, you could use the –all flag to display aggregated information across all subsystems. Be warned that you need a large screen for all that output! For even more output, add the –verbose flag to the –all option and you need a 22” screen at least. The verbose flag prints more output, as the name suggests. For the disk subsystem you can view the difference:

$ collectl -sd -i 5 --verbose
waiting for 5 second sample...

# DISK SUMMARY (/sec)
#KBRead RMerged  Reads SizeKB  KBWrite WMerged Writes SizeKB
162     136     10     15      187      30     19      9
109      24      9     11      566     118     23     24
Ouch!
$ collectl -sd -i 5
waiting for 5 second sample...
#<----------Disks----------->
#KBRead  Reads KBWrit Writes
9865     73    190     23
Ouch!

Each subsystem can be queried individually, the default monitoring interval is 1 second. The man page for collectl lists the following subsystems:

SUMMARY SUBSYSTEMS

b - buddy info (memory fragmentation)
c - CPU
d - Disk
f - NFS V3 Data
i - Inode and File System
j - Interrupts
l - Lustre
m - Memory
n - Networks
s - Sockets
t - TCP
x - Interconnect
y - Slabs (system object caches)

As the name suggests, these sub systems provide summary information. Summaries are ok for a first overview, but don’t forget that information is aggregated and detail is lost.

From an Oracle point of view I’d probably be most interested in the CPU, disk and memory usage. If you are using RAC, network usage can also be interesting.

Detailed subsystem information is available for these (again taken from the excellen manual page):

C - CPU
D - Disk
E - Environmental data (fan, power, temp),  via ipmitool
F - NFS Data
J - Interrupts
L - Lustre OST detail OR client Filesystem detail
N - Networks
T - 65 TCP counters only available in plot format
X - Interconnect
Y - Slabs (system object caches)
Z - Processes

You can combine subsystems, and you can combine detail and summary information. Bear in mind though that this becomes a lot of information for a putty session of gnome-terminal!

In interactive mode, you might want to consider the –home flag, which does a top-like refresh and prints real time information without scrolling: very neat!

But even with the –-home option, digesting all that information visually can be a bit daunting, which leads me to my next section.

Generating graphical output

While all the textual information is all nice and good, it is difficult to visualise. Collectl can help you with that as well. All you need to do is generate a file in tab format, which is as simple as adding the –P and –f options. Since you can’t be overwhelmed with the information gathered in a file (unlike on standard out), you could use the detail switches. If you have the luxury, create the file with the information in a directory expored via samba and analyse it with Excel or other utilities. It’s possible to use gnuplot as well, but I found that a bit lacking for interactive use. The collectl-utils provide a CGI script to analyse collectl files on the host which can be convenient. Here is an example for measuring CPU, memory and all disks with a monitoring interval of 15 seconds. The file will be in “Plot” format (-P) and goes to /export/collectl/plotfiles:

$ collectl -sdmn -i 15 -P -f /export/collectl/plotfiles

Note that you can’t use the verbose flag here, and you also shouldn’t use a file name with the –f switch!

The resulting file is called hostname-yyyymmdd.tab. After renaming it to hostname-yyyymmdd.txt it can quite easily be imported using your favourite spreadsheet application. Imagine all the graphs you could produce with it! Also the header contains interesting information:

################################################################################
# Collectl:   V3.5.1-1  HiRes: 1  Options: -sdmn -i 15 -P -f /export/collectl/plotfiles
# Host:       node1 DaemonOpts:
# Distro:     Red Hat Enterprise Linux Server release 5.5 (Tikanga)  Platform:
# Date:       20110805-142647  Secs: 1312550807 TZ: +0100
# SubSys:     dmn Options: z Interval: 1 NumCPUs: 16  NumBud: 0 Flags: i
# Filters:    NfsFilt:  EnvFilt:
# HZ:         100  Arch: x86_64-linux-thread-multi PageSize: 4096
# Cpu:        AuthenticAMD Speed(MHz): 2210.190 Cores: 4  Siblings: 4
# Kernel:     2.6.18-194.el5  Memory: 65990460 kB  Swap: 16809976 kB
# NumDisks:   173 DiskNames: c0d0 sda sdb sdc sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn sdo sdp sdq sdr sds sdt sdu sdv sdw sdx sdy sdz sdaa sdab sdac sdad sdae sdaf sdag sdah sdai sdaj sdak sdal sdam sdan sdao sdap sdaq sdar sdas sdat sdau sdav sdaw sdax sday sdaz sdba sdbb sdbc sdbd sdbe sdbf sdbg sdbh sdbi sdbj sdbk sdbl sdbm sdbn sdbo sdbp sdbq sdbr sdbs sdbt sdbu sdbv sdbw sdbx sdby sdbz sdca sdcb sdcc sdcd sdce sdcf sdcg sdch sdci sdcj sdck sdcl sdcm sdcn sdco sdcp sdcq sdcr sdcs sdct sdcu sdcv sdcw sdcx sdcy sdcz sdda sddb sddc sddd sdde sddf sddg dm-0 dm-1 dm-2 dm-3 dm-4 dm-5 dm-6 dm-7 dm-8 dm-9 dm-10 dm-11 dm-12 dm-13 dm-14 dm-15 dm-16 dm-17 dm-18 dm-19 dm-20 dm-21 dm-22 dm-23 dm-24 dm-25 dm-26 dm-27 dm-28 dm-29 dm-30 dm-31 dm-32 dm-33 dm-34 dm-35 dm-36 dm-37 dm-38 dm-39 dm-40 dm-41 dm-42 dm-43 dm-44 dm-45 dm-46 dm-47 dm-48 dm-49 dm-50 dm-51 dm-52 dm-53 dm-54 dm-55 dm-56 dm-57 dm-58 dm-59 dm-60
# NumNets:    8 NetNames: lo: eth0: eth1: eth2: eth3: sit0: bond0: bond1:
# SCSI:       DA:0:00:00: ... DA:2:00:00:00
################################################################################

This should be enough to remind you of where you were running this test.

Run duration and interval

Use the –i flag to change the monitoring interval, this is the same as you’d do with SAR or iostat/vmstat and the like. You could then either use the –c option to count n samples, or alternatively use –R to run for n weeks, days, hours, minutes or seconds, each of which are abridged with their first letter. For example to run for 15 minutes with samples taken every 15 seconds, you’d say collectl –i 15 –R 15m.

Quick and dirty

If you need an interactive overview of what’s going on top-style, you can use the –top flag. This will print output very similar to the top command, but this time you have a lot more options to sort on. Use collectl –showtopops. This is so cool that I couldn’t help just listing the options here:

$ collectl --showtopopts
The following is a list of --top's sort types which apply to either
process or slab data.  In some cases you may be allowed to sort
by a field that is not part of the display if you so desire

TOP PROCESS SORT FIELDS

Memory
vsz    virtual memory
rss    resident (physical) memory

Time
syst   system time
usrt   user time
time   total time

I/O
rkb    KB read
wkb    KB written
iokb   total I/O KB

rkbc   KB read from pagecache
wkbc   KB written to pagecache
iokbc  total pagecacge I/O
ioall  total I/O KB (iokb+iokbc)

rsys   read system calls
wsys   write system calls
iosys  total system calls

iocncl Cancelled write bytes

Page Faults
majf   major page faults
minf   minor page faults
flt    total page faults

Miscellaneous (best when used with --procfilt)
cpu    cpu number
pid    process pid
thread total process threads (not counting main)

TOP SLAB SORT FIELDS

numobj    total number of slab objects
actobj    active slab objects
objsize   sizes of slab objects
numslab   number of slabs
objslab   number of objects in a slab
totsize   total memory sizes taken by slabs
totchg    change in memory sizes
totpct    percent change in memory sizes
name      slab names

Filtering information

Let’s say you are running multiple ASM disk groups in your system, but you are only interested in the performance of disk group DATA. The –sD flag will print all the information for all disks (LUNs) of the system. Collectl reports disks as the native devices and dm- devices. For multipathed devices you obviously want to look at the dm- device. You could use the multipath –ll command to map dm- device to WWIDs and your disks in the end. Let’s say you found out that the disks you need to look at are /dev/dm-{1,3,5,8} you could use the –dskfilt flag, which takes a perl regex. In my example, I could use the following command to check on those disks:

collectl -sD -c 1 --dskfilt "dm-(1\b|3\b|5\b|8\b)"
waiting for 1 second sample...

# DISK STATISTICS (/sec)
#          <---------reads---------><---------writes---------><--------averages--------> Pct
#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
dm-1             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-3             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-5             0      0    0    0       0      0    1    1       0     0     0      0    0
dm-8             0      0    0    0       0      0    0    0       0     0     0      0    0
$

Note the “\b” boundary, which is my uneducated way to saying that the expression should match dm-1, but not 10, or anything else that extends beyond number one.

Additional filters you can apply can be found in the output of collectl –showsubopts as well as in section subsystem options in the manpage.

Summary

Used correctly, collectl is the swiss army knife for system monitoring, the level of detail which can be gathered is breathtaking. Thanks Mark Seger! And aplogies for all the good stuff I’ve been missing!

So you don’t see any disks when trying to install ASM?

This is a post that highlights the difference between operating systems, but also the fact that sometime it is hard to break out of a habit once you got used to it. My background is that of a Linux enthusiast, even though I equally like Solaris and AIX but I have a little less exposure to those.

Background

I have recently been asked to look at RAC on SPARC, which I gladly did. The system I was given had the usual software stack for RAC at this customer’s site. It comprised of:

  • Solaris 10 Update 9 64bit on SPARC
  • EMC Power Path 5.1
  • EMC VMAX storage – 10x10G LUNs for a specific performance test

The Power Path configuration has already been in place when I got the machine, and I was allocated /dev/emcpower2[0-9] for my ASM testing. For the experienced Linux user who relies on device-mapper-multipath, the Power Path naming convention can be a bit confusing at first. For reference, the pseudo devices we are interested in for ASM are created under /dev/rdsk/-the “raw” device directory for “character” based access rather than the block device in /dev/dsk/.  By default, the Power Path devices are called “emcpower”, followed by a number and a letter (in SPARC). An example would be /dev/rdsk/emcpower20c.

The number (20) is just a sequence number and doesn’t have any other meaning as far as I know. It also doesn’t seem to be consistent across the cluster nodes by default. The suffix-letter (“c”) does have a meaning though: it indicates the slice of the device. For example, the latter “a” indicates slice 0, xxxc is the whole disk, and xxxg is slice 6.

You can use the format command to have a look at the slices currently defined. Simply type “partition” and then “print” to print it.

Now all the disks I was given had a slice 6, from cylinder 274 to the end of the disk:

# format
[disk selection not shown here]
format> partition
partition> print
Current partition table (original):
Total disk cylinders available: 10238 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       0 -   136      128.44MB    (137/0/0)     263040
  1       swap    wu     137 -   273      128.44MB    (137/0/0)     263040
  2     backup    wu       0 - 10237        9.37GB    (10238/0/0) 19656960
  3 unassigned    wm       0                0         (0/0/0)            0
  4 unassigned    wm       0                0         (0/0/0)            0
  5 unassigned    wm       0                0         (0/0/0)            0
  6        usr    wm     274 - 10237        9.12GB    (9964/0/0)  19130880
  7 unassigned    wm       0                0         (0/0/0)            0

partition> quit

If you’d like to see how the disks are connected to the array, of if there are any faults, use powermt display dev=all, or alternatively look at a specific device:

# powermt display dev=emcpower20
Pseudo name=emcpower20a
Symmetrix ID=00x2x4x0x6x4
Logical device ID=0x0x
state=alive; policy=SymmOpt; priority=0; queued-IOs=0;
==============================================================================
--------------- Host ---------------   - Stor -   -- I/O Path --  -- Stats ---
###  HW Path               I/O Paths    Interf.   Mode    State   Q-IOs Errors
==============================================================================
3072 pci@400/pci@0/pci@c/fibre-channel@0/fp@0,0 c2t50000974C00A6118d17s0 FA  7eA   active  alive       0      0
3077 pci@500/pci@0/pci@9/fibre-channel@0/fp@0,0 c3t50000974C00A611Cd17s0 FA  8eA   active  alive       0      0

If you care to learn more about how your system administrator configured the paths (I have 2 HBAs as you can see), use “powermt display”. Also note the disks c2t5….s0 and c3t5…s0 are the native devices aggregated into emcpower20a.

I have changed the permissions on the pseudo devices to 664 and made them owned by oracle:oinstall before launching OUI. In the disk discovery screen I changed the disk discovery string to /dev/rdsk/emcpower2* but didn’t see a single disk in the OUI window. That was odd.

Troubleshooting disk discovery

This error lead me to look at /tmp/OraInstall/installActions.log which referenced a call to kfod, the disk discovery tool. The entry in the log file showed this content:

WARNING: SRVCTL not found as there is no Grid Infrastructure home on the box.
 INFO: Executing [/tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/kfod, nohdr=true, verbose=true, disks=all, status=true\
 op=disks, asm_diskstring='/dev/rdsk/emcpower2*']
 with environment variables {XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt, DISPLAY=localhost:12.0,\
 PWD=/u01/app/oracle/stage/grid, LC_CTYPE=en_GB.ISO8859-1, _=./runInstaller, USER=oracle,\
 NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat, LC_COLLATE=en_GB.ISO8859-1, CLASSPATH=, HOME=/export/home/oracle,\
 LC_NUMERIC=en_GB.ISO8859-1, SSH_CONNECTION=10.128.46.56 52883 10.129.48.214 22,\
 LD_LIBRARY_PATH=/tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/../lib,\
 ORACLE_HOME=/tmp/OraInstall2011-07-25_11-33-50AM/ext/bin, SHELL=/usr/bin/bash, LOGNAME=oracle,\
 SSH_TTY=/dev/pts/3, SHLVL=1, LC_MONETARY=en_GB.ISO8859-1, MANPATH=/usr/share/man:/usr/openwin/share/man:\
 /db/pub/man:/db/pub/infra/samba/current/man:/db/pub/infra/rsync/current/man:/db/pub/infra/rcs/current/man:\
 /db/pub/infra/tcpdump/current/man:/usr/dt/share/man:/usr/java1.2/man:/usr/apache/man:/usr/perl5/man:/usr/j2se/man:\
 /usr/local/man:/opt/SUNWconn/ge/man:/opt/SUNWconn/man:/opt/SUNWsan/man:/opt/VRTS/man:/opt/DBpam/man:\
 /opt/SUNWexplo/man:/opt/SUNWcstu/man:/opt/VRTSvlic/man, SSH_CLIENT=10.128.46.56 52883 22, \
 MAIL=/var/mail//oracle, TZ=GB-Eire, LC_TIME=en_GB.ISO8859-1, JAVA_HOME=, LC_MESSAGES=C, PS1=[\u@\h \W]> , \
 OLDPWD=/u01/app/oracle/stage, LC_ALL=, TERM=xterm, TMOUT=0, PATH=/usr/bin:/usr/ccs/bin:/usr/bin:/opt/EMCpower/bin:\
 /etc/emc/bin:/etc:/zones/app/oracle/stage/grid/install}
 INFO: Starting Output Reader Threads for process /tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/kfod
 INFO: Parsing KFOD-00311: Error scanning device /dev/rdsk/c3t50000974C00A611Cd16s7
[more errors skipped]
..

(Re-) Discovery of KFOD

So therein lies the problem! As a side effect I rediscovered the command line options used with kfod! Maybe I can use this for some experiments… All you need to do is to set environment variables for ORACLE_HOME and LD_LIBRARY_PATH as per the above output.

After a little fiddling around I remembered that unlike the tests with ZFS I did before I can’t pass the whole disk to ASM (the VTOC is in the first part of the disk!), but rather have to use a slice. So by changing the disk string to /dev/rdsk/emcpower2*g I got a successful report from kfod, executed as “oracle”:

$ /tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/kfod verbose=true, disks=all status=true op=disks \
> asm_diskstring='/dev/rdsk/emcpower2*g'
--------------------------------------------------------------------------------
 Disk          Size Header    Path                                     User     Group
================================================================================
   1:       9341 Mb CANDIDATE /dev/rdsk/emcpower20g                    oracle   oinstall
   2:       9341 Mb CANDIDATE /dev/rdsk/emcpower21g                    oracle   oinstall
   3:       9341 Mb CANDIDATE /dev/rdsk/emcpower22g                    oracle   oinstall
   4:       9341 Mb CANDIDATE /dev/rdsk/emcpower23g                    oracle   oinstall
   5:       9341 Mb CANDIDATE /dev/rdsk/emcpower24g                    oracle   oinstall
   6:       9341 Mb CANDIDATE /dev/rdsk/emcpower25g                    oracle   oinstall
   7:       9341 Mb CANDIDATE /dev/rdsk/emcpower26g                    oracle   oinstall
   8:       9341 Mb CANDIDATE /dev/rdsk/emcpower27g                    oracle   oinstall
   9:       9341 Mb CANDIDATE /dev/rdsk/emcpower28g                    oracle   oinstall
  10:       9341 Mb CANDIDATE /dev/rdsk/emcpower29g                    oracle   oinstall
KFOD-00311: Error scanning device /dev/rdsk/emcpower2g
ORA-15025: could not open disk "/dev/rdsk/emcpower2g"
SVR4 Error: 13: Permission denied
Additional information: 42
Additional information: 272256
Additional information: 12699536

I haven’t worked out which wildcard characters are valid in the ASM_DISKSTING (‘/dev/rdsk/emcpower2[0-9]g’ didn’t work), so I left it there. The emcpower2g device was not meant to be used in the test anyway.

With this setting I returned to the OUI window and continued the installation.

Lesson Learned

Oracle doesn’t always have useful logs, but in this case it was very useful to look at the install-actions file in /tmp. And also remember that creating ASM disks is different from Linux in Solaris and other platforms.

Using Connection Manager to protect a database

I have known about Oracle’s connection manager for quite a while but never managed to use it in anger. In short there was no need to do so. Now however I have been asked to help in finding a solution to an interesting problem.

In summary, my customer is running DR and UAT in the same data centre. Now for technical reasons they rely on RMAN to refresh UAT, no array mirror splits on the SAN (which would be way faster!) possible. The requirement is to prevent an RMAN session with target=UAT and auxiliary=DR from overwriting the DR databases, all of which are RAC databases on 11.2.The architecture included 2 separate networks for the DR hosts and the UAT hosts. DR was on 192.168.99.0/24 whereas UAT was on 192.168.100.0/24. A gateway host with two NICs connects the two. Initially there was a firewall on the gateway host to prevent traffic from UAT to DR, but because of the way Oracle connections work this proved impossible (the firewall was a simple set of IPTABLES rules). After initial discussions I decided to look at connection manager more closely as that is hailed as a solution to Oracle connectivity and firewalls.

Thinking more closely about the problem I realised that the firewall + static routes approach might not be the best one. So I decided to perform a test which didn’t involve IPTABLES or custom routes, and rely on CMAN only to protect the database. A UAT refresh isn’t something that’s going to happen very frequently, which allows me to shut down CMAN, and thus prevent any communication between the two networks. This is easier than maintaining a firewall-remember the easiest way to do something is not to do it at all.

Overview of Connection Manager

For those of you who aren’t familiar with CMAN, here’s a short summary (based on the official Oracle documentation).

Configuration of Oracle Connection Manager (CMAN) allows the clients to connect through a firewall [I haven’t verified this yet, ed]. CMAN is an executable that allows clients to connect despite a firewall being in place between the client and server. CMAN is similar to the Listener in that it reads a configuration file [called CMAN.ora, ed], which contains an address that Oracle Connection Manager listens for incoming connections. CMAN starts similar to the Listener and will enter a LISTEN state.

This solution [to the firewall issue with TCP redirects, ed] will make the REDIRECT happen inside the firewall and the client will not see it; CMAN comports as a proxy service between the client and the real database listener.

Interestingly, Connection Manager is fully integrated into the FAN/FCF framework and equally suitable for UCP connection pools.

Technically speaking Oracle database instances require the initialisation parameters local_listener and remote_listener to be set. In RAC databases, this is usually the case out of the box, however, in addition to the SCAN, the remote_listener must include the CMAN listener as well-an example is provided in this document.

Installing Connection Manager

Connection Manager is now part of the Oracle client, and you can install it by choosing the “custom” option. From the list of selectable options, pick “Oracle Net Listener” and “Oracle Connection Manager”.

From there on it’s exactly the same as any other client installation.

Testing

A quick test with 2 separate networks reveals that the concept actually works. The following hosts are used:

  • cman                     192.168.99.224
  • cmandb                192.168.100.225
  • client                     192.168.99.31

The networks in use are:

  • Public network:  192.168.99.0/24
  • Private network: 192.168.100.0/24

As you can see CMANDB is on a different network than the other hosts.

Connection manager has been installed on host “cman”, with IP 192.168.99.224. The listener process has been configured to listen on port 1821. The corresponding cman.ora file has been configured as follows in $CLIENT_HOME/network/admin:

cman1 =
(configuration=
(address=(protocol=tcp)(host=192.168.99.224)(port=1821))
(rule_list=
(rule=(src=*)(dst=127.0.0.1)(srv=cmon)(act=accept))
(rule=(src=192.168.99.0/24)(dst=192.168.100.225)(srv=*)(act=accept))
)
)

The file has been left in this minimalistic state deliberately. The only connection possible is to the database host. The cmon service must be allowed or otherwise the startup of the connection manager processes will fail.

The gateway host has 2 network interfaces, one for each network:

[root@cman ~]# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:16:3E:F2:34:56
inet addr:192.168.99.224  Bcast:192.168.99.255  Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:1209 errors:0 dropped:0 overruns:0 frame:0
TX packets:854 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:105895 (103.4 KiB)  TX bytes:147462 (144.0 KiB)
Interrupt:17

eth1      Link encap:Ethernet  HWaddr 00:16:3E:52:4A:56
inet addr:192.168.100.224  Bcast:192.168.100.255  Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:334 errors:0 dropped:0 overruns:0 frame:0
TX packets:151 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:36425 (35.5 KiB)  TX bytes:22141 (21.6 KiB)
Interrupt:16

Its routing table is defined as follows:

[root@cman ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.100.0   0.0.0.0         255.255.255.0   U     0      0        0 eth1
192.168.101.0   0.0.0.0         255.255.255.0   U     0      0        0 eth2
192.168.99.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     0      0        0 eth2
[root@cman ~]

Configuration on host “CMANDB”

Note that host 192.168.100.225 is on the private network! The database CMANDB has its local and remote listener configured using the below entries in tnsnames.ora:

[oracle@cmandb admin]$ cat tnsnames.ora

[oracle@cmandb admin]$ cat tnsnames.ora
LOCAL_CMANDB =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(PORT = 1521)(HOST = 192.168.100.225))
)
)

REMOTE_CMANDB =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(PORT = 1821)(HOST = 192.168.99.224))
(ADDRESS = (PROTOCOL = TCP)(PORT = 1521)(HOST = 192.168.100.225))
)
)

CMAN_LSNR =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(PORT = 1521)(HOST = 192.168.99.224))
)
)
  • local listener is set to local_cmandb
  • remote listener is set to remote_cmandb
  • CMAN_LSNR is used for a test to verify that a connection to the CMAN listener is possible from this host

The host had only one network interface, connecting to the CMAN host:

[root@cmandb ~]# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:16:3E:12:14:51
inet addr:192.168.100.225  Bcast:192.168.100.255  Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:627422 errors:0 dropped:0 overruns:0 frame:0
TX packets:456584 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2241758430 (2.0 GiB)  TX bytes:32751153 (31.2 MiB)

The only change to the system was the addition of a default gateway. Unfortunately the CMAN process cannot listen on more than one IP address:

# route add default gw 192.168.100.224 eth0

The following routing table was in use during the testing:

[root@cmandb ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.100.0   0.0.0.0         255.255.255.0   U     0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     0      0        0 eth0
0.0.0.0         192.168.100.224 0.0.0.0         UG    0      0        0 eth0

After the database listener parameters have been changed to LOCAL_CMANDB and REMOTE_CMANDB, the alert log on the CMAN host recorded a number of service registrations. This is important as it allows Connection Manager to hand the connection request off to the database:

[oracle@cman trace]$ grep cmandb *
08-JUL-2011 10:19:55 * service_register * cmandb * 0
08-JUL-2011 10:25:28 * service_update * cmandb * 0
08-JUL-2011 10:35:28 * service_update * cmandb * 0
[...]
08-JUL-2011 11:15:10 * service_update * cmandb * 0
08-JUL-2011 11:15:28 * service_update * cmandb * 0
[oracle@cman trace]$

Additionally, the CMAN processes now know about the database service CMANDB:

CMCTL:cman1> show services
Services Summary...
Proxy service "cmgw" has 1 instance(s).
Instance "cman", status READY, has 2 handler(s) for this service...
Handler(s):
"cmgw001" established:0 refused:0 current:0 max:256 state:ready

(ADDRESS=(PROTOCOL=tcp)(HOST=127.0.0.1)(PORT=23589))
"cmgw000" established:0 refused:0 current:0 max:256 state:ready

(ADDRESS=(PROTOCOL=tcp)(HOST=127.0.0.1)(PORT=31911))
Service "cmandb" has 1 instance(s).
Instance "cmandb", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(PORT=1521)(HOST=192.168.100.225)))
Service "cmandbXDB" has 1 instance(s).
Instance "cmandb", status READY, has 1 handler(s) for this service...
Handler(s):
"D000" established:0 refused:0 current:0 max:1022 state:ready
DISPATCHER 
(ADDRESS=(PROTOCOL=tcp)(HOST=cmandb.localdomain)(PORT=50347))
Service "cmon" has 1 instance(s).
Instance "cman", status READY, has 1 handler(s) for this service...
Handler(s):
"cmon" established:1 refused:0 current:1 max:4 state:ready

(ADDRESS=(PROTOCOL=tcp)(HOST=127.0.0.1)(PORT=59541))
The command completed successfully.
CMCTL:cman1>

Connectivity Test

With the setup completed it was time to perform a test from a third host on the (public) network. Its IP address is 192.168.99.31. The below TNSnames entries were created:

[oracle@client admin]$ cat tnsnames.ora
CMAN =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PORT=1821)(HOST=192.168.99.224)(PROTOCOL=TCP))
)
(CONNECT_DATA =
(SERVICE_NAME = cmandb)
(SERVER = DEDICATED)
)
)

DIRECT =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PORT=1521)(HOST=192.168.100.225)(PROTOCOL=TCP))
)
(CONNECT_DATA =
(SERVICE_NAME = cmandb)
(SERVER = DEDICATED)
)
)

The CMAN entry uses the Connection Manager gateway host to connect to database CMANDB, whereas the DIRECT entry tries to bypass the latter. A tnsping should show whether or not this is possible.

[oracle@client admin]$ tnsping direct
TNS Ping Utility for Linux: Version 11.2.0.1.0 - Production on 08-JUL-2011 11:19:47

Copyright (c) 1997, 2009, Oracle.  All rights reserved.

Used parameter files:

Used TNSNAMES adapter to resolve the alias

Attempting to contact (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PORT=1521)(HOST=192.168.100.225)(PROTOCOL=TCP))) (CONNECT_DATA = (SERVICE_NAME = cmandb) (SERVER = DEDICATED)))

TNS-12543: TNS:destination host unreachable

[oracle@client admin]$ tnsping cman

TNS Ping Utility for Linux: Version 11.2.0.1.0 - Production on 08-JUL-2011 10:53:27

Copyright (c) 1997, 2009, Oracle.  All rights reserved.

Used parameter files:

Used TNSNAMES adapter to resolve the alias

Attempting to contact (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PORT=1821)(HOST=192.168.99.224)(PROTOCOL=TCP))) (CONNECT_DATA = (SERVICE_NAME = cmandb) (SERVER = DEDICATED)))

OK (0 msec)

[oracle@client admin]$

Now that proves that a direct connection is impossible, and also that the connection manager’s listener is working. A tnsping doesn’t imply that a connection is possible though, this requires an end to end test with SQL*Plus:

[oracle@client admin]$ sqlplus system/xxx@cman

SQL*Plus: Release 11.2.0.1.0 Production on Fri Jul 8 10:55:39 2011

Copyright (c) 1982, 2009, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> select instance_name,host_name from v$instance;

INSTANCE_NAME
----------------
HOST_NAME
----------------------------------------------------------------
cmandb
cmandb.localdomain

The successful connection is also recorded in the CMAN log file:

Fri Jul 08 10:55:39 2011

08-JUL-2011 10:55:39 * (CONNECT_DATA=(SERVICE_NAME=cmandb)(SERVER=DEDICATED)(CID=(PROGRAM=sqlplus@client)(HOST=client)(USER=oracle))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.99.31)(PORT=16794)) * establish * cmandb * 0

(LOG_RECORD=(TIMESTAMP=08-JUL-2011 10:55:39)(EVENT=Ready)(CONN NO=0))

Additional date for my Grid Infrastructure seminar in the UK

I was very pleasently surprised that Oracle University are offering another day for my “Grid Infrastructure and Database High Availability Deep Dive” seminar. In addition to the immenent seminars in June (I blogged about them earlier), this one is in London, England. For anyone interested, here is the link:

http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getCourseDesc?dc=D70365_1439069&p_org_id=28&lang=US

The date has been set to October 10th, so there is plenty of time still, but nevertheless I hope to see you there!


Tuning Log and Trace levels for Clusterware 11.2

With the introduction of Clusterware 11.2 a great number of command line tools have either been deprecated ($ORA_CRS_HOME/bin/crs_* and others) or merged into other tools. This is especially true for crsctl, which is now the tool to access and manipulate low level resources in Clusterware.

This also implies that some of the notes on Metalink are no longer applicable to Clusterware 11.2, such as the one detailing how to get more detailed information in the logs. Not that the log information wasn’t already rather comprehensive if you asked me…

And here comes a warning: don’t change the log levels unless you have a valid reason, or under the instructions of support. Higher log levels than the defaults tend to generate too much data, filling up the GRID_HOME and potentially killing the node.

Log File Location

The location for logs in Clusterware hasn’t changed much since the unified log structure was introduced in 10.2 and documented in “CRS and 10g/11.1 Real Application Clusters (Doc ID 259301.1)”. It has been extended though, and quite dramatically so in 11.2, which is documented as well in one of the better notes from support: “11gR2 Clusterware and Grid Home – What You Need to Know (Doc ID 1053147.1)”

The techniques for getting debug and trace information as described for example in “Diagnosability for CRS / EVM / RACG (Doc ID 357808.1)” doesn’t really apply any more as the syntax changed.

Getting Log Levels in 11.2

If you are interested at which log level a specific Clusterware resource operates, you can use the crsctl get log resource resourceName call, as in this example:

# crsctl get log res ora.asm
Get Resource ora.asm Log Level: 1

Don’t forget to apply the “-init” flag when you want to query resources which are part of the lower stack:

[root@lonengbl312 ~]# crsctl get log res ora.cssd
CRS-4655: Resource ‘ora.cssd’ could not be found.
CRS-4000: Command Get failed, or completed with errors.
[root@lonengbl312 ~]# crsctl get log res ora.cssd –init
Get Resource ora.cssd Log Level: 1

Interestingly, most Clusterware daemons start leaving a detailed message in their log file as to which module uses what logging level. Take CSSD for example:

2011-06-02 09:06:14.847: [    CSSD][1531968800]clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0
2011-06-02 09:06:14.847: [    CSSD][1531968800]clsu_load_ENV_levels: Module = GIPCNM, LogLevel = 2, TraceLevel = 0
2011-06-02 09:06:14.847: [    CSSD][1531968800]clsu_load_ENV_levels: Module = GIPCGM, LogLevel = 2, TraceLevel = 0
2011-06-02 09:06:14.847: [    CSSD][1531968800]clsu_load_ENV_levels: Module = GIPCCM, LogLevel = 2, TraceLevel = 0
2011-06-02 09:06:14.847: [    CSSD][1531968800]clsu_load_ENV_levels: Module = CLSF, LogLevel = 0, TraceLevel = 0
2011-06-02 09:06:14.847: [    CSSD][1531968800]clsu_load_ENV_levels: Module = SKGFD, LogLevel = 0, TraceLevel = 0
2011-06-02 09:06:14.847: [    CSSD][1531968800]clsu_load_ENV_levels: Module = GPNP, LogLevel = 1, TraceLevel = 0
2011-06-02 09:06:14.847: [    CSSD][1531968800]clsu_load_ENV_levels: Module = OLR, LogLevel = 0, TraceLevel = 0

This doesn’t match the output of the previous command. Now how do you check this on the command line? crsctl to the rescue again-it now has a “log” option:

# crsctl get log -h
Usage:
  crsctl get {log|trace} {mdns|gpnp|css|crf|crs|ctss|evm|gipc} ",..."
 where
   mdns          multicast Domain Name Server
   gpnp          Grid Plug-n-Play Service
   css           Cluster Synchronization Services
   crf           Cluster Health Monitor
   crs           Cluster Ready Services
   ctss          Cluster Time Synchronization Service
   evm           EventManager
   gipc          Grid Interprocess Communications
   , ...    Module names ("all" for all names)

  crsctl get log res 
 where
        Resource name

That’s good information, but the crucial bit about the and other parameters is missing. The next question I asked myself was “how do I find out which sub-modules make are part of say, crsd. Apart from looking at the trace file”. That information used to be documented in the OTN reference, but it’s now something you can evaluate from Clusterware as well:

[root@lonengbl312 ~]# crsctl lsmodules -h
Usage:
  crsctl lsmodules {mdns|gpnp|css|crf|crs|ctss|evm|gipc}
 where
   mdns  multicast Domain Name Server
   gpnp  Grid Plug-n-Play Service
   css   Cluster Synchronization Services
   crf   Cluster Health Monitor
   crs   Cluster Ready Services
   ctss  Cluster Time Synchronization Service
   evm   EventManager
   gipc  Grid Interprocess Communications

Back to the question as to which modules make up CSSD, I got this answer from Clusterware, matching the log file output:

# crsctl lsmodules css
List CSSD Debug Module: CLSF
List CSSD Debug Module: CSSD
List CSSD Debug Module: GIPCCM
List CSSD Debug Module: GIPCGM
List CSSD Debug Module: GIPCNM
List CSSD Debug Module: GPNP
List CSSD Debug Module: OLR
List CSSD Debug Module: SKGFD

To get detailed information about the log level of CSSD:module, I can now finally use the crsctl get {log|trace} daemon:module command. My CSSD logfile stated the CSSD:OLR had log level 0 and trace level 0, confirmed by Clusterware:

# crsctl get log css “OLR”
Get CSSD Module: OLR  Log Level: 0

Note that the name of the module has to be in upper case. If you are lazy like me you could use the “all” keyword instead of the module name to list all the information in one command:

# crsctl get log css all
Get CSSD Module: CLSF  Log Level: 0
Get CSSD Module: CSSD  Log Level: 2
Get CSSD Module: GIPCCM  Log Level: 2
Get CSSD Module: GIPCGM  Log Level: 2
Get CSSD Module: GIPCNM  Log Level: 2
Get CSSD Module: GPNP  Log Level: 1
Get CSSD Module: OLR  Log Level: 0
Get CSSD Module: SKGFD  Log Level: 0

Setting Log Levels

Let’s have a look how to increase log levels for the critical components. I am picking CSSD here, simply because I want to know if Oracle records the chicken-and-egg problem when having voting files in ASM in the ocssd.log file. (CSSD needs to voting files to start, and ASM needs CSSD to run, but if the voting files are in ASM, then we have gone full circle). With all the information provided in the section above, that’s actually quite simple to do.

# crsctl set log -h
Usage:
  crsctl set {log|trace} {mdns|gpnp|css|crf|crs|ctss|evm|gipc} "=,..."
 where
   mdns          multicast Domain Name Server
   gpnp          Grid Plug-n-Play Service
   css           Cluster Synchronization Services
   crf           Cluster Health Monitor
   crs           Cluster Ready Services
   ctss          Cluster Time Synchronization Service
   evm           EventManager
   gipc          Grid Interprocess Communications
   , ...    Module names ("all" for all names)
   , ...     Module log/trace levels

  crsctl set log res =
 where
      Resource name
                     Agent log levels

The object I would like to research is the voting file discovery. I know that the ASM disks have an entry in the header indicating whether or not a disk contains a voting file, in the kfdhdb.vfstart and kfdhdb.vfend fields as shown here:

# kfed read /dev/oracleasm/disks/OCR1  | grep kfdhdb.vf
kfdhdb.vfstart:                     256 ; 0x0ec: 0×00000100
kfdhdb.vfend:                       288 ; 0x0f0: 0×00000120

With the default log information, I can see the voting disk discovery happening as shown in this cut down version of the ocssd.log:

2011-06-02 09:06:19.941: [    CSSD][1078901056]clssnmvDDiscThread: using discovery string  for initial discovery
2011-06-02 09:06:19.941: [   SKGFD][1078901056]Discovery with str::
2011-06-02 09:06:19.941: [   SKGFD][1078901056]UFS discovery with ::
2011-06-02 09:06:19.941: [   SKGFD][1078901056]OSS discovery with ::
2011-06-02 09:06:19.941: [   SKGFD][1078901056]Discovery with asmlib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: str ::
2011-06-02 09:06:19.941: [   SKGFD][1078901056]Fetching asmlib disk :ORCL:OCR1:

2011-06-02 09:06:19.942: [   SKGFD][1078901056]Handle 0x58d2a60 from lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: for disk :ORCL:OCR1:
2011-06-02 09:06:19.942: [   SKGFD][1078901056]Handle 0x58d3290 from lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: for disk :ORCL:OCR2:
2011-06-02 09:06:19.942: [   SKGFD][1078901056]Handle 0x58d3ac0 from lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: for disk :ORCL:OCR3:

2011-06-02 09:06:19.953: [    CSSD][1078901056]clssnmvDiskVerify: discovered a potential voting file
2011-06-02 09:06:19.953: [   SKGFD][1078901056]Handle 0x5c22670 from lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: for disk  :ORCL:OCR1:
2011-06-02 09:06:19.954: [    CSSD][1078901056]clssnmvDiskVerify: Successful discovery for disk ORCL:OCR1, UID 4700d9a1-93094ff5-bf8b96e1-7efdb883, Pending CIN 0:1301401766:0, Committed CIN 0:1301401766:0
2011-06-02 09:06:19.954: [   SKGFD][1078901056]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x5c22670 for disk :ORCL:OCR1:

2011-06-02 09:06:19.956: [    CSSD][1078901056]clssnmvDiskVerify: Successful discovery of 3 disks
2011-06-02 09:06:19.956: [    CSSD][1078901056]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2011-06-02 09:06:19.957: [    CSSD][1078901056]clssnmCompleteVFDiscovery: Completing voting file discovery

Now I’m setting the new log level for SKGFD, as this seems to be responsible for the disk discovery.

# crsctl set log css “SKGFD:5″
Set CSSD Module: SKGFD  Log Level: 5

Does it make a difference? Restarting Clusterware should give more clues. Looking at the logfile I could tell there was no difference. I tried using the trace level now for the same module:

# crsctl get trace css SKGFD
Get CSSD Module: SKGFD  Trace Level: 5

Interestingly that wasn’t picked up!

$ grep “Module = SKGFD” ocssd.log | tail -n1
2011-06-02 11:31:06.707: [    CSSD][427846944]clsu_load_ENV_levels: Module = SKGFD, LogLevel = 2, TraceLevel = 0

So I decided to leave it at this point as I was running out of time for this test. I’d be interested if anyone has successfully raised the log level for this component. On the other hand, I wonder if it is ever necessary to do so. I can confirm however that raising the log level for the CSSD module created a lot more output, actually so much that it seems to equal a system wide SQL trace-so don’t change log levels unless support directs you to.


A more user friendly multipath.conf

During some recent work I did involving a stretched 11.2.0.2 RAC for a SAP implementation at a customer site I researched TimeFinder/Clone backups. As part of this exercise I have been able to experiment with RHEL (OEL) 5.6 and the new device mapper multipath package on the mount host. I have been very pleasantly surprise about this new feature which I’d like to share.

Background of this article

Device Mapper Multipath is the “native” Linux multipathing software, as opposed to vendor-supplied multipathing such as EMC’s Power Path or Hitachi’s HDLM.

My customer’s setup is rather unique for a SAP environment as it uses Oracle Enterprise Linux and not Solaris/SPARC or AIX on the Power platform with an active/passive solution. Well if that doesn’t make it sound unique, the fact that there is a plan to run Oracle 11.2.0.2 RAC potentially across sites using ASM and ACFS certainly makes this deployment stand out from the rest.

The reason I mention this is simple-it requires a lot of engineering effort to certify components for this combination. For example: it was not trivial to get vendor support for Solutions Enabler the storage engineering uses for connectivity with the VMAX arrays, and so on. After all, Oracle Enterprise is a fairly new platform and Red Hat certainly has an advantage when it comes to vendor certification.

What hasn’t been achieved was a certification of the EMC Power Path software for use with SAP on Linux, for reasons unknown to me. The result was simple: the setup will use the device-mapper multipath package that comes with the Linux distribution.

Configuring multipath

Now with this said I started looking at MOS to get relevant support notes about Linux’s native multipath and found some. The summary of this research is available on this blog, I have written about it in these articles:

What I didn’t know up to date was the fact that the multipath.conf file allows you to define the ownership and mode of a device. As an ASM user this is very important to me. Experience taught me that incorrect permissions are one of the main reasons for ASM failing to start. Remember that root owns block devices by default.

Consider this example:

multipaths {
...
multipath {
wwid        360a98000486e58526c34515944703277
alias       ocr01
mode        660
uid         501
gid         502
}
...
}

The above entry in the multipaths section translates into English as follows:

  • If you encounter a device with the WWID 36a…277
  • Give it an alias name of “OCR01” in /dev/mapper
  • Set the mode to 660 (i.e. rw-rw—)
  • Assign device ownership to the user with UID 501 (maps to “grid” on my system)
  • Assign the group of the device to 502 (maps to “asmdba” on my system)

The path settings are defined globally and don’t need to be mentioned explicitly for each device unless you prefer to override them. I like to use an alias although it isn’t really necessary since ASM relies on a 4k header in a block device to store its identity. If you don’t chose to alias a device I recommend you use the user friendly name instead, mainly for aesthetic reasons.

Why is this really cool? For me it implies two things:

  • I can now be independent of ASMLib which provided device name stability and set the permissions on the block devices correctly for ASM
  • I don’t need to create udev rules to set the permissions (or /etc/rc.local or whichever other way you chose before to set permissions)

Nice! One less headache, as I have to say that I didn’t really like udev…

How to use the RAC FAN Java API

I researched an interesting new feature available with Oracle 11g R2, the so called RAC FAN API when writing the workload management chapter for the RAC book. The RAC FAN API is documented in Oracle® Database JDBC Developer’s Guide, 11g Release 2 (11.2) available online, but when it came to the initial documentation following the 11.2.0.1 release on Linux it was pretty useless. The good news is that it improved!

The RAC FAN Java API

The aim of this API is to allow a Java application to listen to FAN events by creating a subscription to the RAC nodes’ ONS processes. The application then registers a FANListener, based on the subscription, which can pick up instances of the following events:

  • ServiceDownEvent
  • NodeDownEvent
  • LoadAdvisoryEvent

All of these are in the oracle.simplefan namespace, the javadoc reference of which you can find in the official documenation.

When it was initially released in 11.2.0.1 I tried to get the FAN subscription to work without any luck, the registration with the ONS didn’t work. Schematically, this is how it should work:

The application (shown on the top) requires an Oracle client for the ONS configuration and libraries, and the setup is very similar to how you’d set up FCF. With the ons.config file pointing to the RAC nodes’ ONS processes you should be able to read FAN events as they occur on the source system. The ons.config in $CLIENT_HOME/opmn/conf/ons.config has to contain the local and remote ONS port, as well as all the RAC nodes with their remote ONS port. When subscribing to the events, you specify the service name you’d like to receive events for.

My 4 node RAC cluster is comprised of node rac11gr2drnode{1,2,3}, with local ONS listing on port 6200 for external, and port 6100 for internal requests. This can be checked using the onsctl debug command on any cluster node.. The client’s ons.config contains these lines:

localport:6100
remoteport:6200
nodes=rac11gr2drnode1:6200, rac11gr2drnode2:6200,rac11gr2drnode3:6200, rac11gr2drnode4:6200

Unfortunately I couldn’t subscribe to the remote ONS-I know that my setup is valid because of my UCP/FCF testing I did previously (check my earlier blog postings about how to get started with UCP and Tomcat 6). Or maybe it doesn’t work on Windows, who knows? Trying to subscribe to the remote ONS my application bails out with the following error:

Exception in thread “main” oracle.ons.SubscriptionException: Subscription request timed out after 30000 millseconds.  Possible causes: OPMN may not be running, you may have an OPMN running in an alternate ORACLE_HOME using duplicate port values, or OPMN may be misconfigured.
at oracle.ons.SubscriptionNotification.waitForReply(SubscriptionNotification.java:83)
at oracle.ons.ONS.addSubscriber(ONS.java:956)
at oracle.ons.Subscriber.realStartup(Subscriber.java:103)
at oracle.ons.Subscriber.(Subscriber.java:79)
at oracle.simplefan.impl.FanSubscription.subscribe(FanSubscription.java:228)
at oracle.simplefan.impl.FanSubscription.createFanSubscription(FanSubscription.java:46)
at oracle.simplefan.impl.FanManager.subscribe(FanManager.java:120)
at de.mbh.TestClass.(TestClass.java:21)
at de.mbh.TestClass.main(TestClass.java:53)

I tried my favourite search engine on the Exception but couldn’t find any useful information. This is not a big deal for me, I can start a local ONS on the client. This is shown here:

C:\oracle\product\11.2.0\client_1\opmn\bin>onsctl ping
ons is not running ...

C:\oracle\product\11.2.0\client_1\opmn\bin>onsctl start
onsctl start: ons started

C:\oracle\product\11.2.0\client_1\opmn\bin>onsctl debug
HTTP/1.1 200 OK
Content-Length: 3627
Content-Type: text/html
Response:

== client:6200 5872 11/05/26 11:47:16 ==
Home: c:\oracle\product\11.2.0\client_1

======== ONS ========
IP ADDRESS                   PORT    TIME   SEQUENCE  FLAGS
--------------------------------------- ----- -------- -------- --------
10.xxx.xx.98  6200 4dde2fb0 00000006 00000008

Listener:

TYPE                BIND ADDRESS               PORT  SOCKET
-------- --------------------------------------- ----- ------
Local                                  127.0.0.1  6100    344
Remote                                       any  6200    352

Servers: (3)

INSTANCE NAME                  TIME   SEQUENCE  FLAGS     DEFER
---------------------------------------- -------- -------- -------- ----------
dbInstance_rac11gr2drnode1_6200          4d53d6a4 00029496 00000002          0
10.xxx.xx.155 6200
dbInstance_rac11gr2drnode2_6200          4d3eb9e2 0006cbeb 00000002          0
10.xxx.xx157 6200
dbInstance_rac11gr2drnode3_6200 4d5937d6 0001990f 00000002          0
10.xxx.xx.158 6200

Connection Topology: (4)

IP                      PORT   VERS  TIME
--------------------------------------- ----- ----- --------
10.xxx.xx.158  6200     4 4dde2fb2
**                           10.xxx.xx.155 6200
**                           10.xxx.xx.157 6200
**                            10.xxx.xx.98 6200
10.xxx.xx.157  6200     4 4dde2fb2
**                           10.xxx.xx.155 6200
**                           10.xxx.xx.158 6200
**                            10.xxx.xx.98 6200
10.xxx.xx.155  6200     4 4dde2fb2
**                            10.xxx.xx.98 6200
**                           10.xxx.xx.157 6200
**                           10.xxx.xx.158 6200
10.xxx.xx.98  6200     4 4dde2fb2=
**                           10.xxx.xx.155 6200
**                           10.xxx.xx.158 6200
**                           10.xxx.xx.157 6200

Server connections:

ID            CONNECTION ADDRESS              PORT  FLAGS  SENDQ REF WSAQ
-------- --------------------------------------- ----- ------ ----- --- ----
0                           10.xxx.xx.155  6200 010405 00000 001  ---
1                           10.xxx.xx.156  6200 002405 00000 001
2                           10.xxx.xx.157  6200 010405 00000 001  --
3                           10.xxx.xx.158  6200 010405 00000 001  ---

Client connections:

ID            CONNECTION ADDRESS              PORT  FLAGS  SENDQ REF SUB W
-------- --------------------------------------- ----- ------ ----- --- --- -
4                                internal     0 01008a 00000 001 002
request                               127.0.0.1  6100 03201a 00000 001 000

Worker Ticket: 28/28, Last: 11/05/26 11:47:15

THREAD   FLAGS
-------- --------
120 00000012
124 00000012
128 00000012

Resources:

Notifications:
Received: Total 12 (Internal 6), in Receive Q: 0
Processed: Total 12, in Process Q: 0

Pool Counts:
Message: 1, Link: 1, Ack: 1, Match: 1

C:\oracle\product\11.2.0\client_1\opmn\bin>

With a local ONS started on my client, I can actually subscribe to the ONS and make use of the events. The easiest way is to simply decode the load balancing events, as I did in my code, shown below (modified version of the code in the Oracle documentation to make it work):

package de.mbh;

import oracle.simplefan.FanSubscription;
import oracle.simplefan.FanEventListener;
import oracle.simplefan.FanManager;
import oracle.simplefan.LoadAdvisoryEvent;
import oracle.simplefan.NodeDownEvent;
import oracle.simplefan.ServiceDownEvent;
import java.util.Properties;

public class TestClass {

TestClass() {
System.out.println("Hello");

Properties p = new Properties();
p.put("serviceName", "OEMSRV");

System.setProperty("oracle.ons.oraclehome", "c:\\oracle\\product\\11.2.0\\client_1");
System.out.println(System.getProperty("oracle.ons.oraclehome"));

FanSubscription sub = FanManager.getInstance().subscribe(p);

System.out.println("I'm subscribed!");

sub.addListener(new FanEventListener() {

public void handleEvent(ServiceDownEvent arg0) {
System.out.println("Service Down registered!");
}

public void handleEvent(NodeDownEvent arg0) {
System.out.println("Node Down Event Registered");
}

public void handleEvent(LoadAdvisoryEvent arg0) {
System.out.println("Just got a Load Advisory event");

System.out.println("originating database: " + arg0.getDatabaseUniqueName());
System.out.println("originating instance: " + arg0.getInstanceName());
System.out.println("Service Quality     : " + arg0.getServiceQuality());
System.out.println("Percent             : " + arg0.getPercent());
System.out.println("Service Name        : " + arg0.getServiceName());
System.out.println("Service Quality     : " + arg0.getServiceQuality());
System.out.println("Observed at         : " + arg0.getTimestamp() + "\n\n");
} } );
}

/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
TestClass tc = new TestClass();

int i = 0;
while ( i < 100000)  {
try {
Thread.sleep(100);
i++;
} catch (Exception e) {
System.out.println(e);
}
}

System.out.println("execution ended");
}

}

When compiling or executing the code, you need the simplefan.jar and ons.jar files in your classpath, and chances are that you need ojdbc6.jar as well.

Starting the application reveals Load Balancing events being read:

Hello
c:\oracle\product\11.2.0\client_1
I'm subscribed!
Just got a Load Advisory event
originating database: LNGDS1RD
originating instance: dbInstance_rac11gr2drnode3_6200
Service Quality     : -1
Percent             : 51
Service Name        : OEMSRV
Service Quality     : -1
Observed at         : Thu May 26 11:56:03 BST 2011

Just got a Load Advisory event
originating database: LNGDS1RD
originating instance: dbInstance_rac11gr2drnode3_6200
Service Quality     : -1
Percent             : 49
Service Name        : OEMSRV
Service Quality     : -1
Observed at         : Thu May 26 11:56:03 BST 2011

This proves that you can write your own connection cache, which allows you to react to *down events and rebalance your session differently.

Offloading production backups to a different storage array

For quite a while Oracle DBAs have performed split mirror backups using special devices called “Business Continuance Volumes” or BCVs for short. A BCV is a special mirror copy of a LUN on the same storage array as the primary copy.

In a BCV backup scenario, the storage administrator (usually) “splits” the mirror after putting the database into hot backup mode. After the mirror is split, the database is taken out of hot backup mode and resumes normal operation. A new Oracle instance on a different host can be mounted using the split mirror copy of the database for backups. The use of this technology for refreshing a test environment is out of scope of this article. The below figure demonstrates the idea:

The advantage of such an approach is that the backup operation, initiated from the mount host, should not impact the performance of the production database. Once the backup is complete, the BCV for the ARCH diskgroup should be re-synchronised with the source LUN, whereas the DATA disk group should not. This allows us to quickly recover from problems with the primary LUNs-more on that on a later post.

One Step Further

On my current site this process has been refined. One of the requirements was that SRDF should be used as a means for disaster recovery. I should say that SRDF, or Symmetrix Remote Data Facility is my customer’s preferred method for DR, and I do by no means want to advertise for EMC here; it just so happened that I was working on EMC storage for this project.

I should also note that Data Guard cannot be used due to an application constraint (an ACFS file system is integral part of the application and must be backed up together with the database).

All storage is presented by ASM, which in many ways makes life easier for me. The ASM LUNs or “disks” will have all the required information in the “disk” header. So after the cloned LUNs have been presented to the mount host, all I need to do is make them available to the OS, and optionally run an “oracleasm scandisks” as root to detect them. From then on I should be able to simply mount the disk group (either via sql*plus or srvctl in 11.2). The actual backup requires a few more steps, these are shown below.

Before going further into detail let’s have a look at the architecture first:

Two data centres are in use: the local one is used for production in normal operations, including backups. One of the design requirements was that backups can be taken on either data centre, in respect to the DR situation.

Split mirror backups as shown in the above figure are taken on the local data centre in normal operations. In case of DR, the remote data centre will be configured to take backups. For this to happen, it is necessary to clone the replicated LUNs (that would be activated in the DR event) much as it’s done for the local data centre’s split mirror backups. As an added advantage, the clones can be used to create pre- and post batch “backups” that would be activated in case of a horrible failure of the batch/end-of-year processing.

Taking the Backups

To be able to take a backup of the split mirror, a few things are necessary. Most of these are documented in MOS note “RMAN and Split Mirror Disk Backups [ID 302615.1]”. You certainly require a recovery catalogue database in the first place. As a first step you register the database in the catalog. You perform this step connect to the production database and recovery catalog, as shown in the below example:

$ rman target / catalog=rman/rmanPwd@rccat

Recovery Manager: Release 11.2.0.2.0 - Production on Fri May 20 10:39:00 2011

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

connected to target database: PROD (DBID=1796248120)
connected to recovery catalog database

RMAN> register database;

database registered in recovery catalog
starting full resync of recovery catalog
full resync complete

RMAN> exit

If you like, you can configure defaults at this stage, I opted for the following (still connected to the primary database and recovery catalog):

RMAN> configure retention policy to redundancy 3;

new RMAN configuration parameters:
CONFIGURE RETENTION POLICY TO REDUNDANCY 3;
new RMAN configuration parameters are successfully stored

RMAN> CONFIGURE DEFAULT DEVICE TYPE TO SBT_TAPE;

new RMAN configuration parameters:
CONFIGURE DEFAULT DEVICE TYPE TO 'SBT_TAPE';
new RMAN configuration parameters are successfully stored

RMAN> configure controlfile autobackup off;

new RMAN configuration parameters:
CONFIGURE CONTROLFILE AUTOBACKUP OFF;
new RMAN configuration parameters are successfully stored

RMAN> CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE sbt_tape to '%F';

new RMAN configuration parameters:
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE 'SBT_TAPE' TO '%F';
new RMAN configuration parameters are successfully stored

RMAN> configure device type sbt_tape parallelism 4 backup type to backupset;

new RMAN configuration parameters:
CONFIGURE DEVICE TYPE 'SBT_TAPE' PARALLELISM 4 BACKUP TYPE TO BACKUPSET;
new RMAN configuration parameters are successfully stored

RMAN> exit

Now switch over to the mount host for some real work.

The high level steps for taking a backup of the “split mirror” are:

  1. Ensure that the cloned LUNs are presented to the mount host’s operating system, including appropriate zoning on the fabric
  2. Ensure that the multi-pathing solution of choice is correctly configured for all paths to the cloned LUNs
  3. Ensure that the ASM disks are known to the local ASM instance. That may including running an /etc/init.d/oracleasm scandisks as root, or putting the relevant rules into /etc/udev/rules.d/

These steps have been quite generic and will depend on your OS and storage stack. I won’t go into detail, but you might find some relevant bits and pieces on my blog if you are interested.

The next step is to make the ASM disks known to ASM. Initially that has to be done via the command line, from 11.1 onwards you have to connect as SYSASM:

SQL> alter  diskgroup DGName mount;

For Oracle 11.2 this automatically creates a resource for diskgroup DGName in the OCR which is very convenient (especially in RAC environments). Next time, all you need to do is to execute “srvctl start diskgroup -g DGName” as the grid software owner.

For pre 11.2 environments you might want to consider updating the “asm_diskstring” initialisation parameter accordingly.

Once the ASM disk groups your database requires are mounted in ASM, it’s time to mount the database. If not done so already, register the database in the OCR and ensure that you add the spfile option as well, as in this example (Oracle RAC users would add instances additionally):

$ srvctl add database -d PROD -o $ORACLE_HOME -c SINGLE -p ‘+data/PROD/spfilePROD.ora’ \
> -s mount -a DATA,ARCH

As per said MOS document, it is required to start the database using a BACKUP controlfile on the mount host. Otherwise you’d end up with these RMAN errors after the first backup:

RMAN-3014: Implicit resync of recovery catalog failed
RMAN-6038: Recovery catalog package detected an error
RMAN-20035: Invalid high RECID error

To do so, create a backup controlfile on the primary database before splitting the mirror. The process is very much the same you would for a physical standby database:

SQL> alter database backup controlfile to ‘/tmp/backup.ctl’;

Database altered.

This controlfile now needs to be made available to the mount host-an elegant way would be to use DBMS_FILETRANSFER to perform this task, or asmcmd’s copy command.

Once it’s on the mount host, say in /tmp//backup.ctl, it needs to be made available to Oracle. The easiest way is to use RMAN for this:

Connect to the mirror instance as SYSDBA; do not connect to the recovery catalog.

$ rman target /

Recovery Manager: Release 11.2.0.2.0 - Production on Fri May 20 10:53:40 2011

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

connected to target database: PROD (DBID=1796248120, not open)

RMAN> shutdown immediate

using target database control file instead of recovery catalog
database dismounted
Oracle instance shut down

RMAN> startup nomount
connected to target database (not started)
Oracle instance started

Total System Global Area    9219969024 bytes

Fixed Size                     2234056 bytes
Variable Size               4630513976 bytes
Database Buffers            4563402752 bytes
Redo Buffers                  23818240 bytes

RMAN> restore controlfile from '/tmp/PROD/ backup.ctl';

Starting restore at 20-MAY-11
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=200 instance=PROD device type=DISK

channel ORA_DISK_1: copied control file copy
output file name=+ARCH/prod/controlfile/current.256.737725307
output file name=+DATA/prod/controlfile/current.256.737725307
Finished restore at 20-MAY-11

If you were really desperate to back the CURRENT controlfile (which is part of the clone) up, you could have done this prior to the restore of the backup controlfile. You must not connect to the recovery catalog in this case-see MOS note 1111369.1 for more information.

Querying V$DATABASE.CONTROLFILE_TYPE should now return “BACKUP”. With the setup completed, you are ready to back the database up from the cloned LUNs. Connect to RMAN again, using the backup instance and the recovery catalog and initiate the backup. For example:

run {
allocate channel t1 type sbt_tape parms 'ENV=(TDPO_OPTFILE=/u01/app/oracle/product/admin/PROD/tdp/tdpo_PROD.opt)';
allocate channel t2 type sbt_tape parms 'ENV=(TDPO_OPTFILE=/u01/app/oracle/product/admin/PROD/tdp/tdpo_PROD.opt)';
backup database;
}

This completes the backup of the database. One caveat exists: the control file is not backed up as part of this process-RMAN doesn’t back a “backup” controlfile up:

$ rman target / catalog rman/rmanPwd@rccat

Recovery Manager: Release 11.2.0.2.0 - Production on Tue May 24 12:51:22 2011

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

connected to target database: GB1 (DBID=1796248120, not open)
connected to recovery catalog database

RMAN> CONFIGURE CONTROLFILE AUTOBACKUP ON;

new RMAN configuration parameters:
CONFIGURE CONTROLFILE AUTOBACKUP ON;
new RMAN configuration parameters are successfully stored

RMAN> run {
2> allocate channel t1 type sbt_tape parms
3> 'ENV=(TDPO_OPTFILE=/u01/app/oracle/product/admin/PROD/tdp/tdpo_PROD.opt)';
3> backup tablespace USERS;
4> }

allocated channel: t1
channel t1: SID=199 instance=PROD device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.1.0

Starting backup at 24-MAY-11
channel t1: starting full datafile backup set
channel t1: specifying datafile(s) in backup set
input datafile file number=00006 name=+DATA/prod/datafile/users.264.737727697
channel t1: starting piece 1 at 24-MAY-11
channel t1: finished piece 1 at 24-MAY-11
piece handle=1cmd4ott_1_1 tag=TAG20110524T125316
comment=API Version 2.0,MMS Version 5.5.1.0
channel t1: backup set complete, elapsed time: 00:00:01
Finished backup at 24-MAY-11

RMAN-06497: WARNING: control file is not current, control file AUTOBACKUP skipped
released channel: t1

RMAN>  exit

It’s not difficult at all to get around this problem. As part of the regular archive log backups you are performing on the production database anyway, you add a “backup current controlfile” and a command to resync the catalog (“resync catalog”).

At this stage I should note that the backup process wouldn’t be different in principle, if additional backups were to be taken on the DR site (licensing questions aside). Instead of having to clone the primary LUNs on the local data centre, the storage admins would clone the replicated LUNs (the “R2s” in EMC-talk) and bring these clones up on the DR mount host.

Summary

The process described above approach was interesting from my personal point of view as I haven’t used this concept before. I’m an Oracle DBA, and I feel most comfortable when I have things in hand, relying on another team for doing BAU database tasks is a new experience.

The process description deliberately left product names out unless they were part of the command output. The concept is quite universal and is by no means tied down to a specific vendor.

One of the requirements is that your tape solution can talk to the primary host(s) and the mount host in the local and remote data centres. Although I have seen sites (including rather large ones ) where the DR site was not configured to take backups. Although you could argue that you shouldn’t operate from your DR site for long, that’s by no means an excuse for not having backups. Let’s suggest you run into a block corruption, how would you recover from that without a backup? But then the storage team usually argue that block corruption doesn’t happen on their high end arrays.

References

  • RMAN and Split Mirror Disk Backups [ID 302615.1]
  • RMAN Backup of Controlfile fails from Split Mirror / BCV Copy. Is there any way to take controlfile backup from Split Mirror / BCV so that the Production DB is not used for Controlfile backup? [ID 1111369.1]
  • Supported Backup, Restore and Recovery Operations using Third Party Snapshot Technologies [ID 604683.1]

A quick word of warning about database PSU 11.2.0.2.2

I am playing around with the Grid Infrastructure 11.2.0.2 PSU 2 and found an interesting note on My Oracle Support regarding the Patch Set Update. This reminds me that it’s always a good idea to search for a patch number on Metalink before applying a PSU. It also seems to be a good idea to wait for a few days before trying a PSU (or maybe CPU) on your DEV environment for the first time (and don’t even think about applying a PSU on production without thorough testing!)

OK, back to the story: there is a known issue with the patchset which has to do with the change in the Mutex behaviour which the PSU was intended to fix. To quote MOS note “Oracle Database Patch Set Update 11.2.0.2.2 Known Issues (Doc ID 1291879.1)”, Patch 12431716 Is a Recommended Patch for 11.2.0.2.2. In fact, Oracle strongly recommends you to apply the patch to fix Bug 12431716 – Unexpected change in mutex wait behavior in 11.2.0.2.2 PSU (higher CPU possible).

In a nutshell, not applying the patch can cause your system to suffer from excessive CPU usage and more than expected mutex contention. More information can be found in the description of Bug 12431716  Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU / GI PSU which is worth reading.

Besides this, the PSU was applied without any problems to my four node cluster, I just wish there was a way to roll out a new version of opatch to all cluster node’s $GRID_HOME and $ORACLE_HOME in one command. The overall process for the PSU is the same as already described in my previous post about Bundle Patch 3:

  1. Get the latest version of OPatch
  2. Deploy OPatch to $GRID_HOME and $ORACLE_HOME (ensure permissions are set correctly for the OPatch in $GRID_HOME!)
  3. Unzip the PSU (Bug 11724916 – 11.2.0.2.2 Patch Set Update (PSU) (Doc ID 11724916.8)), for example to /tmp/PSU
  4. Change directory to where you unzipped (/tmp/PSU) and become root
  5. Ensure that $GRID_HOME/OPatch is part of the path
  6. Read the readme
  7. Create an OCM response file and save it to say, /tmp/ocm.rsp
  8. Start the patch as root: opatch auto and supply the full path to the OCM response file (/tmp/ocm.rsp)
  9. Apply the beforementioned one-off patch

Then wait, and after a little while you spend trailing the logfile in $GRID_HOME/cfgtoollogs/ and having a coffee the process eventually finishes. Repeat on each node and you’re done. I’m really happy there aren’t these long readme files anymore with 8 steps to be performed, partially as root, partially as CRS owner/RDBMS owner. It reduces tge tune ut takes to apply a PSU significantly.

Happy patching!

Pro Oracle Database 11g RAC on Linux is available for Kindle

Addmittedly I haven’t checked for a little while, but an email by my co-author Steve Show prompted me to go to the Amazon website and look it up.

And yes, it’s reality! Our book is now finally available as a kindle version, how great is that?!?

There isn’t really a lot more to say about this subject. I’ll wonder how many techies are intersted in the kindle version after the PDF has been out for quite a while. If you read this and decide to get the kindle version, could you please let me know how you liked it?  Personally I think the book is well suited for the Amazon reader as it’s mostly text which suits the device well.

Happy reading!