Search

OakieTags

Who's online

There are currently 0 users and 38 guests online.

Recent comments

11g Release 2

Beware of ACFS when upgrading to 11.2.0.3

This post is about a potential pitfall when migrating from 11.2.0.x to the next point release. I stumbled over problem this one on a two node cluster.

The operating system is Oracle Linux 5.5 running 11.2.0.2.3 and I wanted to go to 11.2.0.3.0. As you know, Grid Infrastructure upgrades are out-of-place, in other words require a separate Oracle home. This is also one of the reasons I wouldn’t want less than 20G on a non-lab like environment for the Grid Infrastructure mount points …

Now when you are upgrading from 11.2.0.x to 11.2.0.3 you need to apply a one-off patch, but the correct one! Search for patch number 12539000 (11203:ASM UPGRADE FAILED ON FIRST NODE WITH ORA-03113) and apply the one that matches your version-and pay attention to these PSUs! There is the obvious required opatch update to be performed before again as well.

An interesting problem with ext4 on Oracle Linux 5.5

I have run into an interesting problem with my Red Hat 5.5 installation. Naively I assumed that ext4 has been around for a long time it would be stable. For a test I performed for a friend, I created my database files on a file system formatted with ext4 and mounted it the same way I would have mounted an ext3 file system:

$ mount | grep ext4
/dev/mapper/mpath43p1 on /u02/oradata type ext4 (rw)

Now when I tried to create a data file within a tablespace of a certain size, I got block corruption which I found very interesting. My first thought was: you must have a corruption of the file system. So I shut down all processes accessing /u02/oradata and gave the file system a thorough checking.

The tale of restoring the OCR and voting files on Linux for RAC 11.2.0.2

As part of a server move from one data centre to another I enjoyed working in the depths of Clusterware. This one has been a rather simple case though: the public IP addresses were the only part of the package to change: simple. One caveat though was the recreation of the OCR disk group I am using for the OCR and 3 copies of the voting file. I decided to reply on the backups I took before the server move.

Once the kit has been rewired in the new data centre, it was time to get active. The /etc/multipath.conf file had to be touched to add the new LUNs for my +OCR disk group. I have described the processes in a number of articles, for example here:

http://martincarstenbach.wordpress.com/2011/01/14/adding-storage-dynamic...

A few facts before we start:

Adding another node for RAC 11.2.0.3 on Oracle Linux 6.1 with kernel-UEK

As I have hinted at during my last post about installing Oracle 11.2.0.3 on Oracle Linux 6.1 with Kernel UEK, I have planned another article about adding a node to a cluster.

I deliberately started the installation of my RAC system with only one node to allow my moderately spec’d hardware to deal with a second cluster node. In previous versions of Oracle there was a problem with node additions: the $GRID_HOME/oui/bin/addNode.sh script did pre-requisite checks that used to fail when you had used ASMLib. Unfortuntely, due to my setup I couldn’t test if that was solved (I didn’t use ASMLib).

Cluvfy

Installing Grid Infrastructure 11.2.0.3 on Oracle Linux 6.1 with kernel UEK

Installing Grid Infrastructure 11.2.0.3 on Oracle Linux 6.1

Yesterday was the big day, or the day Oracle release 11.2.0.3 for Linux x86 and x86-64. Time to download and experiment! The following assumes you have already configured RAC 11g Release 2 before, it’s not a step by step guide how to do this. I expect those to shoot out of the grass like mushrooms in the next few days, especially since the weekend allows people to do the same I did!

The Operating System

I have prepared a xen domU for 11.2.0.3, using the latest Oracle Linux 6.1 build I could find. In summary, I am using the following settings:

  • Oracle Linux 6.1 64-bit
  • Oracle Linux Server-uek (2.6.32-100.34.1.el6uek.x86_64)
  • Initially installed to use the “database server” package group
  • 3 NICs – 2 for the HAIP resource and the private interconnect with IP addresses in the ranges of 192.168.100.0/24 and 192.168.101.0/24. The public NIC is on 192.168.99.0/24
    • Node 1 uses 192.168.(99|100|101).129 for eth0, eth1 and eth2. The VIP uses 192.168.99.130
    • Node 1 uses 192.168.(99|100|101).131 for eth0, eth1 and eth2. The VIP uses 192.168.99.132
    • The SCAN is on 192.168.99.(133|134|135)
    • All naming resolution is done via my dom0 bind9 server
  • I am using a 8GB virtual disk for the operating system, and a 20G LUN for the oracle Grid and RDBMS homes. The 20G are subdivided into 2 LVMs of 10G each mounted to /u01/app/oracle and /u01/crs/11.2.0.3. Note you now seem to need 7.5 G for GRID_HOME
  • All software is owned by Oracle
  • Shared storage is provided by the xen blktap driver
    • 3 x 1G LUNs for +OCR containing OCR and voting disks
    • 1 x 10G for +DATA
    • 1 x 10G for +RECO

Configuring Oracle Linux 6.1

Installation of the operating environment is beyond the scope of this article, and it hasn’t really changed much since 5.x. All I did was to install the database server package group. I wrote this article for fans of xen-based para-virtualisation. Although initially for 6.0, it applies equally for 6.1. Here’s the xen native domU description (you can easily convert that to xenstore format using libvirt):

# cat node1.cfg
name="rac11_2_0_3_ol61_node1"
memory=4096
maxmem=8192
vcpus=4
on_poweroff="destroy"
on_reboot="restart"
on_crash="destroy"
localtime=0
builder="linux"
bootargs=""
extra=" "
disk=[
'file:/var/lib/xen/images/rac11_2_0_3_ol61_node1/disk0,xvda,w',
'file:/var/lib/xen/images/rac11_2_0_3_ol61_node1/oracle,xvdb,w',
'file:/var/lib/xen/images/rac11_2_0_3_ol61_shared/ocr1,xvdc,w!',
'file:/var/lib/xen/images/rac11_2_0_3_ol61_shared/ocr2,xvdd,w!',
'file:/var/lib/xen/images/rac11_2_0_3_ol61_shared/ocr3,xvde,w!',
'file:/var/lib/xen/images/rac11_2_0_3_ol61_shared/data1,xvdf,w!',
'file:/var/lib/xen/images/rac11_2_0_3_ol61_shared/fra1,xvdg,w!'
]
vif=[
'mac=00:16:1e:2b:1d:ef,bridge=br1',
'mac=00:16:1e:2b:1a:e1,bridge=br2',
'mac=00:16:1e:2a:1d:1f,bridge=br3',
]
bootloader = "pygrub"

Use the “xm create node1.cfg” command to start the domU. After the OS was ready I installed the following additional software to satisfy the installation requirements:

  • compat-libcap1
  • compat-libstdc++-33
  • libstdc++-devel
  • gcc-c++
  • ksh
  • libaio-devel

This is easiest done via yum and the public YUM server Oracle provides. It also has instructions on how to set your repository up.

# yum install compat-libcap1 compat-libstdc++-33 libstdc++-devel gcc-c++ ksh libaio-devel

On the first node only I wanted a VNC-like interface for a graphical installation. The older package vnc-server I loved from 5.x isn’t available anymore, the package you need is now called tigervnc-server. It also requires a new viewer to be downloaded from sourceforge. On the first node you might want to install these, unless you are brave enough to use a silent installation:

  • xorg-x11-utils
  • xorg-x11-server-utils
  • twm
  • tigervnc-server
  • xterm

Ensure that SELinux and the IPTables packages are turned off. SELinux is still configured in /etc/sysconfig/selinux, where the setting has to be permissive at least. You can use “chkconfig iptables off” to disable the firewall service at boot. Check that there are no filter rules using “iptables -L”.

I created the oracle account using these usual steps-this hasn’t change since 11.2.0.2.

A few changes to /etc/sysctl.were needed; you can copy and paste the below example and append it to your existing settings. Ensure to up the limits where you have more resources!

kernel.shmall = 4294967296
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
fs.file-max = 6815744
net.core.rmem_default = 262144
net.core.wmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_max = 1048576
fs.aio-max-nr = 1048576
net.ipv4.ip_local_port_range = 9000 65500
net.ipv4.conf.eth1.rp_filter = 0
net.ipv4.conf.eth2.rp_filter = 0

Also ensure that you change the rp_filter for your private interconnect to 0 (or 2)-my devices are eth1 and eth2. This is a new requirement for reverse path filtering introduced with 11.2.0.3.

ASM “disks” must be owned by the GRID owner. The easiest way to change the permissions of the ASM disks is to create a new set of udev rules, such as the following:

# cat 61-asm.rules
 KERNEL=="xvd[cdefg]1", OWNER="oracle", GROUP="asmdba" MODE="0660"

After a quick “start_udev” as root these were applied.

Note that as per my domU config file I actually know the device names are persistent, so it was easy to come up with this solution. In real life you would use the dm-multipath package which allows setting the owner,group and permission now in /etc/multipath.conf for every ASM LUN.

There was an interesting problem initially in that kfod seemed to trigger a change of permissions back to root:disk whenever it ran. Changing the ownership back to oracle only lasted until the next execution of kfod. The only fix I could come up with involved the udev rules.

Good news for those who suffered from the multicast problem introduced in 11.2.0.2-cluvfy now knows about it and checks during the post hwos stage (I had already installed cvuqdisk):

[oracle@rac11203node1 grid]$ ./runcluvfy.sh stage -post hwos -n rac11203node1

Performing post-checks for hardware and operating system setup

Checking node reachability...
Node reachability check passed from node "rac11203node1"

Checking user equivalence...
User equivalence check passed for user "oracle"

Checking node connectivity...

Checking hosts config file...

Verification of the hosts config file successful

Node connectivity passed for subnet "192.168.99.0" with node(s) rac11203node1
TCP connectivity check passed for subnet "192.168.99.0"

Node connectivity passed for subnet "192.168.100.0" with node(s) rac11203node1
TCP connectivity check passed for subnet "192.168.100.0"

Node connectivity passed for subnet "192.168.101.0" with node(s) rac11203node1
TCP connectivity check passed for subnet "192.168.101.0"

Interfaces found on subnet "192.168.99.0" that are likely candidates for VIP are:
rac11203node1 eth0:192.168.99.129

Interfaces found on subnet "192.168.100.0" that are likely candidates for a private interconnect are:
rac11203node1 eth1:192.168.100.129

Interfaces found on subnet "192.168.101.0" that are likely candidates for a private interconnect are:
rac11203node1 eth2:192.168.101.129

Node connectivity check passed

Checking multicast communication...

Checking subnet "192.168.99.0" for multicast communication with multicast group "230.0.1.0"...
Check of subnet "192.168.99.0" for multicast communication with multicast group "230.0.1.0" passed.

Checking subnet "192.168.100.0" for multicast communication with multicast group "230.0.1.0"...
Check of subnet "192.168.100.0" for multicast communication with multicast group "230.0.1.0" passed.

Checking subnet "192.168.101.0" for multicast communication with multicast group "230.0.1.0"...
Check of subnet "192.168.101.0" for multicast communication with multicast group "230.0.1.0" passed.

Check of multicast communication passed.
Check for multiple users with UID value 0 passed
Time zone consistency check passed

Checking shared storage accessibility...

Disk                                  Sharing Nodes (1 in count)
------------------------------------  ------------------------
/dev/xvda                             rac11203node1
/dev/xvdb                             rac11203node1
/dev/xvdc                             rac11203node1
/dev/xvdd                             rac11203node1
/dev/xvde                             rac11203node1
/dev/xvdf                             rac11203node1
/dev/xvdg                             rac11203node1

Shared storage check was successful on nodes "rac11203node1"

Post-check for hardware and operating system setup was successful.

As always, I tried to fix as many problems before invoking runInstaller as possible. The “-fixup” option to runcluvfy is again very useful. I strongly recommend running the fixup script prior to executing the OUI binary.

The old trick to remove /etc/ntp.conf causes the NTP check to complete ok, in which case you are getting the ctsd service for time synchronisation. You should not do this in production-consistent times in the cluster are paramount!

I encountered an issue with the check for free space later in the installation during my first attemps. OUI wants 7.5G for GRID_HOME, even though the installation “only” took around 3 in the end. I exported TMP and TEMP to point to my 10G mount point to avoid this warning:

$ export TEMP=/u01/crs/temp
$ export TMP=/u01/crs/temp
$ ./runInstaller

The installation procedure for Grid Infrastructure 11.2.0.3 is almost exactly the same as for 11.2.0.2, except for the option to change the AU size for the initial disk group you create:


Once you have completed the wizard, it’s time to hit the “install” button. The magic again happens in the root.sh file, or rootupgrade.sh if you are upgrading. I included the root.sh output so you have something to compare against:

Performing root user operation for Oracle 11g

The following environment variables are set as:
ORACLE_OWNER= oracle
ORACLE_HOME=  /u01/crs/11.2.0.3

Enter the full pathname of the local bin directory: [/usr/local/bin]: Creating y directory...
Copying dbhome to y ...
Copying oraenv to y ...
Copying coraenv to y ...

Creating /etc/oratab file...
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/crs/11.2.0.3/crs/install/crsconfig_params
Creating trace directory
User ignored Prerequisites during installation
OLR initialization - successful
root wallet
root wallet cert
root cert export
peer wallet
profile reader wallet
pa wallet
peer wallet keys
pa wallet keys
peer cert request
pa cert request
peer cert
pa cert
peer root cert TP
profile reader root cert TP
pa root cert TP
peer pa cert TP
pa peer cert TP
profile reader pa cert TP
profile reader peer cert TP
peer user cert
pa user cert
Adding Clusterware entries to upstart
CRS-2672: Attempting to start 'ora.mdnsd' on 'rac11203node1'
CRS-2676: Start of 'ora.mdnsd' on 'rac11203node1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rac11203node1'
CRS-2676: Start of 'ora.gpnpd' on 'rac11203node1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac11203node1'
CRS-2672: Attempting to start 'ora.gipcd' on 'rac11203node1'
CRS-2676: Start of 'ora.gipcd' on 'rac11203node1' succeeded
CRS-2676: Start of 'ora.cssdmonitor' on 'rac11203node1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rac11203node1'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac11203node1'
CRS-2676: Start of 'ora.diskmon' on 'rac11203node1' succeeded
CRS-2676: Start of 'ora.cssd' on 'rac11203node1' succeeded

ASM created and started successfully.

Disk Group OCR created successfully.

clscfg: -install mode specified
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
CRS-4256: Updating the profile
Successful addition of voting disk 1621f2201ab94f32bf613b17f62982b0.
Successful addition of voting disk 337a3f0b8a2d4f7ebff85594e4a8d3cd.
Successful addition of voting disk 3ae328cce2b94f3bbfe37b0948362993.
Successfully replaced voting disk group with +OCR.
CRS-4256: Updating the profile
CRS-4266: Voting file(s) successfully replaced
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
1. ONLINE   1621f2201ab94f32bf613b17f62982b0 (/dev/xvdc1) [OCR]
2. ONLINE   337a3f0b8a2d4f7ebff85594e4a8d3cd (/dev/xvdd1) [OCR]
3. ONLINE   3ae328cce2b94f3bbfe37b0948362993 (/dev/xvde1) [OCR]
Located 3 voting disk(s).
CRS-2672: Attempting to start 'ora.asm' on 'rac11203node1'
CRS-2676: Start of 'ora.asm' on 'rac11203node1' succeeded
CRS-2672: Attempting to start 'ora.OCR.dg' on 'rac11203node1'
CRS-2676: Start of 'ora.OCR.dg' on 'rac11203node1' succeeded
CRS-2672: Attempting to start 'ora.registry.acfs' on 'rac11203node1'
CRS-2676: Start of 'ora.registry.acfs' on 'rac11203node1' succeeded
Configure Oracle Grid Infrastructure for a Cluster ... succeeded

That’s it! After returning to the OUI screen you run the remaing assistants and finally are rewarded with the success message:

Better still, I could now log in to SQL*Plus and was rewarded with the new version:

$ sqlplus / as sysasm

SQL*Plus: Release 11.2.0.3.0 Production on Sat Sep 24 22:29:45 2011

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production

SQL>

Summary

You might remark that in the output there has only ever been one node referenced. That is correct-my lab box has limited resources and I’d like to test the addNode.sh script for each new release so please be patient! I’m planning an article about upgrading to 11.2.0.3 soon, as well as the addition of a node. One thing I noticed was the abnormally high CPU usage for the CSSD processes: ocssd.bin, cssdagent and cssdmonitor-something I find alarming at the moment.

top - 22:53:19 up  1:57,  5 users,  load average: 5.41, 4.03, 3.77
Tasks: 192 total,   1 running, 191 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3%us,  0.2%sy,  0.0%ni, 99.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4102536k total,  3500784k used,   601752k free,    59792k buffers
Swap:  1048568k total,     4336k used,  1044232k free,  2273908k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27646 oracle    RT   0 1607m 119m  53m S 152.0  3.0  48:57.35 /u01/crs/11.2.0.3/bin/ocssd.bin
27634 root      RT   0  954m  93m  55m S 146.0  2.3  31:45.50 /u01/crs/11.2.0.3/bin/cssdagent
27613 root      RT   0  888m  91m  55m S 96.6  2.3  5124095h /u01/crs/11.2.0.3/bin/cssdmonitor
28110 oracle    -2   0  485m  14m  12m S  1.3  0.4   0:34.65 asm_vktm_+ASM1
28126 oracle    -2   0  499m  28m  15m S  0.3  0.7   0:04.52 asm_lms0_+ASM1
28411 root      RT   0  500m 144m  59m S  0.3  3.6  5124095h /u01/crs/11.2.0.3/bin/ologgerd -M -d /u01/crs/11.2.0.3/crf/db/rac11203node1
32394 oracle    20   0 15020 1300  932 R  0.3  0.0  5124095h top
1 root      20   0 19336 1476 1212 S  0.0  0.0   0:00.41 /sbin/init

...

11.2.0.2 certainly didn’t use that much CPU across 4 cores…

Update: I have just repeated the same installation on VirtualBox 4.1.2 with less potent hardware, and funny enough the CPU problem has disappeared. How is that possible? I need to understand more, and maybe update the XEN host to something more recent.

Oracle 11.2.0.3-can’t be long now!

Update: well it’s out, actually. See the comments below. However the certification matrix hasn’t been updated so it’s anyone’s guess if Oracle/Red Hat 6 are certified at this point in time.

Tanel Poder has already announced it a few days ago, but 11.2.0.3 must be ready for release very soon. It has even been spotted in the “lastet patchset” page on OTN, only to be removed quickly. After another tweet came out from Laurent Schneider, it was time to investigate what’s new. The easiest way is to point your browser to tahiti.oracle.com and type “11.2.0.3” into the search box. You are going to find a wealth of new information!

As a RAC person by heart I am naturally interested in RAC features first. The new features I spotted in the Grid Infrastructure installation guide for Linux are listed here:

http://download.oracle.com/docs/cd/E11882_01/install.112/e22489/whatsnew.htm#CEGJBBBB

Additional information for this article was taken from the “New Features” guide:

http://download.oracle.com/docs/cd/E11882_01/server.112/e22487/chapter1_11203.htm#NEWFTCH1-c

So the question is-what’s in for us?

ASM Cluster File System

As I expected, there was support for ACFS and ADVM for Oracle’s own kernel. This has been overdue for a while. I remember how surprised I was when I installed RAC on Oracle Linux 5.5 with the UEK kernel only to see that infamous “…not supported” output when the installer probed the kernel version. Supported kernels are Linux kernels UEK5-2.6.32-100.34.1 and subsequent updates to 2.6.32-100 kernels for Oracle Linux kernels OL5 and OL6.

A big surprise is support for ACFS for SLES-I though that was pretty much dead in the water after all that messing around from Novell. ACFS has always worked on SLES 10 up to SP3, but it did never for SLES 11. The requirement is SLES 11 SP1, and it has to be 64bit.

There are quite a few additional changes to ACFS. For example, it’s now possible to use ACFS replication and tagging on Windows.

Upgrade

If one of the nodes in the cluster being upgraded with the rootupgrade.sh script fails during the execution of said script, the operation can be completed with the new “force” flag.

Random Bits

I found the following note on MOS regarding the time zone file: Actions For DST Updates When Upgrading To Or Applying The 11.2.0.3 Patchset [ID 1358166.1] I suggest you have a look at that note, as it mentions a new pre-upgrade script you need to download and corrective actions for 11.2.0.1 and 11.2.0.2. I’m sure it’s going to be mentioned in the 11.2.0.3 patch readme as well.

There are also changes expected with the dreaded mutex problem in busy systems, MOS note WAITEVENT: “library cache: mutex X” [ID 727400.1] lists 11.2.0.3 as the release where many of the problems related to this are fixed. Time will tell if they are…

Further enhancements are focused on Warehouse Builder, and XML. SQL apply and the log miner have also been enhanced which is good news for users of Streams and logical standby databases

Summary

It is much to early to say anything else about the patchset. Quite a few important documents don’t have a “new features” section yet. That includes the Real Application Clusters Administration and Deployment guide as well which I’ll cover as soon as it’s out. From a quick glance at the still unreleased patchset it seems it’s less of a radical change than 11.2.0.2 was which is a good thing. Unfortunately the certification matrix hasn’t been updated yet, I am very keen to see support for Oracle/Red Hat Linux 6.x.

Essential tools for Exadata performance experiments

Like I said in a previous post I have started working in the Exadata performance field, which is really exciting, especially after you get it to work really fast!

Also, finding out what your session is spending time on is important if you are just getting started.

I found the following indispensable tools for Exadata performance analysis:

  • Session Snapper from Tanel Poder, available from his website
  • The Tuning and Diagnostic Pack license
  • The Real Time SQL Monitor package
  • A good understanding of DBMS_XPLAN
  • Oracle 10046 traces
  • collectl – I have blogged about it before here and mention it for reference only

There are probably a lot more than those, but these are the bare essentials. Let’s have a look at them in a bit more detail.

Snapper

I don’t think I have to lose a single word about snapper-it’s an established tool for performance diagnostics tought by Tanel in his advanced troubleshooting course. I use snapper primarily for analysis of smart scans. Below is an example for a session using serial execution to scan a table (best viewed with Firefox):

SQL> select count(1) from order_items;

SQL> @snapper all 5 1 243
Sampling SID 243 with interval 5 seconds, taking 1 snapshots...
setting stats to all due option = all

-- Session Snapper v3.52 by Tanel Poder @ E2SN ( http://tech.e2sn.com )

-------------------------------------------------------------------------------------------------------------------------------------
SID, USERNAME  , TYPE, STATISTIC                                                 ,     HDELTA, HDELTA/SEC,    %TIME, GRAPH
-------------------------------------------------------------------------------------------------------------------------------------
243, SOE       , STAT, session logical reads                                     ,     95.02k,        19k,
243, SOE       , STAT, user I/O wait time                                        ,        331,       66.2,
243, SOE       , STAT, non-idle wait time                                        ,        332,       66.4,
243, SOE       , STAT, non-idle wait count                                       ,      1.51k,      302.2,
243, SOE       , STAT, physical read total IO requests                           ,        776,      155.2,
243, SOE       , STAT, physical read total multi block requests                  ,        746,      149.2,
243, SOE       , STAT, physical read requests optimized                          ,         30,          6,
243, SOE       , STAT, physical read total bytes                                 ,    778.67M,    155.73M,
243, SOE       , STAT, cell physical IO interconnect bytes                       ,    778.67M,    155.73M,
243, SOE       , STAT, consistent gets                                           ,     95.06k,     19.01k,
243, SOE       , STAT, consistent gets from cache                                ,          7,        1.4,
243, SOE       , STAT, consistent gets from cache (fastpath)                     ,          7,        1.4,
243, SOE       , STAT, consistent gets direct                                    ,     95.05k,     19.01k,
243, SOE       , STAT, physical reads                                            ,     95.05k,     19.01k,
243, SOE       , STAT, physical reads direct                                     ,     95.05k,     19.01k,
243, SOE       , STAT, physical read IO requests                                 ,        776,      155.2,
243, SOE       , STAT, physical read bytes                                       ,    778.67M,    155.73M,
243, SOE       , STAT, calls to kcmgcs                                           ,          7,        1.4,
243, SOE       , STAT, file io wait time                                         ,        394,       78.8,
243, SOE       , STAT, Number of read IOs issued                                 ,        776,      155.2,
243, SOE       , STAT, no work - consistent read gets                            ,     95.15k,     19.03k,
243, SOE       , STAT, cell flash cache read hits                                ,         30,          6,
243, SOE       , TIME, DB CPU                                                    ,      1.96s,   391.94ms,    39.2%, |@@@@      |
243, SOE       , TIME, sql execute elapsed time                                  ,         6s,       1.2s,   120.0%, |@@@@@@@@@@|
243, SOE       , TIME, DB time                                                   ,         6s,       1.2s,   120.0%, |@@@@@@@@@@|
243, SOE       , WAIT, direct path read                                          ,       3.3s,   660.48ms,    66.0%, |@@@@@@@   |
243, SOE       , WAIT, kfk: async disk IO                                        ,        5ms,        1ms,      .1%, |          |
--  End of Stats snap 1, end=2011-08-23 16:20:03, seconds=5

-----------------------------------------------------------------------
Active% | SQL_ID          | EVENT                     | WAIT_CLASS
-----------------------------------------------------------------------
74% | 1p7y7pvzmxx3y   | direct path read          | User I/O
26% | 1p7y7pvzmxx3y   | ON CPU                    | ON CPU

--  End of ASH snap 1, end=2011-08-23 16:20:03, seconds=5, samples_taken=47

Watch out for the cell smart entries to figure out of a smart scan is happening. In the above example, there wasn’t one.

The same query again, but this time making use of session offloading:

SQL> select /*+full(t)*/ count(1) from order_items t;

SQL> @snapper all 5 1 243
Sampling SID 243 with interval 5 seconds, taking 1 snapshots...
setting stats to all due option = all

-- Session Snapper v3.52 by Tanel Poder @ E2SN ( http://tech.e2sn.com )

-------------------------------------------------------------------------------------------------------------------------------------
SID, USERNAME  , TYPE, STATISTIC                                                 ,     HDELTA, HDELTA/SEC,    %TIME, GRAPH
-------------------------------------------------------------------------------------------------------------------------------------
243, SOE       , STAT, session logical reads                                     ,    427.69k,     85.54k,
243, SOE       , STAT, user I/O wait time                                        ,        165,         33,
243, SOE       , STAT, non-idle wait time                                        ,        167,       33.4,
243, SOE       , STAT, non-idle wait count                                       ,      3.52k,        703,
243, SOE       , STAT, enqueue waits                                             ,         12,        2.4,
243, SOE       , STAT, enqueue requests                                          ,         10,          2,
243, SOE       , STAT, enqueue conversions                                       ,         15,          3,
243, SOE       , STAT, enqueue releases                                          ,         10,          2,
243, SOE       , STAT, global enqueue gets sync                                  ,         25,          5,
243, SOE       , STAT, global enqueue releases                                   ,         10,          2,
243, SOE       , STAT, physical read total IO requests                           ,      4.56k,      911.6,
243, SOE       , STAT, physical read total multi block requests                  ,      4.31k,      862.2,
243, SOE       , STAT, physical read total bytes                                 ,       3.5G,     700.6M,
243, SOE       , STAT, cell physical IO interconnect bytes                       ,      1.52G,    303.95M,
243, SOE       , STAT, ges messages sent                                         ,         12,        2.4,
243, SOE       , STAT, consistent gets                                           ,    427.54k,     85.51k,
243, SOE       , STAT, consistent gets from cache                                ,         75,         15,
243, SOE       , STAT, consistent gets from cache (fastpath)                     ,         75,         15,
243, SOE       , STAT, consistent gets direct                                    ,    427.47k,     85.49k,
243, SOE       , STAT, physical reads                                            ,    427.47k,     85.49k,
243, SOE       , STAT, physical reads direct                                     ,    427.47k,     85.49k,
243, SOE       , STAT, physical read IO requests                                 ,      4.56k,        911,
243, SOE       , STAT, physical read bytes                                       ,       3.5G,    700.36M,
243, SOE       , STAT, calls to kcmgcs                                           ,         75,         15,
243, SOE       , STAT, file io wait time                                         ,     17.45k,      3.49k,
243, SOE       , STAT, cell physical IO bytes eligible for predicate offload     ,      3.49G,       699M,
243, SOE       , STAT, cell smart IO session cache lookups                       ,          5,          1,
243, SOE       , STAT, cell smart IO session cache hits                          ,          5,          1,
243, SOE       , STAT, cell physical IO interconnect bytes returned by smart scan,      1.52G,    303.25M,
243, SOE       , STAT, cell session smart scan efficiency                        ,         -3,        -.6,
243, SOE       , STAT, table scans (long tables)                                 ,          5,          1,
243, SOE       , STAT, table scans (direct read)                                 ,          5,          1,
243, SOE       , STAT, table scan rows gotten                                    ,    129.59M,     25.92M,
243, SOE       , STAT, table scan blocks gotten                                  ,    422.53k,     84.51k,
243, SOE       , STAT, cell scans                                                ,          5,          1,
243, SOE       , STAT, cell blocks processed by cache layer                      ,    501.97k,    100.39k,
243, SOE       , STAT, cell blocks processed by txn layer                        ,    501.97k,    100.39k,
243, SOE       , STAT, cell blocks processed by data layer                       ,     426.9k,     85.38k,
243, SOE       , STAT, cell blocks helped by minscn optimization                 ,    501.91k,    100.38k,
243, SOE       , STAT, cell simulated session smart scan efficiency              ,       3.5G,    699.17M,
243, SOE       , STAT, cell IO uncompressed bytes                                ,       3.5G,    699.17M,
243, SOE       , TIME, DB CPU                                                    ,      3.98s,   796.68ms,    79.7%, |@@@@@@@@  |
243, SOE       , TIME, sql execute elapsed time                                  ,      6.31s,      1.26s,   126.2%, |@@@@@@@@@@|
243, SOE       , TIME, DB time                                                   ,      6.31s,      1.26s,   126.2%, |@@@@@@@@@@|
243, SOE       , WAIT, enq: KO - fast object checkpoint                          ,     2.01ms,      402us,      .0%, |          |
243, SOE       , WAIT, cell smart table scan                                     ,      1.65s,   329.33ms,    32.9%, |@@@@      |
243, SOE       , WAIT, events in waitclass Other                                 ,     5.69ms,     1.14ms,      .1%, |          |
--  End of Stats snap 1, end=2011-08-23 16:21:16, seconds=5

-----------------------------------------------------------------------
Active% | SQL_ID          | EVENT                     | WAIT_CLASS
-----------------------------------------------------------------------
66% | adhm3mbjfzysd   | ON CPU                    | ON CPU
34% | adhm3mbjfzysd   | cell smart table scan     | User I/O

--  End of ASH snap 1, end=2011-08-23 16:21:16, seconds=5, samples_taken=44

The enq: KO fast object checkpoint and the “cell smart table scan” are giveaways for a smart scan.

DBMS_XPLAN

Another proven, useful tool in the arsenal of the performance analyst. For Exadata you might want to use the display_cursor function. It can take a sql_id, a child number and a format parameter. I like the format ‘ALL’ best, as it gives me most information about a statement.

Note that even though you might get the word “storage” in the execution plan, it doesn’t mean you actually see a smart scan in the end! Always check the session wait events and session counters for the word “smart” to make sure one is occurring.

SQL> select * from table(dbms_xplan.display_cursor('3xjhbw9m5u9qu',format=>'ALL'))
2  /

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------
SQL_ID  3xjhbw9m5u9qu, child number 0
-------------------------------------
select count(*) from order_items where product_id > 10

Plan hash value: 2209137760

-------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                 |       |       |   249K(100)|          |
|   1 |  SORT AGGREGATE               |                 |     1 |     4 |            |          |
|*  2 |   INDEX STORAGE FAST FULL SCAN| ITEM_PRODUCT_IX |   338M|  1293M|   249K  (2)| 00:50:00 |
-------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

1 - SEL$1
2 - SEL$1 / ORDER_ITEMS@SEL$1

Predicate Information (identified by operation id):
---------------------------------------------------

2 - storage("PRODUCT_ID">10)
filter("PRODUCT_ID">10)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

1 - (#keys=0) COUNT(*)[22]

31 rows selected.

This leads me to the next tool, SQL Monitoring, and it rocks.

Real Time SQL Monitoring

This is yet another awesome tool in your repository if you have the license. It allows you to check what’s happening during the execution of a SQL statement, which is more than useful in larger data warehouse style queries.

The easiest and most accessible way is to use OEM Grid Control or database console to view the report. I personally like to run the report in a putty session on a large screen. To make it easier to run the report based on a SQL ID I created the below script “report.sql”:

set trim on long 2000000 longchunksize 2000000
set trimspool on  pagesize 0 echo off timing off linesize 1000
select DBMS_SQLTUNE.REPORT_SQL_MONITOR(
sql_id => '&1',
report_level=>'ALL')
from dual;

With this at hand you can query v$sesssion for a SQL_ID and feed it into the code snippet.

Unfortunately I can’t upload text files to wordpress for security reasons, and posting the output here isn’t possible since the report is rather wide. I decided to run the report with HTML output and type “EM” instead an post the print screens. The command I used is sjhown below:

SQL> select dbms_sqltune.report_sql_monitor(sql_id=>’&1′, type=>’EM’, report_level=>’+histogram’) from dual;

The output of the script would be far too wide so I moved it into the attachement of the EM report type here for an insert /*+ append */ into table select ….

You can use this to even look at statistics gathering-in the below two examples I executed dbms_stats-this report has been created for a stats gathering session with a DOP of 16

What I like best about the monitoring report is that it points out where you are in the execution plan, and it provides an estimate for how much work has already been done. More intelligent people have written lots about using these tools, off the top of my head I’d check Greg Rahn’s as well as Doug Burn’s weblogs for more information.

Oracle 10046 trace

Cary Millsap has written the standard about extended trace files and I strongly recommend reading his book and papers on the subject.

Whilst you can run queries in isolation and easily identify them during testing, the matter gets considerably more difficult as soon as you have to trace a session from a connection pool or other third party application. Most of them are not instrumented making tracing by

Of course you could use the MRTools to enable tracing on a wider scale and then trawl through the trace files with their help, but it might be difficult to get permission to do so on a busy production system.

The lesson learned for me was how to enable tracing in parallel query: this MOS note “Master Note; How to Get a 10046 trace for a Parallel Query [ID 1102801.1]” has the answer. It becomes more difficult if your PQ spans instances, but there are ways around this. You also should consider using DBMS_MONITOR to enable session trace if you can’t use “alter session set events” for some reason. Be sure to read all of the note, especially the bit at the bottom which explains the new 11g interface using “alter session set sql_trace” syntax.

I personally liked setting the tracefile identifier to something more easily identifiable. The ADRCI has a nice option to show trace files-if your ADR is consolidated in a cluster file system you could simply run “adcri > show tracefile -t” to list them all and pick the ones with your custom tracefile identifier.

What about command line utilities?

In the past I liked to use strace to see what a process was doing while “on cpu”. Unfortunately this isn’t really an option with Exadata, as most of the calls are purely network related. After all, RDS is a network protocol. Consider this output from strace during a 10 second query (taken on a database node):

#  strace -cp 29031
Process 29031 attached - interrupt to quit
Process 29031 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
88.85    0.033064           2     14635         5 poll
5.47    0.002035           0     14754      7377 setsockopt
2.68    0.000999         999         1           munmap
2.43    0.000906           0     22075      7321 recvmsg
0.56    0.000209           0      7399           sendmsg
0.00    0.000000           0         8         6 read
0.00    0.000000           0         2           write
0.00    0.000000           0         3           mmap
0.00    0.000000           0        60           rt_sigprocmask
0.00    0.000000           0        15         5 rt_sigreturn
0.00    0.000000           0        30           setitimer
0.00    0.000000           0         1           semctl
0.00    0.000000           0       262           getrusage
0.00    0.000000           0       112           times
0.00    0.000000           0         1           io_setup
0.00    0.000000           0         1           io_destroy
0.00    0.000000           0        23           semtimedop
------ ----------- ----------- --------- --------- ----------------
100.00    0.037213                 59382     14714 total

Not too enlightening-all I/O happens in the cells.

Summary

These are the tools I found very useful in getting started with Exadata performance analysis, and they are very good for the job at hand. Note that snapper can be used for getting information about parallel query and SQL Monitor is a blessing when it comes to looking at huge parallel queries involving lots of tables.

Why is my Exadata smart scan not offloading?

After receiving the excellent “Expert Oracle Exadata” book I decided to spend some time looking into Exadata performance after I having spent most of the time previously on infrastructure related questions such as “how can I prevent someone from overwriting my DR system with UAT data”, patching etc.

Now there is one thing to keep in mind with Exadata-you need lots of data before it breaks into a little sweat. Luckily for me, one of my colleagues has performed some testing on this environment. For reasons unknown the swingbench order entry benchmark (version 2.3) has been chosen. For those who don’t know the OE benchmark: it’s a heavy OLTP style workload simulating a web shop where users browse products, place orders etc. OE is optimised for single block I/O, and despite what you may have heard, Exadata doesn’t provide a noticable benefit for these queries.

Anyway, what I liked was the fact that the order_items table had about 350 million rows organised in about 8GB. From discussions with Frits Hoogland I know that a full scan of such a table takes between 40 and 60 seconds depending on system load.

SOE.ORDER_ITEMS

Here are some of the interesting facts around the table:

SQL> select partition_name,num_rows,blocks,avg_space,chain_cnt,global_stats,user_stats,stale_stats
2  from dba_tab_statistics
3  where table_name = 'ORDER_ITEMS'
4  /

PARTITION_NAME                   NUM_ROWS     BLOCKS  AVG_SPACE  CHAIN_CNT GLO USE STA
------------------------------ ---------- ---------- ---------- ---------- --- --- ---
349990815    1144832          0          0 YES NO  NO
SYS_P341                         21866814      71297          0          0 YES NO  NO
SYS_P342                         21889112      72317          0          0 YES NO  NO
SYS_P343                         21877726      71297          0          0 YES NO  NO
SYS_P344                         21866053      71297          0          0 YES NO  NO
SYS_P345                         21870127      71297          0          0 YES NO  NO
SYS_P346                         21887971      72317          0          0 YES NO  NO
SYS_P347                         21875056      71297          0          0 YES NO  NO
SYS_P348                         21891454      72317          0          0 YES NO  NO
SYS_P349                         21883576      72317          0          0 YES NO  NO
SYS_P350                         21859704      71297          0          0 YES NO  NO
SYS_P351                         21866820      71297          0          0 YES NO  NO
SYS_P352                         21865681      71297          0          0 YES NO  NO
SYS_P353                         21865239      71297          0          0 YES NO  NO
SYS_P354                         21870373      71297          0          0 YES NO  NO
SYS_P355                         21882656      71297          0          0 YES NO  NO
SYS_P356                         21872453      71297          0          0 YES NO  NO

17 rows selected.

It has 4 indexes, out of which some are reverse key indexes:

SQL> select index_name,index_type from user_indexes where table_name = 'ORDER_ITEMS';

INDEX_NAME                     INDEX_TYPE
------------------------------ ---------------------------
ITEM_ORDER_IX                  NORMAL/REV
ORDER_ITEMS_UK                 NORMAL
ORDER_ITEMS_PK                 NORMAL
ITEM_PRODUCT_IX                NORMAL/REV

SQL>

This is going to be interesting (I have to admit I didn’t check initially for reverse key indexes). All I wanted was a huge table to see if a smart scan really is so mind blowingly fast. These examples can easily be reproduced by generating the SOE schema with the oewizard-just make sure you select the use of partitioning (you should have a license for it).

Performance Testing

My plan was to start off with a serial execution, then use parallel query and check for execution times.As with all performance tuning of this kind you should have a copy of session snapper from Tanel Poder available. At the time of this writing, the latest version was 3.52 available from Tanel’s blog.

I also wanted to see when a smart scan kicked in. Here’s the first test with serial execution:

10:44:37 SQL> select count(*) from order_items
10:44:40   2  /

COUNT(*)
----------
350749016

Elapsed: 00:00:47.54

OK, that doesn’t look like a smart scan has happened, the 47 seconds are a little too slow. As always, check using snapper to confirm:

SQL> @snapper all 5 1 243
Sampling SID 243 with interval 5 seconds, taking 1 snapshots...
setting stats to all due option = all

-- Session Snapper v3.52 by Tanel Poder @ E2SN ( http://tech.e2sn.com )

-------------------------------------------------------------------------------------------------------------------------------------
SID, USERNAME  , TYPE, STATISTIC                                                 ,     HDELTA, HDELTA/SEC,    %TIME, GRAPH
-------------------------------------------------------------------------------------------------------------------------------------
243, SOE       , STAT, session logical reads                                     ,     81.07k,     16.21k,
243, SOE       , STAT, user I/O wait time                                        ,        299,       59.8,
243, SOE       , STAT, non-idle wait time                                        ,        300,         60,
243, SOE       , STAT, non-idle wait count                                       ,      1.26k,        251,
243, SOE       , STAT, physical read total IO requests                           ,        634,      126.8,
243, SOE       , STAT, physical read total multi block requests                  ,        634,      126.8,
243, SOE       , STAT, physical read total bytes                                 ,    664.14M,    132.83M,
243, SOE       , STAT, cell physical IO interconnect bytes                       ,    664.14M,    132.83M,
243, SOE       , STAT, consistent gets                                           ,     81.07k,     16.21k,
243, SOE       , STAT, consistent gets from cache                                ,          1,         .2,
243, SOE       , STAT, consistent gets from cache (fastpath)                     ,          1,         .2,
243, SOE       , STAT, consistent gets direct                                    ,     81.07k,     16.21k,
243, SOE       , STAT, physical reads                                            ,     81.07k,     16.21k,
243, SOE       , STAT, physical reads direct                                     ,     81.07k,     16.21k,
243, SOE       , STAT, physical read IO requests                                 ,        634,      126.8,
243, SOE       , STAT, physical read bytes                                       ,    664.14M,    132.83M,
243, SOE       , STAT, calls to kcmgcs                                           ,          1,         .2,
243, SOE       , STAT, file io wait time                                         ,        395,         79,
243, SOE       , STAT, Number of read IOs issued                                 ,        635,        127,
243, SOE       , STAT, no work - consistent read gets                            ,      81.1k,     16.22k,
243, SOE       , TIME, DB CPU                                                    ,      1.19s,   237.96ms,    23.8%, |@@@       |
243, SOE       , TIME, sql execute elapsed time                                  ,      4.01s,   801.37ms,    80.1%, |@@@@@@@@  |
243, SOE       , TIME, DB time                                                   ,      4.01s,   801.37ms,    80.1%, |@@@@@@@@  |
243, SOE       , WAIT, direct path read                                          ,      2.99s,   598.63ms,    59.9%, |@@@@@@    |
243, SOE       , WAIT, kfk: async disk IO                                        ,     4.25ms,    849.4us,      .1%, |          |
--  End of Stats snap 1, end=2011-08-17 10:30:47, seconds=5

-----------------------------------------------------------------------
Active% | SQL_ID          | EVENT                     | WAIT_CLASS
-----------------------------------------------------------------------
62% | b0hcgjs21yrq9   | direct path read          | User I/O
38% | b0hcgjs21yrq9   | ON CPU                    | ON CPU

--  End of ASH snap 1, end=2011-08-17 10:30:47, seconds=5, samples_taken=42

PL/SQL procedure successfully completed.

Right-no smart scan; this puzzled me. To recap, smart scans happen only if:

  • Direct path reads are used
  • The row source can be offloaded
  • The parameter cell_offload_processing is set to true (I think it’s “always” in 11.2)
  • There are no chained or migrated rows

Now let’s check these.

I can clearly see that direct path reads have happened from the snapper output-check. It’s also worth remembering that the decision to perform direct path reads is made on a segment basis-and each partition is a segment: keep that in mind!

You can check the execution plan to find out if the row source can be offloaded as in this example:

SQL> select * from table(dbms_xplan.display(format=>'+projection'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2209137760

-----------------------------------------------------------------------------------------
| Id  | Operation                     | Name            | Rows  | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                 |     1 |   249K  (1)| 00:49:52 |
|   1 |  SORT AGGREGATE               |                 |     1 |            |          |
|   2 |   INDEX STORAGE FAST FULL SCAN| ITEM_PRODUCT_IX |   349M|   249K  (1)| 00:49:52 |
-----------------------------------------------------------------------------------------

Column Projection Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------

1 - (#keys=0) COUNT(*)[22]

14 rows selected.

You can see that the INDEX STORAGE FAST FULL SCAN clause is used, in other words the source can be offloaded. Strange that it didn’t happen though. The parameter cell_offload_processing was set to true on my system. Do you remember that the index is a reverse key index? I’m wondering if that has anything to do with it.

To rule out that there was a problem with the direct path reads I set “_serial_direct_read”=true and tried again, but it didn’t make a difference.

Another way to check for smart scans in a live system is the use of v$sql-columns IO_CELL_UNCOMPRESSED_BYTES and  IO_CELL_OFFLOAD_RETURNED_BYTES are cumulative counters for smart scan activity. However if they are 0 like in my case, they indicate some sort of issue.

This continued with parallel query: 8 or 64 slaves and still no smart scan. I even traced the execution, but there was not a single pxxx trace file with the word “smart” in it (and I made sure I captured the waits)

What was going on?

SQL_ID  4vkkk105tny60, child number 0
-------------------------------------
select /*+parallel(64)*/ count(*) from order_items

Plan hash value: 544438321

--------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name            | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                 |       |  4306 (100)|          |        |      |            |
|   1 |  SORT AGGREGATE                   |                 |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR                  |                 |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)            | :TQ10000        |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE                |                 |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR            |                 |   349M|  4306   (1)| 00:00:52 |  Q1,00 | PCWC |            |
|*  6 |       INDEX STORAGE FAST FULL SCAN| ITEM_PRODUCT_IX |   349M|  4306   (1)| 00:00:52 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

1 - SEL$1
6 - SEL$1 / ORDER_ITEMS@SEL$1

Predicate Information (identified by operation id):
---------------------------------------------------

6 - storage(:Z>=:Z AND :Z<=:Z)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

1 - (#keys=0) COUNT()[22]
2 - SYS_OP_MSR()[10]
3 - (#keys=0) SYS_OP_MSR()[10]
4 - (#keys=0) SYS_OP_MSR()[10]

Note
-----
- automatic DOP: Computed Degree of Parallelism is 64

Ok then I got fed up with that ITEM_PRODUCT_IX and forced a full table scan and volia-a smart scan happened.

SQL> select * from table(dbms_xplan.display_cursor('5a1x1v72ujf8s', format=>'ALL'));

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  5a1x1v72ujf8s, child number 0
-------------------------------------
select /*+ full(t) parallel(t, 8) */ count(*) from order_items t

Plan hash value: 661298821

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name        | Rows  | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |             |       | 43358 (100)|          |       |       |        |      |            |
|   1 |  SORT AGGREGATE                |             |     1 |            |          |       |       |        |      |            |
|   2 |   PX COORDINATOR               |             |       |            |          |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)         | :TQ10000    |     1 |            |          |       |       |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE             |             |     1 |            |          |       |       |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR         |             |   349M| 43358   (1)| 00:08:41 |     1 |    16 |  Q1,00 | PCWC |            |
|*  6 |       TABLE ACCESS STORAGE FULL| ORDER_ITEMS |   349M| 43358   (1)| 00:08:41 |     1 |    16 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

1 - SEL$1
6 - SEL$1 / T@SEL$1

Predicate Information (identified by operation id):
---------------------------------------------------

6 - storage(:Z>=:Z AND :Z<=:Z)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

1 - (#keys=0) COUNT()[22]
2 - SYS_OP_MSR()[10]
3 - (#keys=0) SYS_OP_MSR()[10]
4 - (#keys=0) SYS_OP_MSR()[10]

37 rows selected.

To be sure that was correct, I retried with serial execution (check the time):

11:44:10 SQL> select /*+ full(t) single */ count(*) from order_items t;

COUNT(*)
----------
350749016

Elapsed: 00:00:12.02

@snapper all 5 1 241

Sampling SID 243 with interval 5 seconds, taking 1 snapshots...
setting stats to all due option = all

-- Session Snapper v3.52 by Tanel Poder @ E2SN ( http://tech.e2sn.com )

-------------------------------------------------------------------------------------------------------------------------------------
SID, USERNAME  , TYPE, STATISTIC                                                 ,     HDELTA, HDELTA/SEC,    %TIME, GRAPH
-------------------------------------------------------------------------------------------------------------------------------------
243, SOE       , STAT, session logical reads                                     ,    482.88k,     96.58k,
243, SOE       , STAT, application wait time                                     ,          1,         .2,
243, SOE       , STAT, user I/O wait time                                        ,        158,       31.6,
243, SOE       , STAT, non-idle wait time                                        ,        159,       31.8,
243, SOE       , STAT, non-idle wait count                                       ,      3.72k,      744.8,
243, SOE       , STAT, enqueue waits                                             ,         15,          3,
243, SOE       , STAT, enqueue requests                                          ,         14,        2.8,
243, SOE       , STAT, enqueue conversions                                       ,         21,        4.2,
243, SOE       , STAT, enqueue releases                                          ,         14,        2.8,
243, SOE       , STAT, global enqueue gets sync                                  ,         35,          7,
243, SOE       , STAT, global enqueue releases                                   ,         14,        2.8,
243, SOE       , STAT, physical read total IO requests                           ,      5.23k,      1.05k,
243, SOE       , STAT, physical read total multi block requests                  ,      4.91k,        982,
243, SOE       , STAT, physical read total bytes                                 ,      3.95G,    790.99M,
243, SOE       , STAT, cell physical IO interconnect bytes                       ,      1.72G,    343.34M,
243, SOE       , STAT, ges messages sent                                         ,         21,        4.2,
243, SOE       , STAT, consistent gets                                           ,    482.88k,     96.58k,
243, SOE       , STAT, consistent gets from cache                                ,         97,       19.4,
243, SOE       , STAT, consistent gets from cache (fastpath)                     ,         97,       19.4,
243, SOE       , STAT, consistent gets direct                                    ,    482.78k,     96.56k,
243, SOE       , STAT, physical reads                                            ,    482.78k,     96.56k,
243, SOE       , STAT, physical reads direct                                     ,    482.78k,     96.56k,
243, SOE       , STAT, physical read IO requests                                 ,      5.23k,      1.05k,
243, SOE       , STAT, physical read bytes                                       ,      3.95G,    790.99M,
243, SOE       , STAT, calls to kcmgcs                                           ,         97,       19.4,
243, SOE       , STAT, file io wait time                                         ,      19.7k,      3.94k,
243, SOE       , STAT, cell physical IO bytes eligible for predicate offload     ,      3.95G,    790.89M,
243, SOE       , STAT, cell smart IO session cache lookups                       ,          7,        1.4,
243, SOE       , STAT, cell smart IO session cache hits                          ,          7,        1.4,
243, SOE       , STAT, cell physical IO interconnect bytes returned by smart scan,      1.72G,    343.38M,
243, SOE       , STAT, table scans (long tables)                                 ,          7,        1.4,
243, SOE       , STAT, table scans (direct read)                                 ,          7,        1.4,
243, SOE       , STAT, table scan rows gotten                                    ,    147.19M,     29.44M,
243, SOE       , STAT, table scan blocks gotten                                  ,    480.07k,     96.01k,
243, SOE       , STAT, cell scans                                                ,          7,        1.4,
243, SOE       , STAT, cell blocks processed by cache layer                      ,    576.41k,    115.28k,
243, SOE       , STAT, cell blocks processed by txn layer                        ,    576.41k,    115.28k,
243, SOE       , STAT, cell blocks processed by data layer                       ,    485.06k,     97.01k,
243, SOE       , STAT, cell blocks helped by minscn optimization                 ,    576.28k,    115.26k,
243, SOE       , STAT, cell simulated session smart scan efficiency              ,      3.97G,     794.6M,
243, SOE       , STAT, cell IO uncompressed bytes                                ,      3.97G,    794.93M,
243, SOE       , TIME, DB CPU                                                    ,      2.63s,   526.72ms,    52.7%, |@@@@@@    |
243, SOE       , TIME, sql execute elapsed time                                  ,      4.01s,   801.13ms,    80.1%, |@@@@@@@@  |
243, SOE       , TIME, DB time                                                   ,      4.01s,   801.13ms,    80.1%, |@@@@@@@@  |
243, SOE       , WAIT, enq: KO - fast object checkpoint                          ,     2.19ms,    438.8us,      .0%, |          |
243, SOE       , WAIT, cell smart table scan                                     ,      1.58s,   316.49ms,    31.6%, |@@@@      |
243, SOE       , WAIT, events in waitclass Other                                 ,     8.54ms,     1.71ms,      .2%, |          |
--  End of Stats snap 1, end=2011-08-17 12:53:04, seconds=5

-----------------------------------------------------------------------
Active% | SQL_ID          | EVENT                     | WAIT_CLASS
-----------------------------------------------------------------------
66% | cfhpsq29gb49m   | ON CPU                    | ON CPU
34% | cfhpsq29gb49m   | cell smart table scan     | User I/O

--  End of ASH snap 1, end=2011-08-17 12:53:04, seconds=5, samples_taken=47

PL/SQL procedure successfully complet

Confusing, confusing. It was time to ask the experts: Frits Hoogland who reproduced the behaviour on his environment (11.2.0.1 and 11.2.0.2), as well as Kerry Osborne-both of which thought a smart scan should have happened even with the index. An Index FFS certainly used to be offloadable, and we think there is a regression bug visible with the cells. My version of the cell software is 11.2.2.2 with 11.2.0.1 BP 6. I first thought it might be a problem with 11.2.0.1 and the version of the cell software, but it probably isn’t: Kerry tested on 11.2.0.2 BP10 and cellsrv 11.2.2.3.2, Frits tested on 11.2.0.1 and 11.2.0.2 BP 6 and cellsrv 11.2.2.2.0_LINUX.X64_101206.2-1

Summary

What we are seeing isn’t something we should be seeing. I am trying to get this raised as an SR and see what happens. From my testing it seems that it potentially impacts anyone with indexes on their tables.

A day at Cisco’s labs and an introduction to the UCS

I recently had the immense pleasure of visiting Cisco’s labs at Bedfont Lakes for a day of intensive information exchange about their UCS offering. To summarise the day: I was impressed. Even more so by the fact that there is more to come, I’m assuming a few more blogs posts about UCS will get published here after I had some time to benchmark it.

I knew about UCS from a presentation at the UKOUG user group, but it didn’t occur at the time which potential is behind the technology. This potential is something Cisco sadly fail to make clear on their website-which is very good once you understand the UCS concept as it gives you many details about the individual components.

I should stress that I am not paid or otherwise financially motivated to write this article, it’s pure interest in technology that made me write this blog post. A piece of good technology should be mentioned, and this is what I would like to do.

What is the UCS anyway?

When I mentioned to friends that I was going to see Cisco to have a look at their blade server offering I got strange looks. Indeed, Cisco hasn’t been known as a manufacturer of blades before, it’s only recently (in industry terms) that they entered the market. However instead of providing YABE (yet another blade enclosure), they engineered it quite nicely.

If you like, the UCS is an appliance-like environment you can use for all sorts of workloads. It can be fitted in a standard 42” Rack and currently consists of these components (brackets contain product designations for further reading):

  • Two (clustered) Fabric Interconnects (UCS 6120 or 6140 series) for 20 or 40 10G ports, with each port configurable as either uplinks into the core network or server links down to UCS chassis. These ports carry both Ethernet and FCoE traffic from the UCS chassis
  • Two Fabric Extenders (UCS 2100 series), which go into the blade enclosure and provide connectivity up to the Fabric Interconnects. Each UCS 2104 fabric extender (FEX) provides 40Gb bandwidth  to the Interconnect, controlled by QoS policies
  • Blade enclosures (UCS 5100 series), which contain 8 half-width or 4 full width blades
  • Different models of half-width and full-width UCS B-series blades providing up to 512G RAM and 7500 series Intel Xeon processors
  • 10GE Adapters which are Converged Network Adapters (CNA). In other words they can do Fibre Channel over Ethernet and non-storage Ethernet traffic

The Fabric Interconnects can take extension modules with Fibre Channel to link to a FC switch, there is no new technology introduced and existing arrays can be used. Also, existing fibre channel solutions can be used for backups.

Another of the interesting features is the management software, called UCS Manager. It’s integrated into the Fabric Interconnect using a few gigabyte of flash storage. Not only is it used to manage a huge number of blades, it can also stage firmware for each component. At a suitable time, the firmware can be upgraded in a rolling fashion except for the Fabric Interconnect (obviously), though the fabric interconnects can take advantage of the clustering functionality to ensure that complete firmware upgrades can be undertaken with a system-wide outage.

Fibre Channel over Ethernet

What I was very keen to learn about was the adoption of FCoE in UCS. Ever since it has been released, the UCS uses FCoE for storage inside the system. I can image that this must have been difficult to sell, since FCoE was a very young standard at the time, and still probably is.

For those of you who don’t know FCoE, it’s broadly speaking FC payloads in Ethernet frames. Since Ethernet was never designed to work like Fibre Channel, certain amendments had to be made to the 802.x standards. The so-modified Ethernet is often referred to as Data Centre Ethernet (DCE) or Converged Enhanced Ethernet (CEE). In a way, FCoE competes with established Fibre Channel and emerging ones such as iSCSI or even SRP for the future storage solution. History has shown that Ethernet is very resilient and versatile, it might well win the battle for unified connection-if implemented correctly. And by unified I mean network and storage traffic. I was told that the next generation UCS will not have dedicated Fibre Channel ports in the Fabric switches, all ports are unified. All you need is a small SFP to attach a fibre cable or 10G Ethernet.

The fabric Interconnects in the current version use traditional but aggregated 8G/s Fibre Channel to go to the storage.

Nice features

UCS introduces the idea of a service profile. This is probably the biggest differentiator between it and other blade solutions. A blade in the enclosure can take any role and configuration you assign to it. It took me a little while to understand this, but an analogy helped: think of a blade as something configurable similar to a VM: before you can put something on it, you first have to define it. Amongst the things you set are boot order (SAN boot is highly recommended, we’ll see why shortly), which VSAN to use, which VNICs to use in which VLAN etc. Instead of having to provide the same information over and over again, it’s possible to define pools and templates to draw this information from.

Technicalities set aside, once you define a service profile (let’s assume a RAC node for example), you assign this profile to a blade that’s in the enclosure. A few seconds later, you’ll see the blade boot from the storage array and you are done. If the SAN LUNs don’t contain a bootable operating system, you can us eth KVM to create one.

Another nice thing I think is the use of 10G Ethernet throughout. The two switches do not operate in Spanning Tree Mode, which would limit the uplink speed to 10G (one path)

There is obviously more, but I think this blog post has become longer than it should be. I might blog more about the system at a later stage, but not after going to add this final section:

Exadata?

The question that immediately springs to mind is: how does it compare to Exadata? Is it Exadata competition? Well, probably no. UCS is a blade system but it doesn’t feature Infiniband or zero copy/iDB protocol. It doesn’t come with its own more or less directly attached storage subsystem. It can’t do smart scans, create storage indexes, or do other cell offloading. It can’t do EHCC: all these are exclusive to Exadata.

This can be either good or bad for you, and the answer is of course “it depends”. If your workload is highly geared towards DSS and warehousing in general, and you have the requirement to go through TB of data quickly, then Exadata probably is the answer.

If you are in need of consolidating say your SPARC hardware on x86, a trend I see a lot, then you may not need Exadata, and in fact you might be better off waiting for Exadata to mature some more if you are really after it. UCS scores many points by not breaking completely with traditional data centre operations: you still use a storage array you connect the blades to. This makes provisioning of a test database as simple as cloning the production LUNs. Admittedly you get a lot of throughput from IPoIB as Exadata uses it, but I would doubt that an RMAN duplicate from active database is faster than creating clone on the storage array. UCS also allows you to use storage level replication such as SRDF of Continuous Access (or whichever name other vendor give it)

Summary

In summary, UCS is a well-engineered blade system using many interesting technologies. I am especially looking forward to FCoE multi-hop, which should be available with UCS 2. Imagine the I/O bandwidth you could get with all these aggregated links!

Disclaimer

Am I an expert on UCS? Nope, not by a long. So it could be that certain things described in this blog post might be inaccurate. If you like, use the comment option below to make me aware of these!

An introduction to collectl

Some of you may have seen on twitter that I was working on understanding collectl. So why did I start with this? First of all, I was after a tool that records a lot of information on a Linux box. It can also play information back, but this is out of scope of this introduction.

In the past I have used nmon to do similar things, and still love it for what it does. Especially in conjunction with the nmon-analyzer, an Excel plug in it can create very impressive reports. How does collectl compare?

Getting collectl

Getting collectl is quite easy-get it from sourceforge: http://sourceforge.net/projects/collectl/

The project website including very good documentation is available from sourceforge as well, but uses a slightly different URL: http://collectl.sourceforge.net/

I suggest you get the archive-independent RPM and install it on your system. This is all you need to get started! The impatient could type “collectl” at the command prompt now to get some information. Let’s have a look at the output:

$ collectl
waiting for 1 second sample...
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
1   0  1163  10496    113     14     18      4      8     55      5      19
0   0  1046  10544      0      0      2      3    164    195     30      60
0   0  1279  10603    144      9    746    148     20     67     11      19
3   0  1168  10615    144      9    414     69     14     69      5      20
1   0  1121  10416    362     28    225     19     11     71      8      35
Ouch!

The “ouch” has been caused by my CTRL-c to stop the execution.

Collectl is organised to work by subsystems, the standard option is to print CPU, disk and network subsystem, aggregated.

If you don’t know what information you are after, you could use the –all flag to display aggregated information across all subsystems. Be warned that you need a large screen for all that output! For even more output, add the –verbose flag to the –all option and you need a 22” screen at least. The verbose flag prints more output, as the name suggests. For the disk subsystem you can view the difference:

$ collectl -sd -i 5 --verbose
waiting for 5 second sample...

# DISK SUMMARY (/sec)
#KBRead RMerged  Reads SizeKB  KBWrite WMerged Writes SizeKB
162     136     10     15      187      30     19      9
109      24      9     11      566     118     23     24
Ouch!
$ collectl -sd -i 5
waiting for 5 second sample...
#<----------Disks----------->
#KBRead  Reads KBWrit Writes
9865     73    190     23
Ouch!

Each subsystem can be queried individually, the default monitoring interval is 1 second. The man page for collectl lists the following subsystems:

SUMMARY SUBSYSTEMS

b - buddy info (memory fragmentation)
c - CPU
d - Disk
f - NFS V3 Data
i - Inode and File System
j - Interrupts
l - Lustre
m - Memory
n - Networks
s - Sockets
t - TCP
x - Interconnect
y - Slabs (system object caches)

As the name suggests, these sub systems provide summary information. Summaries are ok for a first overview, but don’t forget that information is aggregated and detail is lost.

From an Oracle point of view I’d probably be most interested in the CPU, disk and memory usage. If you are using RAC, network usage can also be interesting.

Detailed subsystem information is available for these (again taken from the excellen manual page):

C - CPU
D - Disk
E - Environmental data (fan, power, temp),  via ipmitool
F - NFS Data
J - Interrupts
L - Lustre OST detail OR client Filesystem detail
N - Networks
T - 65 TCP counters only available in plot format
X - Interconnect
Y - Slabs (system object caches)
Z - Processes

You can combine subsystems, and you can combine detail and summary information. Bear in mind though that this becomes a lot of information for a putty session of gnome-terminal!

In interactive mode, you might want to consider the –home flag, which does a top-like refresh and prints real time information without scrolling: very neat!

But even with the –-home option, digesting all that information visually can be a bit daunting, which leads me to my next section.

Generating graphical output

While all the textual information is all nice and good, it is difficult to visualise. Collectl can help you with that as well. All you need to do is generate a file in tab format, which is as simple as adding the –P and –f options. Since you can’t be overwhelmed with the information gathered in a file (unlike on standard out), you could use the detail switches. If you have the luxury, create the file with the information in a directory expored via samba and analyse it with Excel or other utilities. It’s possible to use gnuplot as well, but I found that a bit lacking for interactive use. The collectl-utils provide a CGI script to analyse collectl files on the host which can be convenient. Here is an example for measuring CPU, memory and all disks with a monitoring interval of 15 seconds. The file will be in “Plot” format (-P) and goes to /export/collectl/plotfiles:

$ collectl -sdmn -i 15 -P -f /export/collectl/plotfiles

Note that you can’t use the verbose flag here, and you also shouldn’t use a file name with the –f switch!

The resulting file is called hostname-yyyymmdd.tab. After renaming it to hostname-yyyymmdd.txt it can quite easily be imported using your favourite spreadsheet application. Imagine all the graphs you could produce with it! Also the header contains interesting information:

################################################################################
# Collectl:   V3.5.1-1  HiRes: 1  Options: -sdmn -i 15 -P -f /export/collectl/plotfiles
# Host:       node1 DaemonOpts:
# Distro:     Red Hat Enterprise Linux Server release 5.5 (Tikanga)  Platform:
# Date:       20110805-142647  Secs: 1312550807 TZ: +0100
# SubSys:     dmn Options: z Interval: 1 NumCPUs: 16  NumBud: 0 Flags: i
# Filters:    NfsFilt:  EnvFilt:
# HZ:         100  Arch: x86_64-linux-thread-multi PageSize: 4096
# Cpu:        AuthenticAMD Speed(MHz): 2210.190 Cores: 4  Siblings: 4
# Kernel:     2.6.18-194.el5  Memory: 65990460 kB  Swap: 16809976 kB
# NumDisks:   173 DiskNames: c0d0 sda sdb sdc sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn sdo sdp sdq sdr sds sdt sdu sdv sdw sdx sdy sdz sdaa sdab sdac sdad sdae sdaf sdag sdah sdai sdaj sdak sdal sdam sdan sdao sdap sdaq sdar sdas sdat sdau sdav sdaw sdax sday sdaz sdba sdbb sdbc sdbd sdbe sdbf sdbg sdbh sdbi sdbj sdbk sdbl sdbm sdbn sdbo sdbp sdbq sdbr sdbs sdbt sdbu sdbv sdbw sdbx sdby sdbz sdca sdcb sdcc sdcd sdce sdcf sdcg sdch sdci sdcj sdck sdcl sdcm sdcn sdco sdcp sdcq sdcr sdcs sdct sdcu sdcv sdcw sdcx sdcy sdcz sdda sddb sddc sddd sdde sddf sddg dm-0 dm-1 dm-2 dm-3 dm-4 dm-5 dm-6 dm-7 dm-8 dm-9 dm-10 dm-11 dm-12 dm-13 dm-14 dm-15 dm-16 dm-17 dm-18 dm-19 dm-20 dm-21 dm-22 dm-23 dm-24 dm-25 dm-26 dm-27 dm-28 dm-29 dm-30 dm-31 dm-32 dm-33 dm-34 dm-35 dm-36 dm-37 dm-38 dm-39 dm-40 dm-41 dm-42 dm-43 dm-44 dm-45 dm-46 dm-47 dm-48 dm-49 dm-50 dm-51 dm-52 dm-53 dm-54 dm-55 dm-56 dm-57 dm-58 dm-59 dm-60
# NumNets:    8 NetNames: lo: eth0: eth1: eth2: eth3: sit0: bond0: bond1:
# SCSI:       DA:0:00:00: ... DA:2:00:00:00
################################################################################

This should be enough to remind you of where you were running this test.

Run duration and interval

Use the –i flag to change the monitoring interval, this is the same as you’d do with SAR or iostat/vmstat and the like. You could then either use the –c option to count n samples, or alternatively use –R to run for n weeks, days, hours, minutes or seconds, each of which are abridged with their first letter. For example to run for 15 minutes with samples taken every 15 seconds, you’d say collectl –i 15 –R 15m.

Quick and dirty

If you need an interactive overview of what’s going on top-style, you can use the –top flag. This will print output very similar to the top command, but this time you have a lot more options to sort on. Use collectl –showtopops. This is so cool that I couldn’t help just listing the options here:

$ collectl --showtopopts
The following is a list of --top's sort types which apply to either
process or slab data.  In some cases you may be allowed to sort
by a field that is not part of the display if you so desire

TOP PROCESS SORT FIELDS

Memory
vsz    virtual memory
rss    resident (physical) memory

Time
syst   system time
usrt   user time
time   total time

I/O
rkb    KB read
wkb    KB written
iokb   total I/O KB

rkbc   KB read from pagecache
wkbc   KB written to pagecache
iokbc  total pagecacge I/O
ioall  total I/O KB (iokb+iokbc)

rsys   read system calls
wsys   write system calls
iosys  total system calls

iocncl Cancelled write bytes

Page Faults
majf   major page faults
minf   minor page faults
flt    total page faults

Miscellaneous (best when used with --procfilt)
cpu    cpu number
pid    process pid
thread total process threads (not counting main)

TOP SLAB SORT FIELDS

numobj    total number of slab objects
actobj    active slab objects
objsize   sizes of slab objects
numslab   number of slabs
objslab   number of objects in a slab
totsize   total memory sizes taken by slabs
totchg    change in memory sizes
totpct    percent change in memory sizes
name      slab names

Filtering information

Let’s say you are running multiple ASM disk groups in your system, but you are only interested in the performance of disk group DATA. The –sD flag will print all the information for all disks (LUNs) of the system. Collectl reports disks as the native devices and dm- devices. For multipathed devices you obviously want to look at the dm- device. You could use the multipath –ll command to map dm- device to WWIDs and your disks in the end. Let’s say you found out that the disks you need to look at are /dev/dm-{1,3,5,8} you could use the –dskfilt flag, which takes a perl regex. In my example, I could use the following command to check on those disks:

collectl -sD -c 1 --dskfilt "dm-(1\b|3\b|5\b|8\b)"
waiting for 1 second sample...

# DISK STATISTICS (/sec)
#          <---------reads---------><---------writes---------><--------averages--------> Pct
#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
dm-1             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-3             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-5             0      0    0    0       0      0    1    1       0     0     0      0    0
dm-8             0      0    0    0       0      0    0    0       0     0     0      0    0
$

Note the “\b” boundary, which is my uneducated way to saying that the expression should match dm-1, but not 10, or anything else that extends beyond number one.

Additional filters you can apply can be found in the output of collectl –showsubopts as well as in section subsystem options in the manpage.

Summary

Used correctly, collectl is the swiss army knife for system monitoring, the level of detail which can be gathered is breathtaking. Thanks Mark Seger! And aplogies for all the good stuff I’ve been missing!