Search

Top 60 Oracle Blogs

Recent comments

April 2011

New Dtrace book $9.99

UPDATE: the book sale was only a couple of days and is over now and back to the regular price last I checked

———————————————————-

New Dtrace book $9.99
http://www.informit.com/deals/

As a fan of Active Session History (ASH) I’ve always wanted a version of ASH for operating systems. Operating systems, especially UNIX/Linux and the ilk have generally had ways of generating the basic info such as what process is active at a certain time slice but getting more detailed information was less than obvious if it was possible at all. Now dtrace changes all that.

Let’s take an example. What if I wanted to see all the communication and timings between and NFS client and an NFS server? What happens when I do:

dd if=/dev/zero of=foo bs=8k count=1

On an NFS mounted file system?
First thing we could do is get the timings,size and file being accessed

#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN
{
      printf("%-16s %-18s %2s %-8s %6s %s\n", "TIME(us)",
          "CLIENT", "OP", "OFFSET", "BYTES", "PATHNAME");
}

nfsv3:::op-read-start
{
      rtime[args[1]->noi_xid] = timestamp;
      roffset[args[1]->noi_xid] =  args[2]->offset / 1024;
      rsize[args[1]->noi_xid] =  args[2]->count;
      rname[args[1]->noi_xid] =  args[1]->noi_curpath;
}

nfsv3:::op-read-done
/   rtime[args[1]->noi_xid] /
{
      printf("%-16d %-18s %2s %-8d %6d %s\n", (timestamp - rtime[args[1]->noi_xid])/ 1000,
          args[0]->ci_remote, "R",
          roffset[args[1]->noi_xid] ,
          rsize[args[1]->noi_xid] ,
          args[1]->noi_curpath);
}

nfsv3:::op-write-start
{
      wtime[args[1]->noi_xid] = timestamp;
      woffset[args[1]->noi_xid] =  args[2]->offset / 1024;
      wsize[args[1]->noi_xid] =  args[2]->count;
      wname[args[1]->noi_xid] =  args[1]->noi_curpath;
}

nfsv3:::op-write-done
/   wtime[args[1]->noi_xid] /
{
      printf("%-16d %-18s %2s %-8d %6d %s\n", (timestamp - wtime[args[1]->noi_xid]) / 1000,
          args[0]->ci_remote, "W",
          woffset[args[1]->noi_xid] ,
          wsize[args[1]->noi_xid] ,
          args[1]->noi_curpath);
}

Now if I run the above dtrace, and then kick off the dd, I get some dtrace output like

TIME(us)         CLIENT             OP OFFSET    BYTES PATHNAME
1079             172.16.100.194      W 0          8192 /domain0/group0/vdb1/datafile/u01/oracle/oradata/orcl/foo

The above data says that it took 1ms (or 1079us) to finish the nfs write operation.
The “dd” command reports 58ms (ie 5800us)

[oracle@kylelinux orcl]$ dd if=/dev/zero of=foo bs=8k count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.005861 seconds, 140 kB/s

Where did all the extra time come from? One area is the TCP/IP layer above the NFS.
We can also trace that. Here is a detailed trace of the “dd” at the TCP/IP level:

    delta      send     recd       seq/ack  seq/ack flags
                    <-- 120          2436     4870 ACK|PUSH|
      198       124 -->              4870     2556 ACK|PUSH|
      514           <-- 0            2556     4994 ACK|
       33           <-- 120          2556     4994 ACK|PUSH|
       37       124 -->              4994     2676 ACK|PUSH|
      280           <-- 116          2676     5118 ACK|PUSH|
       39       116 -->              5118     2792 ACK|PUSH|
      208           <-- 116          2792     5234 ACK|PUSH|
       42       116 -->              5234     2908 ACK|PUSH|
      264           <-- 120          2908     5350 ACK|PUSH|
       59       124 -->              5350     3028 ACK|PUSH|
      224           <-- 152          3028     5474 ACK|PUSH|
      771       148 -->              5474     3180 ACK|PUSH|
      430           <-- 1448         3180     5622 ACK|
      111           <-- 1448         4628     5622 ACK|
      138           <-- 1448         6076     5622 ACK|
      332           <-- 1448         7524     5622 ACK|
      105           <-- 1448         8972     5622 ACK|
       94           <-- 1088          420     5622 ACK|PUSH|
      615       164 -->              5622     1508 ACK|PUSH|
      225           <-- 116          1508     5786 ACK|PUSH|
       57       116 -->              5786     1624 ACK|PUSH|
    39863           <-- 0            1624     5902 ACK|

(notice that the last ACK seems to me to be a lazy ack as it happens well after dd says it is finished)
How would I get that info? It gets a little more complex, but here is the Dtrace script for the TCP level:

#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option defaultargs
#pragma D option dynvarsize=64m

inline int TICKS=$1;
inline string ADDR=$$2;

dtrace:::BEGIN
{
       TIMER = ( TICKS != NULL ) ?  TICKS : 1 ;
       ticks = TIMER;
       TITLE = 10;
       title = 0;
       walltime=timestamp;
       printf("starting up ...\n");
}

tcp:::send,
tcp:::receive
/   title == 0
/
{
      printf("%9s %8s %6s %8s %8s %8s \n",
        "delta"    ,
        "cid"    ,
        "pid"    ,
        "send" ,
        "recd"  ,
        "flags"
      );
        title=TITLE;
}

tcp:::send
/     ( args[2]->ip_daddr == ADDR || ADDR == NULL )
    && curpsinfo->pr_psargs == "/usr/lib/nfs/nfsd"
/
{
    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    flags="";
    flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),flags);
    flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),flags);
    flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),flags);
    flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),flags);
    flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),flags);
    flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),flags);
    flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),flags);
    flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),flags);
    flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),flags);

    printf("%9d %8d %6d %8d --> %8s %8d %8d %s \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[4]->tcp_seq % 10000,
        args[4]->tcp_ack % 10000,
        flags
      );
    flag=0;
    title--;
}

tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL )
    && nfs[args[1]->cs_cid]
/
{
      delta=timestamp-walltime;
      walltime=timestamp;
      flags="";
      flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),flags);
      flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),flags);
      flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),flags);
      flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),flags);
      flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),flags);
      flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),flags);
      flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),flags);
      flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),flags);
      flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),flags);
      printf("%9d %8d %6d %8s <-- %-8d %8d %8d %s \n",         delta/1000,         args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[4]->tcp_seq % 10000,
        args[4]->tcp_ack % 10000,
        flags
      );
    flag=0;
    title--;
}

Now I can run

sudo nfs_details.d 1 172.16.100.194

To get the above output.
The above output can get pretty overwhelming, so what about outputting the summaries every second?

#!/usr/sbin/dtrace -s
#pragma D option defaultargs
#pragma D option dynvarsize=64m
#pragma D option quiet
inline int TICKS=$1;
inline string ADDR=$$2;

dtrace:::BEGIN
{
       TIMER = ( TICKS != NULL ) ?  TICKS : 1 ;
       ticks = TIMER;
       TITLE = 10;
       title = 0;
       start = 1;
}

/* ***** begin initialization */
/*
    curpsinfo->pr_psargs == "/usr/lib/nfs/nfsd"
    or could have done
    execname == "nfsd"
*/
tcp:::send
/   (args[2]->ip_daddr == ADDR || ADDR == NULL )&&
    curpsinfo->pr_psargs == "/usr/lib/nfs/nfsd"
/
{
    printf("execname %s\n",execname);
    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    @send_tot_sz[args[2]->ip_daddr] =  sum( args[2]->ip_plength - args[4]->tcp_offset);
    @send_tot_ct[args[2]->ip_daddr] =  count();
}

tcp:::receive, tcp:::send
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL )
    && nfs[args[1]->cs_cid]
    && start == 1
/
{
       start =0;
       recd_small_tm_beg[args[1]->cs_cid] = 0;
       recd_small_tm_beg[args[1]->cs_cid] = 0;
       recd_small_ct[args[1]->cs_cid]=0;
       recd_small_sz[args[1]->cs_cid]=0;

       recd_big_tm_beg[args[1]->cs_cid] = 0;
       recd_big_tm_beg[args[1]->cs_cid] = 0;
       recd_big_ct[args[1]->cs_cid]=0;
       recd_big_sz[args[1]->cs_cid]=0;

       send_big_tm_beg[args[1]->cs_cid] = 0;
       send_big_tm_beg[args[1]->cs_cid] = 0;
       send_big_ct[args[1]->cs_cid]=0;
       send_big_sz[args[1]->cs_cid]=0;

       send_small_tm_beg[args[1]->cs_cid] = 0;
       send_small_tm_beg[args[1]->cs_cid] = 0;
       send_small_ct[args[1]->cs_cid]=0;
       send_small_sz[args[1]->cs_cid]=0;
}
/* ***** end initialization */

/* ***************** BEG _OUT_  big RTT  TIME ****************** */
tcp:::send
/     (args[2]->ip_daddr == ADDR || ADDR == NULL )
    && curpsinfo->pr_psargs == "/usr/lib/nfs/nfsd"
    && (args[2]->ip_plength - args[4]->tcp_offset) > 300
/
{
    send_big_unacked[args[1]->cs_cid] = 1;
    send_big_sq[args[1]->cs_cid] = args[4]->tcp_seq;

    send_big_tm_beg[args[1]->cs_cid] =
      send_big_tm_beg[args[1]->cs_cid] == 0
    ? timestamp : send_big_tm_beg[args[1]->cs_cid] ;

    send_big_tm_end[args[1]->cs_cid] = timestamp;

    send_big_sz[args[1]->cs_cid] += (args[2]->ip_plength - args[4]->tcp_offset);
    send_big_ct[args[1]->cs_cid] ++;
}
tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL )
    && send_big_unacked[args[1]->cs_cid] == 1
    && nfs[args[1]->cs_cid]
    && (args[2]->ip_plength - args[4]->tcp_offset)  >= 0
/
{
/* kyle
   just take out sequence check
   response seems to be lazy on last ACK , ie slower
   process that asked for data
    && args[4]->tcp_ack > send_big_sq[args[1]->cs_cid]

   if we want to use a beginning and ending sequence :
    recd_sq_beg[args[1]->cs_cid] =
       recd_sq_beg[args[1]->cs_cid] == 0
      ? args[4]->tcp_seq : recd_sq_beg[args[1]->cs_cid] ;

        printf("receive > 0 %d size %d  end %d\n",
              timestamp - send_big_tm_end[args[1]->cs_cid] ,
             (args[2]->ip_plength - args[4]->tcp_offset) ,
              timestamp - send_big_tm_beg[args[1]->cs_cid]
              );

*/
        @rtt_big_tm_beg[args[2]->ip_saddr] =  avg(timestamp - send_big_tm_beg[args[1]->cs_cid]);
        @rtt_big_tm_end[args[2]->ip_saddr] =  avg(timestamp - send_big_tm_end[args[1]->cs_cid]);
        @rtt_big_sz[args[2]->ip_saddr] =  avg(send_big_sz[args[1]->cs_cid]);
        @rtt_big_ct[args[2]->ip_saddr] =  avg(send_big_ct[args[1]->cs_cid]);
        @rtt_big_tot_ct[args[2]->ip_saddr] =  count();

        send_big_sq[args[1]->cs_cid] = 0;
        send_big_unacked[args[1]->cs_cid] = 0;
        send_big_tm_end[args[1]->cs_cid] = 0;
        send_big_tm_beg[args[1]->cs_cid] = 0;
        send_big_sz[args[1]->cs_cid] = 0;
        send_big_ct[args[1]->cs_cid] = 0;
}
tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL )
    && send_big_unacked[args[1]->cs_cid] == 1
    && args[4]->tcp_ack > send_big_sq[args[1]->cs_cid]
    && nfs[args[1]->cs_cid]
    && (args[2]->ip_plength - args[4]->tcp_offset)  < 0 / {         send_big_sq[args[1]->cs_cid] = 0;
        send_big_unacked[args[1]->cs_cid] = 0;
        send_big_tm_end[args[1]->cs_cid] = 0;
        send_big_tm_beg[args[1]->cs_cid] = 0;
        send_big_sz[args[1]->cs_cid] = 0;
        send_big_ct[args[1]->cs_cid] = 0;
}
 /* ***************** END  _OUT_ big RTT  TIME ****************** */

 /* ***************** BEGIN _OUT_  small RTT  TIME ****************** */
tcp:::send
/     (args[2]->ip_daddr == ADDR || ADDR == NULL )
    && curpsinfo->pr_psargs == "/usr/lib/nfs/nfsd"
    && (args[2]->ip_plength - args[4]->tcp_offset) > 0
    && (args[2]->ip_plength - args[4]->tcp_offset) <= 300 / {     send_small_unacked[args[1]->cs_cid] = 1;
    send_small_sq[args[1]->cs_cid] = args[4]->tcp_seq;

    send_small_tm_beg[args[1]->cs_cid] =
      send_small_tm_beg[args[1]->cs_cid] == 0
    ? timestamp : send_small_tm_beg[args[1]->cs_cid] ;

    send_small_tm_end[args[1]->cs_cid] = timestamp;

    send_small_sz[args[1]->cs_cid] += args[2]->ip_plength - args[4]->tcp_offset;
    send_small_ct[args[1]->cs_cid] ++;
}

/*
    && args[4]->tcp_ack > send_small_sq[args[1]->cs_cid]
*/
tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL )
    && send_small_unacked[args[1]->cs_cid] == 1
    && nfs[args[1]->cs_cid]
    && (args[2]->ip_plength - args[4]->tcp_offset) >   0
/
{
        @rtt_small_tm_beg[args[2]->ip_saddr] =  avg(timestamp - send_small_tm_beg[args[1]->cs_cid]);
        @rtt_small_tm_end[args[2]->ip_saddr] =  avg(timestamp - send_small_tm_end[args[1]->cs_cid]);
        @rtt_small_sz[args[2]->ip_saddr] =  avg(send_small_sz[args[1]->cs_cid]);
        @rtt_small_ct[args[2]->ip_saddr] =  avg(send_small_ct[args[1]->cs_cid]);

        send_small_sq[args[1]->cs_cid] = 0;
        send_small_unacked[args[1]->cs_cid] = 0;
        send_small_tm_end[args[1]->cs_cid] = 0;
        send_small_tm_beg[args[1]->cs_cid] = 0;
        send_small_sz[args[1]->cs_cid] =0;
        send_small_ct[args[1]->cs_cid] =0;
}
tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL )
    && send_small_unacked[args[1]->cs_cid] == 1
    && args[4]->tcp_ack > send_small_sq[args[1]->cs_cid]
    && nfs[args[1]->cs_cid]
    && (args[2]->ip_plength - args[4]->tcp_offset) ==   0
/
{
        send_small_sq[args[1]->cs_cid] = 0;
        send_small_unacked[args[1]->cs_cid] = 0;
        send_small_tm_end[args[1]->cs_cid] = 0;
        send_small_tm_beg[args[1]->cs_cid] = 0;
        send_small_sz[args[1]->cs_cid] =0;
        send_small_ct[args[1]->cs_cid] =0;
}
 /* ***************** END  _OUT_ small RTT  TIME ****************** */

/* ***************** START _IN_ SMALL RESPONSE TIME ****************** */
tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL )
    && nfs[args[1]->cs_cid]
    && (args[2]->ip_plength - args[4]->tcp_offset) >   0
    && (args[2]->ip_plength - args[4]->tcp_offset) <= 300 / {        recd_small_unacked[args[1]->cs_cid] = 1;

       recd_small_tm_beg[args[1]->cs_cid] =
          recd_small_tm_beg[args[1]->cs_cid] == 0
        ? timestamp : recd_small_tm_beg[args[1]->cs_cid] ;

       recd_small_sq_end[args[1]->cs_cid] = args[4]->tcp_seq;
       recd_small_tm_end[args[1]->cs_cid] = timestamp;

       @recd_tot_sz[args[2]->ip_saddr] = sum(args[2]->ip_plength - args[4]->tcp_offset);
       @recd_tot_ct[args[2]->ip_saddr] = count();

       recd_small_sz[args[1]->cs_cid] += args[2]->ip_plength - args[4]->tcp_offset;
       recd_small_ct[args[1]->cs_cid] ++;

}
/*
    && args[4]->tcp_ack > recd_small_sq_end[args[1]->cs_cid]
*/
tcp:::send
/      (args[2]->ip_daddr == ADDR || ADDR == NULL )
    && curpsinfo->pr_psargs == "/usr/lib/nfs/nfsd"
    && recd_small_unacked[args[1]->cs_cid] == 1
    && (args[2]->ip_plength - args[4]->tcp_offset) >   0
/
{
/*
        printf("send > 0 beg %d size %d end %d \n",
                            timestamp - recd_small_tm_beg[args[1]->cs_cid],
                           (args[2]->ip_plength - args[4]->tcp_offset) ,
                            timestamp - recd_small_tm_end[args[1]->cs_cid]
                            );
*/
        @resp_small_tm_beg[args[2]->ip_daddr] =  avg(timestamp - recd_small_tm_beg[args[1]->cs_cid]);
        @resp_small_tm_end[args[2]->ip_daddr] =  avg(timestamp - recd_small_tm_end[args[1]->cs_cid]);
        @resp_small_sz[args[2]->ip_daddr] =  avg(recd_small_sz[args[1]->cs_cid]);
        @resp_small_ct[args[2]->ip_daddr] =  avg(recd_small_ct[args[1]->cs_cid]);
             recd_small_tm_beg[args[1]->cs_cid] = 0;
             recd_small_tm_end[args[1]->cs_cid] = 0;
             recd_small_sq_end[args[1]->cs_cid] = 0;
             recd_small_ct[args[1]->cs_cid]  =0;
             recd_small_sz[args[1]->cs_cid]  =0;
             recd_small_unacked[args[1]->cs_cid] = 0;
}
/*
    && args[4]->tcp_ack > recd_small_sq_end[args[1]->cs_cid]
*/
tcp:::send
/      (args[2]->ip_daddr == ADDR || ADDR == NULL )
    && curpsinfo->pr_psargs == "/usr/lib/nfs/nfsd"
    && recd_small_unacked[args[1]->cs_cid] == 1
    && (args[2]->ip_plength - args[4]->tcp_offset) ==   0
/
{
             recd_small_tm_beg[args[1]->cs_cid] = 0;
             recd_small_tm_end[args[1]->cs_cid] = 0;
             recd_small_sq_end[args[1]->cs_cid] = 0;
             recd_small_ct[args[1]->cs_cid]  =0;
             recd_small_sz[args[1]->cs_cid]  =0;
             recd_small_unacked[args[1]->cs_cid] = 0;
}
/* ***************** end _IN_ SMALL RESPONSE TIME ****************** */
/* ****** BIG _IN_ RESP ******************* */
tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL )
    && nfs[args[1]->cs_cid]
    && (args[2]->ip_plength - args[4]->tcp_offset) >  300
/
{
       recd_big_unacked[args[1]->cs_cid] = 1;

       recd_big_tm_beg[args[1]->cs_cid] =
          recd_big_tm_beg[args[1]->cs_cid] == 0
        ? timestamp : recd_big_tm_beg[args[1]->cs_cid] ;

       recd_big_sq_end[args[1]->cs_cid] = args[4]->tcp_seq;
       recd_big_tm_end[args[1]->cs_cid] = timestamp;

       @recd_tot_sz[args[2]->ip_saddr] = sum(args[2]->ip_plength - args[4]->tcp_offset);
       @recd_tot_ct[args[2]->ip_saddr] = count();

       recd_big_sz[args[1]->cs_cid] += args[2]->ip_plength - args[4]->tcp_offset;
       recd_big_ct[args[1]->cs_cid] ++;

}
/*
    && args[4]->tcp_ack > recd_big_sq_end[args[1]->cs_cid]
*/
tcp:::send
/      (args[2]->ip_daddr == ADDR || ADDR == NULL )
    && curpsinfo->pr_psargs == "/usr/lib/nfs/nfsd"
    && recd_big_unacked[args[1]->cs_cid] == 1
/
{
        @resp_big_tm_beg[args[2]->ip_daddr] =  avg(timestamp - recd_big_tm_beg[args[1]->cs_cid]);
        @resp_big_tm_end[args[2]->ip_daddr] =  avg(timestamp - recd_big_tm_end[args[1]->cs_cid]);
        @resp_big_sz[args[2]->ip_daddr] =  avg(recd_big_sz[args[1]->cs_cid]);
        @resp_big_ct[args[2]->ip_daddr] =  avg(recd_big_ct[args[1]->cs_cid]);
        @resp_big_tot_ct[args[2]->ip_daddr] =  count();
             recd_big_tm_beg[args[1]->cs_cid] = 0;
             recd_big_tm_end[args[1]->cs_cid] = 0;
             recd_big_sq_end[args[1]->cs_cid] = 0;
             recd_big_ct[args[1]->cs_cid]  =0;
             recd_big_sz[args[1]->cs_cid]  =0;
             recd_big_unacked[args[1]->cs_cid] = 0;
}
/* ***************** END _IN_ RESPONSE TIME ****************** */

profile:::tick-1sec
/ ticks > 0 /
{ ticks--; }

profile:::tick-1sec
/ title == 0 /
{
        printf("%-15s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s  \n",
                                                   "client",
                                                   "out_rtt",
                                                   "in_resp",
                                                   "in_bytes",
                                                   "in_cnt",
                                                   "i_avg_sz",
                                                   "i_avg_tm",
                                                   "o_bytes_",
                                                   "out_cnt",
                                                   "o_avg_sz",
                                                   "o_avg_tm"
                                            );
        title=TITLE;
}

profile:::tick-1sec
/ ticks == 0 /
{
/*
*/

        normalize(@recd_tot_ct,TIMER);
        normalize(@send_tot_ct,TIMER);
        normalize(@recd_tot_sz,(TIMER) );
        normalize(@send_tot_sz,(TIMER) );
        normalize(@resp_small_tm_beg,1000);
        normalize(@resp_small_tm_end,1000);
        normalize(@rtt_small_tm_beg,1000);
        normalize(@rtt_small_tm_end,1000);
        normalize(@resp_big_tm_beg,1000);
        normalize(@rtt_big_tm_beg,1000);
        normalize(@resp_big_tm_end,1000);
        normalize(@rtt_big_tm_end,1000);
  /*                    1   2    3    4    5     6   7    8    9   10    */
        printa("%-15s %@8d %@8d %@8d %@8d %@8d %@8d %@8d %@8d %@8d %@8d  \n",
                                                   @rtt_small_tm_end,
                                                   @resp_small_tm_end,
                                                   @recd_tot_sz,
                                                   @recd_tot_ct,
                                                   @resp_big_sz,
                                                   @resp_big_tm_beg,
                                                   @send_tot_sz,
                                                   @rtt_big_sz,
                                                   @send_tot_ct,
                                                   @rtt_big_tm_beg
                                                   );

                                                   trunc(@rtt_big_tot_ct);
                                                   trunc(@resp_big_tot_ct);

                                                   trunc(@send_tot_sz);
                                                   trunc(@recd_tot_sz);
                                                   trunc(@send_tot_ct);
                                                   trunc(@recd_tot_ct);

                                                   trunc(@resp_small_tm_beg);
                                                   trunc(@resp_small_tm_end);
                                                   trunc(@resp_small_sz);
                                                   trunc(@resp_small_ct);
                                                   trunc(@resp_big_tm_beg);
                                                   trunc(@resp_big_tm_end);
                                                   trunc(@resp_big_sz);
                                                   trunc(@resp_big_ct);

                                                   trunc(@rtt_small_tm_beg);
                                                   trunc(@rtt_small_tm_end);
                                                   trunc(@rtt_small_sz);
                                                   trunc(@rtt_small_ct);
                                                   trunc(@rtt_big_tm_beg);
                                                   trunc(@rtt_big_tm_end);
                                                   trunc(@rtt_big_sz);
                                                   trunc(@rtt_big_ct);
        title--;
        ticks = TIMER;
}

Now if I run the above dtrace, and re-run my dd I get the following dtrace output:

client           out_rtt  in_resp in_bytes   in_cnt i_avg_sz i_avg_tm o_bytes_  out_cnt o_avg_sz o_avg_tm
172.16.100.194       764     1968     9188       13     8328     4873     1032        0        8        0

this is telling me that my
round trips were 764us
server response to incoming requests is 1968us
I had 9068 incoming bytes
I had 12 incoming packages
the average size of packages above 300 bytes was 8328
the average time to response to packets above 300 bytes was 4837us
and my dd command (in this case) reported 5.8ms or (5800us)
so I’m getting at most the dd response time for the TCP response time to the 8K packet

Deadlock

Here’s an example of a slightly less common data deadlock graph (dumped from 11gR2, in this case):

[Transaction Deadlock]
The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:
Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TX-00010006-00002ade        16      34     X             12      50           S
TX-00050029-000037ab        12      50     X             16      34           S
session 34: DID 0001-0010-00000021	session 50: DID 0001-000C-00000024
session 50: DID 0001-000C-00000024	session 34: DID 0001-0010-00000021
Rows waited on:
Session 50: no row
Session 34: no row
Information on the OTHER waiting sessions:
Session 50:
  pid=12 serial=71 audsid=1560855 user: 52/TEST_USER
  O/S info: user: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
            program: sqlplus.exe
  application name: SQL*Plus, hash value=3669949024
  Current SQL Statement:
  update t1 set n3 = 99 where id = 100
End of information on OTHER waiting sessions.
Current SQL statement for this session:
update t1 set n3 = 99 where id = 200

The anomaly is that the waiters are both waiting on S (share) mode locks for a TX enqueue.

It’s fairly well known that Share (and Share sub exclusive, SSX) lock waits for TM locks are almost a guarantee of a missing “foreign key index”; and it’s also fairly well known that Share lock waits for TX locks can be due to bitmap collisions, issues with ITL (interested transaction list) waits, various RI (referential integrity) collisions including simultaneous inserts of the same primary key.

A cause for TX/4 waits that is less well known or overlooked (because a featrure is less-well used) is simple data collisions on IOTs (index organized tables). In the example above t1 is an IOT with a primary key of id. Session 34 and 50 have both tried to update the rows with ids 100 and 200 – in the opposite order. If this were a normal heap table the deadlock graph would be showing waits for eXclusive TX locks, because it’s an IOT (and therefore similar in some respects to a primary key wait) we see waits for Share TX locks.

SCAN Computers Customer Service…

I mentioned in a previous post that I was planning another customer services rant and here it comes.

A few weeks ago I decided to buy a new server from SCAN Computers. Let me start by saying I’ve used SCAN lots of times in the past for components and they’ve always been great. That’s one of the reasons I decided to use them for a pre-built system. I picked the relevant configuration (2x Xeon quad, 24G RAM, 6 Hard drives etc.) and paid my cash. What followed left me with a rather bad taste in my mouth…

  • When I ordered the kit I was given an estimated delivery date of 2 days. I realized this was a rubbish automated message, but I later received a mail to say they were busy and couldn’t start the build for another week.
  • After some more time, I was contacted and told they couldn’t start the build for a little longer because they were waiting for Intel to deliver the server motherboard and chassis.
  • Once the build started I received an email as it passed each checkpoint. Once the build was complete it went into QA, then I received no more notifications.
  • After 7 days I got pissed off and phoned them to ask what was going on. I was then told the build had failed QA and they were trying to rectify the issue in conjunction with Intel.
  • After a few more days I was contacted and told they couldn’t get all the hard drives I ordered. I agreed to have it delivered with just 5 hard drives, rather than wait an unspecified time for the missing disk to arrive.
  • The refund for the missing hard drive has still not arrived at my bank.
  • The next email I got was telling me the server had failed to be delivered because they couldn’t get hold of me. Admittedly, my account had an incorrect mobile number on it, but the land line number was correct. I received no calls on the land line on the days when they allegedly tried to call. They also didn’t attempt the delivery without contact, even though the address on my account was correct and I was in on the day in question.
  • The delivery was finally made 5 days after the first “failed” delivery.
  • I opened the box and found there was no DVD drive installed. I phoned SCAN to ask what was going on. It seems the option “Use DVD Writer included in chassis”, actually means, “No DVD Writer installed”. I kicked up a fuss about this and was sent a DVD drive for free, which I had to install myself. The DVD drive came the next day, along with a data cable, but there was no converter cable for the power, so I had to go out an buy one.
  • The lack of a DVD drive didn’t stop me starting to use the server. I just did a network installation of Fedora 14. In doing so, I noticed the RAID configuration was incorrect. I asked for a RAID 1 system disk and all the other disks just to be left alone. Instead the additional disks were in a RAID 0 set. It took me quite a while and a couple of phone calls to realize that in order to access the RAID configuration, the BIOS had to be switched out of silent boot.

So it’s just short of 2 months down the line, lots of phone calls and a refund for a missing disk still outstanding…

I would like to mention the server itself is awesome and the individuals I dealt with in customer service were nice enough, but the whole process has aged me. When you are sitting in the SCAN phone queue the hold music tells you that you’re a valued customer and SCAN are transparent and aim to provided the best customer experience possible. I think not…

Would I recommend them? From a kit perspective they are cool. You can choose the components you want (within reason) and it has been put together and cabled well. Unfortunately, the kit is only part of the issue. The whole experience has been really poor…

Cheers

Tim…




InfiniBandインターフェースカード3枚ざし

前回の続き、

40Gbpsを超えるために複数のインターフェースカードを実装しなければならない。2枚じゃ物足りないので、3枚させるマザーボードを買ってきた:

CPUはIntel i7-2600Kを選んだ。
これをヘッドにして、ストレージ・クラスターを3分割してアクセスすれば10GBpsは簡単に突破できるはずです。Exadataは2枚ざしなので、、、1枚多い。

PCI Express 3.0待ちですに続きます。

How Oracle Result Cache and Smart Scans Work Together

I got an interesting question today from a guy who is reading through the Alpha version of our Exadata Book. I’m paraphrasing, but Ajay basically wanted to know about the interaction between the relatively new Result Cache feature and Smart Scans. The short answer is that they work well together, exactly as you would hope. But let me back up for minute.

The Results Cache feature was introduced with 11gR1. It’s a pretty cool feature – Tom Kyte calls it JIT MV (just in time materialized view).

It basically caches the result set of a query as opposed to blocks from the underlying objects that are necessary to process the query. It is designed for situations where data is relatively static and fairly long running queries are repeated many times. Think of a DW that is updated nightly with some key queries that are repeated several times a day. As you might expect, the performance impact can be dramatic.

The question is interesting because Smart Scans read result sets directly into the PGA as opposed to retrieving blocks into the buffer cache. Sound familiar? Anyway, here’s an example of executing an Offloadable statement with a RESULT_CACHE hint:

SYS@SANDBOX> -- flush the result cache in case something is still hanging around
SYS@SANDBOX> exec DBMS_RESULT_CACHE.flush;
 
PL/SQL procedure successfully completed.
 
Elapsed: 00:00:00.07
SYS@SANDBOX> -- run an offloadable statement a few times
SYS@SANDBOX> select avg(pk_col) from kso.skew2 a where col1 > 1;
 
AVG(PK_COL)
-----------
   16049999
 
Elapsed: 00:00:14.34
SYS@SANDBOX> /
 
AVG(PK_COL)
-----------
   16049999
 
Elapsed: 00:00:12.16
SYS@SANDBOX> /
 
AVG(PK_COL)
-----------
   16049999
 
Elapsed: 00:00:13.01
SYS@SANDBOX> -- run the same statement with a result_cache hint
SYS@SANDBOX> select /*+ result_cache */ avg(pk_col) from kso.skew2 a where col1 > 1;
 
AVG(PK_COL)
-----------
   16049999
 
Elapsed: 00:00:11.97
SYS@SANDBOX> -- check execution statistics
SYS@SANDBOX> @fsx
Enter value for sql_text: %skew2%
Enter value for sql_id: 
 
SQL_ID         CHILD  PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT
------------- ------ ---------- ------ ---------- ------ ------- ---------- ----------------------------------------------------------------------
6fduhxkpdx1mc      0 4220890033      1      11.88      0 Yes          74.57 select /*+ result_cache */ avg(pk_col) from kso.skew2 a where col1 > 1
8vwynva819s92      0 4220890033      3      13.08      0 Yes          74.56 select avg(pk_col) from kso.skew2 a where col1 > 1
 
Elapsed: 00:00:00.13

So first I flushed all result cache entries from the SGA (it’s stored in the shared pool). Then I ran an offloadable SELECT statement 3 times.

The admittedly simple SQL statement takes between 12 and 14 seconds (the average elapsed time over three executions is 13 seconds). And the first execution with the RESULT_CACHE hint (before the Result Cache has been built) takes roughly the same amount of elapsed time. The fsx.sql script shows that both versions were offloaded and each cut out approximately the same amount of payload. (I’ve written about the fsx.sql script previously here: How to Tune an Exadata)

So clearly, using the RESULT_CACHE hint did not disable Smart Scans.

Now let’s see if Smart Scans disabled the Result Cache. To test this I’ll just run the hinted statement a few more times.

SYS@SANDBOX> 
SYS@SANDBOX> select /*+ result_cache */ avg(pk_col) from kso.skew2 a where col1 > 1;
 
AVG(PK_COL)
-----------
   16049999
 
Elapsed: 00:00:00.08
 
. . .
 
SYS@SANDBOX> /
 
AVG(PK_COL)
-----------
   16049999
 
Elapsed: 00:00:00.09
SYS@SANDBOX> @fsx
Enter value for sql_text: %skew2%
Enter value for sql_id: 
 
SQL_ID         CHILD  PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT
------------- ------ ---------- ------ ---------- ------ ------- ---------- ----------------------------------------------------------------------
6fduhxkpdx1mc      0 4220890033     10       1.19      0 Yes          74.57 select /*+ result_cache */ avg(pk_col) from kso.skew2 a where col1 > 1
8vwynva819s92      0 4220890033      3      13.08      0 Yes          74.56 select avg(pk_col) from kso.skew2 a where col1 > 1
 
Elapsed: 00:00:00.12

So as you can see, the execution time dropped from over 10 seconds to less than 0.1 seconds. The average elapsed time for the 10 executions was about 1 second, including the first execution that took about 12 seconds. The OFFLOAD column returned by the fsx script is somewhat misleading in this case, as described in my previous post. It basically tells us if any of the executions was offloaded. Let’s clear the entry from the cursor cache and check the stats again. I’ll do this using a script called flush_sql.sql which uses the DBMS_SHARED_POOL.PURGE procedure. Then I’ll re-run the statement a few times.

 
SYS@SANDBOX> @flush_sql
Enter value for sql_id: 6fduhxkpdx1mc
 
PL/SQL procedure successfully completed.
 
Elapsed: 00:00:00.07
SYS@SANDBOX> select /*+ result_cache */ avg(pk_col) from kso.skew2 a where col1 > 1;
 
AVG(PK_COL)
-----------
   16049999
 
Elapsed: 00:00:00.10
 
. . . 
 
SYS@SANDBOX> /
 
AVG(PK_COL)
-----------
   16049999
 
Elapsed: 00:00:00.07
SYS@SANDBOX> @fsx
Enter value for sql_text: %skew2%
Enter value for sql_id: 
 
SQL_ID         CHILD  PLAN_HASH  EXECS  AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT
------------- ------ ---------- ------ ---------- ------ ------- ---------- ----------------------------------------------------------------------
6fduhxkpdx1mc      0 4220890033      5        .00      0 No             .00 select /*+ result_cache */ avg(pk_col) from kso.skew2 a where col1 > 1
8vwynva819s92      0 4220890033      3      13.08      0 Yes          74.56 select avg(pk_col) from kso.skew2 a where col1 > 1
 
Elapsed: 00:00:00.12
 
SYS@SANDBOX> @dplan
Enter value for sql_id: 6fduhxkpdx1mc
Enter value for child_no: 
 
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  6fduhxkpdx1mc, child number 0
-------------------------------------
select /*+ result_cache */ avg(pk_col) from kso.skew2 a where col1 > 1
 
Plan hash value: 4220890033
 
----------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                            |       |       |   178K(100)|          |
|   1 |  RESULT CACHE               | 2gq0a1pgm8yk88cqvfxtxhchny |       |       |            |          |
|   2 |   SORT AGGREGATE            |                            |     1 |    11 |            |          |
|*  3 |    TABLE ACCESS STORAGE FULL| SKEW2                      |   127M|  1342M|   178K  (1)| 00:35:42 |
----------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - storage("COL1">1)
       filter("COL1">1)
 
Result Cache Information (identified by operation id):
------------------------------------------------------
 
   1 -

So it’s clear from the elapsed time and the fsx script and the xplan output that the statement used the Result Cache.

Cool, Smart Scan the first time, Result Cache after that.

Let’s see if it behaves as expected after the data has been changed (it should revert to a Smart Scan which should also build the Result Cache which will be used on subsequent executions). And while we’re at it, let’s see how long it takes to execute the statement without Smart Scan (I’ll do that first actually).

SYS@SANDBOX> -- turn offloading off
SYS@SANDBOX> alter session set cell_offload_processing=false;
 
Session altered.
 
Elapsed: 00:00:00.06
SYS@SANDBOX> select avg(pk_col) from kso.skew2 a where col1 > 1;
 
AVG(PK_COL)
-----------
   16049999
 
Elapsed: 00:00:52.09
SYS@SANDBOX> -- turn offloading back on
SYS@SANDBOX> alter session set cell_offload_processing=true;
 
Session altered.
 
Elapsed: 00:00:00.06
SYS@SANDBOX> @flush_sql
Enter value for sql_id: 6fduhxkpdx1mc
 
PL/SQL procedure successfully completed.
 
Elapsed: 00:00:00.08
SYS@SANDBOX> -- change the data
SYS@SANDBOX> update kso.skew2 set col1 =col1*1 where rownum < 10
SYS@SANDBOX> /
 
9 rows updated.
 
Elapsed: 00:00:00.07
SYS@SANDBOX> commit;
 
Commit complete.
 
Elapsed: 00:00:00.06
SYS@SANDBOX>  -- first run after data changed
SYS@SANDBOX> select /*+ result_cache */ avg(pk_col) from kso.skew2 a where col1 > 1;
 
AVG(PK_COL)
-----------
   16049999
 
Elapsed: 00:00:10.60
SYS@SANDBOX> /
 
AVG(PK_COL)
-----------
   16049999
 
Elapsed: 00:00:00.09

So the statement takes almost a minute when executed without Smart Scan (52 seconds). I then updated a few records and re-ran the statement. As expected the Result Cache entry was invalidated and the statement reverted to a Smart Scan. I didn’t show any proof that a Smart Scan was used, but take my word for it (the execution time was only 10.6 seconds, not 52) and apparently it rebuilt the Result Cache because the next execution took about 0.1 seconds.

So there you have it. The result cache appears to work well with Smart Scans.

Reminder and Public Appearances 2011

First, a reminder – my Advanced Oracle Troubleshooting v2.0 online seminar starts next week already. Last chance to sign up, I can accept registrations until Sunday :-)

I won’t do another AOT seminar before Oct (or Nov) this year. More details and sign-up here:

I have rescheduled my Advanced SQL Tuning and Partitioning & Parallel Execution for Performance seminars too. I will do them in September/October. Unfortunately I’m too busy right now to do them before the summer.

Public Appearances:

  • I will be speaking at the UKOUG Exadata Special Event in London on 18th April
  • I have submitted a few papers for Oracle OpenWorld in San Francisco as well (end of Sep/beginning of Oct), all about Exadata. Let’s see how it goes, but I’ll be there anyway, which means that I’ll probably show up at the Oracle Closed World event too!

And that’s all the travel I will do this year…

Virtual Conferences:

I’ll soon announce the 2nd EsSN virtual conference too ;-)

Free online stuff:

Perhaps in a month or so I will do another hacking session (I’ll plan 2 hours this time, 1 hour isn’t nearly enough for going deep). The topic will probably be about low-level details of SQL plan execution internals… stay tuned!

ASM normal redundancy and failure groups spanning SANs

Julian Dyke has started an interesting thread on the Oak Table mailing list after the latest UKOUG RAC and HA SIG. Unfortunately I couldn’t attend that event, I wish I had, and I knew it would be great.

Anyway, the question revolved around an ASM disk group created with normal redundancy spanning two storage arrays. This should in theory protect against the failure of an array, although at a high price. All ASM disks exported from an array would be 1 failure group. Remember that disks in a failure group all fail if the supporting infrastructure (network, HBA, controller etc) fails. So what would happen with such a setup, if you followed these steps:

  • Shutdown the array for failure group 2
  • Stop the database
  • Shutdown the second array – failure group 1
  • Do some more maintenance…
  • Startup failgroup B SAN
  • Start the database
  • Startup failgroup A SAN

ASM can tolerate the failure of one failgroup (capacity permitting), so the failure of failgroup 2 should not bring the disk group down, which would result in immediate loss of service. But what happens if it comes up after the data in the other failure group has been modified? Will there be data corruption?

Replaying

To simulate two storage arrays my distinguished filer01 and filer02 OpenFiler appliances have been used, each exporting 2 approx. 4G “LUNS” to my database host. At this time I only had access to my 11.1.0.7 2 node RAC system, if time permits I’ll repeat this with 10.2.0.5 and 11.2.0.2. The RAC cluster in the SIG presentation was 10.2. I am skipping the bit about the LUN creation and presentation to the hosts, and assume the following setup:

[root@rac11gr1node1 ~]# iscsiadm -m session
tcp: [1] 192.168.99.51:3260,1 iqn.2006-01.com.openfiler:filer02DiskB
tcp: [2] 192.168.99.50:3260,1 iqn.2006-01.com.openfiler:filer01DiskA
tcp: [3] 192.168.99.51:3260,1 iqn.2006-01.com.openfiler:filer02DiskA
tcp: [4] 192.168.99.50:3260,1 iqn.2006-01.com.openfiler:filer01DiskB

192.168.99.50 is my first openfiler instance, .192.168.99.51 the second. As you can see each export DISKA and DISKB. Mapped to the hosts, this is the target mapping (use iscsiadm –mode session –print 3 to find out):

  • filer02DiskB: /dev/sda
  • filer01DiskA: /dev/sdb
  • filer02DiskA: /dev/sdc
  • filer01DiskB: /dev/sdd

I am using ASMLib (as always on the lab) to label these disks:

[root@rac11gr1node1 ~]# oracleasm listdisks
DATA1
DATA2
FILER01DISKA
FILER01DISKB
FILER02DISKA
FILER02DISKB

DATA1 and DATA2 will not play a role in this article, I’m interested in the other disks. Assuming that the scandisks command completed on all nodes, I can add the disks to the new diskgroup:

SQL> select path from v$asm_disk

PATH
--------------------------------------------------------------------------------
ORCL:FILER01DISKA
ORCL:FILER01DISKB
ORCL:FILER02DISKA
ORCL:FILER02DISKB
ORCL:DATA1
ORCL:DATA2

Let’s create the diskgroup. The important part is to create failure groups per storage array. By the way this is not different from extended distance RAC!

SQL> create diskgroup fgtest normal redundancy
 2  failgroup filer01 disk 'ORCL:FILER01DISKA', 'ORCL:FILER01DISKB'
 3  failgroup filer02 disk 'ORCL:FILER02DISKA', 'ORCL:FILER02DISKB'
 4  attribute 'compatible.asm'='11.1';

Diskgroup created.

With that done let’s have a look at the asm disk information:

SQL> select MOUNT_STATUS,HEADER_STATUS,STATE,REDUNDANCY,FAILGROUP,PATH from v$asm_disk where group_number=2;

MOUNT_S HEADER_STATU STATE    REDUNDA FAILGROUP                      PATH
------- ------------ -------- ------- ------------------------------ --------------------
CACHED  MEMBER       NORMAL   UNKNOWN FILER01                        ORCL:FILER01DISKA
CACHED  MEMBER       NORMAL   UNKNOWN FILER01                        ORCL:FILER01DISKB
CACHED  MEMBER       NORMAL   UNKNOWN FILER02                        ORCL:FILER02DISKA
CACHED  MEMBER       NORMAL   UNKNOWN FILER02                        ORCL:FILER02DISKB

I have set the disk repair time to 24 hours and raised compatible parameters for RDBMS and ASM to 11.1, resulting in these attributes:

SQL> select * from v$asm_attribute

NAME                           VALUE                GROUP_NUMBER ATTRIBUTE_INDEX ATTRIBUTE_INCARNATION READ_ON SYSTEM_
------------------------------ -------------------- ------------ --------------- --------------------- ------- -------
disk_repair_time               3.6h                            2               0                     1 N       Y
au_size                        1048576                         2               8                     1 Y       Y
compatible.asm                 11.1.0.0.0                      2              20                     1 N       Y
compatible.rdbms               11.1.0.0.0                      2              21                     1 N       Y

Unlike 11.2 where disk groups are managed as Clusterware resource, 11.1 requires you to manually start them or append the new disk group to ASM_DISKGORUPS. You should query gv$asm_diskgroup.state to ensure the new diskgroup is mounted on all cluster nodes.

I need some data! A small demo database can be restored to the new failure group to provide some experimental playground. This is quite easily done by using an RMAN duplicate with the correct {db|log}_file_name_convert parameter set.

Mirror

The diskgroup is created with normal redundancy, which means that ASM will create a mirror for every primary extent, taking failure groups into consideration. I wanted to ensure that the data is actually mirrored on the new disk group, which has group number 2.I need to get this information from V$ASM_FILE and V$ASM_ALIAS:

SQL> select * from v$asm_file where group_number = 2

GROUP_NUMBER FILE_NUMBER COMPOUND_INDEX INCARNATION BLOCK_SIZE     BLOCKS      BYTES      SPACE TYPE                 REDUND STRIPE CREATION_ MODIFICAT R
------------ ----------- -------------- ----------- ---------- ---------- ---------- ---------- -------------------- ------ ------ --------- --------- -
 2         256       33554688   747669775      16384       1129   18497536   78643200 CONTROLFILE          HIGH   FINE   05-APR-11 05-APR-11 U
 2         257       33554689   747669829       8192      69769  571547648 1148190720 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         258       33554690   747669829       8192      60161  492838912  990904320 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         259       33554691   747669829       8192      44801  367009792  739246080 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         260       33554692   747669831       8192      25601  209723392  424673280 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         261       33554693   747669831       8192        641    5251072   12582912 DATAFILE             MIRROR COARSE 05-APR-11 05-APR-11 U
 2         262       33554694   747670409        512     102401   52429312  120586240 ONLINELOG            MIRROR FINE   05-APR-11 05-APR-11 U
 2         263       33554695   747670409        512     102401   52429312  120586240 ONLINELOG            MIRROR FINE   05-APR-11 05-APR-11 U
 2         264       33554696   747670417        512     102401   52429312  120586240 ONLINELOG            MIRROR FINE   05-APR-11 05-APR-11 U
 2         265       33554697   747670417        512     102401   52429312  120586240 ONLINELOG            MIRROR FINE   05-APR-11 05-APR-11 U
 2         266       33554698   747670419       8192       2561   20979712   44040192 TEMPFILE             MIRROR COARSE 05-APR-11 05-APR-11 U

11 rows selected.

SQL> select * from v$asm_alias where group_NUMBER=2

NAME                           GROUP_NUMBER FILE_NUMBER FILE_INCARNATION ALIAS_INDEX ALIAS_INCARNATION PARENT_INDEX REFERENCE_INDEX A S
------------------------------ ------------ ----------- ---------------- ----------- ----------------- ------------ --------------- - -
RAC11G                                    2  4294967295       4294967295           0                 3     33554432        33554485 Y Y
CONTROLFILE                               2  4294967295       4294967295          53                 3     33554485        33554538 Y Y
current.256.747669775                     2         256        747669775         106                 3     33554538        50331647 N Y
DATAFILE                                  2  4294967295       4294967295          54                 1     33554485        33554591 Y Y
SYSAUX.257.747669829                      2         257        747669829         159                 1     33554591        50331647 N Y
SYSTEM.258.747669829                      2         258        747669829         160                 1     33554591        50331647 N Y
UNDOTBS1.259.747669829                    2         259        747669829         161                 1     33554591        50331647 N Y
UNDOTBS2.260.747669831                    2         260        747669831         162                 1     33554591        50331647 N Y
USERS.261.747669831                       2         261        747669831         163                 1     33554591        50331647 N Y
ONLINELOG                                 2  4294967295       4294967295          55                 1     33554485        33554644 Y Y
group_1.262.747670409                     2         262        747670409         212                 1     33554644        50331647 N Y
group_2.263.747670409                     2         263        747670409         213                 1     33554644        50331647 N Y
group_3.264.747670417                     2         264        747670417         214                 1     33554644        50331647 N Y
group_4.265.747670417                     2         265        747670417         215                 1     33554644        50331647 N Y
TEMPFILE                                  2  4294967295       4294967295          56                 1     33554485        33554697 Y Y
TEMP.266.747670419                        2         266        747670419         265                 1     33554697        50331647 N Y

My USERS tablespace which I am interested in most has file number 261-I chose it for this example as it’s only 5M in size. Taking my 1 MB allocation unit into account, it means I don’t have to trawl through thousands of line of output when getting the extent map.

Credit where credit is due-the next queries are partly based on the excellent work by Luca Canali from CERN, who has looked at ASM internals for a while. Make sure you have a look at the excellent reference available here: https://twiki.cern.ch/twiki/bin/view/PDBService/ASM_Internals. So to answer the question if the extents making up my users tablespace we need to have a look at the X$KFFXP, i.e. file extent pointers view:

SQL> select GROUP_KFFXP,DISK_KFFXP,AU_KFFXP from x$kffxp where number_kffxp=261 and group_kffxp=2 order by disk_kffxp;

GROUP_KFFXP DISK_KFFXP   AU_KFFXP
----------- ---------- ----------
 2          0        864
 2          0        865
 2          0        866
 2          1        832
 2          1        831
 2          1        833
 2          2        864
 2          2        866
 2          2        865
 2          3        832
 2          3        833
 2          3        831

12 rows selected.

As you can see, I have a number of extents, all evenly spread over my disks. I can verify that this information is correct by querying the X$KFDAT view as well which contains similar information, but more related to the disk  to AU mapping

SQL> select GROUP_KFDAT,NUMBER_KFDAT,AUNUM_KFDAT from x$kfdat where fnum_kfdat = 261 and group_kfdat=2

GROUP_KFDAT NUMBER_KFDAT AUNUM_KFDAT
----------- ------------ -----------
 2            0         864
 2            0         865
 2            0         866
 2            1         831
 2            1         832
 2            1         833
 2            2         864
 2            2         865
 2            2         866
 2            3         831
 2            3         832
 2            3         833

12 rows selected.

OK so I am confident that my data is actually mirrored-otherwise the following test would not make any sense. I have double checked that the disks in failgroup FILER01 actually belong to my OpenFiler “filer01″, and the same for filer02. Going back to the original scenario:

Shut down Filer02

This will take down all the disks of failure group B. Two minutes after taking the filer down I checked if it was indeed shut down:

martin@dom0:~> sudo xm list | grep filer
filer01                                    183   512     1     -b----   1159.6
filer02                                          512     1              1179.6
filer03                                    185   512     1     -b----   1044.4

Yes, no doubt about it-it’s down. What would the effect be? Surely I/O errors, but I wanted to enforce a check. Connected to +ASM2 I issued the “select * from v$asm_disk” command. This caused quite significant logging in the instance’s alert.log:

NOTE: ASMB process exiting due to lack of ASM file activity for 5 seconds
Wed Apr 06 17:17:39 2011
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKA disk:0x0.0x97954459 au:0
 iop:0x2b2997b61330 bufp:0x2b29977b3e00 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKB disk:0x1.0x9795445a au:0
 iop:0x2b2997b61220 bufp:0x2b29977b0200 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
Wed Apr 06 17:17:58 2011
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKB disk:0x1.0x9795445a au:0
 iop:0x2b2997b61440 bufp:0x2b29977a6400 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKA disk:0x0.0x97954459 au:0
 iop:0x2b2997b61550 bufp:0x2b29977b1600 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
Wed Apr 06 17:18:03 2011
WARNING: Disk (FILER02DISKA) will be dropped in: (86400) secs on ASM inst: (2)
WARNING: Disk (FILER02DISKB) will be dropped in: (86400) secs on ASM inst: (2)
GMON SlaveB: Deferred DG Ops completed.
Wed Apr 06 17:19:26 2011
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKB disk:0x1.0x9795445a au:0
 iop:0x2b2997b61550 bufp:0x2b29977b1600 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKA disk:0x0.0x97954459 au:0
 iop:0x2b2997b61440 bufp:0x2b29977b0200 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
Wed Apr 06 17:20:10 2011
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKA disk:0x0.0x97954459 au:0
 iop:0x2b2997b61000 bufp:0x2b29977a6400 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:0, diskname:ORCL:FILER02DISKB disk:0x1.0x9795445a au:0
 iop:0x2b2997b61550 bufp:0x2b29977b1600 offset(bytes):0 iosz:4096 operation:1(Read) synchronous:0
 result: 4 osderr:0x3 osderr1:0x2e pid:6690
Wed Apr 06 17:21:07 2011
WARNING: Disk (FILER02DISKA) will be dropped in: (86217) secs on ASM inst: (2)
WARNING: Disk (FILER02DISKB) will be dropped in: (86217) secs on ASM inst: (2)
GMON SlaveB: Deferred DG Ops completed.
Wed Apr 06 17:27:15 2011
WARNING: Disk (FILER02DISKA) will be dropped in: (85849) secs on ASM inst: (2)
WARNING: Disk (FILER02DISKB) will be dropped in: (85849) secs on ASM inst: (2)
GMON SlaveB: Deferred DG Ops completed.

The interesting lines are “all mirror sides found readable, no repair required”. So taking down the failgroup didn’t cause an outage. The other ASM instance complained as well a little later:

2011-04-06 17:16:58.393000 +01:00
NOTE: initiating PST update: grp = 2, dsk = 2, mode = 0x15
NOTE: initiating PST update: grp = 2, dsk = 3, mode = 0x15
kfdp_updateDsk(): 24
kfdp_updateDskBg(): 24
WARNING: IO Failed. subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so dg:2, diskname:ORCL:FILER02DISKA disk:0x2.0x97a6d9f7 au:1
 iop:0x2b9ea4855e70 bufp:0x2b9ea4850a00 offset(bytes):1052672 iosz:4096 operation:2(Write) synchronous:1
 result: 4 osderr:0x3 osderr1:0x2e pid:870
NOTE: group FGTEST: updated PST location: disk 0000 (PST copy 0)
2011-04-06 17:17:03.508000 +01:00
NOTE: ASMB process exiting due to lack of ASM file activity for 5 seconds
NOTE: PST update grp = 2 completed successfully
NOTE: initiating PST update: grp = 2, dsk = 2, mode = 0x1
NOTE: initiating PST update: grp = 2, dsk = 3, mode = 0x1
kfdp_updateDsk(): 25
kfdp_updateDskBg(): 25
2011-04-06 17:17:07.454000 +01:00
NOTE: group FGTEST: updated PST location: disk 0000 (PST copy 0)
NOTE: PST update grp = 2 completed successfully
NOTE: cache closing disk 2 of grp 2: FILER02DISKA
NOTE: cache closing disk 3 of grp 2: FILER02DISKB
SUCCESS: extent 0 of file 267 group 2 repaired by offlining the disk
NOTE: repairing group 2 file 267 extent 0
SUCCESS: extent 0 of file 267 group 2 repaired - all mirror sides found readable, no repair required
2011-04-06 17:19:04.526000 +01:00
GMON SlaveB: Deferred DG Ops completed.
2011-04-06 17:22:07.487000 +01:00
GMON SlaveB: Deferred DG Ops completed.

No interruption of service though, which is good-the GV$ASM_CLIENT view reported all database instances still connected.

SQL> select * from gv$asm_client;

 INST_ID GROUP_NUMBER INSTANCE_NAME                                                    DB_NAME  STATUS       SOFTWARE_VERSIO COMPATIBLE_VERS
---------- ------------ ---------------------------------------------------------------- -------- ------------ --------------- ---------------
 2            2 rac11g2                                                          rac11g   CONNECTED    11.1.0.7.0      11.1.0.0.0
 1            2 rac11g1                                                          rac11g   CONNECTED    11.1.0.7.0      11.1.0.0.0

The result in the V$ASM_DISK view was as follows:

SQL> select name,state,header_status,path from v$asm_disk;

NAME                           STATE    HEADER_STATU PATH                                               FAILGROUP
------------------------------ -------- ------------ -------------------------------------------------- ------------------------------
 NORMAL   UNKNOWN      ORCL:FILER02DISKA
 NORMAL   UNKNOWN      ORCL:FILER02DISKB
DATA1                          NORMAL   MEMBER       ORCL:DATA1                                         DATA1
DATA2                          NORMAL   MEMBER       ORCL:DATA2                                         DATA2
FILER01DISKA                   NORMAL   MEMBER       ORCL:FILER01DISKA                                  FILER01
FILER01DISKB                   NORMAL   MEMBER       ORCL:FILER01DISKB                                  FILER01
FILER02DISKA                   NORMAL   UNKNOWN                                                         FILER02
FILER02DISKB                   NORMAL   UNKNOWN                                                         FILER02

8 rows selected.

As I expected the disks for failgroup filer02 are gone, and so is the information about the failure group. My disk repair time should be high enough to protect me from having to rebuild the whole disk group. Now I’m really curious if my database can become corrupted-I’ll increase the SCN.

[oracle@rac11gr1node1 ~]$ . setsid.sh rac11g
[oracle@rac11gr1node1 ~]$ sq

SQL*Plus: Release 11.1.0.7.0 - Production on Wed Apr 6 17:24:18 2011

Copyright (c) 1982, 2008, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
 5999304

SQL> begin
 2   for i in 1..5 loop
 3    execute immediate 'alter system switch logfile';
 4   end loop;
 5  end;
 6  /

PL/SQL procedure successfully completed.

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
 5999378

SQL>

Back to the test case.

Stop the Database

[oracle@rac11gr1node1 ~]$ srvctl stop database -d rac11g
[oracle@rac11gr1node1 ~]$ srvctl status database -d rac11g
Instance rac11g2 is not running on node rac11gr1node1
Instance rac11g1 is not running on node rac11gr1node2

Done-this part was simple. Next they stopped their first filer. To prevent bad things from happening I’ll shut down ASM on all nodes. I hope that doesn’t invalidate the test but I can’t see how ASM would not get a problem if the other failgroup went down as well.

Shut down Filer01 and start Filer02

Also quite simple. Shutting down this filer will allow me to follow the story. After filer01 was down I started filer02. I’m curious as to how ASM will react. I have deliberately NOT put disk group FGTEST into the ASM_DISKSTRING, I want to start it manually to get a better understanding of what happens.

After having started ASM on both nodes, I queried V$ASM_DISK and tried to mount the disk group:

SQL> select disk_number,name,state,header_status,path,failgroup from v$asm_disk;

DISK_NUMBER NAME                           STATE    HEADER_STATU PATH                                               FAILGROUP
----------- ------------------------------ -------- ------------ -------------------------------------------------- ------------------------------
 0                                NORMAL   MEMBER       ORCL:FILER02DISKA
 1                                NORMAL   MEMBER       ORCL:FILER02DISKB
 2                                NORMAL   UNKNOWN      ORCL:FILER01DISKA
 3                                NORMAL   UNKNOWN      ORCL:FILER01DISKB
 0 DATA1                          NORMAL   MEMBER       ORCL:DATA1                                         DATA1
 1 DATA2                          NORMAL   MEMBER       ORCL:DATA2                                         DATA2

6 rows selected.

Ooops, now they are both gone….

SQL> alter diskgroup fgtest mount;
alter diskgroup fgtest mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "1" is missing
ORA-15042: ASM disk "0" is missing
ORA-15080: synchronous I/O operation to a disk failed
ORA-15080: synchronous I/O operation to a disk failed

OK, I have a problem here. Both ASM instances report I/O errors with the FGTEST diskgroup, and I can’t mount it. That means I can’t mount the database either-in a way it proves I won’t have corruption. But neither will I have a database, what is worse?

Can I get around this problem?

I think I’ll have to start filer01 and see if that makes a difference. Hopefully I can recover my system with the information in failgroup filer01. Soon after filer01 came online I tried the query against v$asmdisk again and tried to mount it.

SQL> select disk_number,name,state,header_status,path,failgroup from v$asm_disk;

DISK_NUMBER NAME                           STATE    HEADER_STATU PATH                                               FAILGROUP
----------- ------------------------------ -------- ------------ -------------------------------------------------- ------------------------------
 0                                NORMAL   MEMBER       ORCL:FILER02DISKA
 1                                NORMAL   MEMBER       ORCL:FILER02DISKB
 2                                NORMAL   MEMBER       ORCL:FILER01DISKA
 3                                NORMAL   MEMBER       ORCL:FILER01DISKB
 0 DATA1                          NORMAL   MEMBER       ORCL:DATA1                                         DATA1
 1 DATA2                          NORMAL   MEMBER       ORCL:DATA2                                         DATA2

6 rows selected.

That worked!

Wed Apr 06 17:45:32 2011
SQL> alter diskgroup fgtest mount
NOTE: cache registered group FGTEST number=2 incarn=0x72c150d7
NOTE: cache began mount (first) of group FGTEST number=2 incarn=0x72c150d7
NOTE: Assigning number (2,0) to disk (ORCL:FILER01DISKA)
NOTE: Assigning number (2,1) to disk (ORCL:FILER01DISKB)
NOTE: Assigning number (2,2) to disk (ORCL:FILER02DISKA)
NOTE: Assigning number (2,3) to disk (ORCL:FILER02DISKB)
Wed Apr 06 17:45:33 2011
NOTE: start heartbeating (grp 2)
kfdp_query(): 12
kfdp_queryBg(): 12
NOTE: cache opening disk 0 of grp 2: FILER01DISKA label:FILER01DISKA
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 2: FILER01DISKB label:FILER01DISKB
NOTE: cache opening disk 2 of grp 2: FILER02DISKA label:FILER02DISKA
NOTE: F1X0 found on disk 2 fcn 0.0
NOTE: cache opening disk 3 of grp 2: FILER02DISKB label:FILER02DISKB
NOTE: cache mounting (first) group 2/0x72C150D7 (FGTEST)
Wed Apr 06 17:45:33 2011
* allocate domain 2, invalid = TRUE
kjbdomatt send to node 0
Wed Apr 06 17:45:33 2011
NOTE: attached to recovery domain 2
NOTE: cache recovered group 2 to fcn 0.7252
Wed Apr 06 17:45:33 2011
NOTE: LGWR attempting to mount thread 1 for diskgroup 2
NOTE: LGWR mounted thread 1 for disk group 2
NOTE: opening chunk 1 at fcn 0.7252 ABA
NOTE: seq=3 blk=337
NOTE: cache mounting group 2/0x72C150D7 (FGTEST) succeeded
NOTE: cache ending mount (success) of group FGTEST number=2 incarn=0x72c150d7
Wed Apr 06 17:45:33 2011
kfdp_query(): 13
kfdp_queryBg(): 13
NOTE: Instance updated compatible.asm to 11.1.0.0.0 for grp 2
SUCCESS: diskgroup FGTEST was mounted
SUCCESS: alter diskgroup fgtest mount

The V$ASM_DISK view is nicely updated and everything seems to be green:

SQL> select disk_number,name,state,header_status,path,failgroup from v$asm_disk;

DISK_NUMBER NAME                           STATE    HEADER_STATU PATH                                               FAILGROUP
----------- ------------------------------ -------- ------------ -------------------------------------------------- ------------------------------
 0 DATA1                          NORMAL   MEMBER       ORCL:DATA1                                         DATA1
 1 DATA2                          NORMAL   MEMBER       ORCL:DATA2                                         DATA2
 0 FILER01DISKA                   NORMAL   MEMBER       ORCL:FILER01DISKA                                  FILER01
 1 FILER01DISKB                   NORMAL   MEMBER       ORCL:FILER01DISKB                                  FILER01
 2 FILER02DISKA                   NORMAL   MEMBER       ORCL:FILER02DISKA                                  FILER02
 3 FILER02DISKB                   NORMAL   MEMBER       ORCL:FILER02DISKB                                  FILER02

6 rows selected.

Brilliant-will it have an effect on the database?

Starting the Database

Even though things looked ok, they weren’t! I didn’t expect this to happen:

[oracle@rac11gr1node1 ~]$ srvctl start database -d rac11g
PRKP-1001 : Error starting instance rac11g2 on node rac11gr1node1
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:SQL*Plus: Release 11.1.0.7.0 - Production on Wed Apr 6 17:48:58 2011
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:Copyright (c) 1982, 2008, Oracle.  All rights reserved.
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:Enter user-name: Connected to an idle instance.
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:SQL> ORACLE instance started.
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:Total System Global Area 1720328192 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:Fixed Size                    2160392 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:Variable Size              1291847928 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:Database Buffers    419430400 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:Redo Buffers                  6889472 bytes
rac11gr1node1:ora.rac11g.rac11g2.inst:ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [9572],
rac11gr1node1:ora.rac11g.rac11g2.inst:[9533], [0x000000000], [], [], [], [], [], [], [], []
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:
rac11gr1node1:ora.rac11g.rac11g2.inst:SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
rac11gr1node1:ora.rac11g.rac11g2.inst:With the Partitioning, Real Application Clusters, OLAP, Data Mining
rac11gr1node1:ora.rac11g.rac11g2.inst:and Real Application Testing options
rac11gr1node1:ora.rac11g.rac11g2.inst:
CRS-0215: Could not start resource 'ora.rac11g.rac11g2.inst'.
PRKP-1001 : Error starting instance rac11g1 on node rac11gr1node2
CRS-0215: Could not start resource 'ora.rac11g.rac11g1.inst'.

Oops. A quick search on Metalink revealed note Ora-00600: [Kccpb_sanity_check_2], [3621501],[3621462] On Startup (Doc ID 435436.1). The explanation for the ORA-600 is that “the seq# of the last read block is higher than the seq# of the control file header block.” Oracle Support explains it with a lost write, but here the situation is quite different. Interesting! I have to leave that for another blog post.

ORDER BY

This is a quick note on the importance of ORDER BY for the order of the result set produced by a SELECT. The mantra is very simple:

Without an order_by_clause, no guarantee exists that the same query executed more than once will retrieve rows in the same order.

Poll: Oracle Licensing Audit Results

Easy reading for you today folks and literally couple clicks to vote. Cheers!

Rows per block

A recent question on the OTN database forum:

Can any one please point to me a document or a way to calculate the average number of rows per block in oralce 10.2.0.3

One answer would be to collect stats and then approximate as block / avg_row_len – although you have to worry about things like row overheads, the row directory, and block overheads before you can be sure you’ve got it right. On top of this, the average might not be too helpful anyway. So here’s another (not necessarily fast) option that gives you more information about the blocks that have any rows in them (I picked the source$ table from a 10g system because source$ is often good for some extreme behaviour).


break on report

compute sum of tot_blocks on report
compute sum of tot_rows   on report

column avg_rows format 999.99

select
	twentieth,
	min(rows_per_block)			min_rows,
	max(rows_per_block)			max_rows,
	sum(block_ct)				tot_blocks,
	sum(row_total)				tot_rows,
	round(sum(row_total)/sum(block_ct),2)	avg_rows
from
	(
	select
		ntile(20) over (order by rows_per_block) twentieth,
		rows_per_block,
		count(*)			block_ct,
		rows_per_block * count(*)	row_total
	from
		(
		select
			fno, bno, count(*) rows_per_block
		from
			(
			select
				dbms_rowid.rowid_relative_fno(rowid)	as fno,
				dbms_rOwId.rowid_block_number(rowid)	as bno
			from
				source$
			)
		group by
			fno, bno
		)
	group by
		rows_per_block
	order by
		rows_per_block
	)
group by
	twentieth
order by
	twentieth
;

I’ve used the ntile() function to split the results into 20 lines, obviously you might want to change this according to the expected variation in rowcounts for your target table. In my case the results looked like this:

 TWENTIETH   MIN_ROWS   MAX_ROWS TOT_BLOCKS   TOT_ROWS AVG_ROWS
---------- ---------- ---------- ---------- ---------- --------
         1          1         11       2706       3470     1.28
         2         12         22         31        492    15.87
         3         23         34         30        868    28.93
         4         35         45         20        813    40.65
         5         46         57         13        664    51.08
         6         59         70         18       1144    63.56
         7         71         81         23       1751    76.13
         8         82         91         47       4095    87.13
         9         92        101         79       7737    97.94
        10        102        111        140      14976   106.97
        11        112        121        281      32799   116.72
        12        122        131        326      41184   126.33
        13        132        141        384      52370   136.38
        14        142        151        325      47479   146.09
        15        152        161        225      35125   156.11
        16        162        171        110      18260   166.00
        17        172        181         58      10207   175.98
        18        182        191         18       3352   186.22
        19        193        205         22       4377   198.95
        20        206        222         16       3375   210.94
                                 ---------- ----------
sum                                    4872     284538

Of course, the moment you see a result like this it prompts you to ask more questions.

Is the “bell curve” effect that you can see centred around the 13th ntile indicative of a normal distribution of row lengths – if so why is the first ntile such an extreme outlier – is that indicative of a number of peculiarly long rows, did time of arrival have a special effect, is it the result of a particular pattern of delete activity … and so on.

Averages are generally very bad indicators if you’re worried about the behaviour of an Oracle system.