Search

OakieTags

Who's online

There are currently 0 users and 22 guests online.

Recent comments

Oakies Blog Aggregator

Announcing “SLOB Recipes”

I’ve started updating the SLOB Resources page with links to “recipes” for certain SLOB testing. The first installment is the recipe for loading 8TB scale SLOB 2.3 Multiple Schema Model with a 2-Socket Linux host attached to EMC XtremIO. Recipes will include (at a minimum) the relevant SLOB program output (e.g., setup.sh or runit.sh), init.ora and slob.conf.

Please keep an eye on the SLOB Resources page for updates…and don’t miss the first installment. It’s quite interesting.

SLOB-recipes

Filed under: oracle

Oracle 12 and latches, part 2

In my previous post, I looked at non shared latches and how the latching is done by Oracle. This post is a description on how the latching works for shared latches.

The information is quite internal, if you landed on this page it might be a good idea to start with my first post on this topic: first post.

A famous example for shared latches is the ‘cache buffers chains’ latch.

For the sake of the test I quite randomly scanned a test table, and had a little gdb script to look at the function call ksl_get_shared_latch:

break ksl_get_shared_latch
  commands
    silent
    printf "ksl_get_shared_latch laddr:%x, willing:%d, where:%d, why:%d, mode:%d\n", $rdi, $rsi, $rdx, $rcx, $r8
    c
  end

After getting the results, I looked up the latch addresses (laddr) in V$LATCH.ADDR, and searched for a cache buffers chains latch.

Next, I did setup the same construction as the previous investigation into non shared latches, I started two sqlplus / as sysdba sessions. In the first session, I simply took a shared latch in exclusive mode (16 as the fourth argument).

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call ksl_get_shared_latch 0x94af8768 1 0 2303 16
Function returned 1

This takes latch 0x94af8768 in mode 16, exclusive mode.

I essentially did the same as with the previous investigation, I ran perf record, on the session trying to get the latch, and looked at the functions which were used during the spinning for the shared latch with perf report. If the spinning is not clear enough visible, set “_spin_count” to an higher value.

After the investigation of perf report, I worked out the relevant functions. Here is the gdb script with the relevant functions that I discovered:

break ksl_get_shared_latch
  commands
    silent
    printf "ksl_get_shared_latch laddr:%x, willing:%d, where:%d, why:%d, mode:%d\n", $rdi, $rsi, $rdx, $rcx, $r8
    c
  end
break kslgess
  commands
    silent
    printf "kslgess %x, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslskgs
  commands
    silent
    printf "kslskgs %x, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break *0xc291a9
  commands
    silent
    printf " kslskgs loop: %d\n", $r15d
    c
  end
break kslwlmod
  commands
    silent
    printf "kslwlmod %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break skgpwwait
  commands
    silent
    printf "skgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break sskgpwwait
  commands
    silent
    printf "sskgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break semop
  commands
    silent
    printf "semop %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end

If you attach to the second process and source the breakpoint script, this is what you will see:

ksl_get_shared_latch laddr:94af8768, willing:1, where:0, why:2303, mode:16
kslgess 94af8768, 16, 1, 0
kslskgs 94af8768, 1, 464092136, 464092728
 kslskgs loop: 2000
 kslskgs loop: 1999
...
 kslskgs loop: 2
 kslskgs loop: 1
kslwlmod 464092480, -1780205088, -1800435864, 1
kslskgs 94af8768, 1, 464092136, 464092728
 kslskgs loop: 1
skgpwwait 464091920, 1628424256, -1780204168, 0
sskgpwwait 464091920, 1628424256, -1780204168, 0
semop 1409027, 464091720, 1, -1

1-A shared latch is gotten using the distinct ksl_get_shared_latch function call. In the ksl_get_shared_latch function an attempt to get the latch is done, and returns if succeeded.
2-Next kslgess (kernel service latch get spin shared (latch)). It does not appear a latch get attempt is done in this function.
3-kernel service latch spin kernel get shared (this is a guess). Here the latch is tried up to _spin_count-1 times.
4-8 this is the kslskgs function spinning over a latch get.
9-kernel service latch wait list modification. Here the process is (almost?) done spinning, and registers itself in the post wait interface for getting post when the latch is freed.
10/11-here a last time the latch is tried.
12-system kernel generic post wait wait: the function to setup the waiting for getting posted.
13-system system kernel generic post wait wait: probably the port specific code for the post wait setup.
14-semop: semaphore operation: the session starts sleeping on a semaphore waiting to get posted. As has been indicated previously, this makes the process stay clear of running unless it is posted.

This also means that instead of the (very) CPU heaving spinning on the latches, which was done in the old days, this way is very light on the CPU, and seems to be very efficient. Or more efficient than the old days.

A word of caution: it seems the shared latch get in exclusive mode can not be freed; the session that has gotten the latch will hang when calling kslfre, and the session that was waiting for the latch did get:

ORA-03113: end-of-file on communication channel
ORA-24323: value not allowed

This seems to be expected behaviour, as indicated by Stefan Koehler and Andrey Nikolaev, which is not fixed in Oracle version 12.1.0.2.

Oracle 12 and latches

Oracle DBAs who are so old that they remember the days before Oracle 11.2 probably remember the tuning efforts for latches. I can still recall the latch number for cache buffers chains from the top of my head: number 98. In the older days this was another number, 157.

But it seems latches have become less of a problem in the modern days of Oracle 11.2 and higher. Still, when I generate heavy concurrency I can see some latch waits. (I am talking about you and SLOB mister Closson).

I decided to look into latches on Oracle 12.1.0.2 instance on Oracle Linux 7. This might also be a good time to go through how you think they work for yourself, it might be different than you think or have been taught.

In order to understand how latching works, I searched for Oracle related traces. I could find event 10005, but it did not return anything latching related. My current understanding is that event 10005 is Oracle KST tracing, for which the results are put in X$TRACE.

Luckily, I could get a great headstart by studying the work of Andrey Nikolaev. However, that work seems to be strictly Solaris based.

I decided to take a look on how this works on Linux. In order to do this, I did setup a system for the specific purpose of this test. Disclaimer: The techniques below are for educational purposes only, and should never be done on a real database!

The work of Andrey shows kslgetl() as the overal latching function, which does:
kslgetl
– sskgslgf (immediate latch get)
– kslges (wait latch get)
— kskthbwt
— kslwlmod (setup wait list)
— sskgslgf (immediate latch get)
— skgpwwait (sleep latch get)
— semop

In order to do predictable latch gets and generate latch misses, in a very predictable way, I used the method that Andrey shows on his website (which he attributed to Tanel), which is using oradebug call to call the latch get function. In order to get latch waits, you need at least two processes doing something with a latch, one holding the latch, and another one requesting it. In order to facilitate this, I setup two sqlplus / as sysdba sessions.

Taking a latch manually can be done using the kslgetl or the ksl_get_shared_latch functions. Freeing a latch is done using the kslfre function. The kslfre function just takes the latch address as argument. The kslgetl and ksl_get_shared_latch functions take the following arguments:
1-latch address
2-immediate get (0 means yes, 1 means no)
3-where (X$KSLLW.INDX)
4-why (X$KSLWSC.INDX)
5-mode (8=shared,16=exclusive; only for ksl_get_shared_latch function)

Immediate mode get for a non shared latch.
An immediate mode get just tries to fetch a latch once. In order to manually do an immediate latch get, I done:

Session 1:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x60023a80 0 0 2442
Function returned 1

This means session 1 has gotten latch 0x60023a80. I gotten the latch in willing to wait mode, but that does not really matter here. The session did get the latch.

If you want to check this, simply use V$LATCHHOLDER view to verify (in another session):

SQL> select * from v$latchholder;

        PID        SID LADDR            NAME                                                                   GETS     CON_ID
---------- ---------- ---------------- ---------------------------------------------------------------- ---------- ----------
        38        134 0000000060023A80 cache table scan latch                                                  709          0

Please mind that with a latch manually gotten, you need to free the latch before you try to do anything else with your session, otherwise you encounter a (non critical) ORA-600. Freeing a latch is done using oradebug call kslfre and one argument: the latch address.

Session 2:

SQL> oradebug setmypid
Statement processed.
SQL> oradebug call kslgetl 0x60023a80 0 0 2442
Function returned 0

The ‘Function returned 0′ means the immediate latch get failed. As you can see this was an immediate get because the second argument is 0.

In order to understand which functions are involved, I first used the perf record linux utility. However, because the immediate get does not spin, and a latch get by all means has speed/low overhead as one of the principal design criterions, I could not see the function.

This meant I needed to go to one of the tools I have used extensively in the past: gdb (the GNU debugger). You need to attach to the Oracle database server shared process locally on the database server. Here is what I did:

# gdb -p 4600
...
(gdb) set pagination off
(gdb) rbreak ^ksl.*
...
Breakpoint 262 at 0x8ea7b0
 ksl_event_stats_rollup;
(gdb) commands 1-262
type commands for breakpoint(s) 1-262, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

The first gdb function turns off having to press enter for every full screen of output of gdb, the second function breaks on all functions in the oracle executable that start with ‘ksl’. The commands command creates commands that are executed in gdb if breakpoints 1-262 are encountered, which is ‘c': continue.

Now, with the debugger set, I executed the kslgetl function again:

Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 240, 0x000000000c8dccf0 in ksl_get_shared_latch ()
Breakpoint 244, 0x000000000c8de960 in kslfre ()
Breakpoint 247, 0x000000000c8e10a0 in kslws_check_waitstack ()
Breakpoint 240, 0x000000000c8dccf0 in ksl_get_shared_latch ()
Breakpoint 244, 0x000000000c8de960 in kslfre ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()
Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 242, 0x000000000c8ddcb0 in kslgetl ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()
Breakpoint 251, 0x000000000c8e5720 in kslwtectx ()
Breakpoint 253, 0x000000000c8e78e0 in kslwt_end_snapshot ()
Breakpoint 252, 0x000000000c8e7320 in kslwt_update_stats_int ()
Breakpoint 245, 0x000000000c8dedf0 in kslwtbctx ()
Breakpoint 246, 0x000000000c8e08e0 in kslwt_start_snapshot ()

It is important to understand Oracle does a lot of other stuff outside the latch get via kslgetl. Most of the stuff above are functions which start with kslwt, which is the Oracle wait interface. A couple of times a shared latch is taken (as can be seen by the function ksl_get_shared_latch), and freed (kslfre). The important part here is: kslgetl is executed once, and did not go into any other function to try to get the latch.

Getting a latch in willing to wait mode for a non shared latch.
Now let’s do something a bit more exciting: getting a taken latch in willing to wait mode. The first session can do exactly the same, just take the latch. The second session needs to be changed a little bit to indicate it is willing to wait:

SQL> oradebug cell kslgetl 0x60023a80 1 0 2442

This will call additional functions. In order to understand what these functions are, I used perf record, perf report and perf script.

I was able to create a smaller, more specific gdb script to see what is going on:

break kslgetl
  commands
    silent
    printf "kslgetl laddr:%x, willing:%d, where:%d, why:%d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslges
  commands
    silent
    printf "kslges %x, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break kslwlmod
  commands
    silent
    printf "kslwlmod %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break skgpwwait
  commands
    silent
    printf "skgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break sskgpwwait
  commands
    silent
    printf "sskgpwwait %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end
break semop
  commands
    silent
    printf "semop %d, %d, %d, %d\n", $rdi, $rsi, $rdx, $rcx
    c
  end

This is how the gdb output looks like when the latch get in willing to wait mode is executed:

kslgetl laddr:60023a80, willing:1, where:0, why:2442
kslges 60023a80, 0, 1, 0
kslwlmod 13311368, -1780327896, 1610758784, 1
skgpwwait 13311608, -1767360, -1780326976, 0
sskgpwwait 13311608, -1767360, -1780326976, 0
semop 360451, 13310840, 1, -1

Interestingly, if the latch is not taken, this is how the latch get sequence looks like:

kslgetl laddr:60023a80, willing:1, where:0, why:2442

In other words, for getting a non shared latch in willing to wait mode:

1-the function kslgetl is called, which tries to fetch the latch.
If the latch can be taken, the function returns, if not:
2-the function kslges (kernel service latch get spinning) is called, which supposedly also tries to take the same latch.
If the latch still can not be taken, the next function is:
3-the function kslwlmod (kernel service latch waiting list modify) is entered.
In this function the process registers itself as waiting in the post/wait list.
4-the function skgpwwait (system kernel generic post/wait wait) is entered.
This function sets up the waiting for the process so it can be posted.
5-the function sskgpwwait (system system kernel generic post/wait wait)
My current understanding is the ‘ss’ function contain the platform specific code for database functions.
6-the (operating system) function semop (semaphore operation) is called.
This will make the process sleep waiting on a semaphore (operating system signalling mechanism). This way, the process will not be runnable on the CPU unless the semaphore is posed.

The information that is missing here, is the spinning. The earlier work of Andrey Nikolaev showed that in the Solaris port of the database, a distinct function (sskgslgf [immediate] and sskgslspin [spin]) was used to get the latch, which made it easy to count.

Some searching around revealed that a CPU register reveals this information. Add this to the above gdb script:

break *0xc29b51
  commands
    silent
    printf " kslges loop: %d\n", $ecx
    c
  end

And try to get a non shared taken latch in willing to wait mode:

kslgetl laddr:60023a80, willing:1, where:0, why:2442
kslges 60023a80, 0, 1, 0
 kslges loop: 19999
 kslges loop: 19998
...
 kslges loop: 1
 kslges loop: 0
kslwlmod 1208568840, -1780327896, 1610758784, 1
skgpwwait 1208569080, -1281169344, -1780326976, 0
sskgpwwait 1208569080, -1281169344, -1780326976, 0
semop 360451, 1208568312, 1, -1

So…this reveals that getting a non shared latch in willing to wait mode will spin 10*_spin_count. In other words: not _spin_count, which is 2000 by default. What is even more interesting, is for the described type of latch, there is no (short) timed sleep done; the kslges function spins 10*_spin_count times for the latch, then puts itself on the waiting list, and goes to sleep on a semaphore. The way the latching mechanism works has been described (by Oracle AFAIK) as that a process will spin for _spin_count times trying to get a latch, then goes to sleep for some time, and then spins trying to get the latch, goes to sleep, etc.

I strace’d the process holding the latch to see if it is doing a semctl systemcall to signal the waiting process if the latch is freed, and indeed that is what is happening. This is different from the “old days” where processes spinning on latches (cache buffers chains latches most of the time) were really hammering the system and would eat up a lot of the CPU slices trying to get a latch.

Part of the classic latching problem, specifically the cache buffers chains latching problem, quite probably is mitigated by having shared latches, which were introduced in Oracle in different versions for different ports.

Watch out for a second blogpost where I do the same investigation for shared latches.

Tagged: debugger, gdb, internals, latch, oracle, oradebug, sqlplus

Descending Indexes

I’ve written about optimizer defects with descending indexes before now but a problem came up on the OTN database forum a few days ago that made me decide to look very closely at an example where the arithmetic was clearly defective. The problem revolves around a table with two indexes, one on a date column (TH_UPDATE_TIMESTAMP) and the other a compound index which starts with the same column in descending order (TH_UPDATE_TIMESTAMP DESC, TH_TXN_CODE). The optimizer was picking the “descending” index in cases where it was clearly the wrong index (even after the statistics had been refreshed and all the usual errors relating to date-based indexes had been discounted). Here’s an execution plan from the system which shows that there’s something wrong with the optimizer:


SELECT COUNT(*) FROM TXN_HEADER WHERE TH_UPDATE_TIMESTAMP BETWEEN SYSDATE-30 AND SYSDATE;

---------------------------------------------------------------------------------------
| Id  | Operation          | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                  |       |       |     4 (100)|          |
|   1 |  SORT AGGREGATE    |                  |     1 |     8 |            |          |
|*  2 |   FILTER           |                  |       |       |            |          |
|*  3 |    INDEX RANGE SCAN| TXN_HEADER_IDX17 |  1083K|  8462K|     4   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

There are two clues here: first, Oracle has used the (larger) two-column index when the single column is (almost guaranteed to be) the better choice simply because it will be smaller; secondly, we have a cost of 4 to acquire 1M rowids through an (implicitly b-tree) index range scan, which would require at least 250,000 index entries per leaf block to keep the cost that low (or 2,500 if you set the optimizer_index_cost_adj to 1; so it might just be possible if you had a 32KB block size).

The OP worked hard to supply information we asked for, and to follow up any suggestions we made; in the course of this we got the information that the table had about 90M rows and this “timestamp” column had about 45M distinct values ranging from 6th Sept 2012 to 2nd July 2015 with no nulls.

Based on this information I modelled the problem in an instance of 11.2.0.4 (the OP was using 11.2.0.3).


create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                                                          id,
        to_date('06-SEP-2012 15:13:00','dd-mon-yyyy hh24:mi:ss') +
                trunc((rownum - 1 )/4) / (24 * 60)                      upd_timestamp,
        mod(rownum - 1,10)                                              txn_code,
        rpad('x',50)                                                    padding
from
        generator       v1,
        generator       v2
where
        rownum <= 4 * 1030  *  24 * 60
;

create index t1_asc on t1(upd_timestamp) nologging;
create index t1_desc on t1(upd_timestamp desc) nologging;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          => 'T1',
                method_opt       => 'for all columns size 1',
                cascade          => false
        );
end;
/

My data set has 4 rows per minute from 6th Sept 2012 to 3rd July 2015, with both an ascending and descending index on the upd_timestamp column. For reference, here are the statistics about the two indexes:


INDEX_NAME               SAMPLE     BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
-------------------- ---------- ---------- ----------- -----------------
T1_DESC                 5932800          2       17379            154559
T1_ASC                  5932800          2       15737             59825

The ascending index is smaller with a significantly better clustering_factor, so for queries where either index would be a viable option the ascending index is the one (we think) that the optimizer should choose. In passing, the 5.9M index entries is exactly the number of rows in the table – these stats were computed automatically as the indexes were created.

Here’s a simple query with execution plan (with rowsource execution stats):

select max(id) from t1 where upd_timestamp between
to_date('01-jun-2015','dd-mon-yyyy')                and
to_date('30-jun-2015','dd-mon-yyyy')

---------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |         |      1 |        |    29 (100)|      1 |00:00:01.29 |    4710 |
|   1 |  SORT AGGREGATE              |         |      1 |      1 |            |      1 |00:00:01.29 |    4710 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1      |      1 |    167K|    29   (0)|    167K|00:00:00.98 |    4710 |
|*  3 |    INDEX RANGE SCAN          | T1_DESC |      1 |    885 |     5   (0)|    167K|00:00:00.35 |     492 |
---------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T1"."SYS_NC00005$">=HEXTORAW('878CF9E1FEF8FEFAFF')  AND
              "T1"."SYS_NC00005$"<=HEXTORAW('878CF9FEF8FEF8FF') )
       filter((SYS_OP_UNDESCEND("T1"."SYS_NC00005$")>=TO_DATE(' 2015-06-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss') AND SYS_OP_UNDESCEND("T1"."SYS_NC00005$")<=TO_DATE(' 2015-06-30 00:00:00',
              'syyyy-mm-dd hh24:mi:ss')))

The optimizer’s index estimate of 885 rowids is a long way off the actual rowcount of 167,000 – even though I have perfect stats describing uniform data and the query is simple enough that the optimizer should be able to get a good estimate. Mind you, the predicate section looks a lot messier than you might expect, and the table estimate is correct (though not consistent with the index estimate, of course).

Here’s the plan I get when I make the descending index invisible:


--------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |        |      1 |        |  2146 (100)|      1 |00:00:01.25 |    2134 |
|   1 |  SORT AGGREGATE              |        |      1 |      1 |            |      1 |00:00:01.25 |    2134 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1     |      1 |    167K|  2146   (1)|    167K|00:00:00.94 |    2134 |
|*  3 |    INDEX RANGE SCAN          | T1_ASC |      1 |    167K|   453   (2)|    167K|00:00:00.31 |     446 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("UPD_TIMESTAMP">=TO_DATE(' 2015-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "UPD_TIMESTAMP"<=TO_DATE(' 2015-06-30 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

With the descending index invisible we can see that the cardinality estimate for the ascending index correct and notice how much higher this makes the cost. It’s not surprising that the optimizer picked the wrong index with such a difference in cost. The question now is why did the optimizer get the index cardinality (i.e. selectivity, hence cost) so badly wrong.

The answer is that the optimizer has made the same mistake that applications make by storing dates as character strings. It’s using the normal range-based calculation for the sys_op_descend() values recorded against the virtual column and has lost all understanding of the fact that these values represent dates. I can demonstrate this most easily by creating one more table, inserting a couple of rows, gathering stats, and showing you what the internal storage of a couple of “descendomg” dates looks like.


drop table t2;

create table t2 (d1 date);
create index t2_i1 on t2(d1 desc);

insert into t2 values('01-Jun-2015');
insert into t2 values('30-Jun-2015');
commit;

select d1, sys_nc00002$ from t2;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          => 'T2',
                method_opt       => 'for all columns size 1',
                cascade          => false
        );
end;
/

column endpoint_value format 999,999,999,999,999,999,999,999,999,999,999,999
break on table_name skip 1

select
        table_name, column_name, endpoint_number, endpoint_value, to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') end_hex
from
        user_tab_histograms
where
        table_name in ('T1','T2')
and     column_name like 'SYS%'
order by
        table_name, column_name, endpoint_number
;

I’ve put the dates 1st June 2015 and 30th June 2015 into the table because those were the values I used in the where clause of my query. Here are the results showing the internal representation of the (descending) index column and the stored low and high values for the virtual columns in both t1 and t2.


D1        SYS_NC00002$
--------- ------------------------
01-JUN-15 878CF9FEF8FEF8FF
30-JUN-15 878CF9E1FEF8FEFAFF

TABLE_NAME           COLUMN_NAME          ENDPOINT_NUMBER                                   ENDPOINT_VALUE END_HEX
-------------------- -------------------- --------------- ------------------------------------------------ ---------------------------------
T1                   SYS_NC00005$                       0  703,819,340,111,320,000,000,000,000,000,000,000    878CF8FCEFF30BCEBC8823F7000000
                     SYS_NC00005$                       1  703,880,027,955,346,000,000,000,000,000,000,000    878FF6F9EFF20256F3B404F7400000

T2                   SYS_NC00002$                       0  703,819,411,001,549,000,000,000,000,000,000,000    878CF9E1FEF8F24C1C7CED46200000
                     SYS_NC00002$                       1  703,819,419,969,389,000,000,000,000,000,000,000    878CF9FEF8FEEED28AC5B22A200000

If you check the t2 data (sys_nc00002$) values against the end_hex values in the histogram data you’ll see they match up to the first 6 bytes (12 digits). Oracle has done its standard processing – take the first 6 bytes of the column, covert to decimal, round to the most significant 15 digits (technically round(value, -21)), convert and store the result as hex.

So let’s do some arithmetic. The selectivity of a range scan that is in-bounds is (informally): “range we want” / “total range”. I’ve set up t2 to show us the values we will need to calculate the range we want, and I’ve reported the t1 values to allow us to calculate the total range, we just have to subtract the lower value (endpoint number 0) from the higher value for the two sys_nc0000N$ columns. So (ignoring the 21 zeros everywhere) our selectivity is:

  • (703,819,419,969,389 – 703,819,411,001,549) / ( 703,880,027,955,346 – 703,819,340,111,320) = 0.00014777
  • We have 5.9M rows in the table, so the cardinality estimate should be about: 5,932,800 * 0.00014777 = 876.69

The actual cardinality estimate was 885 – but we haven’t allowed for the exact form of the range-based predicate: we should add 1/num_distinct at both ends because the range is a closed range (greater than or EQUAL to, less than or EQUAL to) – which takes the cardinality estimate up to 884.69 which rounds to the 885 that the optimizer produced.

Conclusion

This note shows that Oracle is (at least for dates) losing information about the underlying data when dealing with the virtual columns associated with descending columns in indexes. As demonstrated that can lead to extremely bad selectivity estimates for predicates based on the original columns, and these selectivity estimates make it possible for Oracle to choose the wrong index and introduce a spuriously low cost into the calculation of the full execution plan.

Footnote

This note seems to match bug note 11072246 “Wrong Cardinality estimations for columns with DESC indexes”, but that bug is reported as fixed in 12.1, while this test case reproduces in 12.1.0.2

 

I’ve Been Made an Oracle Ace Director

Well, I guess the title of this post says it all. As I tweeted yesterday:

I’m grateful, proud, honoured and overall just Jolly Chuffed to have been made an Oracle Ace Director! #ACED

I can now put this label on my belongings

I can now put this label on my belongings

I’ve been an Oracle ACE since 2011 and I’m really happy to be making the step up to being an Ace Director. What does being an ACE Director mean? Well, it certainly does not mean that I am technically brilliant. As my community role is as a technical person then I do have to be competent and experienced to be an ACE (or Associate or Director) – but there are many, many people out there who are technically superior to me and are not {and may well not want to be} ACEs of any kind.

To be an ACE of any flavour you have to be committed to supporting the Oracle User Community. The whole ACE program is, I believe, more about recognising and supporting that user community than anything else. Actually, the ACE program web site states this (ACE Program FAQ). To become an ACE Director you have to demonstrate that you have been actively supporting the community for a while (please do not ask me to quantify “a while”) and that you are committed to continuing that activity for at least 12 months. There are some specific activities and commitments that come with the badge but that is balanced by a commitment by the Ace Program to give you some support in doing so (this does not include being paid, it is still voluntary). As I understand it, all ACEs and ACE Directors are reviewed every 12 months and can be re-designated if your community activity has changed.

As I said above, there are a lot of technically strong people who are not and never will be ACEs. This is often because user community activity is not their thing – they have little interest in blogging, presenting, writing or volunteering for user groups. I also know some people who do all those things but they would rather do that with no specific acknowledgement by Oracle Corporation. I guess I am saying that though I am proud to now be an Oracle ACE Director, the main thing it tells you about me is that I am passionate about the user community and I am happy {heck, Jolly Chuffed} to be recognised by Oracle for that. And I am happy for that dialogue to be two-way also. One of the conditions of being an ACE Director is you play a part in representing the user community to Oracle.

Does this mean I have “drunk the Oracle Kool-Aid” as I think some of my American friends would call it? No. Before I became an Oracle ACE I chatted to several friends already on the program and no one I know has been told to not say anything or sanctioned by the ACE Program for criticising some aspect of Oracle Tech. We are still free to be Bitter Old Men & Women (apart from the Bitter Young ones of course). Anyone who has followed my blog for a while, seen me present a few times or spent a couple of evenings in the pub with me will known that I can, at times, be quite critical of aspects of the corporation or it’s software. There is no gagging of us ACEs that I am aware of.

Will being an Oracle ACE Director alter my user community activity? Well, it might. I was doing a lot for the community before now, I made a decision 2 or 3 years ago to become more active in the User Community {for the simple and selfish reason that I like doing it a lot more than I like commuting in and out of London every day}. You don’t do all of this for the ACE recognition, you do it for others reasons and maybe get the ACE badges on the way. But the program helps the Directors a little more, opens a few more doors. So I think I’ll be able to step it up a little more. I’m really looking forward to that.

I’ll stop there. If you are interested in another Oracle ACE Director’s take on the role, check out this video by my friend Tim Hall.

Wrong Results

It is interesting how a combination of technologies in Oracle can play in a way which produce a seemingly bizarre outcomes.

Consider the following query:

SQL> with v as
(
select 20 n from dual
) select distinct v.n
from v, t
where v.n=t.n(+);

no rows selected

Note that I'm doing a left outer join, however, the query somehow managed to loose a single row I have in the subquery factoring (and just in case you're wondering I've used subquery factoring for simplicity and replacing it with a real table makes no difference).

The first reaction is how could something as simple as this go so terribly wrong?

Let's take a look at the plan:



SQL> with v as
(
select 20 n from dual
) select distinct v.n
from v, t
where v.n=t.n(+); 2 3 4 5 6

Execution Plan
----------------------------------------------------------
Plan hash value: 627307704

----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 9 | 6 (17)| 00:00:01 |
| 1 | HASH UNIQUE | | 1 | 9 | 6 (17)| 00:00:01 |
|* 2 | FILTER | | | | | |
|* 3 | HASH JOIN OUTER | | 1 | 9 | 5 (0)| 00:00:01 |
| 4 | VIEW | | 1 | 3 | 2 (0)| 00:00:01 |
| 5 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
|* 6 | MAT_VIEW REWRITE ACCESS FULL| MV_T | 10 | 60 | 3 (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("MV_T"."GID"=0)
3 - access("V"."N"="MV_T"."N"(+))
6 - filter("MV_T"."N"(+) IS NOT NULL)

As it turns out the query rewrite is in play but the real oddity lies in the Predicate Information section. Line #2 reads filter("MV_T"."GID"=0) which took me by surprise (where did it come from?) as well as explained why the row went missing. A predicate like that essentially turned our query into an inner join!

The answer lies in how this materialized view was created and the cool trick the optimizer tried to do which didn't quite work out. Indeed, I have created the table and the materialized view in the following way:

SQL> create table t as
2 select mod(level, 10) n, level m
3 from dual
4 connect by level <= 1000;
Table created

SQL> create materialized view mv_t enable query rewrite as
2 select n, sum(m), grouping_id(n) gid
3 from t
4 group by rollup(n);
Materialized view created

The materialized view has two levels of aggregation due to the use of a rollup. I'm also using a grouping_id function to identify each grouping level. So what the optimizer tried to do is apply the GID = 0 predicate in order to eliminate the rollup data but, unfortunately, it also turned our query into an inner join equivalent as a result (final query taken from the optimizer trace):

SELECT DISTINCT "V"."N" "N"
FROM (SELECT 20 "N" FROM "SYS"."DUAL" "DUAL") "V", "ROOT"."MV_T" "MV_T"
WHERE "MV_T"."GID" = 0
AND "V"."N" = "MV_T"."N"(+)

Of course the correct predicate in this case should be "MV_T"."GID" (+) = 0.

I have tried the above test case on both 11.2.0.3 and 12.1.0.2.0 with both versions producing the same wrong results.

PQ Index anomaly

Here’s an oddity prompted by a question that appeared on Oracle-L last night. The question was basically – “Why can’t I build an index in parallel when it’s single column with most of the rows set to null and only a couple of values for the non-null entries”.

That’s an interesting question, since the description of the index shouldn’t produce any reason for anything to go wrong, so I spent a few minutes on trying to emulate the problem. I created a table with 10M rows and a column that was 3% ‘Y’ and 0.1% ‘N’, then created and dropped an index in parallel in parallel a few times. The report I used to prove that the index build had run  parallel build showed an interesting waste of resources. Here’s the code to build the table and index:


create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        case
                when mod(rownum,100) < 3 then 'Y'
                when mod(rownum,1000) = 7 then 'N'
        end                     flag,
        rownum                  id,
        rpad('x',30)            padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e7
;

-- gather stats here

explain plan for
create index t1_i1 on t1(flag) parallel 4 nologging
;

select * from table(dbms_xplan.display);

create index t1_i1 on t1(flag) parallel 4 nologging;

select index_name, degree, leaf_blocks, num_rows from user_indexes;
alter index t1_i1 noparallel;

As you can see, I’ve used explain plan to get Oracle’s prediction of the cost and size, then I’ve created the index, then checked its size (and set it back to serial from its parallel setting). Here are the results of the various queries (from 11.2.0.4) – it’s interesting to note that Oracle thinks there will be 10M index entries when we know that “completely null entries don’t go into the index”:

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------
|   0 | CREATE INDEX STATEMENT   |          |    10M|    19M|  3073   (3)| 00:00:16 |        |      |            |
|   1 |  PX COORDINATOR          |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (ORDER)     | :TQ10001 |    10M|    19M|            |          |  Q1,01 | P->S | QC (ORDER) |
|   3 |    INDEX BUILD NON UNIQUE| T1_I1    |       |       |            |          |  Q1,01 | PCWP |            |
|   4 |     SORT CREATE INDEX    |          |    10M|    19M|            |          |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE          |          |    10M|    19M|  2158   (4)| 00:00:11 |  Q1,01 | PCWP |            |
|   6 |       PX SEND RANGE      | :TQ10000 |    10M|    19M|  2158   (4)| 00:00:11 |  Q1,00 | P->P | RANGE      |
|   7 |        PX BLOCK ITERATOR |          |    10M|    19M|  2158   (4)| 00:00:11 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS FULL| T1       |    10M|    19M|  2158   (4)| 00:00:11 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------

Note
-----
   - estimated index size: 243M bytes

INDEX_NAME           DEGREE                                   LEAF_BLOCKS   NUM_ROWS
-------------------- ---------------------------------------- ----------- ----------
T1_I1                4                                                562     310000

Although the plan says it’s going to run parallel, and even though the index says it’s a parallel index, we don’t have to believe that the creation ran as a parallel task – so let’s check v$pq_tqstat, the “parallel query table queue” statistics – and this is the result I got:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Ranger                 1 QC                      12        528          4          0           0
                      Producer               1 P004               2786931   39161903          9          1           0
                                             1 P005               2422798   34045157         11          1           0
                                             1 P006               2359251   33152158         12          1           0
                                             1 P007               2431032   34160854         14          2           0
                      Consumer               1 P000               3153167   44520722          3          0           0
                                             1 P001               1364146   19126604          4          1           0
                                             1 P002               2000281   28045742          3          0           0
                                             1 P003               3482406   48826476          3          0           0

                    1 Producer               1 P000                     1        298          0          0           0
                                             1 P001                     1        298          0          0           0
                                             1 P002                     1        298          0          0           0
                                             1 P003                     1         48          0          0           0
                      Consumer               1 QC                       4       1192          2          0           0

Check the num_rows column – the first set of slaves distributed 10M rows and roughly 140MB of data to the second set of slaves – and we know that most of those rows will hold (null, rowid) which are not going to go into the index. 97% of the data that went through the message queues would have been thrown away by the second set of slaves, and “should” have been discarded by the first set of slaves.

As for the original question about the index not being built in parallel – maybe it was, but not very parallel. You’ll notice that the parallel distribution at operation 6 in the plan is “RANGE”. If 97% of your data is null and only 3% of your data is going to end up in the index then you’d need to run at higher than parallel 33 to see any long lasting executions – because at parallel 33 just one slave in the second set will get all the real data and do all the work of sorting and building the index while the other slaves will (or ought to) be just throwing their data away as it arrives. When you’ve got 500M rows with only 17M non-null entries (as the OP had) to deal with, maybe the only thing happening by the time you get to look might be the one slave that’s building a 17M row index.

Of course, one of the reasons I wanted to look at the row distribution in v$pq_tqstat was that I wanted to check whether I was going to see all the data going to one slave, or a spread across 2 slaves (Noes to the left, Ayes to the right – as they used to say in the UK House of Commons), or whether Oracle had been very clever and decided to distribute the rows by key value combined with rowid to get a nearly even spread. I’ll have to set up a different test case to check whether that last option is possible.

Footnote

There was another little oddity that might be a simpler explanation of why the OP’s index creation might actually have run serially. I dropped and recreated the index in my test case several times and at one point I noticed (from view v$pq_slave) that I had 16 slave processes live (though, at that point, IDLE). Since I was the only user of the instance my session should probably have been re-using the same set of slaves each time I ran the test; instead, at some point, one of my test runs had started up a new set of slaves. Possibly something similar had happened to the OP, and over the course of building several indexes one after the other his session had reached the stage where it tried to start “yet another” set of slaves, failed, and decided to run serially rather than reuse any of the slaves that were nominally available and IDLE.

Update

It gets worse. I decided to query v$px_sesstat (joined to v$statname) while the query was running, and caught some statistics just before the build completed. Here are a few critical numbers taken from the 4 sessions that received the 10M rows and built the final index:

Coord   Grp Deg    Set  Sno   SID
264/1     1 4/4      1    1   265
---------------------------------
            physical writes direct                            558
            sorts (memory)                                      1
            sorts (rows)                                2,541,146

264/1     1 4/4      1    2    30
---------------------------------
            sorts (memory)                                      1
            sorts (rows)                                2,218,809

264/1     1 4/4      1    3    35
---------------------------------
            physical writes direct                          7,110
            physical writes direct temporary tablespace     7,110
            sorts (disk)                                        1
            sorts (rows)                                2,886,184

264/1     1 4/4      1    4   270
---------------------------------
            sorts (memory)                                      1
            sorts (rows)                                2,353,861

Not only did Oracle pass 10M rows from one slave set to the other, the receiving slave set sorted those rows before discarding them. One of the slaves even ran short of memory and spilled its sort to disc to do the sort. And we can see (physical writes direct = 558) that one slave set was responsible for handling all the “real” data for that index.

 

Update 2

A couple of follow-ups on the thread have introduced some other material that’s worth reading.  An item from Mohamed Houri about what happens when a parallel slave is still assigned to an executing statement but isn’t given any work to do for a long time; and an item from Stefan Koehler about _px_trace and tracking down why the degree of parallelism of a statement was downgraded.

This Is Not Glossy Marketing But You Still Won’t Believe Your Eyes. EMC XtremIO 4.0 Snapshot Refresh For Agile Test / Dev Storage Provisioning in Oracle Database Environments.

This is just a quick blog post to direct readers to a YouTube video I recently created to help explain to someone how flexible EMC XtremIO Snapshots are. The power of this array capability is probably most appreciated in the realm of provisioning storage for Test and Development environments.

Although this is a silent motion picture I think it will speak volumes–or at least 1,000 words.

Please note: This is just a video demonstration to show the base mechanisms and how they relate to Oracle Database with Automatic Storage Management. This is not a scale demonstration. XtremIO snapshots are supported to in the thousands and extremely powerful “sibling trees” are fully supported.

Not Your Father’s Snapshot Technology

No storage array on the market is as flexible as XtremIO in the area of writable snapshots. This video demonstration shows how snapshots allow the administrator of a “DEV” host–using Oracle ASM–to quickly refresh to current or past versions of ASM disk group contents from the “PROD” environment.

The principles involved in this demonstration are:

  1. XtremIO snapshots are crash consistent.
  2. XtremIO snapshots are immediately created, writeable and space efficient. There is no fixed “donor” relationship. Snapshots can be created from other snapshots and refreshes can go in any direction.
  3. XtremIO snapshot refresh does not involve the host operating system. Snapshot and volume contents can be immediately “swapped” (refreshed) at the array level without any action on the host.

Regarding number 3 on that list, I’ll point out that while the operating system does not play a role in the snapshot operations per se, applications will be sensitive to contents of storage immediately changing. It is only for this reason that there are any host actions at all.

Are Host Operations Involved? Crash Consistent Does Not Mean Application-Coherent

The act of refreshing XtremIO snapshots does not change the SCSI WWN information so hosts do not have any way of knowing the contents of a LUN have changed. In the Oracle Database use case the following must be considered:

  1. With a file system based database one must unmount the file systems before refreshing a snapshot otherwise the file system will be corrupted. This should not alarm anyone. A snapshot refresh is an instantaneous content replacement at the array level. Operationally speaking, file system based databases only require database instance shutdown and the unmounting of the file system in preparation for application-coherent snapshot refresh.
  2. With an ASM based database one must dismount the ASM disk group in preparation for snapshot refresh. To that end, ASM database snapshot restore does not involve system administration in any way.

The video is 5 minutes long and it will show you the following happenings along a timeline:

  1. “PROD” and “DEV” database hosts (one physical and one virtual) each showing the same Oracle database (identical DBID) and database creation time as per dictionary views. This establishes the “donor”<->clone relationship. DEV is a snapshot of PROD. It is begat of a snapshot of a PROD consistency group
  2. A single-row token table called  “test” in the PROD database has value “1.” The DEV database does not even have the token table (DEV is independent of PROD…it’s been changing..but its origins are rooted in PROD as per point #1)
  3. At approximately 41 seconds into the video I take a snapshot of the PROD consistency group with “value 1” in the token table. This step prepares for “time travel” later in the demonstration
  4. I then update the PROD token table to contain the value “42”
  5. At ~2:02 into the video I have already dismounted DEV ASM disk groups and started clobbering DEV with the current state of PROD via a snapshot refresh. This is “catching up to PROD”
    1. Please note: No action at all was needed on the PROD side. The refresh of DEV from PROD is a logical, crash-consistent point in time image
  6. At ~2:53 into the video you’ll see that the DEV database instance has already been booted and that it has value “42” (step #4). This means DEV has “caught up to PROD”
  7. At ~3:32 you’ll see that I use dd(1) to copy the redo LUN over the data LUN on the DEV host to introduce ASM-level corruption
  8. At 3:57 the DEV database is shown as corrupted. In actuality, the ASM disk group holding the DEV database is corrupted
  9. In order to demonstrate traveling back in time, and to recover from the dd(1) corrupting of the ASM disk group,  you’ll see at 4:31 I chose to refresh from the snapshot I took at step #3
  10. At 5:11 you’ll see that DEV has healed from the dd(1) destruction of the ASM disk group, the database instance is booted, and the value in the token table is reverted to 1 (step #3) thus DEV has traveled back in time

Please note: In the YouTube box you can click to view full screen or on youtube.com if the video quality is a problem:

More Information

For information on the fundamentals of EMC XtremIO snapshot technology please refer to the following EMC paper: The fundamentals of XtremIO snapshot technology

For independent validation of XtremIO snapshot technology in a highly-virtualized environment with Oracle Database 12c please click on the following link: Principled Technologies, Inc Whitepaper

For a proven solution whitepaper showing massive scale data sharing with XtremIO snapshots please click on the following link: EMC Whitepaper on massive scale database consolidation via XtremIO

Filed under: oracle

Snap Clone on Exadata

Recently I created a new screenwatch that walks you through the new features of Snap Clone on Exadata in Enterprise Manager 12c version 12.1.0.5. The screenwatch walked through using Snap Clone on Exadata, both from a multi-tenant architecture perspective and a standard non container database perspective. Unfortunately, putting both together meant the resultant video was around 20 minutes long.

Generally, we try to make our screenwatches shorter than that, as the evidence tends to show people lose concentration and interest if the screenwatches are that long. So we decided to split the screenwatch in two. One part would be for vanilla Snap Clone on Exadata, and the other was for Snap Clone Multitenant on Exadata.

The problem with doing that was we didn’t have time to re-record the screenwatches appropriately so we simply edited the screenwatch into two separate ones. Personally, I wasn’t happy with the result because it meant some of what I said in one screenwatch was now covered in the other screenwatch, and vice versa. So here are the links to all three. If you want to watch the shorter ones, click on either of the first two links, but if you feel you can keep your concentration going for long enough to watch the whole thing, click on the third link instead. :)

Snap Clone on Exadata
Snap Clone Multitenant on Exadata
Using Snap Clone on Exadata – the whole story

How long does a logical IO take?

This is a question that I played with for a long time. There have been statements on logical IO performance (“Logical IO is x times faster than Physical IO”), but nobody could answer the question what the actual logical IO time is. Of course you can see part of it in the system and session statistics (v$sysstat/v$sesstat), statistic name “session logical reads”. However, if you divide the number of logical reads by the total time a query took, the logical IO time is too high, because then it assumed all the time the query took was spend on doing logical IO, which obviously is not the case, because there is time spend on parsing, maybe physical IO, etc. Also, when doing that, you calculate an average. Averages are known to hide actual behaviour.

Luckily, with Redhat Enterprise Linux and Oracle Linux version 7, there is the kernel version 3.10 as the stock RedHat kernel version. This kernel version supports systemtap userspace return probes. A systemtap probe is a way to trigger an action when a certain action (the probed event) is started, a return probe is an action triggered when an action is finished. The Oracle UEK kernel version 3 at the time of writing is version 3.8, which does not support this.

My current knowledge is consistent reads are handled by the Oracle database C function kcbgtcr(). Current reads are quite probably handled by the function kcbgcur(). Having said that, I know of at least one exception to this: scans on hybrid columnar compressed segments do not use the kcbgtcr() function.

Please mind all kernel code translations, like kcbgtcr (kernel cache buffers get consistent read) are pure guesses, albeit somewhat educated guesses, as there are a lot of internet publications naming these, including My Oracle Support itself.

With the 3.10 version kernel, we can create a small systemtap script to measure the time between the start and stop of the kcbgtcr routine. If you want to experiment with this, it is probably best to download the latest version of systemtap and compile it yourself.. This is the reason you see /usr/local/bin/stap in the shebang.

#!/usr/local/bin/stap

global latency

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()
	printf(">kcbgtcr\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr").return {
	printf("

Now make the script executable, and run it against a database session:

# ./lio.stap -x 3877

Next, I execute a scan in the database foreground session, and watch the systemtap script output:
(execution of the systemtap script can be cancelled by pressing CTRL-c)

...
>kcbgtcr
kcbgtcr
kcbgtcr
kcbgtcr
kcbgtcr
kcbgtcr
kcbgtcr

This shows the function being executed by the database session. However, it seems the time spend in the kcbgtcr() function is not consistent.

In order to get a better overview, we can add a histogram of the kcbgtcr latencies:

#!/usr/local/bin/stap
global latency, latency_histogram

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr") {
	latency[pid()] = gettimeofday_us()
	printf(">kcbgtcr\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kcbgtcr").return {
	latency[pid()] = gettimeofday_us() - latency[pid()]
	latency_histogram <<< latency[pid()]
	printf("@count(latency_histogram) > 0 ) {
		printf("\n==kcbgtcr latency==\n")
		println(@hist_log(latency_histogram))
	}
}

Next, attach the systemtap script to the database session again, and issue a scan. Once the scan in the database session is finished, press CTRL-c to finish the systemtap script:

...

==kcbgtcr latency==
  value |-------------------------------------------------- count
      0 |                                                       0
      1 |@@@@@@@@@@@@@@@@@@@@                                5998
      2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  14284
      4 |                                                     268
      8 |                                                      58
     16 |                                                     137
     32 |                                                      25
     64 |                                                       1
    128 |                                                      15
    256 |                                                      69
    512 |                                                       1
   1024 |                                                       0
   2048 |                                                       0
   4096 |                                                       1
   8192 |                                                       5
  16384 |                                                      35
  32768 |                                                      39
  65536 |                                                       8
 131072 |                                                       0
 262144 |                                                       1

Okay, the majority is 2 microseconds, but outside the 1 and 2 microseconds buckets, there are a lot of executions that totally fall outside of these, up to 262144 microseconds (262 milliseconds)!

What could cause these huge changes in logical IO time?

At this point I am quite much turning the squelch down and make a lot of information visible (this is a warning!). Here is the systemtap script I am using:

#!/usr/local/bin/stap
#
global latency, latency_histogram

probe begin {
	printf("Begin.\n")
}

probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kc*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
	if ( probefunc() == "kcbgtcr" )
			latency[pid()] = gettimeofday_us()
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kc*").return {
	printf("%s < %s", thread_indent(-1), ppfunc())
	if ( ppfunc() == "kcbgtcr" ) {
		latency[pid()] = gettimeofday_us() - latency[pid()]
		latency_histogram <<< latency[pid()]
		printf(" -- latency(us): %d", latency[pid()])
	}
	printf("\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("qe*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("qe*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kt*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kt*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kd*") {
	printf("%s > %s\n", thread_indent(1), probefunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kd*").return {
	printf("%s < %s\n", thread_indent(-1), ppfunc())
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiosq0") {
	printf("=PARSE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiexe") {
	printf("=EXECUTE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opifch2") {
	printf("=FETCH\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("opiclo") {
	printf("=CLOSE\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtbctx") {
	printf("=KSLWTBCTX\n")
}
probe process("/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle").function("kslwtectx") {
	printf("=KSLWTECTX\n")
}
probe process("/lib64/libaio.so.1").function("io_submit") {
	printf("=IO_SUBMIT\n")
}
probe process("/lib64/libaio.so.1").function("io_getevents_0_4") {
	printf("=IO_GETEVENTS\n")
}
probe process("/lib64/libc.so.6").function("pread64") {
	printf("=PREAD\n")
}

Warning! This script will run for a long time before it is compiled as kernel module!! Systemtap works by dynamically creating a kernel module from the system tap script, and insert it into the kernel. Because I injected a huge chunk of code to run in the kernel a lot of things are potentially influenced. I found the Oracle database to drop core’s for example.

After running this against a session and doing a very simple full table scan with a count(*), I searched for the logical IO, in other words where the kcbgrcr() function was performed. Here it is:

166149 oracle_5487_tes(5487):     < kdst_fetch
166155 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166159 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets
166162 oracle_5487_tes(5487):     > kdst_fetch
166164 oracle_5487_tes(5487):      > kdst_fetch0
166167 oracle_5487_tes(5487):       > kcbipnns
166170 oracle_5487_tes(5487):       < kcbipnns
166173 oracle_5487_tes(5487):       > kcbrls
166177 oracle_5487_tes(5487):        > kcbrls_direct
166181 oracle_5487_tes(5487):        < kcbrls_direct
166183 oracle_5487_tes(5487):       < kcbrls
166185 oracle_5487_tes(5487):       > kdstsnb
166188 oracle_5487_tes(5487):       < kdstsnb
166191 oracle_5487_tes(5487):       > ktrget2
166194 oracle_5487_tes(5487):        > ktsmg_max_query
166197 oracle_5487_tes(5487):        < ktsmg_max_query
166200 oracle_5487_tes(5487):        > kcbgtcr
166204 oracle_5487_tes(5487):         > kcbldrget
166209 oracle_5487_tes(5487):          > kcblgt
166211 oracle_5487_tes(5487):          < kcblgt
166213 oracle_5487_tes(5487):          > kcbzvb
166216 oracle_5487_tes(5487):           > kcbhvbo
166219 oracle_5487_tes(5487):            > kcbhxoro
166222 oracle_5487_tes(5487):            < kcbhxoro
166224 oracle_5487_tes(5487):           < kcbhvbo
166226 oracle_5487_tes(5487):          < kcbzvb
166228 oracle_5487_tes(5487):          > kcbztek_trace_blk
166230 oracle_5487_tes(5487):          < kcbztek_trace_blk
166233 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
166236 oracle_5487_tes(5487):           > kcbtgobj
166239 oracle_5487_tes(5487):            > kd4obj
166251 oracle_5487_tes(5487):            < kd4obj
166254 oracle_5487_tes(5487):           < kcbtgobj
166257 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
166260 oracle_5487_tes(5487):         < kcbldrget
166262 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 62
166265 oracle_5487_tes(5487):        > ktrgcm
166268 oracle_5487_tes(5487):         > kcbcge
166271 oracle_5487_tes(5487):         < kcbcge
166273 oracle_5487_tes(5487):         > ktcckv
166275 oracle_5487_tes(5487):         < ktcckv
166285 oracle_5487_tes(5487):        < ktrgcm
166286 oracle_5487_tes(5487):       < ktrget2
166289 oracle_5487_tes(5487):       > kdr9ir2blk
166291 oracle_5487_tes(5487):       < kdr9ir2blk
166293 oracle_5487_tes(5487):      < kdst_fetch0
166294 oracle_5487_tes(5487):     < kdst_fetch
166297 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
166300 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

How to read: “>” means entering a function, “<" means return from a function.
I selected a piece of the systemtap output/tracing where the counting procedure is visible.

The first row is "< kdst_fetch" in other words: returning from kernel data scan table fetch. So this function has performed a fetch. Not surprisingly, the next function is qeaeCn1SerialRowsets, which I think is the count function.

After the count, the kdst_fetch function is entered again, then kdst_fetch0. The next functions are kcbipnns, kcbrls and kcbrls_direct. Probably these functions are related to pinning and releasing blocks. See Alexander Anokhin’s post on that.

A few functions further we encounter kcbgrcr(). Interestingly, it is followed by the kcbldrget() function, which is kernel cache buffers direct path loader get. In other words, Oracle has chosen to do a direct path read, because this is the function that starts off the direct path read code. The next function, kcblget() requests a block. This means that blocks read in this code path are from PGA memory, not SGA (buffer cache) memory.

The next functions, kcbzvb, kcbhvbo and kcbhxoro are the block XOR checking functionality. The function actually performing this (sxorchk) is not visible because I did not probe for it.

The next functions are not clear to me at this point (kcbztek_trace_blk, kcbl_objdchk_with_cache_reread, kcbtgobj, kd4obj). Then the kcbldrget returns, and the kcbgtcr function too.

Here we can see that probing a huge number of functions does influence the performance of a process. While previously we saw kcbgtcr() took 2us, probably because of all the probes the time the kcbgtcr function took 62us.

The next piece of execution is done by the ktrgcm function. This is handling undo and buffer cleanout. After that function, we cross kdr9ir2blk (function unknown), after which kdst_fetch0 and kdst_fetch return, and the count is done using qeaeCn1SerialRowsets.

Okay, so far so good, but the question was: why are there logical IOs/kcbgtcr() executions that take excessively more time?

After a little searching, I was able to find a very good reason:

169252 oracle_5487_tes(5487):     > kdst_fetch
169254 oracle_5487_tes(5487):      > kdst_fetch0
169256 oracle_5487_tes(5487):       > kcbipnns
169258 oracle_5487_tes(5487):       < kcbipnns
169259 oracle_5487_tes(5487):       > kcbrls
169262 oracle_5487_tes(5487):        > kcbrls_direct
169264 oracle_5487_tes(5487):        < kcbrls_direct
169265 oracle_5487_tes(5487):       < kcbrls
169267 oracle_5487_tes(5487):       > kdstsnb
169269 oracle_5487_tes(5487):       < kdstsnb
169270 oracle_5487_tes(5487):       > ktrget2
169272 oracle_5487_tes(5487):        > ktsmg_max_query
169274 oracle_5487_tes(5487):        < ktsmg_max_query
169275 oracle_5487_tes(5487):        > kcbgtcr
169278 oracle_5487_tes(5487):         > kcbldrget
169280 oracle_5487_tes(5487):          > kcblgt
169283 oracle_5487_tes(5487):           > kcblrs
169286 oracle_5487_tes(5487):            > kdsdrcbk
169288 oracle_5487_tes(5487):            < kdsdrcbk
169291 oracle_5487_tes(5487):            > kcbldio
169296 oracle_5487_tes(5487):             > kcfaioe
169299 oracle_5487_tes(5487):             < kcfaioe
169301 oracle_5487_tes(5487):             > kcflbi
169314 oracle_5487_tes(5487):              > kcf_hard_ftype_check
169317 oracle_5487_tes(5487):              < kcf_hard_ftype_check
=IO_SUBMIT
169416 oracle_5487_tes(5487):             < kcflbi
169420 oracle_5487_tes(5487):            < kcbldio
169425 oracle_5487_tes(5487):            > kcblcffln
169429 oracle_5487_tes(5487):            < kcblcffln
169432 oracle_5487_tes(5487):           < kcblrs
169434 oracle_5487_tes(5487):           > kcblsinc
169438 oracle_5487_tes(5487):           < kcblsinc
169440 oracle_5487_tes(5487):           > kcblcio
169443 oracle_5487_tes(5487):            > kcblci
169447 oracle_5487_tes(5487):             > kcflci
=IO_GETEVENTS
=IO_GETEVENTS
169481 oracle_5487_tes(5487):              > kcflwi
=IO_GETEVENTS
=IO_GETEVENTS
=KSLWTBCTX
=IO_GETEVENTS
=IO_GETEVENTS
=KSLWTECTX
179477 oracle_5487_tes(5487):              < kcflwi
179484 oracle_5487_tes(5487):             < kcflci
179488 oracle_5487_tes(5487):            < kcblci
179491 oracle_5487_tes(5487):           < kcblcio
179494 oracle_5487_tes(5487):          < kcblgt
179497 oracle_5487_tes(5487):          > kcbzvb
179509 oracle_5487_tes(5487):           > kcbhvbo
179513 oracle_5487_tes(5487):            > kcbhxoro
179516 oracle_5487_tes(5487):            < kcbhxoro
179518 oracle_5487_tes(5487):           < kcbhvbo
179520 oracle_5487_tes(5487):          < kcbzvb
179539 oracle_5487_tes(5487):          > kcbztek_trace_blk
179544 oracle_5487_tes(5487):          < kcbztek_trace_blk
179549 oracle_5487_tes(5487):          > kcbl_objdchk_with_cache_reread
179555 oracle_5487_tes(5487):           > kcbtgobj
179559 oracle_5487_tes(5487):            > kd4obj
179562 oracle_5487_tes(5487):            < kd4obj
179563 oracle_5487_tes(5487):           < kcbtgobj
179565 oracle_5487_tes(5487):          < kcbl_objdchk_with_cache_reread
179569 oracle_5487_tes(5487):         < kcbldrget
179571 oracle_5487_tes(5487):        < kcbgtcr -- latency(us): 10295
179576 oracle_5487_tes(5487):        > ktrgcm
179580 oracle_5487_tes(5487):         > kcbcge
179582 oracle_5487_tes(5487):         < kcbcge
179585 oracle_5487_tes(5487):         > ktcckv
179587 oracle_5487_tes(5487):         < ktcckv
179589 oracle_5487_tes(5487):        < ktrgcm
179591 oracle_5487_tes(5487):       < ktrget2
179593 oracle_5487_tes(5487):       > kdr9ir2blk
179606 oracle_5487_tes(5487):       < kdr9ir2blk
179609 oracle_5487_tes(5487):      < kdst_fetch0
179611 oracle_5487_tes(5487):     < kdst_fetch
179616 oracle_5487_tes(5487):     > qeaeCn1SerialRowsets
179620 oracle_5487_tes(5487):     < qeaeCn1SerialRowsets

If you go through the calls, you will see that the start is exactly the same, until line 17. After kcbgtcr>kcbldrgt (consistent read request function choosing direct path reads), the kcblgt function does not return immediately, but rather starts off a lot of extra code path.

This code path fetches new blocks. The most striking thing here is that kcbgtcr requests the blocks, and physical IO is done on behalf of the consistent read request, in other words: on behalf of the logical IO. This is obvious if you think about it, if you want to read blocks you have to look if they are available or not, and if they are not available, you have to fetch them.

As a conclusion: if kcbgtcr() together with kcbgcur() and a couple of other functions is considered the logical IO, then a logical IO has a variable time, instead of a consistent one. Because physical IO is requested inside the logical IO request, technically the physical IO is part of the logical IO. This means that it is technically incorrect to state that a physical IO is slower than logical IO, because a physical IO is part of the logical IO request that needed that physical IO. and as such a physical IO can not be slower than a logical IO

Tagged: internals, linux, oracle, profiling, systemtap