Search

OakieTags

Who's online

There are currently 0 users and 36 guests online.

Recent comments

Affiliations

Oakies Blog Aggregator

Alternative opinions about Oracle Linux vs. RHEL…

Jay Weinshenker has written a couple of good posts in response to my recent post on Oracle Linux vs. RHEL.

I don’t agree 100% with his all his points, but I always think it’s good to hear different sides of the story and I certainly enjoyed reading them.

Cheers

Tim…




The FizzBuzz Oracle Database Coding Challenge

July 26, 2011 Through a web search I located a page titled “Coding Horror: Why Can’t Programmers.. Program?“  A simple question was asked in an interview, and apparently 199 of 200 programmers struggled to build a solution for the problem in less than ten minutes.  The problem must be that the 199 people who did not succeed [...]

IOT 2 – First examples and proofs

<.. IOT1 – Basics
IOT3 – Great reductions in IO for IOTs..>
IOT4 – Boosting Buffer Cache Efficiency….>
IOT5 – Primary Key issues……>

In my first post on IOTs I ran through the basics of what they are. Here I am going to create some test tables and show you a few things.

I am going to create a simple PARENT table with 9,999 records and then two CHILD tables. CHILD_HEAP, a normal table, and CHILD_IOT, an Index Organized Table. They have the same columns and will hold very similar data.

All of this is on Oracle 11.1 but is exactly the same on 10.2. 8K block size, tablespaces are auto segment space managed.

Here are the creation statements:

--first create the parent table, keyed by ID.
-- The other columns are not significant, they just represent "information"
create table mdw.parent
(id       number(10)    not null
,name     varchar2(100) not null
,date_1   date          not null
,num_1    number(2)
,num_2    number(2)
,constraint pare_pk primary key(id)
 using index tablespace index_01
)
tablespace data_01
/
--
--Now put my 9999 parents into the table.
insert into parent
select rownum
,dbms_random.string('U',mod(rownum,10)+50)
,sysdate-(mod(rownum,500)+1000)
,mod(rownum,99)+1
,trunc(dbms_random.value(0,100))
from dual connect by level < 10000
/
--
-- create the table to hold the children as a heap table
create table child_heap
(pare_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(100) not null
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint chhe_pk primary key(pare_id,cre_date)
 using index tablespace index_01
)
tablespace data_01
/
--
-- create the table to hold the children as an IOT table
create table child_iot
(pare_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(100) not null
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint chio_pk primary key(pare_id,cre_date)
-- using index tablespace index_01 -- CANNOT STATE for IOT. State in table definition
)
ORGANIZATION INDEX -- This is it. This makes the table an IOT
tablespace data_01
/

There are only two differences between the statements creating the CHILD_HEAP and the CHILD_IOT tables.

The main one is the inclusion of the line ORGANIZATION INDEX and is what instructs Oracle to create the table as an IOT. Note that it does not state the index and you cannot state the index. The IOT is created based on the Primary Key.
The other change is that you now cannot state the tablespace for the Primary Key index. I’ve not played with this at all but I don’t think you can state anything with the “using index” as the table storage clauses are used for the Primary Key index. I personally find this a little illogical as it is the index segment that is created, but I guess others would find it more natural that you still state this at the table level.

When I create IOTs on a real system, I put the IOT in a table tablespace {I still maintain table and index tablespaces, for reasons I won’t go into here}. I put it there as it holds the actual data. If I lose that Primary Key index I am losing real data, not duplicated data.

I then populated the two CHILD tables with data. The method of creating this test data is very important.

I am simulating a very common situation, where data is coming in for a set of Parents (think customers, accounts, scientific instruments, financial entities) and the data is coming in as a record or set of records each day. ie not where the parent and all of it’s child records are created at one time, like an order and it’s order lines. I am simulating where the child data is created a few records at a time, not all in one go.

The code is simple. it loops for one hundred days and for each day it creates 10,000 records for random parents. On each day any given parent will have none, one or several records. On average, each parent will end up with 100 records, but some will have more and some less. The key thing is that the data for any given parent is created a record at a time, with lots of records created for other parents before the next record for that given parent.

The two tables will have the same pattern of data but not identical data. {I could have seeded the random number generator to make the two data sets the same but this will do}. Below is the statement for one table, you just change the table name to populate each table. {BTW I like using the from dual connect by level <=x method of getting the number of rows desired – it is fast and is neat, once you have seen it once}.

declare
v_num number :=10000; -- number of people
v_str varchar2(60);
begin
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..100 loop --days to do
  v_str:=dbms_random.string('U',60);
  insert into CHILD_HEAP
    (pare_id,cre_date,vc_1,date_1,num_1,num_2)
  select
    trunc(dbms_random.value(1,v_num))
   ,sysdate-(100-i) + (rownum/(60*60*24) )
   ,substr(v_str,1,51+mod(rownum,10))
   ,sysdate-(100-i) + ((mod(rownum,30)+1)/3)
   ,mod(rownum,20)+1
   ,mod(rownum,99)+1
  from dual connect by level <=v_num;
end loop;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
end;
/

I then gathered objects stats on the tables.
Let’s check the size of the tables:

select segment_name, segment_type,tablespace_name,blocks
from dba_segments where owner=USER and segment_name like 'CHILD%';

SEGMENT_NAME    SEGMENT_TYPE    TABLESPACE_NAME     BLOCKS
--------------- --------------- --------------- ----------
CHILD_HEAP      TABLE           DATA_01              12288

1 row selected.

ONE row? Where is the other table, where is CHILD_IOT? It does not exists.

Remember from my first post that I made the comment I would have prefered it if Index Organized Tables had been called something like ‘Table Containing Indexes’? The table data has been placed in the Primary Key index and the table segment does not even exist. If you start using IOTs this will catch you out periodically – it does me anyway and I’ve been using them on and off for years :-) .

Let’s look at the size of the primary key indexes:

select segment_name, segment_type,tablespace_name,blocks
from dba_segments where owner=USER and segment_name like 'CH%PK'
and segment_name not like '%ORD%'

SEGMENT_NAME    SEGMENT_TYPE    TABLESPACE_NAME     BLOCKS
--------------- --------------- --------------- ----------
CHHE_PK         INDEX           INDEX_01              4224
CHIO_PK         INDEX           DATA_01              19456

2 rows selected.

Note that the Primary Key index for CHILD_HEAP, CHHE_PK, is there and is 4,224 blocks in size, and the CHILD_IOT Primary Key, CHIO_PK, is a lot larger at 19,456 blocks. In fact, not only is the CHIO_PK index larger than the CHILD_HEAP table, it is larger than the combined size of the CHILD_HEAP table and CHHE_PK index combines. So much for me saying last post that IOTs can save disk space? I’ll come back to that in a later post…

Here are some other stats from one of my scripts:

mdw11> @tab_sci_own
owner for Table: mdw
Name for Table: child_heap

OWNER    TABLE_NAME          NUM_ROWS      BLOCKS AVG_L GLS ULS LST_ANL      PRT  SAMP_SIZE
-------- -------------- ------------- ----------- ----- --- --- ------------ --- ----------
MDW      CHILD_HEAP          1000,000      12,137    83 YES NO  250711 22:01 NO     1000000

INDEX_NAME      TYP PRT UNQ BL     L_BLKS   DIST_KEYS       CLUSTF     LB_KEY     DB_KEY LST_ANL
--------------- --- --- --- -- ---------- ----------- ------------ ---------- ---------- ------------
CHHE_PK         NOR NO  UNI  2      4,034    1000,000      995,857          1          1 250711 22:02

INDEX_NAME                   TABLE_NAME       PSN COL_NAME
---------------------------- ---------------- --- ------------------------------------------------
CHHE_PK                      CHILD_HEAP       1   PARE_ID
CHHE_PK                      CHILD_HEAP       2   CRE_DATE

--
--
owner for Table: mdw
Name for Table: child_iot

OWNER    TABLE_NAME          NUM_ROWS      BLOCKS AVG_L GLS ULS LST_ANL      PRT  SAMP_SIZE
-------- -------------- ------------- ----------- ----- --- --- ------------ --- ----------
MDW      CHILD_IOT           1000,000                83 YES NO  250711 22:03 NO     1000000

INDEX_NAME      TYP PRT UNQ BL     L_BLKS   DIST_KEYS       CLUSTF     LB_KEY     DB_KEY LST_ANL
--------------- --- --- --- -- ---------- ----------- ------------ ---------- ---------- ------------
CHIO_PK         IOT NO  UNI  2     17,855     910,881            0          1          1 250711 22:03

INDEX_NAME                   TABLE_NAME       PSN COL_NAME
---------------------------- ---------------- --- ------------------------------------------------
CHIO_PK                      CHILD_IOT        1   PARE_ID
CHIO_PK                      CHILD_IOT        2   CRE_DATE

Note the lack of BLOCKS for the CHILD_IOT table and the CLUSTERING_FACTOR of 0 for the CHIO_PK.

The clustering factor is the number of times Oracle, when scanning the whole index in order, would have to swap to a different Table block to look up the table record for each index entry. If it is close to the number of blocks in the table, then the clustering factor is low and the order of records in the table matches the order of entries in the index. This would make index range scans that need to visit the table reasonably efficient.

If the clustering factor is close to the number of records in the table then it means there is no correlation between index order and table row order and such index ranges scans that have to visit the table would be inefficient. Again, this is significant and will be the major topic of the next post.

The depth of the index does not change, being 3 in each case (BL or blevel 2)

So, can we see evidence of the theoretical efficiency of looking up single records via the IOT that I mentioned in the fist post? Here we go {oh, usual disclaimer, I run the code twice and show the second run, to remove the parsing overhead}:

-- First the Heap table
select * from child_HEAP where PARE_ID=1234
AND cre_date=to_date('24-JUN-11 20:13:21','DD-MON-YY HH24:MI:SS')

   PARE_ID CRE_DATE  VC_1
---------- --------- ------------------------------------------------------
DATE_1         NUM_1      NUM_2
--------- ---------- ----------
      1234 24-JUN-11  LUTFHOCIJNYREYICQNORREAJOVBRIHFVLXNIGIVZDMFJCTGYFWC
25-JUN-11         11         16
1 row selected.

Execution Plan
------------------------------------------------------------------------------------------
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |            |     1 |    83 |     3   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| CHILD_HEAP |     1 |    83 |     3   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN         | CHHE_PK    |     1 |       |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets

--and now the IOT table

select * from child_IOT where PARE_ID=1234
AND cre_date=to_date('24-JUN-11 21:23:41','DD-MON-YY HH24:MI:SS')

   PARE_ID CRE_DATE  VC_1
---------- --------- -------------------------------------------------------
DATE_1         NUM_1      NUM_2
--------- ---------- ----------
      1234 24-JUN-11
CSIGBHSXWNDDTCFRCNWYPRNLEQWPCRYTXQQZHACDEXHOBEYXLNYBHRUHJ
27-JUN-11          7         52
1 row selected.

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |    83 |     2   (0)| 00:00:01 |
|*  1 |  INDEX UNIQUE SCAN| CHIO_PK |     1 |    83 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets

{I had to look up the exact values of CRE_DATE of a couple of records to do the above queries}

To look up a single row with the heap table you can see that the explain plan was to carry out a unique scan on the primary key and then look up the row via the rowid and took 4 consistent gets. 3 to walk down the index and get the rowid, one to look up the row block.

For the IOT table the explain plan reveals that there was simply an index unique scan of the Primary Key, nothing more. All data for the row was there in the index entry rather than the rowid. Thus only 3 consistent gets were required.

For single row lookups on the Primary Key, IOTS are more efficient than traditional Heap tables with a Primary Key index. {Please, no one point out that if all the columns you need are in the index you also do not need to go to the table, that is a different topic}.

Quite a few people have shown this efficiency before but the next step is far, far more interesting and shows a much more significant impact of IOTs. That is the topic of the next post :-) .

For now, I am going to finish off with what happens with range scans as I suggested they could slow down with an IOT.
Below, I select count(*) for just one of the parent values.

select count(*) from child_heap where pare_id = 2

  COUNT(*)
----------
        98

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |         |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| CHHE_PK |   100 |   400 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets

--
--

select count(*) from child_iot where pare_id = 2

  COUNT(*)
----------
        93

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |     4 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |         |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| CHIO_PK |   100 |   400 |     4   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets

Both statements carry out a range scan on the Primary Key of the table. For the normal HEAP table this takes 3 consistent gets, which is no suprise as we have an 8k block size and only 100 rows for a given parent, they happen to fit into one block of the index. So Oracle works down the depth of the index and looks at one block.

For the IOT the scan works down the index but has to scan three blocks. Even though there are fewer entries, 93 compared to 98, they span three blocks and thus the total number of consistent gets is 5.

Admittedly I was a little lucky in my example above. Sometimes the entries for one parent will scan 2 blocks for the heap table’s Primary Key and occasionally the entries for the IOT will fit into 2 blocks. But if you look at the number of leaf blocks in the earlier stats (4,034 for the normal and 17,855 for the IOT, both for 10,000 entries) usually the 100 or so entries for single parent in the normal index will all fall into one block and the entries for the IOT will fall into between 2 and 3 blocks.

A select count(*) will full scan the smallest segment that can satisfy the query. Let’s try it:

mdw11> select count(*) from child_heap

  COUNT(*)
----------
   1000000

Execution Plan
-------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |     1 |   989   (1)| 00:00:15 |
|   1 |  SORT AGGREGATE       |         |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| CHHE_PK |  1000K|   989   (1)| 00:00:15 |
-------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          2  db block gets
       4109  consistent gets
       4088  physical reads

mdw11> select count(*) from child_iot

  COUNT(*)
----------
   1000000

Execution Plan
-------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |     1 |  4359   (1)| 00:01:05 |
|   1 |  SORT AGGREGATE       |         |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| CHIO_PK |  1000K|  4359   (1)| 00:01:05 |
-------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
      19298  consistent gets
      19246  physical reads

The number of consistent gets (and physical reads) are close to the number of leaf blocks in the two segments, though higher. This is because Oracle is scanning the whole index, leaf blocks and branch blocks. The scan is far more expensive for the IOT, simply as the index is so much larger. I’ve not shown timings but on my little laptop, the count(*) takes about 3 seconds on CHILD_HEAP and about 5 seconds on the CHILD_IOT.

That is enough for one post.


O-1 Again

I don’t often waste time searching the internet for evidence of my brilliance – but sometimes you’ve just got to do it ;) My O-1 visa (aliens of exceptional ability) was up for renewal, and the immigration authority decided to ask me for more evidence of the significance of my work.

In part, I think, this was because they had decided that Cary Millsap and Tom Kyte were colleagues of mine and therefore couldn’t be used as referees for Cost-Based Oracle. This put me into a weird “Catch-22″ situation – everyone I know that’s in a position to act as a referee is someone I know quite well, which means they’re just as much a “colleague” as Tom or Cary. So anyone I could ask to write a reference presumably wouldn’t be acceptable because I’d asked them!

In the end I asked the presidents of a number of international Oracle User Groups and editors of various magazines to make a statement for me – and they were all very helpful, for which I am grateful.

I also found a couple of dozen books and a handful of academic papers that reference my work – and that must have helped too. I also did was for salary surveys for the USA – it looks as if my fee income puts me somewhere between a top-rate gynaecologist and a top-rate surgeon (there’s an analogy lurking there somewhere).

Anyway, my passport arrived back from the US Embassy early this morning – so I’m ready for my next trip to the USA (California and Minnesota – 16th Aug)

OpenWord Suggest-a-Sessions @ Oracle Mix

There has been quite a bit written about Oracle Mix Voting (let me refer to it as OMV for short) this year and I don’t think I should waste your time restating any of this. What I wanted to do is to take a step back and try to list “stakeholders” of OMV. Before I [...]

Method-R Tools

It’s rare to find professional level tools  that are written by the people who use them. Most professional tools are written by teams of people who have little idea of how they are used and many users write tools that never reach a professional level. Thus it is a  pleasure to see the tools put out by Method R  with Cary Millsap and Jeff Holt.  Jeff and Cary have been well known in the industry for their work and contributions to the world of Oracle performance tuning. With Method R tools  they not only share their know how and  tools they use to tune systems, but they have polished them  into solid, clean , tight professional level tools that are dependable, high quality and written with integrity.

See: Pythians Blog on Method-R tools

Method R Tools

  1. MR Trace ($49.95) – automatically creates, collects and manages trace files for queries run in Oracle’s  SQL Developer using extended tracefile options targeted at collecting pertinent performance information for optimizing SQL statements.
  1. MR Tools (starting at $397)

mrskew - your trace file profiler and data miner, the core MR Tools product, check out Karl Arao’s examples

other tools

  • mrls - trace file lister, examples, list trace files sorted by  LIO or sorted by rows returned
  • mrcallrm - trace file cropper – retaining required info and filtering undesired

Trace file timings:

  • mrtim - timestamp converter
  • mrtimfix - your trace file tim value fixer, fix 11gR1 bug in trace files & change timing to UNIX epoch
  • mrnl - pre-pend lines in trace with line #, time, duration and other options as well
  1.  Method R Profiler (starting at $2,500) 

Is like the gold standard for the Method R tools . The profiler takes all the know how and analysis and wraps  it up in a clean crisp html UI that makes for easy browsing , exploration and analysis

I look forward to blogging soon on some of the innovative and powerful features of Method R Profiler

 

A quick summary of things in the report

 

1. Task-Level Profiles
1.1 Profile by subroutine
   subroutines:
     events like:
      db file sequential read
      db file scattered read >
   spark line
   table , each line is a bucket between 1us - 1000 secs, orders of 10
      call count
      total seconds
2.2 next subroutine

3. Contribution to Subroutine by Statement
3.1 Statement Contributions to <>
     statement text (with sql hash)
       seconds duration
       count
       mean
       min
       max
     drill down to stats
3.2 next subroutine

4. Statement Detail
4.1 Statement Report for <>
   Oracle hash value:
   Full statement text: drilldown
   Re-use: # distinct text
   Text size: x lines, y bytes
   # of distinct plans: X
   Optimizer goals:
   response time contribution:
4.1.1 Profile by Contribution to Parent Cursor
4.1.2 Profile by Subroutine
     waits, CPU, time by
     seconds, count, mean, min, max, skew sparkline
4.1.3 Profile by Database Call
      for
        FETCH
        beween calls
        EXEC
        PARSE
        cursor close
     show
       seconds
       CPU
       rwos processed
       bufer cache access total
         CR
         CU
      PIOs
      Library cache misses
4.1.4 Execution Plans and Placeholder Values
      Execution plan
        seconds for row !
        seconds including descendents
        rows returned
        CR LIOs
        PIOs read
        PIOs writes
4.1.5 Statement Text
4.2 ... next statement ...
5. Configuration information

The Oracle Exadata IO Resource Manager Limit Clause, part 2

In my previous post I described how the IO resource manager (IORM) in the exadata storage can be used both to guarantee a minimum amount of IO a database can get (which is what is covered in most material available), but also to set a maximum amount of IO. This is what Oracle calls an inter-database resource manager plan. This is set and configured at the cell level using the cellcli with ‘alter iormplan dbplan’.

But Randy Johnson (who is one of the authors of the upcoming book ‘Expert Oracle Exadata’) commented this is not the only way to limit IO: this can be done using the database resource manager (DBRM) too. This means you can set a limit on the CPU resource usage on the database layer, which gets pushed to the storage layer and enforces IO limits according to the limit set on the database layer. Interesting!

Let’s try it!

Here’s a simple query without any resource management:

SQL> select count(*) from cg_var;

COUNT(*)
----------
1596587000

Elapsed: 00:00:44.00

Now create a simple resource manager plan, with one resource group which limits usage to 1%:

BEGIN
DBMS_RESOURCE_MANAGER.CREATE_PENDING_AREA();

DBMS_RESOURCE_MANAGER.CREATE_PLAN(
PLAN => 'TESTPLAN',
COMMENT => 'Testplan! Limit a user to 1 percent');

DBMS_RESOURCE_MANAGER.CREATE_CONSUMER_GROUP (
CONSUMER_GROUP => 'LIMITED',
COMMENT => 'This is a limited group');

DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
PLAN => 'TESTPLAN',
GROUP_OR_SUBPLAN => 'LIMITED',
COMMENT => 'This group is limited to 1 percent',
MAX_UTILIZATION_LIMIT => 1);

DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
PLAN => 'TESTPLAN',
GROUP_OR_SUBPLAN => 'OTHER_GROUPS',
COMMENT => 'Mandatory group');

DBMS_RESOURCE_MANAGER.SET_CONSUMER_GROUP_MAPPING
(DBMS_RESOURCE_MANAGER.ORACLE_USER, 'CG', 'LIMITED');

DBMS_RESOURCE_MANAGER.VALIDATE_PENDING_AREA();
DBMS_RESOURCE_MANAGER.SUBMIT_PENDING_AREA();
END;
/

Mind the “MAX_UTILIZATION_LIMIT => 1″! Using this plan, all sessions in this database are not limited, except for the Oracle User ‘CG’, which has a maximal utilization limit of 1 percent.

Now grant access to the resource group ‘LIMITED’ for the user ‘CG’:

BEGIN
DBMS_RESOURCE_MANAGER_PRIVS.grant_switch_consumer_group(
grantee_name => 'CG',
consumer_group => 'LIMITED',
grant_option => FALSE);
END;
/

And activate the resource manager plan:

ALTER SYSTEM SET RESOURCE_MANAGER_PLAN = 'TESTPLAN' sid='*';

And try the same simple SQL again:

SQL> select count(*) from cg_var;

COUNT(*)
----------
1596587000

Elapsed: 00:36:10.21

This is clearly limited!

The resource management is done on the cell/storage layer. This means the waitevents do not indicate resource management (!). When looking at the waits, the process is mostly waiting for the ‘cell smart table scan’ event.

Tagged: exadata, IORM, management, oracle, resource, resource management, resource manager

So you don’t see any disks when trying to install ASM?

This is a post that highlights the difference between operating systems, but also the fact that sometime it is hard to break out of a habit once you got used to it. My background is that of a Linux enthusiast, even though I equally like Solaris and AIX but I have a little less exposure to those.

Background

I have recently been asked to look at RAC on SPARC, which I gladly did. The system I was given had the usual software stack for RAC at this customer’s site. It comprised of:

  • Solaris 10 Update 9 64bit on SPARC
  • EMC Power Path 5.1
  • EMC VMAX storage – 10x10G LUNs for a specific performance test

The Power Path configuration has already been in place when I got the machine, and I was allocated /dev/emcpower2[0-9] for my ASM testing. For the experienced Linux user who relies on device-mapper-multipath, the Power Path naming convention can be a bit confusing at first. For reference, the pseudo devices we are interested in for ASM are created under /dev/rdsk/-the “raw” device directory for “character” based access rather than the block device in /dev/dsk/.  By default, the Power Path devices are called “emcpower”, followed by a number and a letter (in SPARC). An example would be /dev/rdsk/emcpower20c.

The number (20) is just a sequence number and doesn’t have any other meaning as far as I know. It also doesn’t seem to be consistent across the cluster nodes by default. The suffix-letter (“c”) does have a meaning though: it indicates the slice of the device. For example, the latter “a” indicates slice 0, xxxc is the whole disk, and xxxg is slice 6.

You can use the format command to have a look at the slices currently defined. Simply type “partition” and then “print” to print it.

Now all the disks I was given had a slice 6, from cylinder 274 to the end of the disk:

# format
[disk selection not shown here]
format> partition
partition> print
Current partition table (original):
Total disk cylinders available: 10238 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       0 -   136      128.44MB    (137/0/0)     263040
  1       swap    wu     137 -   273      128.44MB    (137/0/0)     263040
  2     backup    wu       0 - 10237        9.37GB    (10238/0/0) 19656960
  3 unassigned    wm       0                0         (0/0/0)            0
  4 unassigned    wm       0                0         (0/0/0)            0
  5 unassigned    wm       0                0         (0/0/0)            0
  6        usr    wm     274 - 10237        9.12GB    (9964/0/0)  19130880
  7 unassigned    wm       0                0         (0/0/0)            0

partition> quit

If you’d like to see how the disks are connected to the array, of if there are any faults, use powermt display dev=all, or alternatively look at a specific device:

# powermt display dev=emcpower20
Pseudo name=emcpower20a
Symmetrix ID=00x2x4x0x6x4
Logical device ID=0x0x
state=alive; policy=SymmOpt; priority=0; queued-IOs=0;
==============================================================================
--------------- Host ---------------   - Stor -   -- I/O Path --  -- Stats ---
###  HW Path               I/O Paths    Interf.   Mode    State   Q-IOs Errors
==============================================================================
3072 pci@400/pci@0/pci@c/fibre-channel@0/fp@0,0 c2t50000974C00A6118d17s0 FA  7eA   active  alive       0      0
3077 pci@500/pci@0/pci@9/fibre-channel@0/fp@0,0 c3t50000974C00A611Cd17s0 FA  8eA   active  alive       0      0

If you care to learn more about how your system administrator configured the paths (I have 2 HBAs as you can see), use “powermt display”. Also note the disks c2t5….s0 and c3t5…s0 are the native devices aggregated into emcpower20a.

I have changed the permissions on the pseudo devices to 664 and made them owned by oracle:oinstall before launching OUI. In the disk discovery screen I changed the disk discovery string to /dev/rdsk/emcpower2* but didn’t see a single disk in the OUI window. That was odd.

Troubleshooting disk discovery

This error lead me to look at /tmp/OraInstall/installActions.log which referenced a call to kfod, the disk discovery tool. The entry in the log file showed this content:

WARNING: SRVCTL not found as there is no Grid Infrastructure home on the box.
 INFO: Executing [/tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/kfod, nohdr=true, verbose=true, disks=all, status=true\
 op=disks, asm_diskstring='/dev/rdsk/emcpower2*']
 with environment variables {XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt, DISPLAY=localhost:12.0,\
 PWD=/u01/app/oracle/stage/grid, LC_CTYPE=en_GB.ISO8859-1, _=./runInstaller, USER=oracle,\
 NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat, LC_COLLATE=en_GB.ISO8859-1, CLASSPATH=, HOME=/export/home/oracle,\
 LC_NUMERIC=en_GB.ISO8859-1, SSH_CONNECTION=10.128.46.56 52883 10.129.48.214 22,\
 LD_LIBRARY_PATH=/tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/../lib,\
 ORACLE_HOME=/tmp/OraInstall2011-07-25_11-33-50AM/ext/bin, SHELL=/usr/bin/bash, LOGNAME=oracle,\
 SSH_TTY=/dev/pts/3, SHLVL=1, LC_MONETARY=en_GB.ISO8859-1, MANPATH=/usr/share/man:/usr/openwin/share/man:\
 /db/pub/man:/db/pub/infra/samba/current/man:/db/pub/infra/rsync/current/man:/db/pub/infra/rcs/current/man:\
 /db/pub/infra/tcpdump/current/man:/usr/dt/share/man:/usr/java1.2/man:/usr/apache/man:/usr/perl5/man:/usr/j2se/man:\
 /usr/local/man:/opt/SUNWconn/ge/man:/opt/SUNWconn/man:/opt/SUNWsan/man:/opt/VRTS/man:/opt/DBpam/man:\
 /opt/SUNWexplo/man:/opt/SUNWcstu/man:/opt/VRTSvlic/man, SSH_CLIENT=10.128.46.56 52883 22, \
 MAIL=/var/mail//oracle, TZ=GB-Eire, LC_TIME=en_GB.ISO8859-1, JAVA_HOME=, LC_MESSAGES=C, PS1=[\u@\h \W]> , \
 OLDPWD=/u01/app/oracle/stage, LC_ALL=, TERM=xterm, TMOUT=0, PATH=/usr/bin:/usr/ccs/bin:/usr/bin:/opt/EMCpower/bin:\
 /etc/emc/bin:/etc:/zones/app/oracle/stage/grid/install}
 INFO: Starting Output Reader Threads for process /tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/kfod
 INFO: Parsing KFOD-00311: Error scanning device /dev/rdsk/c3t50000974C00A611Cd16s7
[more errors skipped]
..

(Re-) Discovery of KFOD

So therein lies the problem! As a side effect I rediscovered the command line options used with kfod! Maybe I can use this for some experiments… All you need to do is to set environment variables for ORACLE_HOME and LD_LIBRARY_PATH as per the above output.

After a little fiddling around I remembered that unlike the tests with ZFS I did before I can’t pass the whole disk to ASM (the VTOC is in the first part of the disk!), but rather have to use a slice. So by changing the disk string to /dev/rdsk/emcpower2*g I got a successful report from kfod, executed as “oracle”:

$ /tmp/OraInstall2011-07-25_11-33-50AM/ext/bin/kfod verbose=true, disks=all status=true op=disks \
> asm_diskstring='/dev/rdsk/emcpower2*g'
--------------------------------------------------------------------------------
 Disk          Size Header    Path                                     User     Group
================================================================================
   1:       9341 Mb CANDIDATE /dev/rdsk/emcpower20g                    oracle   oinstall
   2:       9341 Mb CANDIDATE /dev/rdsk/emcpower21g                    oracle   oinstall
   3:       9341 Mb CANDIDATE /dev/rdsk/emcpower22g                    oracle   oinstall
   4:       9341 Mb CANDIDATE /dev/rdsk/emcpower23g                    oracle   oinstall
   5:       9341 Mb CANDIDATE /dev/rdsk/emcpower24g                    oracle   oinstall
   6:       9341 Mb CANDIDATE /dev/rdsk/emcpower25g                    oracle   oinstall
   7:       9341 Mb CANDIDATE /dev/rdsk/emcpower26g                    oracle   oinstall
   8:       9341 Mb CANDIDATE /dev/rdsk/emcpower27g                    oracle   oinstall
   9:       9341 Mb CANDIDATE /dev/rdsk/emcpower28g                    oracle   oinstall
  10:       9341 Mb CANDIDATE /dev/rdsk/emcpower29g                    oracle   oinstall
KFOD-00311: Error scanning device /dev/rdsk/emcpower2g
ORA-15025: could not open disk "/dev/rdsk/emcpower2g"
SVR4 Error: 13: Permission denied
Additional information: 42
Additional information: 272256
Additional information: 12699536

I haven’t worked out which wildcard characters are valid in the ASM_DISKSTING (‘/dev/rdsk/emcpower2[0-9]g’ didn’t work), so I left it there. The emcpower2g device was not meant to be used in the test anyway.

With this setting I returned to the OUI window and continued the installation.

Lesson Learned

Oracle doesn’t always have useful logs, but in this case it was very useful to look at the install-actions file in /tmp. And also remember that creating ASM disks is different from Linux in Solaris and other platforms.

Logical I/O - Evolution: Part 2 - 9i, 10g Prefetching

In the initial part of this series I've explained some details regarding logical I/O using a Nested Loop Join as example.

To recap I've shown in particular:

- Oracle can re-visit pinned buffers without performing logical I/O

- There are different variants of consistent gets - a "normal" one involving buffer pin/unpin cycles requiring two latch acquisitions and a short-cut variant that visits the buffer while holding the corresponding "cache buffers chains" child latch ("examination") and therefore only requiring a single latch acquisition

- Although two statements use a similar execution plan and produce the same number of logical I/Os one is significantly faster and scales better than the other one

The initial part used the "classic" shape of the Nested Loop Join, but Oracle introduced in recent releases various enhancements in that area - in particular in 9i the "Table Prefetching" and in 11g the Nested Loop Join Batching using "Vector/Batched I/O".

Although these enhancements have been introduced primarily to optimize the physical I/O patterns, they could also have an influence on logical I/O.

The intention of Prefetching and Batching seems to be the same - they both are targeted towards the usually most expensive part of the Nested Loop Join: The random table lookup as part of the inner row source. By trying to "prefetch" or "batch" physical I/O operations caused by this random block access Oracle attempts to minimize the I/O waits.

I might cover the effect on physical I/O of both "Prefetching" and "Batching" in separate posts, here I'll only mention that you might see "db file scattered read" or "db file parallel read" multi-block I/O operations instead of single block "db file sequential read" operations for the random table access with those optimizations (Index prefetching is also possible, by the way). Note also that if you see the Prefetching or Batching plan shape it does not necessarily mean that it is actually going to happen at execution time - Oracle monitors the effectiveness of the Prefetching and can dynamically decide whether it will be used or not.

10.2.0.4 Table Prefetching - Random order

Let's enable table prefetching in 10.2 and re-run the original test case. The first run will use the different order variant of T1 and T2:

Inner row source Unique Index - T1 different order than T2

---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:04.12 | 310K|
| 2 | NESTED LOOPS | | 1 | 100K| 202K (1)| 100K|00:00:03.90 | 310K|
| 3 | TABLE ACCESS FULL | T2 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 100K| 1 | 2 (0)| 100K|00:00:02.71 | 300K|
|* 5 | INDEX UNIQUE SCAN | T1_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.20 | 200K|
---------------------------------------------------------------------------------------------------------------

Inner row source Non-Unique Index - T1 different order than T2

--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:05.03 | 311K|
| 2 | TABLE ACCESS BY INDEX ROWID| T2 | 1 | 1 | 2 (0)| 100K|00:00:04.40 | 311K|
| 3 | NESTED LOOPS | | 1 | 100K| 202K (1)| 200K|00:00:03.02 | 211K|
| 4 | TABLE ACCESS FULL | T1 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
|* 5 | INDEX RANGE SCAN | T2_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.49 | 201K|
--------------------------------------------------------------------------------------------------------------

As you'll see, in 10g even with table prefetching enabled the unique index variant does look the same and performs similar as in the original post.

This changes in 11g by the way, where the unique index variant also supports the table prefetching plan shape.

For the non-unique variant you'll see a different shape of the execution plan where the inner row source random table lookup is actually a parent operation to the Nested Loop Join (and hence will only be started once and consumes the information generated by the child Nested Loop operation).

Note that in case of nested Nested Loop Joins only the inner-most row source will make use of the Table Prefetching shape. The same applies to the 11g Nested Loop Join Batching. If you happen to have several Nested Loops Joins that are not directly nested then each of the inner-most row sources might use the Table Prefetching/Batching shape - which means that it can be used more than once as part of a single execution plan.

If you compare the Runtime profile of the non-unique index variant with the original Runtime profile without Table Prefetching you'll not see any difference in terms of logical I/O, however it becomes obvious that the overall execution is actually slightly faster (more significant with row source sampling overhead enabled). In particular the random table access requires significantly less time than in the original Runtime profile, so it seems to be more efficient, although it is still slower than the unique index variant.

Begin Update

Having focused on the logical I/O I completely forgot to mention the inconsistency in the A-Rows column (thanks to Flado who pointed this out in his comment below), which shows 200K rows for the Nested Loop operation although only 100K rows have been identified in the inner index lookup. I believe this is an inconsistency that also shows up when performing an SQL trace so it seems to be a problem with the row source statistics. In principle with this plan shape the Nested Loop Join operation seems to account for the sum of both the rows identified in the driving row source and the inner index lookup, rather than the expected number of rows identified in the inner index lookup only.

However, as mentioned below in the "Statistics" section there is another anomaly - a consistent get and "buffer is pinned count" for every row looked up in the inner table, so this might not be just coincidence but another indicator that there is really some excess work happening with Table Prefetching.

By the way - both anomalies are still present in 11.1 / 11.2 when using Table Prefetching there.

End Update

Let's have a look at the session statistics.

Statistics Name Unique Non-Unique Difference
----------------------------------------------------- -------- ----------- -----------
STAT..table scan blocks gotten 10,000 10,000 0
STAT..table scan rows gotten 100,000 100,000 0
STAT..table fetch by rowid 100,000 100,002 2
STAT..consistent gets 310,012 311,101 1,089
STAT..consistent gets from cache 310,012 311,101 1,089
STAT..session logical reads 310,012 311,101 1,089
STAT..index fetch by key 100,000 2 -99,998
STAT..rows fetched via callback 100,000 2 -99,998
STAT..index scans kdiixs1 0 100,000 100,000
STAT..buffer is not pinned count 200,001 99,997 -100,004
STAT..buffer is pinned count 189,998 290,006 100,008
STAT..consistent gets - examination 300,001 100,007 -199,994
STAT..no work - consistent read gets 10,001 211,084 201,083
LATCH.cache buffers chains 320,031 522,195 202,164

So the only significant difference in this case is the increased "buffer is pinned count" / decreased "buffer is not pinned" count statistics, although the number of logical I/O stays the same. I don't know if this really means excess work with Table Prefetching enabled or whether this is an instrumentation problem. Nevertheless with Table Prefetching enabled in this case you'll end up with both a "buffer is pinned count" and "consistent get" for each row looked up in the inner row source table operation. The number of logical I/O and latch acquisitions stays the same, so it's not obvious from the statistics why this performs better than the non-Table Prefetching case - according to the statistics it even performs more work, but may be the table random access as parent operation to the Nested Loop allows a more efficient processing requiring less CPU cycles.

10.2.0.4 Table Prefetching - Same (Random) order

Let's change the data order and use either the same "Pseudo-Random" order (by uncommenting the second "dbms_random.seed(0)" call) or order by ID - it doesn't matter with Table Prefetching in 10g.

Inner row source Unique Index - T1 and T2 same order

---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:03.91 | 310K|
| 2 | NESTED LOOPS | | 1 | 100K| 202K (1)| 100K|00:00:03.70 | 310K|
| 3 | TABLE ACCESS FULL | T2 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 100K| 1 | 2 (0)| 100K|00:00:02.54 | 300K|
|* 5 | INDEX UNIQUE SCAN | T1_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.14 | 200K|
---------------------------------------------------------------------------------------------------------------

Inner row source Non-Unique Index - T1 and T2 same order

--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:04.54 | 221K|
| 2 | TABLE ACCESS BY INDEX ROWID| T2 | 1 | 1 | 2 (0)| 100K|00:00:03.90 | 221K|
| 3 | NESTED LOOPS | | 1 | 100K| 202K (1)| 200K|00:00:02.82 | 211K|
| 4 | TABLE ACCESS FULL | T1 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
|* 5 | INDEX RANGE SCAN | T2_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.40 | 201K|
--------------------------------------------------------------------------------------------------------------

Now we really see a difference: The unique index variant still shows the same results, but the non-unique variant saves logical I/O on the random table access - and is faster than with random order - coming closer to the unique index variant performance.

Whereas the index range scan still requires approx. 200,000 logical I/Os the random table access only requires 10,000 logical I/Os instead of 100,000.

The session statistics:

Statistics Name Unique Non-Unique Difference
----------------------------------------------------- -------- ----------- -----------
STAT..table scan blocks gotten 10,000 10,000 0
STAT..table scan rows gotten 100,000 100,000 0
STAT..table fetch by rowid 100,000 100,002 2
LATCH.cache buffers chains 320,023 342,213 22,190
STAT..consistent gets 310,012 221,110 -88,902
STAT..consistent gets from cache 310,012 221,110 -88,902
STAT..session logical reads 310,012 221,110 -88,902
STAT..index fetch by key 100,000 2 -99,998
STAT..rows fetched via callback 100,000 2 -99,998
STAT..index scans kdiixs1 0 100,000 100,000
STAT..no work - consistent read gets 10,001 121,093 111,092
STAT..buffer is not pinned count 200,001 10,006 -189,995
STAT..buffer is pinned count 189,998 379,997 189,999
STAT..consistent gets - examination 300,001 100,007 -199,994

The session statistics confirm this: The "buffer is pinned count" increases by another 90,000 for the non-unique index variant which corresponds to the 90,000 logical I/Os performed less as part of the random table access operation.

The number of latch acquisitions decreases accordingly so that we end up with a comparable number as with the unique index variant.

Scalability

If you run the non-unique index Table Prefetching variant with the concurrent execution test harness you'll see a corresponding slightly increased scalability although it still scales not as good as the unique index variant.

Summary

Table Prefetching has been introduced in Oracle 9i in order to optimize the random physical access in Nested Loop Joins, however it also seems to have a positive effect on logical I/O. The effectiveness of this optimization depends on the data order - if the data from the driving row source is in the same order as the inner row source table buffers can be kept pinned. Note that the same doesn't apply to the index lookup - even if the data is ordered by ID and consequently the same index branch and leaf blocks will be accessed repeatedly with each iteration, a buffer pinning optimization could not be observed.

In the next part we'll see what happens with this example in Oracle 11g and its new features.

Homecoming

I’ve been on holiday, walking in the Lake District, for a few days, so you haven’t seen much from me for a while (apart from a couple of post-dated items).

Now that I’m back I’ve got several hundred email messages, a couple of jobs, and a load of writing to catch up on, so you’re still not going to see much output. However, I have a of odds and ends that I’ve filed for later use, and I started tidying this list up a few weeks ago, so I thought I’d share the tidy bit with you – just to make sure you have some interesting reading over the next few days.

It’s published as a WordPress “page” rather than a post, I’ll probably post a new temporary link to it from time to time so that people can see when it gets updated. I’ve also added it to the “Special Links” list in the right-hand column.