Who's online

There are currently 0 users and 25 guests online.

Recent comments


Oakies Blog Aggregator

Updated (XML) Content Section

“Once per year I try to update the “XML Content” page that, in principle, should…

STATISTICS COLLECTOR Row Source Operation and LOBs

The aim of the STATISTICS COLLECTOR row source operation, which is used in adaptive plans, is to buffer all data produced by its child operation until it is known whether the inflection point is crossed. It goes without saying that buffering requires memory and, therefore, Oracle Database has to limit the amount of memory that can be allocated for that purpose. As a result, in some situations no adaptive plans can be used because according to the query optimizer estimations too much memory is required.

A special case of the previous limitation is when the row source operation under the STATISTICS COLLECTOR operation produces a LOB. In fact, when a LOB is detected, independently of the amount of data that is expected to be produced, the adaptive plan is bypassed by the query optimizer.

Let’s have a look to an example:

  • Create the test table T1
SQL> CREATE TABLE t1 (id, n, pad)
  2  AS
  3  SELECT rownum, rownum, lpad('*',100,'*')
  4  FROM dual
  5  CONNECT BY level <= 10000;

  2  SELECT 10000+rownum, 666, lpad('*',100,'*')
  3  FROM dual
  4  CONNECT BY level <= 150;



SQL> execute dbms_stats.gather_table_stats(user,'t1')
  • Create the test table T2
SQL> CREATE TABLE t2 (id, n, pad)
  2  AS
  3  SELECT rownum, rownum, lpad('*',100,'*')
  4  FROM dual
  5  CONNECT BY level <= 10000;


SQL> execute dbms_stats.gather_table_stats(user,'t2')
  • The query optimizer selects an adaptive plan when T1 and T2 are joined
  2  SELECT *
  3  FROM t1, t2
  4  WHERE =
  5  AND t1.n = 666;

SQL> SELECT * FROM table(dbms_xplan.display(format=>'basic +predicate +note +adaptive'));

Plan hash value: 1837274416

|   Id  | Operation                     | Name  |
|     0 | SELECT STATEMENT              |       |
|- *  1 |  HASH JOIN                    |       |
|     2 |   NESTED LOOPS                |       |
|     3 |    NESTED LOOPS               |       |
|-    4 |     STATISTICS COLLECTOR      |       |
|  *  5 |      TABLE ACCESS FULL        | T1    |
|  *  6 |     INDEX UNIQUE SCAN         | T2_PK |
|     7 |    TABLE ACCESS BY INDEX ROWID| T2    |
|-    8 |   TABLE ACCESS FULL           | T2    |

Predicate Information (identified by operation id):

   1 - access("T1"."ID"="T2"."ID")
   5 - filter("T1"."N"=666)
   6 - access("T1"."ID"="T2"."ID")

   - this is an adaptive plan (rows marked '-' are inactive)
  • Create the test table T1L which contains the same data as table T1; notice that a CLOB is used instead of a VARCHAR2
  2  AS
  3  SELECT id, n, to_clob(pad) AS pad
  4  FROM t1;


SQL> execute dbms_stats.gather_table_stats(user,'t1l')
  • The query optimizer does not selects an adaptive plan when T1L and T2 are joined
  2  SELECT *
  3  FROM t1l, t2
  4  WHERE =
  5  AND t1l.n = 666;

SQL> SELECT * FROM table(dbms_xplan.display(format=>'basic +predicate +note +adaptive'));

Plan hash value: 810895618

| Id  | Operation                    | Name  |
|   0 | SELECT STATEMENT             |       |
|   1 |  NESTED LOOPS                |       |
|   2 |   NESTED LOOPS               |       |
|*  3 |    TABLE ACCESS FULL         | T1L   |
|*  4 |    INDEX UNIQUE SCAN         | T2_PK |

Predicate Information (identified by operation id):

   3 - filter("T1L"."N"=666)
   4 - access("T1L"."ID"="T2"."ID")

Note that when the query optimizer detects the limitation just described, in the 10053 trace it writes the following information:

AP: Adaptive joins bypassed for table T2 @ SEL$1 due to LOB-producing table under statistics collector

The script I wrote to check this behaviour can be downloaded here.


I have a simple script that creates two identical tables , collects stats (with no histograms) on the pair of them, then executes a join. Here’s the SQL to create the first table:

create table t1
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
	trunc(dbms_random.value(0,1000))	n_1000,
	trunc(dbms_random.value(0,750))		n_750,
	trunc(dbms_random.value(0,600))		n_600,
	trunc(dbms_random.value(0,400))		n_400,
	trunc(dbms_random.value(0,90))		n_90,
	trunc(dbms_random.value(0,72))		n_72,
	trunc(dbms_random.value(0,40))		n_40,
	trunc(dbms_random.value(0,3))		n_3
	generator	v1,
	generator	v2
	rownum <= 1e6

-- gather stats: no histograms

The two tables have 1,000,000 rows each and t2 is created from t1 with a simple “create as select”. The columns are all defined to be integers, and the naming convention is simple – n_400 holds 400 distinct values with uniform distribution from 0 – 399, n_750 holds 750 values from 0 – 749, and so on.

Here’s the simple query:

        t1.*, t2.*
        t1, t2
        t1.n_400 = 0
and     t2.n_72  = t1.n_90
and     t2.n_750 = t1.n_600
and     t2.n_400 = 1

Since I’ve created no indexes you might expect the query to do a couple of and a hash join to get its result – and you’d be right; but what do you think the predicted cardinality would be ?

Here are the results from running explain plan on the query and then reporting the execution plan – for three different versions of Oracle:
| Id  | Operation            |  Name       | Rows  | Bytes | Cost (%CPU)|
|   0 | SELECT STATEMENT     |             |    96 |  4992 |  1230  (10)|
|*  1 |  HASH JOIN           |             |    96 |  4992 |  1230  (10)|
|*  2 |   TABLE ACCESS FULL  | T1          |  2500 | 65000 |   617  (11)|
|*  3 |   TABLE ACCESS FULL  | T2          |  2500 | 65000 |   613  (10)|

Predicate Information (identified by operation id):
   1 - access("T2"."N_750"="T1"."N_600" AND "T2"."N_72"="T1"."N_90")
   2 - filter("T1"."N_400"=0)
   3 - filter("T2"."N_400"=1)

| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT   |      |   116 |  6032 |  1229  (10)| 00:00:07 |
|*  1 |  HASH JOIN         |      |   116 |  6032 |  1229  (10)| 00:00:07 |
|*  2 |   TABLE ACCESS FULL| T1   |  2500 | 65000 |   616  (11)| 00:00:04 |
|*  3 |   TABLE ACCESS FULL| T2   |  2500 | 65000 |   612  (10)| 00:00:04 |

Predicate Information (identified by operation id):
   1 - access("T2"."N_750"="T1"."N_600" AND "T2"."N_72"="T1"."N_90")
   2 - filter("T1"."N_400"=0)
   3 - filter("T2"."N_400"=1)

| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT   |      |  2554 |   139K|  1225  (10)| 00:00:07 |
|*  1 |  HASH JOIN         |      |  2554 |   139K|  1225  (10)| 00:00:07 |
|*  2 |   TABLE ACCESS FULL| T1   |  2500 | 70000 |   612  (10)| 00:00:04 |
|*  3 |   TABLE ACCESS FULL| T2   |  2500 | 70000 |   612  (10)| 00:00:04 |

Predicate Information (identified by operation id):
   1 - access("T2"."N_72"="T1"."N_90" AND "T2"."N_750"="T1"."N_600")
   2 - filter("T1"."N_400"=0)
   3 - filter("T2"."N_400"=1)

The change for (which is still there for I didn’t check to see if it also appears in is particularly worrying. When you see a simple query like this changing cardinality on the upgrade you can be fairly confident that some of your more complex queries will change their plans – even if there are no clever new optimizer transformations coming into play.

I’ll write up an explanation of how the optimizer has produced three different estimates some time over the next couple of weeks; but if you want an earlier answer this is one of the things I’ll be covering in my presentation on calculating selectivity at “Super Sunday” at UKOUG Tech 14.

UKOUG 2014

So it’s that time of year when I have to decide on my timetable for the UKOUG annual conference. Of course, I never manage to stick to it, but in principle here are the sessions I’ve highlighted:


  • 12:30 – How to Avoid a Salted Banana – Lothar Flatz
  • 13:30 – Calculating Selectivity  – Me
  • 15:00 – Advanced Diagnostics Revisited – Julian Dyke
  • 16:00 – Testing Jumbo Frames for RAC – Neil Johnson


  • 9:00 – Oracle Indexes Q & A session – Richard Foote
  • 10:00 – How Oracle works in 50 minutes – Martin Widlake
  • 11:30 – Predictive Queries in 12c – Brendan Tierney
  • 14:30 – Oracle Database In-Memory DB and the Query Optimizer – Christian Antognini
  • 16:00 – Instrumenting, Analysing, & Tuning the Performance of Oracle ADF Applications – Frank Houweling
  • 17:00 – Techniques for Strategic Testing – Clive King


  • 9:30 – Top Five Things You Need To Know About Oracle Database In-Memory Option – Maria Colgan
  • 10:30 – How to Write Better PL/SQL – Andrew Clarke
  • 12:00 – Optimizer Round Table – Tony Hasler
  • 14:00 – What we had to Unlearn & Learn when Moving from M9000 to Super Cluster -Philippe Fierens
  • 15:00 – Maximum Availability Architecture: A Recipe for Disaster? – Julian Dyke
  • 16:30 – Chasing the Holy Grail of HA – Implementing Transaction Guard & Application Continuity in Oracle Database 12c -Mark Bobak
  • 17:30 – Five Hints for Efficient SQL – Me


  • 9:00 – Fundamentals of Troubleshooting (without graphics) pt.1 – Me
  • 10:00 – Fundamentals of Troubleshooting (without graphics) pt.2 – Me
  • 11:30 – Indexing in Exadata – Richard Foote



How do you Explain Oracle in 50 Minutes?

I’ve done a very “brave”* thing. I’ve put forward a talk to this year’s UKOUG Tech14 conference titled “How Oracle Works – in under 50 minutes”. Yes, I really was suggesting I could explain to people how the core of Oracle functions in that time. Not only that, but the talk is aimed at those new to Oracle technology. And it got accepted, so I have to present it. I can’t complain about that too much, I was on the paper selection committee…

* – “brave”, of course, means “stupid” in this context.

As a result I am now strapped to the chair in front of my desk, preparing an attempt to explain the overall structure of an Oracle instance, how data moves in out of storage, how ACID works and a few other things. Writing this blog is just avoidance behaviour on my part as I delay going back to it.

Is it possible? I’m convinced it is.

If you ignore all the additional bits, the things that not all sites use, such as Partitioning, RAC, Resource Manager, Materialized Views etc, etc, etc, then that removes a lot. And if not everyone uses it, then it is not core.
There is no need or intention on my part to talk about details of the core – for example, how the Cost Based aspect of the optimizer works, Oracle permissions or the steps needed for instance recovery. We all use those but the details are ignored by some people for their whole career {not usually people who I would deem competent, despite them holding down jobs as Oracle technicians, but they do}.

You are left with a relatively small set of things going on. Don’t get me wrong, it is still a lot of stuff to talk about and is almost certainly too much for someone to fully take in and digest in the time I have. I’m going to have to present this material as if I am possessed. But my intention is to describe a whole picture that makes sense and will allow people to understand the flow. Then, when they see presentations on aspects of it later in the conference, there is more chance it will stick. I find I need to be taught something 3 or 4 times. The first time simply opens my mind to the general idea, the second time I retain some of the details and the third or forth time I start integrating it into what I already new.

My challenge is to say enough so that it makes sense and *no more*. I have developed a very bad habit of trying to cram too much into a presentation and of course this is a real danger here. I’m trying to make it all visual. There will be slides of text, but they are more for if you want to download the talk after the conference. However, drawing pictures takes much, much, much longer than banging down a half dozen bullet points.

One glimmer in the dark is that there is a coffee break after my session. I can go right up to the wire and then take questions after I officially stop, if I am not wrestled to the ground and thrown out the room.

If anyone has any suggestions or comments about what I should or should not include, I’d love to hear them.

This is all part of my intention to provide more conference content for those new to Oracle. As such, this “overview” talk is at the start of the first day of the main conference, 10am Monday. I have to thank my fellow content organisers for allowing me to stick it in where I wanted it. If you are coming to the conference and don’t know much Oracle yet – then I am amazed you read my blog (or any other blog other than maybe AskTom). But if you have colleagues or friends coming who are still relatively new to the tech, tell them to look out for my talk. I really hope it will help them get that initial understanding.

I had hoped to create a fully fledged thread of intro talks running through all of Monday and Tuesday, but I brought the idea up too late. We really needed to promote the idea at the call for papers and then maybe sources a couple of talk. However, using the talks that were accepted we did manage to get a good stab at a flow of intro talks through Monday. I would suggest:

  • 08:50 – Welcome and Introduction
    • Get there in time for the intro if you can, as if you are newish to the tech you are probably newish to a conference
  • 09:00 RMAN the basics, by Michael Abbey.
    • If you are a DBA type, backup/recovery is your number one concern.
  • 10:00 – How Oracle Works in 50 Minutes
    • I think I have said enough!
  • 11:30 – All about Joins by Tony Hasler
    • Top presenter, always good content
  • 12:30 – Lunch. Go and talk to people, lots of people, find some people you might like to talk with again. *don’t stalk anyone*
  • 13:20 – Go to the Oracle Keynote.
    • Personally, I hate whole-audience keynotes, I am sick of being told every year how “there has never been a better time to invest in oracle technology” – but this one is short and after it there is a panel discussion by technical experts.
  • 14:30 is a bit tricky. Tim Hall on Analytical Functions is maybe a bit advanced, but Tim is a brilliant teacher and it is an intro to the subject. Failing that, I’d suggest the Oracle Enterprise Manager round table hosted by Dev Nayak as Database-centric oracle people should know OEM.
  • 16:00 – Again a bit tricky for someone new but I’d plump for The role of Privileges and Roles in Oracle 12C by Carl Dudley. He lectures (lectured?) in database technology and knows his stuff, but this is a New Feature talk…
  • 17:00 – Tuning by Explain Plan by Arian Stijf
    • This is a step-by-step guide to understanding the most common tool used for performance tuning
  • 17:50 onwards – go to the exhibition drinks, the community drinks and just make friends. One of the best thing to come out of conferences is meeting people and swapping stories.

I better get back to drawing pictures. Each one takes me a day and I need about 8 of them. Whoops!

Little things worth knowing: creating the Swingbench order entry schema “lights out”

This is a little note, primarily to myself I guess, about the creation of the order entry schema (part of Swingbench, written by Dominic Giles) when no VNC sessions are available (although you can almost always use port-forwarding :). Instead, you can create the schema on the command line. I always execute commands on remote systems in screen for increased peace of mind. Should the network drop, the order entry generation will continue as if nothing ever happened.

Like many others I use Swingbench during trainings and presentations to have some activity on a system. Very useful for demonstrating ASH and OEM, and many other things too!

This article references swingbench 2.5, the latest version at the time of writing. It requires a JDK to be present, I simply used the one bundled with Oracle which is my target database version. You first start by downloading Swingbench from Dominic’s website and unzip it in a directory from where you want to run it. It is usually a good idea to keep the load generator on a different (set of) machine(s). As long as you can connect to the database you want to use to house the SOE schema you’re fine.

Creating the SOE schema

Before you can run swingbench (or minibench/charbench) you need to create the SOE schema. The oewizard is the tool of choice. If you haven’t done so, add java to your path. Then you can have a look at the available options:

$ JAVA_HOME=/u01/app/oracle/product/
$ export PATH=$PATH:$JAVA_HOME/bin

$ ./oewizard -h
usage: parameters:
 -allindexes             build all indexes for schema
 -bigfile                use big file tablespaces
 -c            wizard config file
 -cl                     run in character mode
 -compositepart          use a composite paritioning model if it exisits
 -compress               use default compression model if it exists
 -create                 create benchmarks schema
 -cs      connectring for database
 -dba          dba username for schema creation
 -dbap         password for schema creation
 -debug                  turn on debugging output
 -debugf                 turn on debugging output to file (debug.log)
 -df           datafile name used to create schema in
 -drop                   drop benchmarks schema
 -dt         driver type (oci|thin)
 -g                      run in graphical mode (default)
 -generate               generate data for benchmark if available
 -h,--help               print this message
 -hashpart               use hash paritioning model if it exists
 -hcccompress            use HCC compression if it exisits
 -nocompress             don't use any database compression
 -noindexes              don't build any indexes for schema
 -nopart                 don't use any database partitioning
 -normalfile             use normal file tablespaces
 -oltpcompress           use OLTP compression if it exisits
 -p            password for benchmark schema
 -part                   use default paritioning model if it exists
 -pkindexes              only create primary keys for schema
 -rangepart              use a range paritioning model if it exisits
 -s                      run in silent mode
 -scale           mulitiplier for default config
 -sp    the number of softparitions used. Defaults to cpu
 -tc       the number of threads(parallelism) used to
                         generate data. Defaults to cpus*2
 -ts         tablespace to create schema in
 -u            username for benchmark schema
 -v                      run in verbose mode when running from command
 -version       version of the benchmark to run

Note that not all partitioning schemes are valid in this case, and neither are all compression methods. I found the below command to create the schema the way I want it.

./oewizard -allindexes -cl -create -cs //cluster-scan/sbench -dba martin \
> -dbap notTellingYouHere -ts soe_tbs -part -p soe -scale 8 -tc 8 -u soe -v

SwingBench Wizard
Author  :        Dominic Giles
Version :

Running in Lights Out Mode using config file : oewizard.xml
Connecting to : jdbc:oracle:thin:@//cluster-scan/sbench
Starting script ../sql/soedgcreateuser.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 104 millisecond(s)
Starting run
Starting script ../sql/soedgdrop2.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 55 millisecond(s)
Starting script ../sql/soedgcreatetableshash2.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 520 millisecond(s)
Starting script ../sql/soedgviews.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 19 millisecond(s)
Starting script ../sql/soedgsqlset.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 84 millisecond(s)
Inserting data into table ADDRESSES_11499978
Inserting data into table ADDRESSES_3999993

-- more data

Completed processing table LOGON_7943271 in 0:00:25
Connection cache closed
Starting script ../sql/soedganalyzeschema2.sql
Script completed in 0 hour(s) 1 minute(s) 20 second(s) 701 millisecond(s)
Starting script ../sql/soedgconstraints2.sql
Script completed in 0 hour(s) 1 minute(s) 41 second(s) 520 millisecond(s)
Starting script ../sql/soedgindexes2.sql
Script completed in 0 hour(s) 0 minute(s) 44 second(s) 685 millisecond(s)
Starting script ../sql/soedgsequences2.sql
Script completed in 0 hour(s) 0 minute(s) 3 second(s) 53 millisecond(s)
Starting script ../sql/soedgpackage2_header.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 215 millisecond(s)
Starting script ../sql/soedgpackage2_body.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 127 millisecond(s)
Starting script ../sql/soedgsetupmetadata.sql
Script completed in 0 hour(s) 0 minute(s) 0 second(s) 614 millisecond(s)

|           Datagenerator Run Stats        |
Connection Time                        0:00:00.005
Data Generation Time                   0:10:38.185
DDL Creation Time                      0:03:51.630
Total Run Time                         0:14:29.823
Rows Inserted per sec                      151,713
Data Generated (MB) per sec                   12.1
Actual Rows Generated                   97,718,288

Connecting to : jdbc:oracle:thin:@//cluster-scan/sbench

Post Creation Validation Report
The creation of the schema appears to have been unsuccessful. See the following sections for further details.

Valid Objects
Valid Code :

Invalid Objects (1)
Invalid Tables :
Invalid Indexes :
Invalid Views :
Invalid Sequences :
Invalid Code : 'ORDERENTRY'

Missing Objects (0)
Missing Tables :
Missing Indexes : 
Missing Views :
Missing Sequences :
Missing Code :
Schema Created

To compile the package SYS needs to grant execute on DBMS_LOCK to SOE.

This results in a schema with the following number of rows:

SQL> declare
  2   v_cnt number;
  3  begin
  4   for i in (select table_name from dba_tables where owner='SOE') loop
  5    execute immediate 'select count(*) from soe.' || i.table_name into v_cnt;
  6    dbms_output.put_line(i.table_name || ' - ' || to_char(v_cnt, '999,999,999'));
  7   end loop;
  8  end;
  9  /
ORDER_ITEMS -   34,316,243
ORDERS -   11,438,320
INVENTORIES -      896,853
LOGON -   19,063,872
CUSTOMERS -    8,000,000
ADDRESSES -   12,000,000
CARD_DETAILS -   12,000,000
WAREHOUSES -        1,000

PL/SQL procedure successfully completed.

Elapsed: 00:00:02.70

SQL> compute sum of m on report
SQL> break on report

SQL> select sum(bytes)/power(1024,2) m, segment_name, segment_type
  2   from dba_segments where owner = 'SOE'
  3  group by segment_name, segment_type
  4  order by 1;

              M SEGMENT_NAME                   SEGMENT_TYPE
--------------- ------------------------------ ------------------
            .06 WAREHOUSES                     TABLE
            .06 PRODUCT_INFORMATION_PK         INDEX
            .06 ORDERENTRY_METADATA            TABLE
            .13 WHS_LOCATION_IX                INDEX
            .13 PROD_CATEGORY_IX               INDEX
            .13 WAREHOUSES_PK                  INDEX
            .13 PROD_SUPPLIER_IX               INDEX
            .13 PROD_NAME_IX                   INDEX
            .13 PRD_DESC_PK                    INDEX
            .25 PRODUCT_INFORMATION            TABLE
            .31 PRODUCT_DESCRIPTIONS           TABLE
          16.69 INV_WAREHOUSE_IX               INDEX
          16.69 INV_PRODUCT_IX                 INDEX
          18.00 INVENTORY_PK                   INDEX
          88.00 INVENTORIES                    TABLE
         169.00 CUST_ACCOUNT_MANAGER_IX        INDEX
         177.06 CUSTOMERS_PK                   INDEX
         204.06 CUST_DOB_IX                    INDEX
         242.19 ORD_SALES_REP_IX               INDEX
         253.38 ORDER_PK                       INDEX
         254.31 CUST_FUNC_LOWER_NAME_IX        INDEX
         265.06 ORD_CUSTOMER_IX                INDEX
         265.69 ADDRESS_PK                     INDEX
         265.75 CARD_DETAILS_PK                INDEX
         276.88 CARDDETAILS_CUST_IX            INDEX
         278.13 ORD_WAREHOUSE_IX               INDEX
         278.50 ADDRESS_CUST_IX                INDEX
         291.88 ORD_ORDER_DATE_IX              INDEX
         347.81 CUST_EMAIL_IX                  INDEX
         717.88 ITEM_PRODUCT_IX                INDEX
         788.94 ITEM_ORDER_IX                  INDEX
         864.00 LOGON                          TABLE PARTITION
         968.00 CARD_DETAILS                   TABLE PARTITION
       1,352.00 CUSTOMERS                      TABLE PARTITION
       1,408.00 ADDRESSES                      TABLE PARTITION
       1,544.00 ORDERS                         TABLE PARTITION
       2,368.00 ORDER_ITEMS                    TABLE PARTITION
37 rows selected.

Happy benchmarking!

truncated ddl in 12c (and

Just curious if anyone else is seeing this behaviour..

SQL> set long 500000
SQL> @pt "select * from v$sql where sql_id = '1km492z723vpu'"
SQL_TEXT                      : alter table scott.emp
SQL_FULLTEXT                  : alter table scott.emp
SQL_ID                        : 1km492z723vpu

Suffice to say, there’s a lot more to that SQL statement, but its lost in v$sql. Once complete, it does appear in full in v$sqlstats, but its a pain when you cant see the whole statement..

Physical IO on Linux

I posted a fair amount of stuff on how Oracle is generating IOs, and especially large IOs, meaning more than one Oracle block, so > 8KB. This is typically what is happening when the Oracle database is executing a row source which does a full segment scan. Let’s start off with a quiz: what you think Oracle is the maximum IO size the Oracle engine is capable of requesting of the Operating System (so the IO size as can be seen at the SCI (system call interface) layer? If you made up your answer, remember it, and read on!

The real intention of this blogpost is to describe what is going on in the Oracle database kernel, but also what is being done in the Linux kernel. Being a performance specialised Oracle DBA means you have to understand what the operating system does. I often see that it’s of the utmost importance to understand how an IO ends up as a request at the NAS or SAN head, so you understand what a storage admin is talking about.

Many people (including myself in the past) would state that the maximum IO size on Linux is 1MB. For the Linux 2.6 kernel and higher this statement is incorrect because there is no such thing as a single maximum IO size on Linux. There used to be one in the Linux 2.4 era, which was set with the maxphys parameter, but that time is long gone. In order to find out what it is now, let’s test and see!

First let’s get a Linux system and a big table!
The system I got is a VMWare Fusion VM, running Linux 3.8.13-44.1.5.el6uek.x86_64 (UEK3) on Oracle Linux 6u6. On top of that I am using the Oracle database and grid infrastructure version This system is using udev for providing disk access (as opposed to asmlib)
The redundancy mode of ASM is external, although for reading (what I will be covering) this doesn’t matter.

Now let’s get a normal database session, and use a combination of sql_trace with waits (10046/8) and strace to see how the Oracle database interfaces with the kernel. Please mind I’ve prepared a heap table with no indexes on it, so a count(*) on it always will result in a full table scan. Also, the buffercache is sized small enough (or the table is created large enough, it depends on how you look at it) to have the session make the decision to do a direct path read, instead of a buffered read. If you don’t know what that means: please search this blog on direct path reads, or even better, download my presentation ‘about multiblock reads’.

The direct path read decision is visible via the ‘direct path read’ wait event. If you get a full table scan operation and see ‘db file scattered read’ waits, you are doing a buffered scan.

In the most cases, you will get a maximum value of 1MB if possible, which seems to support the generally assumed 1MB maximum operating system induced IO size. Why? Well, because you probably set the DB_FILE_MULTIBLOCK_READ_COUNT parameter to 128, which means you have explicitly set the Oracle process not to do IO with a size more than 1MB (8192*128). In this blogpost, I explain that Oracle can request IOs bigger than 1MB.

In the blogpost series on extra huge database IOs, I show that Oracle can do huge (1MB+) IOs, but the physical request size (what Oracle actually requests at the SCI layer, visible with the pread/pwrite/io_submit/io_getevents functions) still is 1MB. This limit is imposed by the physical storage structure which the database uses with ASM, called allocation unit (often called ‘AU’). The default size of an allocation unit is 1MB. The allocation unit can be seen in both the database and the ASM instance with the following query:

SYS@+ASM AS SYSASM> select name, allocation_unit_size from v$asm_diskgroup;

------------------------------ --------------------
DATA					    1048576

How about doing an unbuffered read on a filesystem? I’ve created a database on an (XFS, but this doesn’t matter AFAIK) filesystem, and tried to set the maximum value to DB_FILE_MULTIBLOCK_READ_COUNT. I’ve done this by setting DB_FILE_MULTIBLOCK_READ_COUNT to 10000 (ten thousand), and then bounce the database to see what the number has become. In my case, the value became 4096. I think this is the limit for Oracle on Linux x86_64, but love to hear if you have gotten different results:

I set 10000:

SYS@fv12102 AS SYSDBA> select name, value from v$spparameter where name like 'db_file_multiblock%';

NAME						   VALUE
-------------------------------------------------- ----------------------------------------------------------------------
db_file_multiblock_read_count			   10000

But Oracle limits this to 4096:

SYS@fv12102 AS SYSDBA> select name, value from v$parameter where name like 'db_file_multiblock%';

NAME						   VALUE
-------------------------------------------------- ----------------------------------------------------------------------
db_file_multiblock_read_count			   4096

Okay. Let’s start our investigation at that point: a database which is set up with a DB_FILE_MULTIBLOCK_READ_COUNT set to 4096, alias 32MB (with a block size of 8KB), and a table which got extents large enough to accommodate huge (32MB) IOs.

Fire up a session regular database session, and enable sql trace at level 8:

$ sqlplus ts/ts@//localhost/v11204
SQL> alter session set events 'sql_trace level 8';

Now start another session as root on the database server, and find the PID of the server process of the sqlplus process we just created above. Issue strace with verbose writing setting:

# strace -e write=all -e all -p PID
Process PID attached - interrupt to quit

Okay, we are setup and ready to go, but there is one additional thing: the way direct path reads work, they would probably give little waits with fast IO capabilities. One way to get the waits back, is to limit the IO capabilities of the process. Doing so is documented in this article.

Now issue the full table scan on a large table in sqlplus while strace is attached:

SQL> select count(*) from bigtab;

Now take a peek at the strace output!
The output first shows IOs as we expect:

io_getevents(139717184229376, 1, 128, {{0x7f126dd3d780, 0x7f126dd3d780, 33554432, 0}}, {600, 0}) = 1
times(NULL)                             = 431386800
write(7, "\n*** 2014-11-24 13:09:28.028\n", 29) = 29
 | 00000  0a 2a 2a 2a 20 32 30 31  34 2d 31 31 2d 32 34 20  .*** 201 4-11-24  |
 | 00010  31 33 3a 30 39 3a 32 38  2e 30 32 38 0a           13:09:28 .028.    |
lseek(7, 0, SEEK_CUR)                   = 31181
write(7, "WAIT #139717129509840: nam='dire"..., 130) = 130
 | 00000  57 41 49 54 20 23 31 33  39 37 31 37 31 32 39 35  WAIT #13 97171295 |
 | 00010  30 39 38 34 30 3a 20 6e  61 6d 3d 27 64 69 72 65  09840: n am='dire |
 | 00020  63 74 20 70 61 74 68 20  72 65 61 64 27 20 65 6c  ct path  read' el |
 | 00030  61 3d 20 33 39 30 37 33  30 20 66 69 6c 65 20 6e  a= 39073 0 file n |
 | 00040  75 6d 62 65 72 3d 34 20  66 69 72 73 74 20 64 62  umber=4  first db |
 | 00050  61 3d 37 34 31 33 37 36  20 62 6c 6f 63 6b 20 63  a=741376  block c |
 | 00060  6e 74 3d 34 30 39 36 20  6f 62 6a 23 3d 32 30 34  nt=4096  obj#=204 |
 | 00070  37 34 20 74 69 6d 3d 31  39 32 30 30 37 31 30 31  74 tim=1 92007101 |
 | 00080  39 39                                             99                |

What is visible here, is first the reap of an I/O request (with asynchronous IO on Linux this is typically the io_getevents() call). If you take a close look at the arguments of the io_getevents() call (taken from the manpage of io_getevents):

int io_getevents(aio_context_t ctx_id, long min_nr, long nr, struct io_event *events, struct timespec *timeout);

And then focus on the struct io_event:

struct io_event {
         __u64           data;           /* the data field from the iocb */
         __u64           obj;            /* what iocb this event came from */
         __s64           res;            /* result code for this event */
         __s64           res2;           /* secondary result */

The above description is taken from the annotated Linux kernel source, as available here: I use this site for navigating the Linux kernel source. What is important, is that the third field (io_event.res) contains the size of the IO request. Having learned this, now look again in the io_getevents call. The size of the IO reaped above is 33554432, which is 33554432/1024/1024=32 MB. Yes, that’s a single IO of 32MB! Also, this is consistent with the wait line a little lower:

 | 00050  61 3d 37 34 31 33 37 36  20 62 6c 6f 63 6b 20 63  a=741376  block c |
 | 00060  6e 74 3d 34 30 39 36 20  6f 62 6a 23 3d 32 30 34  nt=4096  obj#=204 |

Block count = 4096 * 8192 (block size) = 33554432

So, I wonder what you thought was possible, the correct answer on my operating system (Linux x86_64) with Oracle is 32MB. It turned out the big IOs in the ASM case were limited by the allocation unit size of 1MB.

The next thing I’ve wondered is how this matches with the maximum IO size of the disk devices as visible by the Operating System. You can request 32MB, but a normal SCSI disk doesn’t do 32MB IOs. Of course in my case the SCSI disk really is a VMWare virtual disk device.

Let’s keep the 32MB IO in mind, now dive from the top layer, the SCI (system call interface) where an IO enters the kernel to the bottom of the kernel from an IO perspective, to the block device. The block device settings are found in /sys/block/DEVICE/queue. The maximum IO size the device is capable of is found in max_hw_sectors_kb. This is in kilobytes, and read only (can’t change hardware, right?). In my case this is:

[root@bigmachine queue]# cat max_hw_sectors_kb

My disk supports a maximum of 4M for an IO size! But this is not what is used, the actual setting is in max_sectors_kb:

[root@bigmachine queue]# cat max_sectors_kb

That’s half a megabyte!

So…we got (up to) 32MB sized IO requests coming in, and a device that is set to 512KB IOs. This means that somewhere between the SCI and the device, there is a mechanism to scatter the request size to the device’s maximum IO size, and once the IO requests are done, going back to gather the IO results to the original request.

There are a couple of layers in the Linux kernel through which the call travels (including common functions):

-SCI/system call interface: system_call, sys_io_submit…. (io_submit, do_io_submit, io_submit_one; these seem to be in the VFS layer)
-VFS/virtual filesystem: aio_run_iocb, do_aio_read, xfs_file_read_iter, generic_file_read_iter, xfs_vm_direct_IO, bio_*, kiocb_batch_refill
-Block layer: blk_finish_plug, blk_flush_plug_list, queue_unplugged, __blk_run_queue, blk_run_queue
-SCSI layer: scsi_*
-Device driver: mptspi_qcmd, mptscsih_qcmd, mpt_put_msg_frame

(note: there seems to be consensus the above mentioned layers exist, although there is different wording and different numbers by different sources. Also, there doesn’t seem to be a very clear description of what is done by which layer, and what typically defines a kernel layer. For some functions it is clear they belong to a certain layer (for example aio_run_iocb in Linux/fs/aio.c, bulk_finish_plug in Linux/block/blk-core.c, etc.), for some layers, like the SCI layer, it seems there isn’t a clear layer definition by looking at where the function is defined. Also please mind the SCSI layer is implemented as a driver, just like the actual device driver for the hardware. This is very understandable, but makes it a bit harder to see it in a layered way)

System Call Interface (SCI)
The request enters kernel space via the SCI. The function of the SCI is to elevate a process to system priority to perform a kernel mode task, like (but not limited to) doing I/O. The system call implementation on Linux makes use of a wrapper function in glibc, which executes the system call on behalf of the user systemcall request. The reason for mentioning this, is that sometimes the glibc wrapper “hides” the real system call, for example calling the semtimedop() function:

(gdb) break semtimedop
Breakpoint 1 at 0x3bb38eb090: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) c

Breakpoint 1, semtimedop () at ../sysdeps/unix/syscall-template.S:82

Above is a gdb (GNU debugger) session which attaches to an Oracle background process, which I know is sleeping in the system call semtimedop() when idle, A breakpoint is set on the semtimedop function, and the execution of the attached process is resumed. It then breaks on the function, showing the source code at which the break happened. Instead of showing the actual semtimedop function, it shows the pseudo function in glibc which wraps this system call. This hides the arguments of calling the semtimedop() function. My current workaround is to read the kernel registers which “carry” the arguments (RDI, RSI, RDX, RCX, R8, R9 for the first 6 arguments in most cases).

Virtual File System (VFS)
The next layer is virtual filesystem. Here we see functions specific to asynchronous IO or synchronous IO, and doing direct IO or not, and also actual filesystem specific functions (in my case xfs, when ext4 is used, you will see specific functions for that. I highly recommend XFS!). This layer also uses a structure called ‘request_queue’, which keeps track of the actual IO requests for a block device, of which each individual request is a struct ‘request’, which contains one or more structs called ‘bio’ which contains a description of the request, which points to structure called ‘bio_vec’, which points to pages for storing the disk request contents. This is all setup and created in kernel memory by the user process in system mode. It’s my assumption that the properties of the disk device (=maximum advertised IO size) are taken into account when the VFS filesystem implementation creates requests and all necessary structs and memory area’s. Please mind it’s important that enough memory is available to setup the necessary structures, and enough CPU to make this happen. Also some of the crucial structures for doing IO (request, bio, bio_vec) seem to be setup in this layer. An IO can’t be done without a memory area for the IO request to hold the data for sending it to the device (alias a write), or a memory area for the IO request to hold the data which is fetched from the device (alias a read).

The funny thing is that when you use ASM (the simple version 11.2 ASM with a local ASM instance and local disk devices), you will still see some functions of the VFS layer, because you use a disk device which is opened using the local filesystem. Examples of these functions are: aio_run_iocb, do_aio_read.

Block Layer
The next layer is the block layer. Here the request queue is handled, and I/O scheduling is done. Oracle advises the deadline scheduler in all cases. The scheduler works by plugging a request queue, much like a plug in your bathtub, letting the requests enter the queue. Having multiple requests in a queue means it can be optimised by reordering the requests, and merging adjacent requests up to the device’s advertised maximum IO size. Once a request’s timeout expires, or the requesting process finishes submitting IO, the queue is unplugged.

SCSI layer
The SCSI layer is responsible for communicating with SCSI devices to do IOs.

Device driver
The device driver layer is the layer that truly physically communicates with a device, and implements the device specific communication. In my case the functions start with mpt, which is the driver for LSI PCI adapters.

To see how the flow of IO going through the block layer, there is a tool called blktrace. Actually this is a mini-suite of tools consisting of blktrace (tracing the IO requests through the block layer), blkparse (parsing the output of blktrace to make it human readable), btrace (script to combine blktrace and blkparse, and btt (a blktrace output post processing tool)), among others.

In order to use blktrace, the debug file system of the Linux kernel needs to be mounted. Here is how that is done:

# mount -t debugfs debugfs /sys/kernel/debug

If the kernel debugfs is not mounted, you get the following message:

[root@bigmachine ~]# btrace /dev/oracleasm/disk1
Invalid debug path /sys/kernel/debug: 0/Success

I use blktrace in this article for looking at the IO requests to understand what is going on. The workflow for this use of blktrace is:
– create a trace file of the block flow using blktrace
– make the trace file human readable via blkparse or analyse via btt (block trace times)

Actually, you can parse the output of blktrace directly via blkparse using ‘blktrace -d DEVICE – | blkparse -i -‘. To make that even simpler, the script ‘btrace’ is created, to do exactly that.

Here’s how that looks like (depending on the number of processes using it, the output can be huge, this is only a snippet):

[root@bigmachine ~]# btrace /dev/oracleasm/disk1
  8,16   0       57     0.260669503  2421  Q  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       58     0.260672502  2421  G  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       59     0.260673231  2421  P   N [asm_gmon_+asm]
  8,16   0       60     0.260674895  2421  I  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       61     0.260675745  2421  U   N [asm_gmon_+asm] 1
  8,16   0       62     0.260677119  2421  D  WS 4088 + 8 [asm_gmon_+asm]
  8,16   0       63     0.260882884     0  C  WS 4088 + 8 [0]

What is shown here, is the typical flow of an IO in the block layer:
Q – Queue. A request starts off sending a notification on the intent to queue at the given location.
G – Get request. A struct request is allocated.
P – Plug. When the block device queue is empty, the queue is plugged in order to receive further IOs and have the ability to optimise (merge and/or reorder) them before the data is sent to the device.
I – Insert. A request is sent to the IO scheduler for addition to the internal queue and later service by the driver. The request is fully allocated at this time.
U – Unplug. The start of sending requests to the driver.
D – Driver. A request has been sent to the driver and removed from the queue.
C – Complete. A previously issued request to the driver has been completed.

The main point is here, that you can truly see how the IO requests flow through the block layer and are issued to the storage device, in other words, you can see how the block layer receives the IOs, and what is exactly submitted to the driver as request for the physical storage layer.

This is a microscopic view of the disk IOs. In most cases, when you want to gain information on block layer IO processing, another view on it is provided by processing blktrace output with btt. This is an example output of btt:

First capture IO events using blktrace:

[root@bigmachine ~]# blktrace -w 60 -d /dev/oracleasm/disk1 -o - | blkparse -d sdb.blkparse -i -

In this example I captured IOs for 60 seconds. You can exclude ‘-w 60′, and press interrupt (ctrl-c) when you deem IO recording is enough. This produces a binary file ‘sdb.blkparse’, which can be used btt:

This is the first part, the flow through the block layer until IO completion:

==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000001   0.239795347   3.002829973         238
Q2G               0.000000001   0.159337842   3.011192142         264
G2I               0.000000679   0.000001724   0.000011618         264
I2D               0.000000764   0.000007633   0.000153436         264
D2C               0.000000001   0.103328167   3.012509148         233
Q2C               0.000000001   0.270961298   3.012516496         233

Note: time is in milli seconds.
Q2Q – Time between IO requests.
Q2G – Time it takes for a request struct to be allocated.
G2I – Time it takes for the request to be inserted in the device’s queue.
I2D – Time spend in the device queue waiting to be issued to the driver.
D2C – Time spend between issuing to the driver and completion of the request. This includes controller, storage. This is the same figure as the ‘svctm’ column with iostat -x.
Q2C – Total time spend in block layer and physical IO. This is the same figure as the ‘await’ column with iostat -x.

The second part is the device overhead section:

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8, 16) |  66.6284%   0.0007%   0.0000%   0.0032%  38.1339%
---------- | --------- --------- --------- --------- ---------
   Overall |  66.6284%   0.0007%   0.0000%   0.0032%  38.1339%

This is partly the same as the IO flow table above. This is expressed as a percentage of where the total time of the IO is spend.
Q2G – Request struct allocation.
G2I – Insertion in the device queue.
Q2M – Total time until merge.
I2D – Time spend in the queue until it was dispatched to the driver.
D2C – Time spend on doing the IO after submitting the request to the driver.

Tagged: blkparse, blktrace, btt, internals, IO, IO size, linux, oracle

Line Numbers

One of the presentations I went to at the DOAG conference earlier on this month was called “PL/SQL Tuning, finding the perf. bottleneck with hierarchical profiler” by Radu Parvu from Finland. If you do a lot of PL/SQL programming and haven’t noticed the dbms_hprof package yet make sure you take a good look at it.

A peripheral question that came up at the end of the session asked about problems with line numbers in pl/sql procedures; why, when you get a run-time error, does the reported line number sometimes look wrong, and how do you find the right line. I can answer (or give at least one reason for) the first part, but not the second part; Julian Dontcheff had an answer for the second bit, but unfortunately I failed to take a note of it.

Here’s the SQL to create, run and list a very simple (and silly) procedure.

define m_value = 3
set timing on

create or replace procedure silly
        m_n number;
        for i in 1..1000000 loop
                m_n := exp(sqrt(ln(&m_value)));
        end loop;

execute silly


Here’s the output I got from running this, and the thing I want you to note is the time it takes to run, and the line number reported for the assignment:

Procedure created.

Elapsed: 00:00:00.01

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.00
  1  create or replace procedure silly
  2  as
  3          m_n number;
  4  begin
  5          for i in 1..1000000 loop
  6                  m_n := exp(sqrt(ln(&m_value)));
  7          end loop;
  8* end;

It seems to take my instance o.oo seconds to perform 1,000,000 evaluations of the exponent of the square root of the natural logarithm of 3 at line 6 of the code. But let’s make two changes; first, let’s try that with the value -1 (which is a little careless if you know your logarithms).

Procedure created.

Elapsed: 00:00:00.02
BEGIN silly; END;

ERROR at line 1:
ORA-06502: PL/SQL: numeric or value error
ORA-06512: at "TEST_USER.SILLY", line 5
ORA-06512: at line 1

Notice that the 6502 error is reported at line 5, not line 6.

Now let’s go back to the value 3, but start the script with the command: alter session set plsql_optimize_level = 0; (the default level is 2, the range is 0 to 3):

Session altered.

Elapsed: 00:00:00.00

Procedure created.

Elapsed: 00:00:00.01
BEGIN silly; END;

ERROR at line 1:
ORA-06502: PL/SQL: numeric or value error
ORA-06512: at "TEST_USER.SILLY", line 6
ORA-06512: at line 1

Reducing the pl/sql optimisation level to zero (or one) results in the the error message reporting the problem at line 6 – which matches our original code. So let’s check the effect of running the code at level zero with a valid number for the input.

Session altered.

Elapsed: 00:00:00.00

Procedure created.

Elapsed: 00:00:00.02

PL/SQL procedure successfully completed.

Elapsed: 00:00:24.56

The run time for our pl/sql call has gone up from 0.00 seconds to 24.56 seconds.

Like all good compilers the pl/sql compiler had recognised (at level 2) that we were assigning a constant inside a loop, so it had (effectively) rewritten our code to move the assignment outside the loop, effectively swapping lines 6 and 5, storing the change in the “object” level code, but not in the database copy of the source. The benefit we get (from the default settings) is a saving of 25 seconds of CPU, the (small) penalty we pay is that the lines reported for run-time errors aren’t always going to identify the text that caused the problem.

HOWTO: Create a Structured XMLIndex, using the In-Memory Column Store

In Oracle database version, Oracle introduced the Oracle In-Memory Database option. It is possible…