Search

OakieTags

Who's online

There are currently 0 users and 26 guests online.

Recent comments

Oakies Blog Aggregator

Free Book: Introducing Microsoft SQL Server 2012

Michael McLaughlin‘s blog has a post about a free MS SQL Server book (Introducing Microsoft SQL Server 2012) available here. If you are into MS SQL Server, you can’t get a better deal than that. :)

Remember, you can still get hold of a free copy of Tom Kyte‘s book (Expert Oracle Database Architecture, 2nd Edition) from Red Gate.

Cheers

Tim…


Free Book: Introducing Microsoft SQL Server 2012 was first posted on May 25, 2012 at 9:12 am.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.




Performance testing with Virident PCIe SCM

Thanks to the kind introduction from Kevin Closson I was given the opportunity to benchmark the Virident PCIe flash cards. I have written a little review of the testing conducted, mainly using SLOB. To my great surprise Virident gave me access to a Westmere-EP system running a top of the line 2s12c24t system with lots of memory.

In summary the testing shows that the “flash revolution” is happening, and that there are lots of vendors out there building solutions for HPC and Oracle database workloads alike. Have a look at the attached PDF for the full story if you are interested. When looking at the numbers please bear in mind it was a two socket system! I’m confident the server could not max out the cards.

Full article:

Virident testing martin bach consulting

Subquery Factoring

I have a small collection of postings where I’ve described anomalies or limitations in subquery factoring (the “with subquery”, or Common Table Expression (CTE) to give it the official ANSI name). Here’s another example of Oracle’s code not behaving consistently. You may recognise the basic query from yesterday’s example of logical tuning – so I won’t reprint the code to generate the data sets. This examples in this note were created on 11.2.0.2 – we start with a simple query and its execution plan:

select
	*
from
	t2
where
	not exists (
		select	null
		from 	t3
		where	n3a = n2a
		and	n3b = n2b
		and	id3 = id2
	)
and	not exists (
		select	null
		from 	t3
		where	n3a = n2a
		and	n3b = n2b
		and	id3 = id2 + 1000
	)
;

-------------------------------------------------------------
| Id  | Operation           | Name  | Rows  | Bytes | Cost  |
-------------------------------------------------------------
|   0 | SELECT STATEMENT    |       |    50 |  7200 |    16 |
|   1 |  NESTED LOOPS ANTI  |       |    50 |  7200 |    16 |
|   2 |   NESTED LOOPS ANTI |       |  4999 |   644K|    16 |
|   3 |    TABLE ACCESS FULL| T2    |  5000 |   585K|    16 |
|*  4 |    INDEX UNIQUE SCAN| T3_PK |     1 |    12 |       |
|*  5 |   INDEX UNIQUE SCAN | T3_PK |  5000 | 60000 |       |
-------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("N3A"="N2A" AND "N3B"="N2B" AND "ID3"="ID2"+1000)
   5 - access("N3A"="N2A" AND "N3B"="N2B" AND "ID3"="ID2")

Now what happens if I take this piece of SQL (which is really the core of a much more complex query), put it into a CTE, and reference it later ? In one of the queries in the client’s application I wanted to materialize a piece of code like this because the result set was used multiple times in the body of the query. You’ll notice that I got two anti-joins when my test case ran as a “naked” SQL statement, but look what happens when I try to use the CTE mechanism in the simplest possible way:

with	cte as (
select
	/*+ materialize */
	*
from
	t2
where
	not exists (
		select
			null
		from 	t3
		where	n3a = n2a
		and	n3b = n2b
		and	id3 = id2
	)
and	not exists (
		select
			null
		from 	t3
		where	n3a = n2a
		and	n3b = n2b
		and	id3 = id2 + 1000
	)
)
select
	*
from
	cte
;

------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                        | Rows  | Bytes | Cost  |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                             |  5000 |   468K|  5029 |
|   1 |  TEMP TABLE TRANSFORMATION |                             |       |       |       |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9D6641_40101879 |       |       |       |
|*  3 |    FILTER                  |                             |       |       |       |
|   4 |     TABLE ACCESS FULL      | T2                          |  5000 |   585K|    15 |
|*  5 |     INDEX UNIQUE SCAN      | T3_PK                       |     1 |    12 |     1 |
|*  6 |     INDEX UNIQUE SCAN      | T3_PK                       |     1 |    12 |     1 |
|   7 |   VIEW                     |                             |  5000 |   468K|    14 |
|   8 |    TABLE ACCESS FULL       | SYS_TEMP_0FD9D6641_40101879 |  5000 |   585K|    14 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter( NOT EXISTS (SELECT 0 FROM "T3" "T3" WHERE "ID3"=:B1 AND "N3B"=:B2
              AND "N3A"=:B3) AND  NOT EXISTS (SELECT 0 FROM "T3" "T3" WHERE "ID3"=:B4+1000 AND
              "N3B"=:B5 AND "N3A"=:B6))
   5 - access("N3A"=:B1 AND "N3B"=:B2 AND "ID3"=:B3)
   6 - access("N3A"=:B1 AND "N3B"=:B2 AND "ID3"=:B3+1000)

When I try to materialize the view Oracle reverts to a pair of filter subqueries. If I take away the /*+ materialize */ hint the CTE goes inline and the query optimizes with anti-joins – just as it did in the original form – so the behaviour isn’t as inconsistent as some of the earlier cases I’ve documented. I can get the behaviour I want by adding /*+ unnest */ hints to the two subqueries so the problem isn’t a show-stopper, but it’s just a little irritating to have to do this.

Here’s what I found in the 10053 trace file with the attempt to materialize:

SU:   Checking validity of unnesting subquery SEL$2 (#0)
SU:     SU bypassed: invalidated.
SU:   Validity checks failed.
SU:   Checking validity of unnesting subquery SEL$3 (#0)
SU:     SU bypassed: invalidated.
SU:   Validity checks failed.

That’s just a little strange, considering that the equivalent section of the trace for the version where the CTE goes inline reads as folllows:

SU:   Checking validity of unnesting subquery SEL$2 (#3)
SU:   Passed validity checks.
SU:   Unnesting subquery query block SEL$2 (#3)SU: Transform ALL/NOTEXISTS subquery into a regular antijoin.
SU:   Checking validity of unnesting subquery SEL$3 (#2)
SU:   Passed validity checks.
SU:   Unnesting subquery query block SEL$3 (#2)SU: Transform ALL/NOTEXISTS subquery into a regular antijoin.

Thank goodness we can still hint – and then generate SQL Baselines.

 

When is a foreign key not a foreign key...

I learn or relearn something new every day about Oracle.  Just about every day really!

Last week I was in Belgrade Serbia delivering a seminar and an attendee reminded me of something I knew once but had totally forgotten about.  It had to do with foreign keys and the dreaded NULL value.

Many of you might think the following to be not possible, we'll start with the tables:

ops$tkyte%ORA11GR2> create table p
  2  ( x int,
  3    y int,
  4    z int,
  5    constraint p_pk primary key(x,y)
  6  )
  7  /
Table created.


ops$tkyte%ORA11GR2> create table c
  2  ( x int,
  3    y int,
  4    z int,
  5    constraint c_fk_p foreign key (x,y) references p(x,y)
  6  )
  7  /
Table created.

Looks like a normal parent child relationship - a row may exist in C if and only if a parent row exists in P.  If that is true - then how can this happen:
ops$tkyte%ORA11GR2> select count( x||y ) from p;

COUNT(X||Y)
-----------
          0

ops$tkyte%ORA11GR2> select count( x||y ) from c;

COUNT(X||Y)
-----------
          1
There are zero records in P - none.  There is at least one record in C and that record has a non-null foreign key.  What is happening?
It has to do with NULLs and foreign keys and the default "MATCH NONE" rule in place.  If your foreign key allows NULLs and your foreign key is a composite key - then you must be careful of the condition where by only SOME of the foreign key attributes are not null.  For example - to achieve the above magic, I inserted:
ops$tkyte%ORA11GR2> insert into c values ( 1, null, 0 );
1 row created.
The database cannot validate a foreign key when it is partially null.  In order to enforce the "MATCH FULL" option of a foreign key - you would want to add a constraint to your table:
ops$tkyte%ORA11GR2> alter table c add constraint check_nullness
  2  check ( ( x is not null and y is not null ) or
  3          ( x is null and y is null ) )
  4  /
Table altered.
That will ensure either:
  • All of the columns are NULL in the foreign key
  • None of the columns are NULL in the foreign key
As long as that constraint is in place - your foreign key will work as  you probably think it should work.

Logical tuning

Here’s a model of a problem I solved quite recently at a client site. The client’s query was much more complex and the volume of data much larger, but this tiny, two table, example is sufficient to demonstrate the key principle. (Originally I thought I’d have to use three tables to model the problem, which is why you may find my choice of table names a little odd). I ran this example on 11.2.0.2 – which was the client version:

create table t2
as
select
	rownum					id2,
	trunc(dbms_random.value(0,1000))	n2a,
	trunc(dbms_random.value(0,1000))	n2b,
	lpad(rownum,6,'0')			vc2,
	lpad('x',100,'x')			padding
from
	all_objects
where
	rownum <= 5000;

alter table t2 add constraint t2_pk primary key(id2, n2a);

create table t3
as
select
	rownum					id3,
	trunc(dbms_random.value(0,1000))	n3a,
	trunc(dbms_random.value(0,1000))	n3b,
	lpad(rownum,6,'0')			vc3,
	lpad('x',100,'x')			padding
from
	all_objects
where
	rownum <= 5000;

alter table t3 add constraint t3_pk primary key(n3a, n3b, id3);

-- now collect stats on the table and execute this query (with autotrace enabled)

select
	*
from
	t2
where
	not exists (
		select	/*+ unnest */
			null
		from 	t3
		where	n3a = n2a
		and	n3b = n2b
		and	(id3 = id2 or id3 = id2 + 1000)
	)
;

select * from table(dbms_xplan.display);

You’ll note that I’ve included the /*+ unnest */ hint in the subquery because I want Oracle to run this as a hash anti-join; and it would appear to be legal (and simple) to do this given the various not null contraints and primary key information. In fact the plan uses a filter subquery:

------------------------------------------------------------
| Id  | Operation          | Name  | Rows  | Bytes | Cost  |
------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |  5000 |   585K|  5015 |
|*  1 |  FILTER            |       |       |       |       |
|   2 |   TABLE ACCESS FULL| T2    |  5000 |   585K|    15 |
|*  3 |   INDEX RANGE SCAN | T3_PK |     1 |    12 |     2 |
------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter( NOT EXISTS (SELECT /*+ UNNEST */ 0 FROM "T3" "T3" WHERE
              "N3B"=:B1 AND "N3A"=:B2 AND ("ID3"=:B3 OR "ID3"=:B4+1000)))
   3 - access("N3A"=:B1 AND "N3B"=:B2)
       filter("ID3"=:B1 OR "ID3"=:B2+1000)

I suspect that the optimizer code bypasses the anti-join because of the (carefully bracketed) disjunct (OR) predicate. On the client site this resulted in the subquery being executed 9 million times, reducing an intermediate data set from 9M rows to 2M rows at a cost of 27 million buffer visits and about 60 CPU seconds. Fortunately I was able to dredge up a little bit of propositional calculus and quote the following equivalence:

        not( A or B ) <=> (not A and not B)

Equally fortunately I didn’t have to worry about three-valued logic (all relevant columns were declared not null), so I was able to rewrite the query in the form:

select
	*
from
	t2
where
	not exists (
		select	null
		from 	t3
		where	n3a = n2a
		and	n3b = n2b
		and	id3 = id2
	)
and	not exists (
		select	null
		from 	t3
		where	n3a = n2a
		and	n3b = n2b
		and	id3 = id2 + 1000
	)
;

With this code Oracle did two unnests and converted to hash anti-joinsin both cases (at least, that’s what happened on the client site – my small sample switched to nested loop anti-joins):

-------------------------------------------------------------
| Id  | Operation           | Name  | Rows  | Bytes | Cost  |
-------------------------------------------------------------
|   0 | SELECT STATEMENT    |       |    50 |  7200 |    15 |
|   1 |  NESTED LOOPS ANTI  |       |    50 |  7200 |    15 |
|   2 |   NESTED LOOPS ANTI |       |  4999 |   644K|    15 |
|   3 |    TABLE ACCESS FULL| T2    |  5000 |   585K|    15 |
|*  4 |    INDEX UNIQUE SCAN| T3_PK |     1 |    12 |       |
|*  5 |   INDEX UNIQUE SCAN | T3_PK |  5000 | 60000 |       |
-------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("N3A"="N2A" AND "N3B"="N2B" AND "ID3"="ID2"+1000)
   5 - access("N3A"="N2A" AND "N3B"="N2B" AND "ID3"="ID2")

The change to hash anti-joins was a huge benefit (the nested loop anti-join would have improved things for the client to a degree, but there’s not really an enormouse difference in some cases between a filter subquery and an equivalent nested loop anti/semi-join). In this case the query run time dropped from 95 seconds to 27 seconds – all of it CPU time.

Update: a check of the 10053 trace file for 10.2.0.3 shows the following:

SU:   Checking validity of unnesting subquery SEL$2 (#2)
SU:     SU bypassed: Invalid correlated predicates.
SU:   Validity checks failed.

On the other hand, it is possible to get unnesting and a hash anti-join with the predicate: id3 between id2 and id2 + 1000. (But see comment 6 and its reply)

Linux : HandBrake – Streaming DVDs from my NAS…

When my TV broke a few months ago I made the decision not to replace it. That means I only get to watch DVDs on the computer or stuff streamed on the web (BBC iPlayer, ITV Player or 4OD) using my iPad. I’m pretty happy with the situation as it prevents me wasting too much time in front of the TV. My only issue was being tied to the computer for DVDs. Yesterday I entered the 21st century and started streaming DVDs to my iPad.

A little Googling revealed HandBrake is about as simple as it gets, where DVD video transcoders on Linux are concerned. With that installed I saved a copy of a DVD (Alien) into the “movies” folder on my NAS, which is pre-configured for streaming videos. That’s nice and simple.

The next step was to get an iPad app to allow me to watch the streamed content. There are a bunch of them around, but I ended up with PlugPlayer. It automatically detected my NAS and worked immediately.

So now I can watch DVDs streamed to my iPad. All I need is a 60TB hard disk so I can store all my DVDs. :)

Cheers

Tim…

 


Linux : HandBrake – Streaming DVDs from my NAS… was first posted on May 23, 2012 at 2:42 pm.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.




60TB Disk Drives?

I was reading a story where Seagate were talking about 60TB disk drives. That’s all well and good, but how quick can I get data to and from them? If I need a certain number of spindles to get the performance I require, then I’m just going to end up with masses of wasted capacity.

I can picture the scene now. I have a database of “x” terabytes in size and I need “y” number of spindles to get the performance I require, so I end up having to buy disks amounting to “z” petabytes of space to meet my performance needs. Not only is it hard to justify, but you know the “spare” capacity will get used to store stuff that’s got nothing to do with my database.

Just think of those 60TB bad-boys in a RAID5 configuration. Shudder. :)

Feel free to insert a, “SSD/Flash will solve the worlds storage problems”, comment of your choice here. :)

Cheers

Tim….


60TB Disk Drives? was first posted on May 23, 2012 at 1:03 am.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.




VirtualBox 4.1.16 Released…

VirtualBox 4.1.16 has been released. You can download it from the usual place and see the fixes in the changelog.

I’ve downloaded the Fedora 17 build, ready for the next weeks final release of Fedora 17… :)

Cheers

Tim…


VirtualBox 4.1.16 Released… was first posted on May 22, 2012 at 8:58 pm.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.




All about RAC and MTU with a video

Let’s first discuss how RAC traffic works before continuing. Environment for the discussion is: 2 node cluster with 8K database block size, UDP protocol is used for cache fusion. (BTW, UDP and RDS protocols are supported in UNIX platform; whereas Windows uses TCP protocol).

UDP protocol, fragmentation, and assembly

UDP Protocol is an higher level protocol stack, and it is implemented over IP Protocol ( UDP/IP). Cache Fusion uses UDP protocol to send packets over the wire (Exadata uses RDS protocol though).

MTU defines the Maximum Transfer Unit of an IP packet. Let us consider an example of MTU set to 1500 in a network interface. One 8K block transfer can not be performed with just one IP packet  as the IP packet size (1500 bytes) is less than 8K. So, one transfer of UDP packet of 8K size is fragmented to 6 IP packets and sent over the wire. In the receiving side, those 6 packets are reassembled to create one UDP buffer of size 8K. After the assembly, that UDP buffer is delivered to an UDP port of a UNIX process. Usually, a foreground process will listen on that port to receive the UDP buffer.

Consider what happens If MTU is set to 9000 in the network interface:  Then 8K buffer can be transmitted over the wire with just one IP packet. There is no need for fragmentation or reassembly with MTU=9000 as long as the block size is less than 8K. MTU=9000 is also known as jumbo frame configuration.  ( But, if the database block size is greater than jumbo frame then fragmentation and reassembly is still required. For example, for 32KB size, with MTU=9000,  there will three 9K IP packets  and one 5K IP packet to be transmitted).

Fragmentation and reassembly is performed at OS Kernel layer level and hence it is the responsibility of Kernel and the stack below to complete the fragmentation and assembly. Oracle code simply calls the send and receive system calls, passes the buffers to populate.

Few LMS system calls in Solaris platform:

0.6178  0.0001 sendmsg(30, 0xFFFFFFFF7FFF7060, 32768)          = 8328
0.6183  0.0004 sendmsg(30, 0xFFFFFFFF7FFFABE0, 32768)          = 8328
0.6187  0.0001 sendmsg(36, 0xFFFFFFFF7FFFBA10, 32768)          = 144
...
0.7241  0.0001 recvmsg(27, 0xFFFFFFFF7FFF9A10, 32768)          = 192
0.7243  0.0001 recvmsg(27, 0xFFFFFFFF7FFF9A10, 32768)          = 192

UDP vs TCP

If you talk to a network admin about use of UDP for cache fusion, usually, there will be few eyebrows raised about the use of UDP. From RAC point of view, UDP is the right choice over TCP for cache fusion traffic. With TCP/IP, for every packet transfer has overhead, connection need to be setup, packet sent, and the process must wait for TCP Acknowledgement before considering the packet send as complete. In a busy RAC systems, we are talking about 2-3 milli-seconds for packet transfer and with TCP/IP, we probably may not be able to achieve that level of performance. With UDP, packet transfer is considered complete, as soon as packet is sent and error handling is done by Oracle code itself. As you know, reliable network is a key to RAC stability, if much of packets (closer to 100%) are sent without any packet drops, UDP is a good choice over TCP/IP for performance reasons.

If there are reassembly failures, then it is a function of unreliable network or kernel or something else, but nothing to do with the choice of UDP protocol itself. Of course, RDS is better than UDP as the error handling is offloaded to the fabric, but usually require, infiniband fabric for a proper RDS setup. For that matter, VPN connections use UDP protocol too.

IP identification?

In a busy system, there will be thousands of IP packets traveling in the interface, in a given second. So, obviously, there will be many IP packets from different UDP buffers received by the interface. Also, because these ethernet frames can be delivered in any order, how does Kernel know how to assemble them properly? More critically, how does the kernel know that 6 IP packets from one UDP buffer belongs together and the order of those IP packets?

Each of these IP packet has an IP identification and fragment offset. Review the wireshark files uploaded in this blog entry, you will see that all 6 IP packets will have the same IP identification. That ID and the fragment offset is used by the kernel to assemble the IP packets together to create UDP buffer.

Identification: 0x533e (21310)
..
Fragment offset: 0

Reassembly failures

What happens if an IP packet is lost, assuming MTU=1500 bytes?

From the wireshark files with mtu1500, you will see that each of the packet have a Fragment offset. That fragment offset and IP identification is used to reassemble the IP packets to create 8K UDP buffer. Consider that there are 6 puzzle pieces, each puzzle piece with markings, and Kernel uses those markings( offset and IP ID) to reassemble the packets. Let’s consider the case, one of 6 packet never arrived, then the kernel threads will keep those 5 IP packets in memory for 64 seconds( Linux kernel parameter ipfrag_time controls that time) before declaring reassembly failure. Without receiving the missing IP packet, kernel can not reassemble the UDP buffer, and so, reassembly failure is declared.

Oracle foreground process will wait for 30 seconds (it used to be 300 seconds or so in older version of RAC) and if the packet is not arrived within that timeout period, FG process will declare a ‘GC lost packet’ and re-request the block. Of course, kernel memory allocated for IP fragmentation and assembly is constrained by Kernel parameter ipfrag_high_thres and ipfrag_low_thres and lower values for these kernel parameters can lead to reassembly failures too (and that’s why it is important to follow all best practices from RAC installation guides).

BTW, there are few other reasons for ‘gc lost packets’ too. High CPU usage also can lead to ‘gc lost packets’ failures too, as the process may not have enough cpu time to drain the buffers, network buffers allocated for that process becomes full, and so, kernel will drop incoming packets.

It is probably better to explain these concepts visually. So, I created a video. When you watch this video, notice that there is HD button on the top of the video. Play this in HD mode so that you will have better learning experience.

You can get the presentation file from the video here: MTU

Wireshark files explained in the video can be reviewed here:
wireshark_1500mtu
wireshark_9000mtu

BTW, when you review the video, you will see that I had little bit trouble identifying the packet in the wireshark output initially. I understood the reason for not seeing the packets filled with DEADBEEF characters. Why do you think I didn’t see the packets initially?

Also, looks like, video quality is not that great when embedded. If you want actual mp4 files, let me know, may be I can upload to a drop box and let you download, email me.

MTU Vidoe

Nested Loop Join Costing

The basic formula for calculating the costs of a Nested Loop Join is pretty straightforward and has been described and published several times.

In principle it is the cost of acquiring the driving row source plus the cost of acquiring the inner row source of the Nested Loop as many times as the driving row source dictates via the cardinality of the driving row source.

Cost (outer rowsource) + Cost (inner rowsource) * Card (outer rowsource)

Obviously there are cases where Oracle has introduced refinements to the above formula where this is no longer true. Here is one of these cases that is probably not uncommon.

Let's start with a simple two table join that shows above formula in action. It represents a parent-child relationship where the parent table has 10,000 rows with a unique identifier, and a child table with 100 rows each that map to a single parent row, having 1,000,000 rows in total.

set echo on create table t as select rownum as id , rownum as attr1 , rpad('x', 100) as filler from dual connect by level <= 10000 ; exec dbms_stats.gather_table_stats(null, 't') create table t2 as select rownum as id , mod(rownum, 10000) + 1 as fk , mod(rownum, 20) + 1 as attr1 , rpad('x', 100) as filler from dual connect by level <= 1000000 ; exec dbms_stats.gather_table_stats(null, 't2') create index t2_idx on t2 (fk); explain plan for select /*+ use_nl(t t2) leading(t) index(t2) */ * from t , t2 where t.attr1 <= 500 and t2.fk = t.id; set pagesize 0 linesize 200 tab off select * from table(dbms_xplan.display);

which gives the following execution plan in 11.2:

--------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 50005 | 10M| 51087 (1)| 00:10:14 | | 1 | NESTED LOOPS | | | | | | | 2 | NESTED LOOPS | | 50005 | 10M| 51087 (1)| 00:10:14 | |* 3 | TABLE ACCESS FULL | T | 500 | 54500 | 45 (0)| 00:00:01 | |* 4 | INDEX RANGE SCAN | T2_IDX | 100 | | 2 (0)| 00:00:01 | | 5 | TABLE ACCESS BY INDEX ROWID| T2 | 100 | 11300 | 102 (0)| 00:00:02 | --------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter("T"."ATTR1"<=500) 4 - access("T2"."FK"="T"."ID")

There are a couple of noteworthy comments:

1. T2.FK and T.ID have the same number of distinct keys each (10,000), so assuming corresponding primary and foreign key constraints in place this means that there is at least one matching row for every parent row.

2. There is a filter on T that restricts the driving row source to 500 rows. It is interesting to note that this filter results in a "biased join" where Oracle picks the NUM_DISTINCT from the other table for the join selectivity calculation rather than using the greatest NUM_DISTINCT of both, in this case arriving at a correct cardinality estimate of approx. 50,000

3. The T2_IDX index has a worst-case clustering factor due to the scattering of T2.FK via the MOD function, hence the resulting cost of a single iteration of the Nested Loop join is 100 for picking up 100 rows from the table (plus 2 for the index access)

The overall cost for the loop is 45 for acquiring the driving row source plus 500 times 102, the remaining part being CPU costing overhead. This matches above formula.

Now let's re-create table T2 using a different distribution of the T2.FK column:

drop table t2; purge table t2; create table t2 as select rownum as id , mod(rownum, 500) + 1 as fk , mod(rownum, 20) + 1 as attr1 , rpad('x', 100) as filler from dual connect by level <= 1000000 ; exec dbms_stats.gather_table_stats(null, 't2') create index t2_idx on t2 (fk);

The only difference is now that the FK column no longer has 10,000 distinct keys but only 500.

Of course on average now 2,000 rows in T2 match a parent row in T, but obviously no longer all of them have a match in T2.

Let's review the execution plan for the same SQL statement as above:

--------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 998K| 211M| 52609 (1)| 00:10:32 | | 1 | NESTED LOOPS | | | | | | | 2 | NESTED LOOPS | | 998K| 211M| 52609 (1)| 00:10:32 | |* 3 | TABLE ACCESS FULL | T | 500 | 54500 | 45 (0)| 00:00:01 | |* 4 | INDEX RANGE SCAN | T2_IDX | 2000 | | 5 (0)| 00:00:01 | | 5 | TABLE ACCESS BY INDEX ROWID| T2 | 1996 | 220K| 2007 (1)| 00:00:25 | --------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter("T"."ATTR1"<=500) 4 - access("T2"."FK"="T"."ID")

Since now 2,000 rows match a single parent key on average, the cost per iteration has increased accordingly. The index range scan needs to visit some root / branch blocks plus four leaf blocks on average, resulting in a cost of 5. The table access needs to visit 2,000 rows on average, hence with the same worst-case clustering factor the cost per iteration is 2,000 plus CPU overhead.

Given the fact that we need to do this 500 times the final cost of the Nested Loop join ought to be something close to one million, but surprisingly it is only a little bit higher than the cost of the previous scenario.

So clearly the original formula doesn't apply here, although the cost for a single iteration in the execution plan seems to match the expectations.

It looks like Oracle a long way back introduced a refinement to the original formula in the case of less distinct keys of the inner row source join column than the driving row source join column.

The idea behind it seems to be that this is what Oracle calls a "sparse" join. Obviously not every row from the driving row source will find a match in the inner row source, hence some of the loop iterations should end with the index lookup, not finding a match and therefore no need to visit the inner table afterwards. The refinement hence calculates a lower average cost of the table visit per loop iteration.

This is of course true when looking at the join by itself. But if you are unlucky and a corresponding filter on the driving row source turns the "sparse" join into a (rather) "dense" join, then this refinement can lead to a significant cost underestimate of a Nested Loop join, potentially driving the optimizer towards favouring that join method over others.

And it is exactly that scenario what above query simulates: The filter on T results in a full match of all 500 driving rows, and the actual cost of the Nested Loop join ought to be closer to one million than 50,000 in this case.

The refinement to the cost calculation seems to be based on the ratio between the NUM_DISTINCT of two join coin columns: In my example the ratio is 10,000:500, so the overall cost is only approx. 1/20 of the original cost.

There are some further details how the formula deals with a small number of loop iterations. For example, the first iteration will get the full original cost, a number of further iterations (again seems to correspond to the ratio, here for example 20) get only the index cost added, and from then on the costing of the table access levels at the original cost downscaled by the ratio (2000 / 20 which is 100 in this case).

The refinement obviously has been added in release 10.1.0.3 and can be found in the Fix Control list as bug number 3120429. The text for this fix is: "account for join key sparsity in computing NL index access cost" and apparently only applies if the inner row source uses an index access.

This also means that the original costing can be activated by disabling the fix:

explain plan for select /*+ opt_param('_fix_control' '3120429:0') use_nl(t t2) leading(t) index(t2) */ * from t , t2 where t.attr1 <= 500 and t2.fk = t.id; set pagesize 0 linesize 200 tab off select * from table(dbms_xplan.display); --------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 998K| 211M| 1003K (1)| 03:20:41 | | 1 | NESTED LOOPS | | | | | | | 2 | NESTED LOOPS | | 998K| 211M| 1003K (1)| 03:20:41 | |* 3 | TABLE ACCESS FULL | T | 500 | 54500 | 45 (0)| 00:00:01 | |* 4 | INDEX RANGE SCAN | T2_IDX | 2000 | | 5 (0)| 00:00:01 | | 5 | TABLE ACCESS BY INDEX ROWID| T2 | 1996 | 220K| 2007 (1)| 00:00:25 | --------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter("T"."ATTR1"<=500) 4 - access("T2"."FK"="T"."ID")

So if you happen to have such cases where the filter on a driving row source leads to an underestimate of the Nested Loop join cost as outlined here, using the fix control allows arriving at more reasonable costing figures.

I recently had such a case at a client where only a small part of a fully populated time dimension was used in the foreign key of a fact table, but the filter on the time dimension lead to exactly the scenario described here - the join wasn't really sparse and the Nested Loop join cost was significantly underestimated.