Search

OakieTags

Who's online

There are currently 0 users and 31 guests online.

Recent comments

Oakies Blog Aggregator

Tips on Submitting an Abstract to Conference

<.. The tech15 committee and my role
<…. Who plans the Tech15 content

The call for Papers for UKOUG Tech15 has gone out. This is how most of the content for a large conference is sourced, by asking the community to submit abstracts for consideration. With smaller conferences and user group meetings the organisers can often get enough talks by hassling asking people they know.

mail_image_preview_big

Firstly, I would encourage anyone who has considered talking at conference but never has, to submit an abstract. We could easily fill the whole event with known speakers but we don’t. We have a policy of having some New Blood at every conference. {If you are in the UK and want to try out presenting then a great way to do so is by presenting at a smaller user group meeting, like for example the next RAC/Database UKOUG SIG meeting on July 1st :-) – It’s a friendly, relaxed way to get into presenting. Get in touch with me if it appeals to you}.

You can click on this link to go to the submission page, but before you do…

When you submit an abstract for a conference, you are not actually at that point communicating with your potential audience. You are communicating with the handful of people who are tasked with going through all the submissions and selecting the papers. With the UKOUG conference you are also communicating with the volunteers who will judge abstracts. And we, the agenda planning committee, take those judging scores very seriously. It is a large part of how we attempt to ensure we select the talks on topics that people want to hear about, as well as the people who you want to hear talk.

So when you get to the field where you describe your proposed presentation (the abstract) I would suggest you don’t want to be “teasing the audience” at this point. The people who are judging and selecting the papers are seasoned conference attenders. A catchy title might get you noticed but if the abstract does not clearly state what your talk is about, what you intend to cover and who you expect your audience to be, it makes it less likely that your abstract will get selected.
Also, if you have looked at the call-for-papers page and associated notes and have seen that we are particularly interested in some area (eg “what you need to know about ….” for the database stream) and your paper is addressing it, it is worth making that a little bit obvious. The agenda planning day is hectic, we get tired and tetchy and our brains start to leak out of our ears. If your abstract is clear about what you are talking about, you are increasing your chances of selection.

In years gone by we have given the people the option to give two versions of your abstract – the one for judging and the one for promoting your talk (that is the one that gets put in the conference notes and your potential audience will read and decided if your talk is worth their attention). However, many people felt this was asking for the same information twice so we have reverted back to a single abstract this your. However, you can change your abstract text after your talk has been accepted {but note, we are wise to people trying to change the actual content of the talk later on – we LOOK at the changes!}. So sell your talk to the committee and judges now and worry about the catchy reference to your favorite movie afterwards.

I used to make my submission abstract humorous (well, in my eyes) but I don’t anymore, or at least I tone it down. If anything, I make the abstract factual and simple. As an example:


How Oracle Works in 50 Minutes
—————————————–
This is a high level but technical talk about the key processes that underlie the Oracle database. I describe what an instance is and how the server process is the gateway to the instance. I cover how the redo mechanism works and why it is critical, how data is moved into and out of memory, delayed block cleanout, what a commit IS, the importance of the undo tablespace and how a read consistent image is maintained for the user. The intended audience is new DBAs or developers who have never been taught how the database works and at the end of the talk they will understand how these key processes work.

OK, the description is a bit boring but you know exactly what my talk is going to be about and if it will fit in your conference.

So what happens when you click on the above link to submit an abstract? You will see the below front screen:

Submission Screen

Submission Screen

I would suggest you not only read this screen but also check out the menu on the left of the screen. Look at the “Hints & Tips” and also the stream you are intending to submit to (eg “Systems” if you want to present on Exadata). If you are unsure which area your talk fits in, check them all out.

the big red Submit an Abstract will actually take you to the same place that the left menu “Speaker Application” takes you too. The first step of submitting an abstract is actually saying who you are and registering on the system. If you are willing to judge abstracts (ie you ticked that box) you will then get to indicate what topics in what streams you are willing to judge. THEN you will be put into the “Speaker Lounge” and you can enter your abstract by clicking the “Submit” button.

When you come back to the system, you can go straight to the Speaker Lounge, the system will show you your details again so you can correct anything. You will see what abstract(s) you have submitted and click on them to check or change anything, or click on “Submit” to add another abstract.

Think carefully before you submit 15 abstracts. As a general rule, more than 3 and you start to reduce your chances of having a paper selected. People judge your papers will score you down if you submit too many, it’s like you dilute your judging scores over all the abstracts.

Enjoy.

Little things worth knowing: what does opatchauto actually do?

I recently applied system patch 20132450 to my 12.1.0.2.0 installation on a 2 node RAC system on Oracle Linux 7.1. While ensuring that OPatch is the latest version available I came across an interesting command line option in opatchauto. It is called “-generateSteps”.

[oracle@rac12sby1 ~]$ $ORACLE_HOME/OPatch/opatchauto apply -help
OPatch Automation Tool
Copyright (c) 2015, Oracle Corporation.  All rights reserved.


DESCRIPTION
    Apply a System Patch to Oracle Home. User specified the patch
    location or the current directory will be taken as the patch location.
    opatchauto must run from the GI Home as root user.

SYNTAX
/OPatch/opatchauto apply
                            [-analyze]
                            [-database  ]
                            [-generateSteps]
                            [-invPtrLoc  ]
                            [-jre  ] 
                            [-norestart ]
                            [-nonrolling ]
                            [-ocmrf  ]
                            [-oh  ]
                            [  ]

OPTIONS


       -analyze
              This option runs all the required prerequisite checks to confirm
              the patchability of the system without actually patching or 
              affecting the system in any way.

       -database
              Used to specify the RDBMS home(s) to be patched. Option value 
              is oracle database name separated by comma.

       -generateSteps
              Generate the manual steps of apply session. These steps are what
              'opatchauto apply' does actually.

       -invPtrLoc
              Used to locate the oraInst.loc file when the installation used 
              the -invPtrLoc. This should be the path to the oraInst.loc file.

       -jre
              This option uses JRE (java) from the
              specified location instead of the default location
              under Oracle Home.

       -norestart
              This option tells opatchauto not to restart the Grid Infrastructure
              stack and database home resources after patching.

       -nonrolling
              This option makes the patching session run in 'nonrolling' mode.
              It is required that the stack on the local node is running while
              it must be stopped on all the remaining nodes before the patching
              session starts.

       -ocmrf 
              This option specifies the absolute path to the OCM response file. 
              It is required if the target Oracle Home doesn't have OCM 
              installed and configured.

       -oh
              This option specifies Oracle Home(s) to be patched. This can be a
              RAC home, a Grid Home or comma separated list of multiple homes.


PARAMETERS
      Patch Location
                    Path to the location for the patch. If the patch 
                    location is not specified, then the current
                    directory is taken as the patch location.


Example:
      To patch GI home and all RAC homes: 
      '/OPatch/opatchauto apply'

      To patch multiple homes:
      '/OPatch/opatchauto apply -oh ,,'

      To patch databases running from RAC homes only:
      '/OPatch/opatchauto apply -database db1,db2...dbn'

      To patch software-only installation:
      '/OPatch/opatchauto apply -oh ' OR
      '/OPatch/opatchauto apply -oh '


opatchauto succeeded.
[oracle@rac12sby1 ~]$ 

So this could be quite interesting so I wanted to see what it does with the system patch. I don’t have a database installed in my cluster yet, it’s going to be the standby site for another cluster (rac12pri). Unfortunately opatchauto still skips RDBMS homes without a database in it in my tests so I end up using the “-oh” flag in this case. Here’s the result, formatted for better readability. In reality all calls to opatchauto are in a single line:

[root@rac12sby1 ~]# opatchauto apply /u01/patches/20132450 \
-oh /u01/app/12.1.0.2/grid -generateSteps -ocmrf /u01/patches/ocm.rsp
OPatch Automation Tool
Copyright (c) 2015, Oracle Corporation.  All rights reserved.

OPatchauto version : 12.1.0.1.5
OUI version        : 12.1.0.2.0
Running from       : /u01/app/12.1.0.2/grid

Invoking opatchauto utility "generateapplysteps"

To apply the patch, Please do the following manual actions:

Step 1 As the "oracle" user on the host "rac12sby1", please run the following commands:
  Action 1.1 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch version -oh /u01/app/12.1.0.2/grid -invPtrLoc \
/u01/app/12.1.0.2/grid/oraInst.loc -v2c 12.1.0.1.5

Step 2 As the "oracle" user on the host "rac12sby1", please run the following commands:
  Action 2.1 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch lsinventory -invPtrLoc \
/u01/app/12.1.0.2/grid/oraInst.loc -oh /u01/app/12.1.0.2/grid

Step 3 As the "oracle" user on the host "rac12sby1", please run the following commands:
  Action 3.1 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch prereq CheckComponents \
-ph /u01/patches/20132450/19769473 -invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc \
-oh /u01/app/12.1.0.2/grid

  Action 3.2 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch prereq CheckComponents \
-ph /u01/patches/20132450/19769479 -invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc \
-oh /u01/app/12.1.0.2/grid

  Action 3.3 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch prereq CheckComponents \
-ph /u01/patches/20132450/19769480 -invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc \
-oh /u01/app/12.1.0.2/grid

  Action 3.4 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch prereq CheckComponents \
-ph /u01/patches/20132450/19872484 -invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc \
-oh /u01/app/12.1.0.2/grid

Step 4 As the "oracle" user on the host "rac12sby1", please run the following commands:
  Action 4.1 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch prereq CheckConflictAgainstOH \
-ph /u01/patches/20132450/19769473 -invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc \
-oh /u01/app/12.1.0.2/grid

  Action 4.2 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch prereq CheckConflictAgainstOH \
-ph /u01/patches/20132450/19769479 -invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc \
-oh /u01/app/12.1.0.2/grid

  Action 4.3 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch prereq CheckConflictAgainstOH \
-ph /u01/patches/20132450/19769480 -invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc \
-oh /u01/app/12.1.0.2/grid

  Action 4.4 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch prereq CheckConflictAgainstOH \
-ph /u01/patches/20132450/19872484 -invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc \
-oh /u01/app/12.1.0.2/grid

Step 5 As the "root" user on the host "rac12sby1", please run the following commands:
  Action 5.1 [root@rac12sby1]#
  /u01/app/12.1.0.2/grid/perl/bin/perl -I/u01/app/12.1.0.2/grid/perl/lib \
-I/u01/app/12.1.0.2/grid/crs/install /u01/app/12.1.0.2/grid/crs/install/rootcrs.pl -prepatch

Step 6 As the "oracle" user on the host "rac12sby1", please run the following commands:
  Action 6.1 [oracle@rac12sby1]$
  echo /u01/patches/20132450/19769473 > /tmp/OraGI12Home1_patchList

  Action 6.2 [oracle@rac12sby1]$
  echo /u01/patches/20132450/19769479 >> /tmp/OraGI12Home1_patchList

  Action 6.3 [oracle@rac12sby1]$
  echo /u01/patches/20132450/19769480 >> /tmp/OraGI12Home1_patchList

  Action 6.4 [oracle@rac12sby1]$
  echo /u01/patches/20132450/19872484 >> /tmp/OraGI12Home1_patchList

  Action 6.5 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch napply -phBaseFile /tmp/OraGI12Home1_patchList \
-local  -invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc -oh /u01/app/12.1.0.2/grid -silent

Step 7 As the "root" user on the host "rac12sby1", please run the following commands:
  Action 7.1 [root@rac12sby1]#
  /u01/app/12.1.0.2/grid/rdbms/install/rootadd_rdbms.sh

Step 8 As the "root" user on the host "rac12sby1", please run the following commands:
  Action 8.1 [root@rac12sby1]#
  /u01/app/12.1.0.2/grid/perl/bin/perl -I/u01/app/12.1.0.2/grid/perl/lib \
-I/u01/app/12.1.0.2/grid/crs/install /u01/app/12.1.0.2/grid/crs/install/rootcrs.pl -postpatch


Step 9 As the "oracle" user on the host "rac12sby1", please run the following commands:
  Action 9.1 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch lsinventory \
-invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc -oh /u01/app/12.1.0.2/grid | grep 19769473

  Action 9.2 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch lsinventory \
-invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc -oh /u01/app/12.1.0.2/grid | grep 19769479

  Action 9.3 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch lsinventory \
-invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc -oh /u01/app/12.1.0.2/grid | grep 19769480

  Action 9.4 [oracle@rac12sby1]$
  /u01/app/12.1.0.2/grid/OPatch/opatch lsinventory \
-invPtrLoc /u01/app/12.1.0.2/grid/oraInst.loc -oh /u01/app/12.1.0.2/grid | grep 19872484

You could of course use oplan which is bundled in $ORACLE_HOME/OPatch/oplan/oplan to generate a lot more detailed profile for the patch application. For a first glance at the activity opatchauto -generateSteps seems quite useful.

The steps all rang a bell from my 11.2.0.1 days when this was the procedure to patch Grid Infrastructure (thank god that’s over). What I didn’t recognise was rootadd_rdbms.sh. Looking at the file I can see it is changing ownership and permissions for oradism (important!) and some other executables in $ORACLE_HOME and fixes potential problems with “fs.file-max*” in /etc/sysctl.conf.

Cartesian join

Some time ago I pulled off the apocryphal “from 2 hours to 10 seconds” trick for a client using a technique that is conceptually very simple but, like my example from last week, falls outside the pattern of generic SQL. The problem (with some camouflage) is as follows: we have a data set with 8 “type” attributes which are all mandatory columns. We have a “types” table with the same 8 columns together with two more columns that are used to translate a combination of attributes into a specific category and “level of relevance”. The “type” columns in the types table are, however, allowed to be null although each row must have at least one column that is not null – i.e. there is no row where every “type” column is null.

The task is to match each row in the big data set with all “sufficiently similar” rows in the types table and then pick the most appropriate of the matches – i.e. the match with the largest “level of relevance”. The data table had 500,000 rows in it, the types table has 900 rows. Here’s a very small data set representing the problem client data (cut down from 8 type columns to just 4 type columns):


create table big_table(
	id		number(10,0)	primary key,
	v1		varchar2(30),
	att1		number(6,0),
	att2		number(6,0),
	att3		number(6,0),
	att4		number(6,0),
	padding		varchar2(4000)
);

create table types(
	att1		number(6,0),
	att2		number(6,0),
	att3		number(6,0),
	att4		number(6,0),
	category	varchar2(12)	not null,
	relevance	number(4,0)	not null
);

insert into big_table values(1, 'asdfllkj', 1, 1, 2, 1, rpad('x',4000));
insert into big_table values(2, 'rirweute', 1, 3, 1, 4, rpad('x',4000));

insert into types values(   1, null, null, null, 'XX',  10);
insert into types values(   1, null, null,    1, 'YY',  20);
insert into types values(   1, null,    1, null, 'ZZ',  20);

commit;

A row from the types table is similar to a source row if it matches on all the non-null columns. So if we look at the first row in big_table, it matches the first row in types because att1 = 1 and all the other attN columns are null; it matches the second row because att1 = 1 and att4 = 1 and the other attN columns are null, but it doesn’t match the third row because types.att3 = 1 and big_table.att3 = 2.

Similarly, if we look at the second row in big_table, it matches the first row in types, doesn’t match the second row because types.att4 = 1 and big_table.att4 = 4, but does match the third row. Here’s how we can express the matching requirement in SQL:


select
	bt.id, bt.v1,
	ty.category,
	ty.relevance
from
	big_table	bt,
	types		ty
where
	nvl(ty.att1(+), bt.att1) = bt.att1
and	nvl(ty.att2(+), bt.att2) = bt.att2
and	nvl(ty.att3(+), bt.att3) = bt.att3
and	nvl(ty.att4(+), bt.att4) = bt.att4
;

You’ll realise, of course, that essentially we have to do a Cartesian merge join between the two tables. Since there’s no guaranteed matching column that we could use to join the two tables we have to look at every row in types for every row in big_table … and we have 500,000 rows in big_table and 900 in types, leading to an intermediate workload of 450,000,000 rows (with, in the client case, 8 checks for each of those rows). Runtime for the client was about 2 hours, at 100% CPU.

When you have to do a Cartesian merge join there doesn’t seem to be much scope for reducing the workload, however I didn’t actually know what the data really looked like so I ran a couple of queries to analyse it . The first was a simple “select count (distinct)” query to see how many different combinations of the 8 attributes existed in the client’s data set. It turned out to be slightly less than 400.

Problem solved – get a list of the distinct combinations, join that to the types table to translate to categories, then join the intermediate result set back to the original table. This, of course, is just applying two principles that I’ve discussed before: (a) be selective about using a table twice to reduce the workload, (b) aggregate early if you can reduce the scale of the problem.

Here’s my solution:


with main_data as (
	select
		/*+ materialize */
		id, v1, att1, att2, att3, att4
	from
		big_table
),
distinct_data as (
	select
		/*+ materialize */
		distinct att1, att2, att3, att4
	from	main_data
)
select
	md.id, md.v1, ty.category, ty.relevance
from
	distinct_data	dd,
	types		ty,
	main_data	md
where
	nvl(ty.att1(+), dd.att1) = dd.att1
and	nvl(ty.att2(+), dd.att2) = dd.att2
and	nvl(ty.att3(+), dd.att3) = dd.att3
and	nvl(ty.att4(+), dd.att4) = dd.att4
and	md.att1 = dd.att1
and	md.att2 = dd.att2
and	md.att3 = dd.att3
and	md.att4 = dd.att4
;

And here’s the execution plan.


---------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                            |    12 |  2484 |    11  (10)| 00:00:01 |
|   1 |  TEMP TABLE TRANSFORMATION |                            |       |       |            |          |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9D6619_8FE93F1 |       |       |            |          |
|   3 |    TABLE ACCESS FULL       | BIG_TABLE                  |     2 |   164 |     2   (0)| 00:00:01 |
|   4 |   LOAD AS SELECT           | SYS_TEMP_0FD9D661A_8FE93F1 |       |       |            |          |
|   5 |    HASH UNIQUE             |                            |     2 |   104 |     3  (34)| 00:00:01 |
|   6 |     VIEW                   |                            |     2 |   104 |     2   (0)| 00:00:01 |
|   7 |      TABLE ACCESS FULL     | SYS_TEMP_0FD9D6619_8FE93F1 |     2 |   164 |     2   (0)| 00:00:01 |
|*  8 |   HASH JOIN                |                            |    12 |  2484 |     6   (0)| 00:00:01 |
|   9 |    NESTED LOOPS OUTER      |                            |     6 |   750 |     4   (0)| 00:00:01 |
|  10 |     VIEW                   |                            |     2 |   104 |     2   (0)| 00:00:01 |
|  11 |      TABLE ACCESS FULL     | SYS_TEMP_0FD9D661A_8FE93F1 |     2 |   104 |     2   (0)| 00:00:01 |
|* 12 |     TABLE ACCESS FULL      | TYPES                      |     3 |   219 |     1   (0)| 00:00:01 |
|  13 |    VIEW                    |                            |     2 |   164 |     2   (0)| 00:00:01 |
|  14 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9D6619_8FE93F1 |     2 |   164 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   8 - access("MD"."ATT1"="DD"."ATT1" AND "MD"."ATT2"="DD"."ATT2" AND
              "MD"."ATT3"="DD"."ATT3" AND "MD"."ATT4"="DD"."ATT4")
  12 - filter("DD"."ATT1"=NVL("TY"."ATT1"(+),"DD"."ATT1") AND
              "DD"."ATT2"=NVL("TY"."ATT2"(+),"DD"."ATT2") AND
              "DD"."ATT3"=NVL("TY"."ATT3"(+),"DD"."ATT3") AND
              "DD"."ATT4"=NVL("TY"."ATT4"(+),"DD"."ATT4"))

Critically I’ve taken a Cartesian join that had a source of 500,000 and a target of 900 possible matches, and reduced it to a join between the 400 distinct combinations and the 900 possible matches. Clearly we can expect this to to take something like one twelve-hundredth (400/500,000) of the work of the original join – bringing 7,200 seconds down to roughly 6 seconds. Once this step is complete we have an intermediate result set which is the 4 non-null type columns combined with the matching category and relevance columns – and can use this in a simple and efficient hash join with the original data set.

Logic dictated that the old and new results would be the same – but we did run the two hour query to check that the results matched.

Footnote: I was a little surprised that the optimizer produced a nested loops outer join rather than a Cartesian merge in the plan above – but that’s probably an arterfact of the very small data sizes in my test.There’s presumably little point in transferring the data into the PGA when the volume is so small.

Footnote 2: I haven’t included the extra steps in the SQL to eliminate the reduce the intermediate result to just “the most relevant” – but that’s just an inline view with an analytic function. (The original code actually selected the data with an order by clause and used a client-side filter to eliminate the excess!).

Footnote 3: The application was a multi-company application – and one of the other companies had not yet gone live on the system because they had a data set of 5 million rows to process and this query had never managed to run to completion in the available time window.  I’ll have to get back to the client some day and see if the larger data set also collapsed to a very small number of distinct combinations and how long the rewrite took with that data set.

 

Demos do fail.

I am an ardent believer of “show me how it works” principle and usually, I have demos in my presentation. So, I was presenting “Tools for advanced debugging in Solaris and Linux” with demos in IOUG Collaborate 2015 in Las Vegas on April 13 and my souped-up laptop (with 32G of memory, SSD drives, and an high end video processor etc ) was not responding when I tried to access folder to open my presentation files.

Sometimes, demos do fail. At least, I managed to complete the demos with zero slides :-) Apologies to the audience for my R-rated rants about laptop issues.

You can download presentations files from the links below.

Session_145_advanced_debugging_using_UNIX_tools

Session_189_Riyaj_Inmemory_internals_files

Good Singapore Maths Students Would Likely Make Good Oracle DBAs (Problems)

An interesting mathematics based question from a Singapore high school exam has been doing the internet rounds in the past few days. Considering it’s aimed at high school students, it’s a tricky one and obviously designed to filter out the better students, in a country with a very good reputation for churning out mathematically gifted […]

Little things worth knowing: direct path inserts and referential integrity

This is another post to remind myself that Oracle evolves, and what I thought I knew might no longer be relevant. So double-checking instead of assuming should become a habit!

Today’s example: direct path inserts. I seemed to remember from Oracle 9i that a direct path insert ignores referential integrity. This is still confirmed in the 9i Release 2 Concepts Guide, chapter 19 “Direct Path Insert”. Quoting from there:

During direct-path INSERT operations, Oracle appends the inserted data after existing data in the table. Data is written directly into datafiles, bypassing the buffer cache. Free space in the existing data is not reused, and referential integrity constraints are ignored

That sounds a bit harsh in today’s times so it’s worth a test. On Oracle 12.1.0.2 I created a parent/child relationship, admittedly rather crude:

SQL> create table parent (id, vc) as
  2   select distinct data_object_id, subobject_name from dba_objects
  3   where data_object_id is not null;

Table created.

SQL> alter table parent add constraint pk_parent primary key (id);

Table altered.

SQL> create table child (id number, p_id number not null, vc varchar2(100),
  2  constraint pk_child primary key (id),
  3  constraint fk_parent_child foreign key (p_id) references parent (id));

Table created.

Now when I try to insert data using a direct path insert it fails:

SQL> insert /*+ append */ into child select s_child.nextval, -1, 'this will not work' from parent;
insert /*+ append */ into child select s_child.nextval, -1, 'this will not work' from parent
*
ERROR at line 1:
ORA-02291: integrity constraint (MARTIN.FK_PARENT_CHILD) violated - parent key not found

Which is brilliant and what I expected since it prevents me from writing a lot of garbage into my table. If you have developed with Oracle then you probably know about deferrable constraints. An existing constraint can’t be changed to a status of “initially deferrable”, which is why I have to drop it and then try the insert again:

SQL> alter table child drop constraint FK_PARENT_CHILD;

Table altered.

Elapsed: 00:00:00.04
SQL> alter table child add constraint FK_PARENT_CHILD foreign key (p_id) references parent (id)
  2  deferrable initially deferred;

Table altered.

Elapsed: 00:00:00.02
SQL> insert /*+ append */ into child select s_child.nextval, -1, 'this will not work' from parent;

7640 rows created.

Elapsed: 00:00:00.30
SQL> commit;
commit
*
ERROR at line 1:
ORA-02091: transaction rolled back
ORA-02291: integrity constraint (MARTIN.FK_PARENT_CHILD) violated - parent key not found

So this contraint fires as well! Good news for me. If it’s all ok then the insert will of course succeed.

Trouble is that Oracle “silently” ignores the direct path load! I haven’t initially put this into the post but thanks to Oren for adding it to it in the comments section (make sure to have a look at it).

Back to the “hot” constraint definition and inserting into the table yields the expected result.

SQL> alter table child drop constraint FK_PARENT_CHILD;

Table altered.

SQL> alter table child add constraint FK_PARENT_CHILD foreign key (p_id) references parent (id);

Table altered.

SQL> insert /*+ append */ into child select s_child.nextval, 2, 'this should work however' from parent;

7640 rows created.

SQL> commit;

Commit complete.

Summary

Again-the insert wasn’t really a “direct path insert”, see comments. It really pays off to verify and update “old knowledge” from time to time! I still prefer valid data over garbage, and when it comes to ILM I can “move” my table to get the required compression result.

References

Have fun!

Not Exists

The following requirement appeared recently on OTN:


=========================================================================================================
I have a following query and want to get rid of the "NOT EXISTS' clause without changing the end results.

SELECT   A.c,
         A.d,
         A.e,
         A.f
  FROM   A
WHERE   NOT EXISTS (SELECT   1
                       FROM   B
                      WHERE   B.c = A.c AND B.d = A.d AND B.e = A.e);
===========================================================================================================

Inevitably this wasn’t the problem query, and almost inevitably the OP was asking us how to implement a solution which wasn’t appropriate for a problem that shouldn’t have existed. Despite this it’s worth spending a little time to take the request at its face value and look at the sort of thing that could be going on.

First, of course, you cannot get rid of the “not exists” clause, although you may be able to make it look different. If you want “all the rows in A that are not referenced in B” then you HAVE to examine all the rows in A, and you have to do some sort of check for each row to see whether or not it exists in B. The only option you’ve got for doing something about the “not exists” clause is to find a way of making it as a cheap as possible to implement.

A couple of people came up with suggestions for rewriting the query to make it more efficient. One suggested writing it as a “NOT IN” subquery, but it’s worth remembering that the optimizer may cheerfully transform a “NOT IN” subquery to a “NOT EXISTS” subquery if it’s legal and a manual rewrite may overlook the problem of NULLs; another suggested rewriting the query as an outer join, but again it’s worth remembering that the optimimzer may transform a “NOT EXISTS” subquery to an “ANTI-JOIN” – which is a bit like an outer join with filter, only more efficient. So, before suggesting a rewrite, it’s worth looking at the execution plan to see what the optimizer is doing just in case it’s doing something silly. There are two options – anti-join or filter subquery.

Here, with code I’ve run under 10.2.0.5 to match the OP, is a demonstration data set, with the two plans you might expect to see – first, some the data:


execute dbms_random.seed(0)

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(0,4))           c,
        trunc(dbms_random.value(0,5))           d,
        trunc(dbms_random.value(0,300))         e,
        rownum                                  f,
        rpad('x',100)                   padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;

create table t2
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(0,4))           c,
        trunc(dbms_random.value(0,5))           d,
        trunc(dbms_random.value(0,300))         e,
        rownum                                  f,
        rpad('x',100)                   padding
from
        generator       v1,
        generator       v2
where
        rownum <= 24000
;

create index t1_i1 on t1(c,d,e);
create index t2_i1 on t2(c,d,e);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

The OP had followed up their original query with a claim that “Table A holds 100 million rows and table B holds 24,000″ – that’s a lot of checks (if true) and you ought to be asking how quickly the OP expects the query to run and how many of the 100 M rows are going to survive the check. I’ve set up just 1M rows with 6,000 distinct values for the column combination (c,d,e), and a reference table with 24,000 rows which are likely to include most, but not all, of those 6,000 combinations.

Rather than generate a very large output, I’ve written a query that generates the required data set, then counts it:


select
        max(f), count(*)
from (
        SELECT   /*+ no_merge */
                 A.c,
                 A.d,
                 A.e,
                 A.f
          FROM   t1 A
        WHERE   NOT EXISTS (SELECT   /* no_unnest */
                                      1
                               FROM   t2 B
                              WHERE   B.c = A.c AND B.d = A.d AND B.e = A.e)
)
;

This took about 0.35 seconds to run – aggregating roughly 14,500 rows from 1M. The plan was (as I had expected) based on a (right) hash anti join:


---------------------------------------------------------------------------------
| Id  | Operation               | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |       |     1 |    13 |  2183   (5)| 00:00:11 |
|   1 |  SORT AGGREGATE         |       |     1 |    13 |            |          |
|   2 |   VIEW                  |       |   999K|    12M|  2183   (5)| 00:00:11 |
|*  3 |    HASH JOIN RIGHT ANTI |       |   999K|    23M|  2183   (5)| 00:00:11 |
|   4 |     INDEX FAST FULL SCAN| T2_I1 | 24000 |   234K|    11  (10)| 00:00:01 |
|   5 |     TABLE ACCESS FULL   | T1    |  1000K|    14M|  2151   (4)| 00:00:11 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("B"."C"="A"."C" AND "B"."D"="A"."D" AND "B"."E"="A"."E")

Oracle has built an in-memory hash table from the 24,000 rows in t2, then scanned the t1 table, probing the hash table with each row in turn. That’s 1M probe in less than 0.35 seconds. You ought to infer from this that most of the time spent in the original query should have been spent scanning the 100M rows, and only a relatively small increment appear due to the “not exists” clause.

You’ll notice, though that there was a comment in my subquery with the /* no_unnest */ hint embedded – if I change this from a comment to a hint (/*+ */) I should get a plan with a filter subquery, and maybe that’s what’s happening to the OP for some odd reason. Here’s the plan:


------------------------------------------------------------------------------
| Id  | Operation            | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |     1 |    13 | 15166   (1)| 00:01:16 |
|   1 |  SORT AGGREGATE      |       |     1 |    13 |            |          |
|   2 |   VIEW               |       |   999K|    12M| 15166   (1)| 00:01:16 |
|*  3 |    FILTER            |       |       |       |            |          |
|   4 |     TABLE ACCESS FULL| T1    |  1000K|    14M|  2155   (4)| 00:00:11 |
|*  5 |     INDEX RANGE SCAN | T2_I1 |     4 |    40 |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T2" "B"
              WHERE "B"."E"=:B1 AND "B"."D"=:B2 AND "B"."C"=:B3))
   5 - access("B"."C"=:B1 AND "B"."D"=:B2 AND "B"."E"=:B3)

The query took 1.65 seconds to complete. (And re-running with rowsource execution statistics enabled, I found that the subquery had executed roughly 914,000 times in that 1.65 seconds). Even if the original query had used the filter subquery plan the subquery shouldn’t have made much difference to the overall performance. Of course if T2 didn’t have that index on (c,d,e) then the filter subquery plan would have been much more expensive – but then, we would really have expected to see the hash anti-join.

If you’re wondering why the subquery ran 914,000 times instead of 1M times, you’ve forgotten “scalar subquery caching”.  The session caches a limited number of results from subquery execution as a query runs and may be able to use cached results (or simply a special “previous-execution” result) to minimise the number of executions of the subquery.

Did you notice the index I created on t1(c,d,e) ? If I drive the query through this index I’ll access all the rows for a given combination of (c,d,e) one after the other and only have to run the subquery once for the set. To make this happen, though, I’ll have to declare one of the columns to be NOT NULL, or add a suitable “column is not null” predicate to the query; and then I’ll probably have to hint the query anyway:


select
        max(f)
from (
        SELECT   /*+ no_merge index(a) */
                 A.c,
                 A.d,
                 A.e,
                 A.f
          FROM   t1 A
        WHERE   NOT EXISTS (SELECT   /*+ no_unnest */
                                      1
                               FROM   t2 B
                              WHERE   B.c = A.c AND B.d = A.d AND B.e = A.e)
        and     c is not null
)
;

---------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |     1 |    13 | 65706   (1)| 00:05:29 |
|   1 |  SORT AGGREGATE               |       |     1 |    13 |            |          |
|   2 |   VIEW                        |       |   999K|    12M| 65706   (1)| 00:05:29 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T1    | 50000 |   732K| 52694   (1)| 00:04:24 |
|*  4 |     INDEX FULL SCAN           | T1_I1 | 50000 |       |  2869   (2)| 00:00:15 |
|*  5 |      INDEX RANGE SCAN         | T2_I1 |     4 |    40 |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("C" IS NOT NULL AND  NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM
              "T2" "B" WHERE "B"."E"=:B1 AND "B"."D"=:B2 AND "B"."C"=:B3))
   5 - access("B"."C"=:B1 AND "B"."D"=:B2 AND "B"."E"=:B3)

Re-running this code with rowsource execution statistics enabled showed that the subquery ran just 6,000 times (as expected) – for a total run time that was slightly faster than the hash anti-join method (0.17 seconds – but I do have a new laptop using SSD only, with a 3.5GHz CPU and lots of memory).

Every which way, if we can get reasonable performance from the underlying table access there’s no way that introducing a “NOT EXISTS” ought to be a disaster. The worst case scenario – for some reason Oracle chooses to run a filter subquery plan and the appropriate index hasn’t been created to support it.

Footnote:

Of course, table A didn’t really exist, it was a three table join; and it didn’t produce 100M rows, it produced anything between zero and 5 million rows, and the effect of the subquery (which correlated back to two of the joined tables) was to leave anything between 0 and 5 million rows. And (apparently) the query was quick enough in the absence of the subquery (producing, for example, 1 million rows in only 5 minutes), but too slow with the subquery in place.

But that’s okay. Because of our tests we know that once we’ve produced a few million rows it takes fractions of a second more to pass them through a hash table with an anti-join to deal with the “not exists” subquery; and I doubt if we have to play silly games to push the data through a filter subquery plan in the right order to squeeze a few extra hundredths of a second from the query.

If the OP is happy with the basic select statement before the “not exists” subquery, all he has to do is take advantage of a no_merge hint:


select  {list of columns}
from
        (
        select /*+ no_merge */ .... rest of original query
        )    v1
where
        not exists (
                select  null
                from    b
                where   b.c = v1.c and b.d = v1.d and b.e = v1.e
        )
;

You’re probably wondering why the OP currently sees a performance problem as the subquery is added. The best guess is that the subquery has introduce a “magic 5% fudge factor” to the arithmetic (did you notice the cardinality of t1 dropping to 50,000 from 1M in the plan above) and made it pick a worse execution plan for the rest of the query. We can’t tell, though, since the OP hasn’t yet given us the information that would allow us to see what’s going wrong.

Combined ACCESS And FILTER Predicates - Excessive Throw-Away

Catchy title... Let's assume the following data setup:


create table t1
as
select
rownum as id
, 1 as id2
, rpad('x', 100) as filler
from
dual
connect by
level <= 1e4
;

create table t2
as
select
rownum as id
, 1 as id2
, rpad('x', 100) as filler
from
dual
connect by
level <= 1e4
;

create table t3
as
select
rownum as id
, 1 as id2
, rpad('x', 100) as filler
from
dual
connect by
level <= 1e4
;

exec dbms_stats.gather_table_stats(null, 't1')

exec dbms_stats.gather_table_stats(null, 't2')

exec dbms_stats.gather_table_stats(null, 't3')

-- Deliberately wrong order (FBI after gather stats) - the virtual columns created for this FBI don't have statistics, see below
create index t2_idx on t2 (case when id2 = 1 then id2 else 1 end, case when id2 = 2 then id2 else 1 end, filler, id);

create index t3_idx on t3 (id, filler, id2);

And the following execution plan (all results are from 12.1.0.2 but should be applicable to other versions, too):


----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10000 | 1416K| 132 (0)| 00:00:01 |
|* 1 | HASH JOIN | | 10000 | 1416K| 132 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL | T2 | 10000 | 292K| 44 (0)| 00:00:01 |
|* 3 | HASH JOIN | | 10000 | 1123K| 88 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL| T3 | 10000 | 70000 | 44 (0)| 00:00:01 |
| 5 | TABLE ACCESS FULL| T1 | 10000 | 1054K| 44 (0)| 00:00:01 |
----------------------------------------------------------------------------

How long would you expect it to run to return all rows (no tricks like expensive regular expressions or user-defined PL/SQL functions)?Probably should take just a blink, given the tiny tables with just 10000 rows each.However, these are the runtime statistics for a corresponding execution:


| | | |
| |DATABASE |CPU |
|DURATION |TIME |TIME |
|------------|------------|------------|
|+0 00:00:23 |+0 00:00:23 |+0 00:00:23 |
| | | |

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Pid | Ord | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Execs | A-Rows| PGA | Start | Dur(T)| Dur(A)| Time Active Graph | Activity Graph ASH | Top 5 Activity ASH |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | | 6 | SELECT STATEMENT | | | | 132 (100)| | 1 | 1 | 0 | | | | | | |
|* 1 | 0 | 5 | HASH JOIN | | 10000 | 1191K| 132 (0)| 00:00:01 | 1 | 1 | 1401K | 2 | 23 | 22 | ##### ############## | @@@@@@@@@@@@@@@@@@@ (100%) | ON CPU(22) |
| 2 | 1 | 1 | TABLE ACCESS FULL | T2 | 10000 | 70000 | 44 (0)| 00:00:01 | 1 | 10K | | | | | | | |
|* 3 | 1 | 4 | HASH JOIN | | 10000 | 1123K| 88 (0)| 00:00:01 | 1 | 10K | 1930K | | | | | | |
| 4 | 3 | 2 | TABLE ACCESS FULL| T3 | 10000 | 70000 | 44 (0)| 00:00:01 | 1 | 10K | | | | | | | |
| 5 | 3 | 3 | TABLE ACCESS FULL| T1 | 10000 | 1054K| 44 (0)| 00:00:01 | 1 | 10K | | | | | | | |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

How is it possible to burn more than 20 seconds of CPU time with that execution plan?The actual rows produced correspond pretty much to the estimated cardinalities (except for the final hash join), so that doesn't look suspect at first glance.What becomes obvious from the SQL Monitoring output is that all the time is spent on the hash join operation ID = 1.Of course at that point (at the latest) you should tell me off for not having you shown the predicate section of the plan and the corresponding query in first place.So here is the predicate section and the corresponding query:


Predicate Information (identified by operation id):
---------------------------------------------------

1 - access(CASE "T1"."ID2" WHEN 1 THEN "T1"."ID2" ELSE 1 END =CASE
"T2"."ID2" WHEN 1 THEN "T2"."ID2" ELSE 1 END AND CASE "T1"."ID2" WHEN
2 THEN "T1"."ID2" ELSE 1 END =CASE "T2"."ID2" WHEN 2 THEN "T2"."ID2"
ELSE 1 END )
filter("T3"."ID2"=CASE WHEN ("T1"."ID">"T2"."ID") THEN
"T1"."ID" ELSE "T2"."ID" END )

3 - access("T3"."ID"="T1"."ID")


select /*+
leading(t1 t3 t2)
full(t1)
full(t3)
use_hash(t3)
swap_join_inputs(t3)
full(t2)
use_hash(t2)
swap_join_inputs(t2)
*/
t1.*
--, t3.id2
--, case when t1.id > t2.id then t1.id else t2.id end
from
t1
, t2
, t3
where
1 = 1
--
and case when t1.id2 = 1 then t1.id2 else 1 end = case when t2.id2 = 1 then t2.id2 else 1 end
and case when t1.id2 = 2 then t1.id2 else 1 end = case when t2.id2 = 2 then t2.id2 else 1 end
--
and t3.id = t1.id
and t3.id2 = case when t1.id > t2.id then t1.id else t2.id end
;

There are two important aspects to this query and the plan: First, the join expression (without corresponding expression statistics) between T1 and T2 is sufficiently deceptive to hide from the optimizer that in fact this produces a cartesian product (mimicking real life multi table join expressions that lead to bad estimates) and second, the table T3 is joined to both T1 and an expression based on T1 and T2, which means that this expression can only be evaluated after the join to T1 and T2.With the execution plan shape enforced via my hints (but could be a real life execution plan shape preferred by the optimizer) T3 and T1 are joined first, producing an innocent 10K rows row source, which is then joined to T2. And here the accident happens inside the hash join operation:If you look closely at the predicate section you'll notice that the hash join operation has both, an ACCESS operation and a FILTER operation. The ACCESS operation performs based on the join between T1 and T2 a lookup into the hash table, which happens to be a cartesian product, so produces 10K times 10K rows, and only afterwards the FILTER (representing the T3 to T1/T2 join expression) is applied to these 100M rows, but matching only a single row in my example here, which is what the A-Rows shows for this operation.So the point is that this excessive work and FILTER throwaway isn't very well represented in the row source statistics. Ideally you would need one of the following two modifications to get a better picture of what is going on:Either the FILTER operator should be a separate step in the plan, which in theory would then look like this:


---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Execs | A-Rows|
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 132 (100)| | 1 | 1 |
|* 1a| FILTER | | 10000 | 1191K| 132 (0)| 00:00:01 | 1 | 1 |
|* 1b| HASH JOIN | | 10000 | 1191K| 132 (0)| 00:00:01 | 1 | 100M |
| 2 | TABLE ACCESS FULL | T2 | 10000 | 70000 | 44 (0)| 00:00:01 | 1 | 10K |
|* 3 | HASH JOIN | | 10000 | 1123K| 88 (0)| 00:00:01 | 1 | 10K |
| 4 | TABLE ACCESS FULL| T3 | 10000 | 70000 | 44 (0)| 00:00:01 | 1 | 10K |
| 5 | TABLE ACCESS FULL| T1 | 10000 | 1054K| 44 (0)| 00:00:01 | 1 | 10K |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1a- filter("T3"."ID2"=CASE WHEN ("T1"."ID">"T2"."ID") THEN
"T1"."ID" ELSE "T2"."ID" END )
1b- access(CASE "T1"."ID2" WHEN 1 THEN "T1"."ID2" ELSE 1 END =CASE
"T2"."ID2" WHEN 1 THEN "T2"."ID2" ELSE 1 END AND CASE "T1"."ID2" WHEN
2 THEN "T1"."ID2" ELSE 1 END =CASE "T2"."ID2" WHEN 2 THEN "T2"."ID2"
ELSE 1 END )
3 - access("T3"."ID"="T1"."ID")

Which would make the excess rows produced by the ACCESS part of the hash join very obvious, but is probably for performance reasons not a good solution, because then the data would have to flow from one operation to another one rather than being processed within the HASH JOIN operator, which means increased overhead.Or an additional rowsource statistics should be made available:


----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Execs | A-Rows|AE-Rows|
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 132 (100)| | 1 | 1 | 1 |
|* 1 | HASH JOIN | | 10000 | 1191K| 132 (0)| 00:00:01 | 1 | 1 | 100M |
| 2 | TABLE ACCESS FULL | T2 | 10000 | 70000 | 44 (0)| 00:00:01 | 1 | 10K | 10K |
|* 3 | HASH JOIN | | 10000 | 1123K| 88 (0)| 00:00:01 | 1 | 10K | 10K |
| 4 | TABLE ACCESS FULL| T3 | 10000 | 70000 | 44 (0)| 00:00:01 | 1 | 10K | 10K |
| 5 | TABLE ACCESS FULL| T1 | 10000 | 1054K| 44 (0)| 00:00:01 | 1 | 10K | 10K |
----------------------------------------------------------------------------------------------------

Which I called here "Actually evaluated rows" and in addition to this case here of combined ACCESS and FILTER operations could also be helpful for other FILTER cases, for example even for simple full table scan to see how many rows were evaluated, and not only how many rows matched a possible filter (what A-Rows currently shows).In a recent OTN thread this topic came up again, and since I also came across this phenomenon a couple of times recently I thought to put this note together. Note that Martin Preiss has submitted a corresponding database idea on the OTN forum.Expanding on this idea a bit further, it could be useful to have an additional "Estimated evaluated rows (EE-Rows)" calculated by the optimizer and shown in the plan. This could also be used to improve the optimizer's cost model for such cases, because at present it looks like the optimizer doesn't consider additional FILTER predicates on top of ACCESS predicates when calculating the CPU cost of operations like HASH JOINs.Note that this problem isn't specific to HASH JOIN operations, you can get similar effects with other join methods, like NESTED LOOP joins, or even simple INDEX lookup operations, where again the ACCESS part isn't very selective but only the FILTER applied afterwards filters matching rows.Here are some examples with the given setup:


select /*+
leading(t1 t3 t2)
full(t1)
full(t3)
use_hash(t3)
swap_join_inputs(t3)
index(t2)
use_nl(t2)
*/
t1.*
--, t3.id2
--, case when t1.id > t2.id then t1.id else t2.id end
from
t1
, t2
, t3
where
1 = 1
--
and case when t1.id2 = 1 then t1.id2 else 1 end = case when t2.id2 = 1 then t2.id2 else 1 end
and case when t1.id2 = 2 then t1.id2 else 1 end = case when t2.id2 = 2 then t2.id2 else 1 end
--
and t3.id = t1.id
and t3.id2 = case when t1.id > t2.id then t1.id else t2.id end
;

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Pid | Ord | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Execs | A-Rows| PGA | Start | Dur(T)| Dur(A)| Time Active Graph | Activity Graph ASH | Top 5 Activity ASH |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | | 6 | SELECT STATEMENT | | | | 10090 (100)| | 1 | 1 | | | | | | | |
| 1 | 0 | 5 | NESTED LOOPS | | 10000 | 1416K| 10090 (1)| 00:00:01 | 1 | 1 | | | | | | | |
|* 2 | 1 | 3 | HASH JOIN | | 10000 | 1123K| 88 (0)| 00:00:01 | 1 | 10K | 1890K | | | | | | |
| 3 | 2 | 1 | TABLE ACCESS FULL| T3 | 10000 | 70000 | 44 (0)| 00:00:01 | 1 | 10K | | | | | | | |
| 4 | 2 | 2 | TABLE ACCESS FULL| T1 | 10000 | 1054K| 44 (0)| 00:00:01 | 1 | 10K | | | | | | | |
|* 5 | 1 | 4 | INDEX RANGE SCAN | T2_IDX | 1 | 30 | 1 (0)| 00:00:01 | 10K | 1 | | 3 | 33 | 32 | ################### | @@@@@@@@@@@@@@@@@@@ (100%) | ON CPU(32) |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T3"."ID"="T1"."ID")
5 - access(CASE "T1"."ID2" WHEN 1 THEN "T1"."ID2" ELSE 1 END
="T2"."SYS_NC00004$" AND CASE "T1"."ID2" WHEN 2 THEN "T1"."ID2" ELSE 1
END ="T2"."SYS_NC00005$")
filter("T3"."ID2"=CASE WHEN ("T1"."ID">"T2"."ID") THEN
"T1"."ID" ELSE "T2"."ID" END )



select /*+
leading(t1 t3 t2)
full(t1)
full(t3)
use_hash(t3)
swap_join_inputs(t3)
index(t2)
use_nl(t2)
*/
max(t1.filler)
--, t3.id2
--, case when t1.id > t2.id then t1.id else t2.id end
from
t1
, t2
, t3
where
1 = 1
--
and case when t1.id2 = 1 then t1.id2 else 1 end = case when t2.id2 = 1 then t2.id2 else 1 end
and case when t1.id2 = 2 then t1.id2 else 1 end = case when t2.id2 = 2 then t2.id2 else 1 end
--
and t3.id = t1.id
and t2.filler >= t1.filler
and t2.id = case when t1.id2 > t3.id2 then t1.id2 else t3.id2 end

;

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Pid | Ord | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Execs | A-Rows| PGA | Start | Dur(T)| Dur(A)| Time Active Graph | Activity Graph ASH | Top 5 Activity ASH |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | | 7 | SELECT STATEMENT | | | | 20092 (100)| | 1 | 1 | | | | | | | |
| 1 | 0 | 6 | SORT AGGREGATE | | 1 | 223 | | | 1 | 1 | | | | | | | |
| 2 | 1 | 5 | NESTED LOOPS | | 1 | 223 | 20092 (1)| 00:00:01 | 1 | 10K | | | | | | | |
|* 3 | 2 | 3 | HASH JOIN | | 10000 | 1123K| 88 (0)| 00:00:01 | 1 | 10K | 1900K | | | | | | |
| 4 | 3 | 1 | TABLE ACCESS FULL| T3 | 10000 | 70000 | 44 (0)| 00:00:01 | 1 | 10K | | | | | | | |
| 5 | 3 | 2 | TABLE ACCESS FULL| T1 | 10000 | 1054K| 44 (0)| 00:00:01 | 1 | 10K | | | | | | | |
|* 6 | 2 | 4 | INDEX RANGE SCAN | T2_IDX | 1 | 108 | 2 (0)| 00:00:01 | 10K | 10K | | 2 | 34 | 34 | #################### | @@@@@@@@@@@@@@@@@@@ (100%) | ON CPU(34) |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("T3"."ID"="T1"."ID")
6 - access(CASE "T1"."ID2" WHEN 1 THEN "T1"."ID2" ELSE 1 END
="T2"."SYS_NC00004$" AND CASE "T1"."ID2" WHEN 2 THEN "T1"."ID2" ELSE 1
END ="T2"."SYS_NC00005$" AND "T2"."FILLER">="T1"."FILLER" AND
"T2"."ID"=CASE WHEN ("T1"."ID2">"T3"."ID2") THEN "T1"."ID2" ELSE
"T3"."ID2" END AND "T2"."FILLER" IS NOT NULL)
filter("T2"."ID"=CASE WHEN ("T1"."ID2">"T3"."ID2") THEN
"T1"."ID2" ELSE "T3"."ID2" END )

The former one exhibits exactly the same problem as the HASH JOIN example, only that the FILTER is evaluated in the inner row source of a NESTED LOOP join after the index access operation.The latter one shows as variation the classic partial "index access" due to a range comparison in between - although the entire expression can be evaluated on index level, the access part matches every index entry, so the range scan actually needs to walk the entire index at each loop iteration and the FILTER is then applied to all the index values evaluated.

No Data Loss without Synchronous Network

I don’t usually write about specific products; but once in a while I encounter something worth talking about, especially if it addresses a very common issue anyone with datacenter management responsibilities will understand: avoiding the unavoidable loss of last minute data changes in database after a disaster but without expensive synchronous replication. This product solves that problem with an innovative out of the box approach. The product is Phoenix Systems from Axxana.

Replication and Last Mile Transactions

Imagine this scenario: the datacenter was on fire. All components—servers, storage, network—are destroyed. You can’t salvage anything of value. What happens to the data you need to continue your business? Fortunately you have a disaster recovery plan. You replicate the database to a remote datacenter using either Data Guard (from Oracle), or some storage level technology such as EMC’s SRDF. So the database comes up on the remote site; but there is a problem. Since you used asynchronous replication (such as maximum performance mode in Data Guard), not all changes have made it to remote site. The final few changes, usually known as last mile transactions were yet to be shipped to the remote site when the disaster struck. These are lost forever in the fire. While you do have a database at the remote datacenter, and that database is almost complete—the operative word is “almost”; not 100%, mind you—some data is inevitably lost. What if you can’t afford to be “almost” complete? Organizations such as financial institutions, hospitals, manufacturing concerns and airlines where losing data is simply not an option, this is not a luxury they can afford. Even in case of other organizations where data loss may not be as unacceptable, the loss of data creates a sense of uncertainty, since you don’t know exactly what was lost.

If you need to be 100% certain that all data is available at the remote site, what would you do?
It’s easy, using synchronousreplication. All changes on the primary database are reflected on the remote database site in real time. If you used Data Guard, you would have run it in maximum protection mode. In that case, Oracle would have committed the changes in the remote database before confirming the commit at the local site. When the primary site is gone, the remote site is available with 100% of the data. Bye bye, “almost”.
Brilliant. So, why didn’t you do it? Better yet, why doesn’t everyone do it, since it is that simple? Data loss could be unacceptable in the worst case and unsettling at best. Why accept it?

Synchronous Replication

Well, there is a pesky little detail. Synchronous replication means the network connectivity has to be rock solid. Networks have three general characteristics: throughput(how much data they can pump. Imagine a three lane highway compared to a single lane local road), latency(how much time passes to process the data; not the actual speed. For instance, the car on the highway may travel at 70 miles per hour; but it will be a few minutes spent on the ramp to the highway) and reliability(does the data move from point A to point B with 100% accuracy or it has to be re-transmitted a few times).

Synchronous replication requires the throughput to be very high (to support the large amounts change at least during bursts of activity), latency to be very low (otherwise the remote site will get the data late and respond to the primary even later, causing the primary to hold off the commit) and extremely reliable (otherwise the primary may think the remote site is not accessible and therefore shutdown itself to protect the data). If you run the Data Guard in maximum protection mode, you are essentially telling Oracle to make sure that the remote site has absolutely, positively, undoubtedly (add any other adjective you can think of) received the change and has committed.

If Oracle can’t ensure that because of any reason, such as not getting the response in time due to a less reliable network, what choices does it have? If it allows further data changes, the changes are not there at the remote site yet. At this time if the primary database fails, the data is gone. Therefore Oracle has no choice but to shut down the primary database to stop any changes coming in. It’s maximum protection, after all, and that’s what you have instructed it to do.
So, if you decide to use maximum protection, you have to use a high throughput, low latency and extremely reliable network infrastructure. Most public commercial network infrastructures are not. Either you have to contract a commercial carrier to provide this elevated level of service, or build your own, such as using dark fiber. The costs of the network infrastructure become exponentially high, especially when the remote site is farther away from the primary. In many cases, the cost of the network itself may be several times that of the database infrastructure it protects. Owing to the cost limitations, you may be forced to locate the remote site close by, e.g. in New York City and Hoboken, NJ. It will still be exceedingly expensive; and it may not offer the same degree of protection that you expect. These two cities are close enough to be in the same exposure area of disasters such as floods, hurricanes, war and so on. The farther away the remote site is, the more protected your data is; but so is your cost. Could you accomplish no data loss without this expensive proposition?
Until now, the decision was really black and white. If you want no data loss at all, you have no choice but to go for a super expensive network solution. Many companies can’t justify that high sticker price and therefore settle for potential data loss. Many, in fact, make detailed plans as a part of the business continuity efforts on handling this lost data.
It’s assumed that zero data loss is synonymous with expensive network. If you don’t have a high throughput, low latency, highly reliable network, you have to live with some data loss.
Here is the good news. No, it doesn’t have to. Now, it’s possible to have the cheaper commoditized public network infrastructure and still have complete data protection. Allow me to explain.

Data Protection in Oracle

In Oracle, the data files are written asynchronously at different intervals unrelated to the data changes and commits. In other words, when you commit a change, the data files may not have that changed data. In fact the change occurs in the memory only (called a buffer cache) and may not exist in the data files for hours afterwards. Similarly when you make a change but not commit, the data can still be persisted to the data files. Let me repeat that: the data files are updated with the changed data even if you didn’t commit yet. This is the reason why if you have a storage or operating system level replication solution—even synchronous—replicating the data files, the remote site may or may not have the data, even hours after the change.
How does Oracle protect the data that was changed and committed but in the memory, if the data files do not have them? It captures the pre- and post-change data and packages them into something called redo blocks. Remember, these have nothing to do with data blocks. These are merely changes created by activities performed on the database. This redo data—also known as redo vector—is written to a special area in memory called log buffer. When you commit, the relevant redo blocks from the log buffer are written to special files in the database called redo log files, also known as online redo log files. The commit waits until this writing—known as redo flushing—has ended. You can check the Oracle sessions waiting for this flushing to complete by looking at the event “log file sync”. Since the changes—most importantly—committed changes are recorded in the redo log files, Oracle does not need to rely on the memory alone to know which changes are committed and which are not. In case of a failure, Oracle examines the redo logs to find these changes and updates the data files accordingly. Redo logs are very small compared to the data files.
By the way, redo flushing also occurs at other times: every three seconds, every filled 1 MB of log buffer, when a third of the log buffer is full and some other events; but those additional flushes merely make sure the redo log file is up to date, even if there is no commit.
As you can see, this redo log file becomes the most important thing in the data recovery process. When a disaster occurs, you may have copies of the data files at the remote site (thanks to the replication); but as you learned in the previous section the copies are not useful yet since they may not have all the committed changes and may even have uncommitted changes. In other words, this copy is not considered “consistent” by Oracle. After a disaster, Oracle needs to check the redo log files and apply the changes to make the data files consistent. This is known as a “media recovery”. You have to initiate the media recovery process. In case of a synchronous replication at storage or Operating System level, the redo logs are perfectly in sync with the primary site and Oracle has no trouble getting to the last committed transaction just before the failure. There will be no data lost as a result of the recovery process. In case of Data Guard with maximum protection, this is not required since the changes are updated at the remote site anyway. But what about your cheaper, commodity network with asynchronous replication? The redo logs at the remote site will not be up to date with the primary site’s redo logs. When you perform media recovery, you can’t get to the very last change before the failure, simply because you may not have them. Therefore you can perform a recovery only up to the last available information in the redo at the remote site. This is known as “incomplete” media recovery, distinguished from the earlier described “complete” media recovery. To complete the media recovery, you need the information in the redo log files at the primary site; but remember, the primary site is destroyed. This information is gone. You end up with a data loss. Perhaps even worse, you won’t even know exactly how much you lost, since you don’t have the access to the primary redo log files.
Now, consider this situation carefully. All you need is the last redo log file from the primary database to complete the recovery. Unfortunately the file is not available because it’s destroyed or otherwise inaccessible since the site itself is inaccessible. This tiny little file is the only thing that stays between you and complete recovery. What if you somehow magically had access to this file, even though the rest of the data center is gone? You would have been able to complete the recovery with no data loss and looked like a hero and that too without a synchronous replication solution with super expensive network.

Enter the Black Box

Oracle’s redo data can be written to multiple files at the same time. This is called a redo log group and the files are called members. Log flushing writes to all members before confirming the flush. As you can see, all members of a group have the same data. Multiple members are created only for redundancy. As long as one member of a group is available, Oracle can use it to perform recovery. This is where the new tool from Axxana comes in. It is a storage device—named Phoenix—where you create the second member of the redo log groups. The first member of the group is on your normal storage as usual. When disaster strikes and nukes the primary site, you have a copy of the all-important redo log from the Phoenix system.
This is where the first benefit of Phoenix systems comes in. The storage is not just any ordinary one. It’s encased in a bomb-proof, fire-proof and water-proof container which preserves the internal storage from many calamities. The storage has normal connectivity such a network port to connect a network as a NAS and a fiber port to connect to a fiber switch as a SAN. Under ordinary circumstances you would use these ports to connect to your infrastructure system and use the storage. After a disaster, due to the indestructible nature of the enclosure, this storage will most likely be intact. All you have to do is to access the data from it and perform a complete recovery. No data needs to be lost.
But how do you get to the information on the Phoenix system? This is where the second benefit comes in. The Phoenix system transfers its data to another component of the Phoenix system at the remote site. The Phoenix system creates a replica of the required redo logs at the remote site. Since the only data on it are the small redo log files, the amount to transfer is very little and does not put a toll on your other network infrastructure.
Note a very important point: Phoenix system does not perform any transfer of data during normal operation. It’s only during a disaster the system transports the required redo log files to complete a 100% recovery at the remote site.


But it still depends on getting the data out of the Phoenix system that was at the primary site. What if the disaster site is physically inaccessible or it is infeasible to physically transport the Phoenix system to a location where it can be connected to your network? It’s quite possible in case of floods, hurricanes or other natural disasters or manmade ones like war or strikes. The network cables are also likely out of commission after a disaster. Without that access, how can Phoenix system extract the needed last mile transactions from the primary site—you might ask.

No worries; there is a cellular modem built into the Phoenix system that allows you to connect to it from the remote site and transfer the data over a cellular network wirelessly. The system also has its own battery that allows it to stay operational even when external power is gone—a common occurrence in almost any calamity. What’s more: the transfer of data after the disaster can also utilize this cellular connectivity; so you may not even need to physically connect to this storage at the primary site (now defunct due to the disaster) to your network. The data you need to perform the complete no-data-loss recovery may already be transferred over to the system at the remote site and is waiting for you. In any case, you have access to the data. And all this comes without the need to invest in synchronous replication and expensive network infrastructure.

In summary, assuming the primary site is gone and you are using a cheaper asynchronous network, the remote site will be partially up-to-date and the last mile transactions will be lost with the loss of the primary database. However, since you used the disaster-proof Phoenix system for an additional member of the redo log group, the last mile transactions will be intact in that system; the remote system won’t have it yet. At this point you have multiple options:

1)     

If the network infrastructure between primary and remote sites is still operational (rare; but possible), Phoenix system data transfer creates the copy of redo logs at the remote site automatically. With this in place, you can perform a complete database recovery. Data Loss: 0%.

2)     

If the network infrastructure is not operational, Phoenix will automatically engage the built-in cellular modem and initiate the data transfer to the remote site. It will be slow; but the amount of data to be transferred is small; so it won’t matter much. Once that is complete, your recovery will be complete with 0% data loss.

3)     

If you don’t have cellular access either, but have access to the physical Phoenix system device, you can mount it on your network and perform a complete recovery.

As I started off, this is a very simple and elegant out of the box solution to a complex problem. It opens up possibilities for a no-data-loss recovery scenario in case of a disaster, where the option didn’t even exist. In a future post I will describe in detail my hands on experience with screenshots, scripts, etc. with the Phoenix system. Stay tuned.
More about Axxana: www.axxana.com

Basic (newbie) install CoreOS on VirtualBox – Getting started with Docker

I got intrigued by this Dutch article mentioning Docker and CoreOS. So on this Saturday,…