Who's online

There are currently 0 users and 31 guests online.

Recent comments


Oakies Blog Aggregator

Planet Earth

I know it’s another post that’s not about Oracle, but someone sent me this video link a couple of days ago and it’s too wonderful not to share. (I’ve just got back from Iceland, so the Aurora Borealis at 1:05 is particularly relevant)

Coming Soon

Just a quick reminder if you’re in Northern Europe: I’ll be in Denmark on 13th December presenting a new tutorial event Beating the Oracle Optimizer.

If you’ve been on my “Designing Optimal SQL” course, this is a follow up that takes you in the realm of designing “strange” SQL for those occasions when the Optimizer isnt going to do well enough with “normal” SQL.

In other news:

Don’t forget the UKOUG annual conference in Birmingham from 5th – 7th Dec.

Anyone attending the full event can register for a free seat at the Oak Table Day on Sunday 4th Dec. where I’ll be presenting a geeky session about Redo.

If your abstract wasn’t selected this year, then I see there’s a series of “unconference” slots set aside in the exhibition gallery on the Monday to Wednesday.



As I’ve often pointed out, this blog isn’t AskTom, or the OTN forum, so I don’t expect to have people asking me to solve their problems; neither do I answer email questions about specific problems. Occasionally, though, questions do appear that are worth a little public airing, and one of these came in by email a couple of weeks ago. The question is longer than the answer I sent, my contribution to the exchange doesn’t start until the heading: “My Reply”.

Last week I find a very interesting thing about use_hash hint accidentally. That is when you have join two tables using unique column from one table and you have a equal predicate on that column, you cannot use hint to make them using hash join.  I know that it does not make sense to use hash join in this case because nested loop is the best way to do it. The point is why Oracle ignore the hint here.

Here is the test case.

--Table creation

create table t1 as
select  rownum id,
        object_name name
from    all_objects
where   rownum <= 1000

create table t2 as
        mod(rownum,20)+1 id
from    dual
connect by
        rownum <= 1000;

-- Index creation (for table T1, we can create a primary key index or unique index)

alter table t1 add constraint t1_pk primary key(id);
create index ind_t2 on t2(id);

-- Gather table statistics here

select    /*+ ordered use_hash(t2) */
from    t1, t2
where =
and = 1

Here, the use_hash hint will be ignored. Without rewriting the query, oracle only uses nested loop (which is the best thing, other join method are completely no make sense).

In your article, you said:

Why do people think that Oracle “ignores” hints ? There are two main reasons.

    1.  The available hints are not properly documented
    2.  the hints are rarely used properly – because they are not documented properly.
    3.  (there are a few bugs that make things go really wrong anyway) – and yes, I know that’s the third reason of two.

In the above test case, reasons 1 and 2 do not apply. So is it an Oracle feature or a bug?

We can argue that this is a feature since in this case oracle really know nested loop is the best thing. Then for table T1

select  /*+ full(t1) */
from    t1
where = 1

The above query should also ignore the FULL hint, but it does not.

Now, if we change the the query to:

select  /*+ ordered use_hash(t2) */
from    t1,t2
where =
and in (1,2)

The hint works and oracle use hash join here.

Since oracle can use transitive closure when generating query plan, now, we rewrite the first query to an equivalent one:

select  /*+ ordered use_hash(t2) */
from    t1,t2
where =
and = 1

Oracle will use the hint here even though oracle knows we want = 1.
So, it looks this is more like a bug than a feature. What do you think?

My reply

If you read chapter 6 of Cost Based Oracle – Fundamentals, somewhere around page 142, you will see what’s going on. There is an inconsistency in this part of the optimizer code here which could be addressed but might be hard to change. (So the answer to your question is that this is more like a bug than a feature – but it’s a side effect of a more significant defect in the code, rather than a very local bug.)

In the first case (predicate on t1), Oracle uses transitive closure to generate a predicate on t2 - and as it does so it drops the join predicate. This makes the hash join impossible and puts the hint out of context.

In the second case, Oracle keeps the join predicate (and that’s the inconsistency in behaviour), so the hash join is still legal and therefore the hint has to be obeyed.

I would guess that the logic works like this:

Predciate on t1: We generate a transitive predicate, but the source predicate is strong>equality on a , so the join predicate is redundant and can be dropped.

 Predicate on t2: We generate a transitive predicate, but the source predicate is not a guaranteed to be single row predicate, so the join predicate has to be kept and checked. (The “not unique” requirement is also why the join on the in-list behaves the way it does).

Arguably Oracle could introduce a second pass in the optimizer code which could note that the generated predicate has resulted in a predicate with equality on a unique key, and with this change in place the optimizer could decide to drop the join predicate and we would be back to consistent behaviour. In fact, to my mind, the code should never drop predicates – but it needs to change so that it recognises “redundant” predicates properly and doesn’t double count them in the calculation of join selectivity.

Friday Philosophy – OK, so I am on Twitter Now

Not a very exciting Friday Philosophy this week I’m afraid, just a self-publicising announcement that I am now on Twitter. I’ve put the wordpress widget on the blog for a while (days or weeks, I don’t know), my twitter name is MDWidlake. {I was a little surprised mwidlake had gone already but that says more about how rare I consider my surname to be than naivety, I hope}. It seems you can click on a part of the widget to follow me, which is a pretty safe thing to do as I am not very verbal as yet.

As I said, I’m not very active at the moment, I’m more following just a few friends and seeing what people use Twitter for. So far it mostly seems to be about:

  • Random stuff posted when bored
  • Complaining about work or, more specifically, tasks that are proving trickier than hoped
  • Drinking
  • Random stuff posted when bored
  • Articles that have caught someone’s eye
  • …or more often, about tweets about articles that have caught someone’s eye
  • Chatty stuff that only makes sense between social peers (and isn’t that one of the main points of something like Twitter?)
  • Random stuff posted when bored
  • Cuddly toys. I think that is a result of low sample size and that Doug Burns is away at a conference. I worry about his sanity sometimes.

Niall Litchfield, Neil Chandler and Doug Burns were right {thanks again for your advice, gents}, there is some nice stuff on there and I’ve already seen some articles and web pages I found interesting via it – but I have also failed to get on with proper work-like stuff I should have been doing as a result.

I also like the chatty extension to real social engagement that Twitter gives but I hold out on my final decision as to whether this makes up for the negative impact it seems to have on real, meeting-in-person socialising.

The interface to Twitter seems a bit, well, rubbish to me. I know, I’ve been on there for all of a week and I am probably missing the Bleedin’ Obvious  but it seems the stuff I see in Timeline, the default view, is just a subset of what people I follow say. I suspect that it’s got something to do with whether the person the tweet is replying to is on my follow list. To understand half the social stuff you have to go clicking around on people’s full tweet history and follow the thread back. Surely there is an easier way than this, maybe some connect-by tree-walk SQL could be invoked…

I’ve already dropped one person off my “following” list. I only followed one celebrity and I decided I could live without the random musings of Simon Pegg. I can imagine people get addicted to following several dozen b to z level celebs, maybe it’s like constantly living in some sort of poor quality reality tv show {Personally I tend to avoid all reality TV, I prefer reality. Except that I am forced to watch that dancing thing on BBC by my wife. And like most men who make that sort of defence, I can’t quite explain away why I still watch it if she is away…}.

So, don’t expect too much in the way of interesting, witty, insightful or even existing tweets from me as yet, but if you want to follow, heck you can always drop me like a sack of manure any time you like :-) .

Little Things Doth Crabby Make – Part XVII. I See xfs_mkfile(8) Making Fragmented Files.

BLOG UPDATE 21-NOV-2011: The comment thread for this post is extremely relevant.


I recently had an “exchange of ideas” with an individual. It was this individual’s assertion that modern systems exhibit memory latencies measured in microseconds.

Since I haven’t worked on a system with microsecond-memory since late in the last millennium I sort of let the conversation languish.

The topic of systems speeds and feeds was fresh on my mind from that conversation when I encountered something that motivated me to produce this installment in the Little Things Doth Crabby Make series.

This installment in the series has to do with disk scan throughput and file system fragmentation. But what does that have to do with modern systems’ memory latency? Well, I’ll try to explain.

Even though I haven’t had the displeasure of dealing with microsecond memory, this century, I do recall such ancient systems were routinely fed (and swamped) by just a few hundred megabytes per second disk scan throughput.

I try to keep things like that in perspective when I’m fretting over the loss of 126MB/s like I was the other day. Especially when the 126MB/s is a paltry 13% degradation in the systems I was analyzing! Modern systems are a modern marvel!

But what does any of that have to do with XFS and fragmentation? Please allow me to explain. I had a bit of testing going where 13% (for 126MB/s) did make me crabby (it’s Little Things Doth Crabby Make after all).

The synopsis of the test, and thus the central topic of this post, was:

  1. Create and initialize a 32GB file whilst the server is otherwise idle
  2. Flush the Linux page cache
  3. Use dd(1) to scan the file with 64KB reads — measure performance
  4. Use xfs_bmap(8) to report on file extent allocation and fragmentation

Step number 1 in the test varied the file creation/initialization method between the following three techniques/tools:

  1. xfs_mkfile(8)
  2. dd(1) with 1GB writes (yes, this works if you have sufficient memory)
  3. dd(1) with 64KB writes

The following screen-scrape shows that the xfs_mkfile(8) case rendered a file that delivered scan performance significantly worse than the two dd(1) cases. The degradation was 13%:

# xfs_mkfile 32g testfile
 # sync;sync;sync;echo "3" > /proc/sys/vm/drop_caches
 # dd if=testfile of=/dev/null bs=64k
 524288+0 records in
 524288+0 records out
 34359738368 bytes (34 GB) copied, 40.8091 seconds, 842 MB/s
 # xfs_bmap -v testfile > frag.xfs_mkfile.out 2>&1
 # rm -f testfile
 # dd if=/dev/zero of=testfile bs=1024M count=32
 32+0 records in
 32+0 records out
 34359738368 bytes (34 GB) copied, 22.1434 seconds, 1.6 GB/s
 # sync;sync;sync;echo "3" > /proc/sys/vm/drop_caches
 # dd if=testfile of=/dev/null bs=64k
 524288+0 records in
 524288+0 records out
 34359738368 bytes (34 GB) copied, 35.5057 seconds, 968 MB/s
 # xfs_bmap -v testfile > frag.ddLargeWrites.out 2>&1
 # rm testfile
 # df -h .
 Filesystem Size Used Avail Use% Mounted on
 /dev/sdb 2.7T 373G 2.4T 14% /data1
 # dd if=/dev/zero of=testfile bs=1M count=32678
 32678+0 records in
 32678+0 records out
 34265366528 bytes (34 GB) copied, 21.6339 seconds, 1.6 GB/s
 # sync;sync;sync;echo "3" > /proc/sys/vm/drop_caches
 # dd if=testfile of=/dev/null bs=64k
 522848+0 records in
 522848+0 records out
 34265366528 bytes (34 GB) copied, 35.3932 seconds, 968 MB/s
 # xfs_bmap -v testfile > frag.ddSmallWrites.out 2>&1

I was surprised by the xfs_mkfile(8) case. Let’s take a look at the xfs_bmap(8) output.

First, the two maps from the dd(1) files:

# cat frag.ddSmallWrites.out
 0: [0..9961471]: 1245119816..1255081287 6 (166187576..176149047) 9961472
 1: [9961472..26705919]: 1342791800..1359536247 7 (84037520..100781967) 16744448
 2: [26705920..43450367]: 1480316192..1497060639 8 (41739872..58484319) 16744448
 3: [43450368..66924543]: 1509826928..1533301103 8 (71250608..94724783) 23474176
 # cat frag.ddLargeWrites.out
 0: [0..9928703]: 1245119816..1255048519 6 (166187576..176116279) 9928704
 1: [9928704..26673151]: 1342791800..1359536247 7 (84037520..100781967) 16744448
 2: [26673152..43417599]: 1480316192..1497060639 8 (41739872..58484319) 16744448
 3: [43417600..67108863]: 1509826928..1533518191 8 (71250608..94941871) 23691264

The mapping of file offsets to extents is quite close in the dd(1) file cases. Moreover, XFS gave me 4 extents for my 32GB file. I like that..but…

So what about the xfs_mkfile(8) case? Well, not so good.

I’ll post a blog update when I figure out more about what’s going on. In the meantime, I’ll just paste it and that will be the end of this post for the time being:

# cat frag.xfs_mkfile.out
 0: [0..10239]: 719289592..719299831 4 (1432..11671) 10240
 1: [10240..14335]: 719300664..719304759 4 (12504..16599) 4096
 2: [14336..46591]: 719329072..719361327 4 (40912..73167) 32256
 3: [46592..78847]: 719361840..719394095 4 (73680..105935) 32256
 4: [78848..111103]: 719394608..719426863 4 (106448..138703) 32256
 5: [111104..143359]: 719427376..719459631 4 (139216..171471) 32256
 6: [143360..175615]: 719460144..719492399 4 (171984..204239) 32256
 7: [175616..207871]: 719492912..719525167 4 (204752..237007) 32256
 8: [207872..240127]: 719525680..719557935 4 (237520..269775) 32256
 [...3,964 lines deleted...]
 3972: [51041280..51073535]: 1115787376..1115819631 6 (36855136..36887391) 32256
 3973: [51073536..51083775]: 1115842464..1115852703 6 (36910224..36920463) 10240
 3974: [51083776..51116031]: 1115852912..1115885167 6 (36920672..36952927) 32256
 3975: [51116032..54897663]: 1142259368..1146040999 6 (63327128..67108759) 3781632
 3976: [54897664..55078911]: 1146077440..1146258687 6 (67145200..67326447) 181248
 3977: [55078912..56094207]: 1195607400..1196622695 6 (116675160..117690455) 1015296
 3978: [56094208..67108863]: 1245119816..1256134471 6 (166187576..177202231) 11014656

Filed under: oracle

Procedural SQL*Plus and Password Encryption

One  small but bothersome  issue I’ve had for 20 years is  how to drive a program like SQL*Plus with a shell script to make it procedural.  One approach is to just run single commands to SQL*Plus causing a connection and exit for every SQL statement. Connecting and disconnecting is costly. Ideally, I just want to open up a  connection and send commands to the connection and get the response. Of course languages like java, perl, python all have Oracle connection APIs but what if I just want a simple shell script and don’t have access to perl or python with the Oracle APIs or access to java? Can’t I just do it in shell?  Can’t I just connect SQL*Plus to a named pipe and echo my commands into the pipe? Well yes but there is an annoying obstacle. After echoing the first command the pipe, which SQL*Plus dutifully executes, SQL*Plus then exits. I first ran into this problem about 20 years ago and didn’t solve it at the time.  A few years later, I figured out a trick to make it work and have been using it ever since. The trick is to have a process run a “tail -f” of an empty file into the pipe. With this second process tailing, SQL*Plus doesn’t exit when reading from the pipe. Since I first started using this I’ve never looked into exactly why. It think that when SQL*Plus tries to read from the pipe after the first command has been sent, the OS says, no more data , and SQL*Plus exits. With the second process doing the “tail -f”, then the OS tells SQL*Plus, waiting for more data to send you, and SQL*Plus waits. Would love a more detailed explanation.

function usage
       echo "Usage: $(basename $0)    [sid] [port]"
       echo "  username        database username"
       echo "  username        database password"
       echo "  host            hostname or IP address"
       echo "  sid             optional database sid (default: orcl)"
       echo "  port            optional database port (default: 1521)"
       exit 2
[[ $# -lt 3 ]] && usage
[[ $# -gt 5 ]] && usage
[[ $# -gt 0 ]] && UN=$1
[[ $# -gt 1 ]] && PW=$2
[[ $# -gt 2 ]] && HOST=$3
[[ $# -gt 3 ]] && SID=$4
[[ $# -gt 4 ]] && PORT=$5
  sh ./$CLEAN > /dev/null 2>&1
  echo "" > $CLEAN
  echo "rm $OPEN" >> $CLEAN
  echo "rm $PIPE" >> $CLEAN
  rm $OPEN $PIPE > /dev/null 2>&1
  touch  $OPEN
  cmd="$MKNOD $PIPE p"
  eval $cmd
  tail -f $OPEN >> $PIPE &
  echo "kill -9 $OPENID" >> $CLEAN
  # run SQLPLUS silent unless DEBUG is 2 or higher
  if [ $DEBUG -lt 2 ]; then
                     (ADDRESS_LIST=             \
                         (ADDRESS=              \
                             (PROTOCOL=TCP)     \
                             (HOST=$HOST)       \
                             (PORT=$PORT)))     \
                      (CONNECT_DATA=            \
                             (SERVER=DEDICATED) \
   cmd="sqlplus $SILENT \"$CONNECT\" < $PIPE > /dev/null &"
   cmd="sqlplus $SILENT \"$CONNECT\" < $PIPE  &"
   echo "$cmd"
   echo "PIPE $PIPE"
   eval $cmd
   echo "kill -9 $SQLID" >> $CLEAN

Example execution

$ ./  scott tiger
PIPE sqlplus.pipe
$ echo 'select * from dual;' > sqlplus.pipe
$ echo 'exit' > sqlplus.pipe
$ sh sqlplus.clean

The above code will create a SQL*Plus connection reading it’s input from a pipe. It also creates a cleanup file to remove the pipe file and kill the tail process. If creating multiple connections then the file names will have to be pre/post-pended with some string to keep them separate. could be a timestamp. Could be info about which target.

Now what does this have to do with password encryption?
Well there are packages that handle password encryption. Oracle has as cool thing called wallets, that can be setup so that SQL*Plus can connect without a password.
Oracle’s Doc
Here is a quick setup:
Unfortunately Oracle’s wallet method requires files that can’t be redistributed. These file dependencies legally have to be installed by the end user,  where as Oracle instant client can be redistributed. So what if I’m redistributing a tool that uses instant client? Then the wallet solution is out of the question, at least for easy installs.
Now what if I create my own binary to handle password encryption, like “Oracle Password Repository”
This doesn’t help that much for protection, because a user can actually get the text password. What I want is a way to encrypt passwords and hide  the unencrypted passwords from the user. Sure I want  them to connect, then they have access to the database,  but I want to prevent them from walking off with a file full of clear text passwords.  One solution, like the Oracle wallet with the “cwallet.sso” files, is to have a binary that creates the connections  for SQL*Plus over a pipe and then pass the pipe to the users. Bit of a hack, especially for an interactive users, but for scripts that run SQL*Plus it not only centralizes the passwords, but it encrypts and  helps prevent the user from getting access to and walking away with a set of clear text passwords.


Reguarding the beginning of this blog post and the problem of SQL*Plus exiting after receiving the first command via a named pipe, here is what truss looks like echoing ‘select * from dual’ into the pipe with and without having a second process tailing (nothing) into the pipe. First example has a second process doing a “tail -f ” of an empty file into the pipe while echoing ‘select * from dual’ into the pipe which SQL*Plus reads, executes and stays connected

fstat64(1, 0x08044440)                          = 0
write(1, "\n", 1)                               = 1
write(1, " D\n", 2)                             = 2
write(1, " -\n", 2)                             = 2
write(1, " X\n", 2)                             = 2
write(1, "\n", 1)                               = 1
read(0, 0x0813FAD4, 5120)       (sleeping...)

Second example, there is no “tail -f” and we just do “echo ‘select * from dual;’” into the pipe which SQL*Plus executes then exits:

write(1, "\n", 1)                               = 1
write(1, " D\n", 2)                             = 2
write(1, " -\n", 2)                             = 2
write(1, " X\n", 2)                             = 2
write(1, "\n", 1)                               = 1
write(1, " S Q L >  ", 5)                       = 5
read(0, 0x0813F714, 5120)                       = 0
write(4, "\0\r\0\006\0\0\0\0\003\t".., 13)      = 13
read(4, "\011\0\006\0\0\0\0\0\t01".., 2064)     = 17
write(1, " D i s c o n n e c t e d".., 95)      = 95
write(1, " W i t h   t h e   P a r".., 78)      = 78
write(4, "\0\n\0\006\0\0\0\0 @", 10)            = 10

Here is a good explanation

If all file descriptors referring to the write end of a pipe have been closed, then an attempt to read(2)
from the pipe will see end-of-file (read(2) will return 0).

The part that isn’t explained, for me, is that a reader will wait until at the write end has been opened.  So the EOF doesn’t happened until there is an open and a close and all open write file descriptors have to be closed, thus adding a never finishing write will keep the reader from reading an EOF.

Here is the code for OpenSolaris (thanks to Erik Schrock for this links)


16K CR gets for UQ index scan? Easy. Kind of.

I planned to write on this for quite some time, but failed to do so. Sorry about that. Today I finally got time and desire to describe a situation from the title. It was observed on an Oracle running Solaris SPARC; manifested itself as a severe CPU burning at 100% utilization with ‘latch free’ on the first place of the Top 5 wait events in Statspack report.

Here is part of a 1 hour report:

Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                   ---------------       ---------------
                  Redo size:            -72,280.51            -18,601.83
              Logical reads:            411,881.06            106,000.09
              Block changes:              2,515.53                647.39
             Physical reads:                665.17                171.18
            Physical writes:                218.18                 56.15
                 User calls:                394.87                101.62
                     Parses:                 79.46                 20.45
                Hard parses:                  0.94                  0.24
                      Sorts:                885.81                227.97
                     Logons:                  0.05                  0.01
                   Executes:              1,160.92                298.77
               Transactions:                  3.89

  % Blocks changed per Read:    0.61    Recursive Call %:     75.88
 Rollback per transaction %:    5.17       Rows per Sort:      5.95

Instance Efficiency Percentages (Target 100%)
            Buffer Nowait %:   99.98       Redo NoWait %:    100.00
            Buffer  Hit   %:   99.84    In-memory Sort %:    100.00
            Library Hit   %:   99.93        Soft Parse %:     98.82
         Execute to Parse %:   93.16         Latch Hit %:     89.32
Parse CPU to Parse Elapsd %:   84.20     % Non-Parse CPU:     99.57

 Shared Pool Statistics        Begin   End
                               ------  ------
             Memory Usage %:   90.41   91.42
    % SQL with executions>1:   31.20   35.28
  % Memory for SQL w/exec>1:   30.16   30.05

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                     % Total
Event                                               Waits    Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
latch free                                      2,529,903      49,710    48.22
CPU time                                                       31,694    30.75
db file sequential read                         1,228,775      13,164    12.77
buffer busy waits                                 269,229       3,076     2.98
enqueue                                             2,276       2,567     2.49 

And this is Latch sleep breakdown:

                                      Get                            Spin &
Latch Name                       Requests      Misses      Sleeps Sleeps 1->4
-------------------------- -------------- ----------- ----------- ------------
cache buffers chains        4,190,999,464 ###########   2,524,798 467719813/20
library cache                  27,510,901      44,028         931 43116/895/15
row cache enqueue latch        10,692,790      27,738          16 27722/16/0/0
cache buffer handles            6,478,263       7,240          66 7175/64/1/0/
shared pool                     2,657,115       5,706         412 5300/400/6/0

As you can see, ‘latch free’ waits are almost entirely accounted to the infamous ‘cache buffers chains’ latch gets. You can easily find information on what it is and why it could happen, and how to diagnose such contention. Here I’ll just show you what I’ve found by accident (just staring at the report). I noticed that two queries – #2 and #3 in the SQL Ordered by Gets section – which are nothing but a unique index access to a table, had ~16K gets per execution:

                                                     CPU      Elapsd
  Buffer Gets    Executions  Gets per Exec  %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
    333,759,439       21,538       15,496.3   22.4  7028.63  24546.46 4091048759
Module: JDBC Thin Client

    259,164,764       16,137       16,060.3   17.4  5391.01  18934.74  307775861
Module: JDBC Thin Client
select ID from T where ID=:1

How that could possibly happen? At first I thought the index used by queries was missing, but that was not the case. Second thought – the index has exploded due to modifications (I found one query that probably had inserted lots of rows to the table, but rows_processed for the statement was zero) and its height was increased, which caused other sessions to do much more work when accessing both index and table blocks. That was just an idea which I didn’t know how to confirm or dismiss so I asked wise guys and they were kind enough to point out to me that index height increase is not a necessity to have a spike in CR gets for concurrent SQL access. All that is needed is just an uncommitted transaction that inserts rows to the table somewhere nearby the normally accessed data. Yep, it sounded reasonable to me and I was OK with this explanation. I did confirm that the long running INSERT statement was indeed adding millions of rows to the table due to an application bug (missing condition), but failed to do so because of transaction timeout on the WebLogic set to half an hour. So the statement actually was rolled back and that’s why rows_processed for the query was zero.
But when I tried to reproduce this situation to prepare test case for this blog entry, I was not able to do so. A simple access by index to the table with an in-flight transaction inserting huge amount of data is definitely not enough to reproduce such situation. So it must have been one of those scenarios when a (table or index) block has been updated many times, and subsequent transactions had to apply undo to reverse changes to a block made by other transactions. It’s very easy to setup so I won’t do that here. Also I think it’s impossible to identify which block caused the issue having just a standard Statspack report in hands. Anyway, here is statistics comparison that shows undo access was the cause for increased LIO and ‘cache buffers chain’ latch contention:

Statname Normal R1 R2 Comments
branch node splits




consistent changes




number of times undo was applied to construct a consistent block version
consistent gets




consistent gets – examination




number of times a block (such as undo or index root) was accessed under protection of ‘cache buffers chain’ latch
data blocks consistent reads – undo records applied




number of undo records applied while constructing CR block copy
leaf node splits




transaction tables consistent read – undo records applied




user calls




Filed under: Oracle, Performance Tagged: indexes, read consistency

Simplified GNS setup for RAC and newer

One of the main problems I have seen with GNS (Grid Naming Service) installations was that you couldn’t really see if your DNS and DHCP configuration was correct until it’s been too late. This has been addressed, but it’s little known. There are a number of checks you can run before starting Oracle Universal Installer, and this post is about them.

What is the Grid Naming System?

I was initially drawn towards the GNS when it was initially released with It is aimed at environments where the Oracle DBAs take on (yet another) piece of work, namely the DNS administration. By virtue of “subdomain delegation”, the master DNS server responsible for “” hands off requests for a subdomain to this – – to an Oracle managed process. This was quite poorly documented initially, prompting me to figure it out myself in an earlier post:

The problem with GNS in was that you couldn’t really test if the DNS setup was sufficient for Oracle Installer to work, and I had a few attempts at the installation (the discussion here takes into account that I might not have been able to perform sufficient checking!)

Implementing DNS and DHCP

As a quick reminder, here is the subdomain delegation bit you add to your forward resolution zone file (with bind as the example):

$ORIGIN rac.localdomain.
@       IN      NS      gns.rac.localdomain.
gns.rac.localdomain.    IN      A

Here, I have a domain called “localdomain” for which is the primary NS. Anything for “*.rac.localdomain” will be handed off to gns.rac.localdomain with IP address This IP address will later be defined as the GNS VIP and as it seems, it’s the only address that must be registered in DNS. This is one of the biggest changes-with “manual” name resolution you’d add the SCAN, VIP, and public names for all cluster nodes in DNS before installation.

DHCP is very simple to set up and my example in the previously mentioned post are perfectly sufficient to get started.

Implementing on the host

The below is a sample “/etc/resolv.conf” I used – ensure that your /etc/nsswitch.conf sets “hosts” to files then DNS, which is the documented way.

options attempts: 2
options timeout: 1

search rac.localdomain localdomain

The “attempts” and “timeout” options were new to my builds, and the search order must include the GNS domain, before it bubbles up to the corporate domain. This doesn’t have to be a top level domain by the way, I just kept it simple.

The hostname, as defined in /etc/hosts (and NOT in DNS like I said), has to be in the DNS subdomain. I opted for rac11203gnsnode1.rac.localdomain, with IP address Nothing else is defined in /etc/hosts, optionally you could define the private interconnect addresses in there as well.

Testing and validating the setup

The cluvfy utility has been enhanced to test GNS before you go through the installation process and finding out it all failed. Before you install, you invoke runcluvfy comp gns -precrsinst as in this example:

[oracle@rac11203gnsnode1 ~]$ runcluvfy comp gns -precrsinst -domain rac.localdomain -vip -verbose -n rac11203gnsnode1

Verifying GNS integrity

Checking GNS integrity...
Checking if the GNS subdomain name is valid...
The GNS subdomain name "rac.localdomain" is a valid domain name
Checking if the GNS VIP is a valid address...
GNS VIP "" resolves to a valid IP address
Checking the status of GNS VIP...

GNS integrity check passed

Verification of GNS integrity was successful.

The command is fairly self explanatory: pass the subdomain you want GNS to manage and the GNS VIP, and off it goes. With my settings in the zone file (the reverse file doesn’t have any entries) it completed successfully.

The GNS integrity can also be verified post installation as in this example:

[oracle@rac11203gnsnode1 ~]$ cluvfy comp gns -postcrsinst -verbose

Verifying GNS integrity

Checking GNS integrity...
Checking if the GNS subdomain name is valid...
The GNS subdomain name "rac.localdomain" is a valid domain name
Checking if the GNS VIP belongs to same subnet as the public network...
Public network subnets "" match with the GNS VIP ""
Checking if the GNS VIP is a valid address...
GNS VIP "" resolves to a valid IP address
Checking the status of GNS VIP...
Checking if FDQN names for domain "rac.localdomain" are reachable

GNS resolved IP addresses are reachable

GNS resolved IP addresses are reachable

GNS resolved IP addresses are reachable
Checking status of GNS resource...
Node          Running?                  Enabled?
------------  ------------------------  ------------------------
rac11203gnsnode1  yes                       yes

GNS resource configuration check passed
Checking status of GNS VIP resource...
Node          Running?                  Enabled?
------------  ------------------------  ------------------------
rac11203gnsnode1  yes                       yes

GNS VIP resource configuration check passed.

GNS integrity check passed

Verification of GNS integrity was successful.
[oracle@rac11203gnsnode1 ~]$

Post installation

The srvctl utility has a few more options as well to query the GNS state. The most comprehensive one is the “-a” flag:

[oracle@rac11203gnsnode1 ~]$ srvctl config gns -a
GNS is enabled.
GNS is listening for DNS server requests on port 53
GNS is using port 5353 to connect to mDNS
GNS status: OK
Domain served by GNS: rac.localdomain
GNS version:
GNS VIP network:

There are more granular options returning version, status etc. The most useful of these options seems to be the “-l”, listing all allocated IPs (note that is NOT supplied by GNS, but defined in the host file). Since we only connect via VIP or SCAN this is not a problem.

For my 1 node experimental systems, these IP addresses were used:

[oracle@rac11203gnsnode1 ~]$ srvctl config gns -l
Name                 Type Value
gnsclu-scan1-vip     A
gnsclu-scan2-vip     A
gnsclu-scan3-vip     A
rac11203gnsnode1-vip A
scan                 A
scan                 A
scan                 A
[oracle@rac11203gnsnode1 ~]$

Finally {}

If found the following addition to crsctl option very useful to quickly set up a DNS test “server”-surely not something to be done in production but interesting nevertheless.

[oracle@rac11203gnsnode1 ~]$ crsctl start testdns -h


 crsctl start testdns [-address ] [-port ][-domain ] [-once][-v]
 Start a test DNS listener that listens on the given address at the given port and for specified domain
 IP_address IP address to be used by the listener (defaults to hostname)
 port The port on which the listener will listen. Default value is 53.
 domain The domain query for which to listen. By default, all domain queries are processed.

-once Flag indicating that DNS listener should exit after one DNS query packet is received
 -v Verbose output
[oracle@rac11203gnsnode1 ~]$

You might also consider this interesting DNS query tool:

[oracle@rac11203gnsnode1 ~]$ crsctl query dns -servers
CRS-10018: the following configuration was found on the system:
CRS-10019: There are 2 domains in search order. They are:
CRS-10022: There are 2 name servers. They are:
CRS-10020: number of retry attempts for name lookup is: 4
CRS-10021: timeout for each name lookup is: 5
[oracle@rac11203gnsnode1 ~]$

And if you don’t like the manual page for nslookup, you could use this little command instead to interrogate your DNS setup:

[oracle@rac11203gnsnode1 ~]$ crsctl query dns -h
 crsctl query dns -servers
 Lists the system configured DNS server, search paths, attempt and timeout values

crsctl query dns -name  [-dnsserver ] [-port ] [-attempts ] [-timeout ] [-v]
 Returns a list of addresses returned by DNS lookup of the name with the specified DNS server
 name Fully qualified domain name to lookup
 DNS_server_address Address of the DNS server on which name needs to be looked up
 port Port on which DNS server is listening
 attempts Number of retry attempts
 timeout Timeout in seconds


The handling of GNS has improved greatly, and a lot more information is available about what’s happening under the covers. Unknown to many, the main Clusterware tools undergo revisions all the time, so it’s worth running crsctl commands on the various objects available with the “-h” flag to stay up to date.

Spam Poetry

Here is a spam comment that I thought deserved publishing in its entirety but without the spam links. I hope you all enjoy the lyricism as much as I did. Incomparable Blog. I tot up this Blog to my bookmarks.Thanks for alluring the in the nick of time b soon to examine this, I lean [...]

New Defaults, Old Side Effects

When 11.2 came out I posted about deferred segment creation at and a couple of odd side effects. Oracle published a Note  1050193.1 that makes the quite extraordinary claim that Sequences are not guaranteed to generate all consecutive values starting with the ‘START WITH’ value. It’s absolutely true that sequences don’t guarantee no gaps – but [...]