Bind Variable Peeking – execution plan inefficiency

In my Beyond Oracle Wait interface article I troubleshooted a test case where an execution plan somehow went “crazy” and started burning CPU, lots of logical IOs and the query never completed.

I have uploaded the test case I used to my new website, to a section where I will upload some of my demo scripts which I show at my seminars (and people can download & test these themselves too):

Basically what I do is this:

  1. I run the query with bind variable values where only a handful of rows match the filter condition. Thus Oracle picks nested loop join (and indexed access path)
  2. Then I run the same query with different bind values, where a lot of rows match the filter condition. Oracle reuses existing execution plan (with nested loops!!!). Oracle ends up looping through a lot of blocks again and again (because nested loop visits the “right” side of the join once for every row coming from the “left” side of the join).

Using nested loops over lots of rows is a sure way to kill your performance.

And an interesting thing with my script is that the problem still happens in Oracle 11.1 and 11.2 too!

Oracle 11g has Adaptive Cursor Sharing, right? This should take care of such a problem, right? Well no, adaptive bind variable peeking is a reactive technique – it only kicks in after the problem has happened!

How to setup a private DNS for your virtual cluster

One of the challenges I faced recently was building a virtual cluster based on Oracle 11g Release 2 on top of Oracle Enterprise Linux (OEL) running inside VMware server. Although I have an existing virtual Oracle 11g Release 1 cluster, I decided to build a new one in order to be able to teach both […]

The Core Performance Fundamentals Of Oracle Data Warehousing – Partitioning

[back to Introduction] Partitioning is an essential performance feature for an Oracle data warehouse because partition elimination (or partition pruning) generally results in the elimination of a significant amount of table data to be scanned. This results in a need for less system resources and improved query performance. Someone once told me “the fastest I/O is the one that never happens.” This is precisely the reason that partitioning is a must for Oracle data warehouses – it’s a huge I/O eliminator. I frequently refer to partition elimination as the anti-index. An index is used to find a small amount data that is required; partitioning is used to eliminate vasts amounts of data that is not required. Main Uses For Partitioning I would classify the main reasons to use partitioning in your Oracle data warehouse into these four areas: Data Elimination Partition-Wise Joins Manageability (Partition Exchange Load, Local Indexes, etc.) Information Lifecycle Management (ILM) Partitioning Basics The most common partitioning design pattern found in Oracle data warehouses is to partition the fact tables by range (or interval) on the event date/time column. This allows for partition elimination of all the data not in the desired time window in queries. For example: If I have a [...]

The Core Performance Fundamentals Of Oracle Data Warehousing – Table Compression

[back to Introduction] Editor’s note: This blog post does not cover Exadata Hybrid Columnar Compression. The first thing that comes to most people’s mind when database table compression is mentioned is the savings it yields in terms of disk space. While reducing the footprint of data on disk is relevant, I would argue it is the lesser of the benefits for data warehouses. Disk capacity is very cheap and generally plentiful, however, disk bandwidth (scan speed) is proportional to the number of spindles, no mater what the disk capacity and thus is more expensive. Table compression reduces the footprint on the disk drives that a given data set occupies so the amount of physical data that must be read off the disk platters is reduced when compared to the uncompressed version. For example, if 4000 GB of raw data can compress to 1000 GB, it can be read off the same disk drives 4X as fast because it is reading and transferring 1/4 of the data off the spindles (relative to the uncompressed size). Likewise, table compression allows for the database buffer cache to contain more data without having to increase the memory allocation because more rows can be stored [...]

The Oracle Wait Interface Is Useless (sometimes) – part 3a

OK, here it is, the ‘first part of the last part’, though the topics discussed in these articles will be discussed more over time in my blog and in Tanel’s. I’ve split it into two subparts, because it was just getting insanely long as single posting.

Before I get going on this one, let’s just clear up a misunderstanding from the first part of this series. The first part uses a specific example which, for clarity reasons, will continue to be used for the remainder of the series. The example shown is some kind of query-related issue in this case, but the approach shown here is a more general one that will locate the root cause in many, many other cases than just the case of a bad query. This SQL example is still a good one, though, because there are no waits, lots of CPU, and no system calls. But the underlying problem might not be caused by poor SQL, and diving straight into SQL tuning could waste you 2 days fixing SQL when that is not the root cause. Just because we happen to be running SQL, it does not mean that it is a SQL problem. What we are trying to achieve here is a more robust approach to root cause diagnosis of Oracle performance problems, one where we don’t guess, we don’t assume, we don’t stick with things we’ve seen before: We quantitatively find the real problem. That might be the smart way, it might be the lazy way – those might be the same thing! But most crucially, it’s the fast way to fixing the real problem.

So, back to the main topic. In the first part of this blog, I walked through the example problem and showed that we have a problem that has the following attributes:

  • The user is reporting a server response time problem
  • Zero Oracle waits
  • 100% CPU
  • Zero system calls
  • DB Time is showing all the CPU and elapsed time is spent executing SQL (in this example…)

Note to Kevin: Though I stated the query ‘never’ returns in part one, I was, of cause, using my usual exaggerative vernacular. I meant, of course, that it was taking a very long time to return, but still returning correct results. Sorry about that – it will ‘never’ happen again.

At this point, we might jump into a prolonged SQL tuning exercise. But let’s not for now, because that is still guesswork in my opinion. Let’s make the following reasonable assumption: The whole system is running slowly, using huge quantities of CPU, and this is a mission-critical system where you cannot simply start killing queries and re-executing them in your SQL*Plus session. That’s a reasonable assumption – I have worked on many, many, production systems where this is the case. So let’s carry on diagnosing without actually interfering with the production system in any way.

This is where we might deviate from time-based diagnostics. Not because we don’t think that time is the single most valuable metric on which to determine what ‘fast’ and ‘slow’ actually mean in quantitative terms, but because there is a bit of a shortcut available to us here that might just nail the cause of the problem. That shortcut is session-based statistics, and they have nothing to do with time, being simple counters.

This is where it makes no sense at all to re-invent the wheel, and was why Tanel is on the hook for this part: Tanel has already done some significant work on session-based statitics analysis and has written a great tool, snapper, for just this purpose. If you have not already read part two of this series, head over to Tanel’s blog now to catch up before proceeding!

OK, so we’ve abandoned the wait interface, and let’s assume that the sessions statistics did not yield anything particularly obvious. Or, maybe you just want to dig a little deeper, or possibly take yet another short cut to the truth? Now it’s time to look beyond Oracle and into the Operating System. The Operating System has a completely different viewpoint of the operation of Oracle, notably through the use of execution stack profiling. Don’t run away just yet, it’s not as difficult as you might imagine.

Various methods exist to probe the execution stack across the various Operating Systems. All of these should be used with caution, as they have varying degrees of intrusiveness on the operation of Oracle. The more intrusive methods, such as gdb, and the (possibly) less tested methods (oradebug short_stack) have very real potential to crash the process that you are attaching to, or worse. Don’t say I didn’t warn you! However, other methods for profiling the stack are less intrusive, and are already very robust: I’m talking here about Dtrace and variants.

I have held off from getting too excited by Dtrace in the past, even though the technology always looked amazing. The reason for holding back was that it was a single-OS tool, and without anything comparable on the Linux platform I didn’t see it being useful to >75% of my customers. That’s changing now, though, with the production release of Systemtap (partial, at least) in RHEL 5.4 and there is even a similar tool for AIX named Probevue. So now I’m excited about this technology!

Before digging into Dtrace and friends, I think it’s important that we take a little diversionary refresh into what a stack actually is, and why it is important. For those with a good coding background, look away now, and rejoin the group for part 3b. Just to keep your interest level, though, would it be interesting to get a second-by-second view of where the time goes for a process, down to the following levels of detail?

  • Which line of the plan did I spend most time in?
  • How much time did I spend waiting for page faults?
  • Which wait events did I spend most time in?

That’s going to be part 3b… so please check back periodically!

Back to the “Stack Primer for DBAs”: Consider the following elementary piece of C code:


int i;

void my_function();

main() {



Don’t worry about the details of the language if you are unfamiliar with C. The only things to observe here are that we have one “function”, named ‘main()’, which is making a call to another function ‘my_function()’. The main() function is a special one – all standalone C programs have a main() function, which is the initial point of execution. A C program is just a set of compiled machine code instructions at execute time, just like any other computer program. These instructions occupy a chunk of memory, where each memory location is either a machine code instruction or a piece of data. When the code gets down to the call to my_function() in main(), the execution must jump (or branch) to the address that my_function()’s machine code instructions reside at. Before this can happen, an amount of context must be stored, namely:

  • The ‘return’ address, so that execution can resume in the main() function after my_function() is complete
  • Any ‘data’ passed to my_function() (the function arguments)

In C, this information is pushed onto the stack and then the execution branches to the entry point of my_function(). As a sidenote, the sharing of data and execution context in this way is a common way to exploit security holes by overwriting execution context with oversized arguments from the data portion. We’re not interested in all that stuff for this – we are only interested in observing the stack contents. If we took a stack dump while the program were in the my_function() function, it would look like this:


Pretty simple, huh? If there were arguments for main() and my_function(), they would also be shown in some way. Most stack trace facilities can only show primitive datatypes (such as integers) and all other more complex arguments (such as structs) are shown as pointers (addresses) to the actual data without decoding the actual content, as follows:

my_function(42, 0x12345678)

Some tools will print a stack trace (known as a backtrace, because it is unwound backwards from the current position in the stack) with some known pointers decoded into human readable form, such as simple struct datatypes. Anyway, let’s have a look at something a bit more complicated – an Oracle backtrace taken from an errorstack trace file:

----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+36        call     kgdsdst()            000000000 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   7FFFB345A778 ? 000000000 ?
ksedst1()+98         call     skdstdst()           000000000 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksedst()+34          call     ksedst1()            000000001 ? 000000001 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
dbkedDefDump()+2736  call     ksedst()             000000001 ? 000000001 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksedmp()+36          call     dbkedDefDump()       000000001 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksdxfdmp()+1837      call     ksedmp()             000000001 ? 000000000 ?
                                                   7FFFB3456278 ? 000000001 ?
                                                   000000000 ? 000000000 ?
ksdxcb()+1782        call     ksdxfdmp()           7FFFB345BAE0 ? 000000011 ?
                                                   000000003 ? 7FFFB345BA40 ?
                                                   7FFFB345B9A0 ? 000000000 ?
sspuser()+112        call     ksdxcb()             000000001 ? 000000011 ?
                                                   000000001 ? 000000001 ?
                                                   7FFFB345B9A0 ? 000000000 ?
__restore_rt()       call     sspuser()            000000001 ? 000000011 ?
                                                   000000001 ? 000000001 ?
                                                   7FFFB345B9A0 ? 000000000 ?
semtimedop()+10      signal   __restore_rt()       000018000 ? 7FFFB345C730 ?
                                                   000000001 ?
                                                   FFFFFFFFFFFFFFFF ?
                                                   FFFFFFFFFFD23940 ?
                                                   000000000 ?
sskgpwwait()+259     call     semtimedop()         000018000 ? 7FFFB345C730 ?
                                                   000000001 ? 7FFFB345C6D8 ?
                                                   FFFFFFFFFFD23940 ?
                                                   000000000 ?
skgpwwait()+151      call     sskgpwwait()         7FFFB345CB94 ? 00A9A15C0 ?
                                                   07848C4D8 ? 0002DC6C0 ?
                                                   7FFF00000000 ? 000000000 ?
ksliwat()+1816       call     skgpwwait()          7FFFB345CB94 ? 00A9A15C0 ?
                                                   000000000 ? 0002DC6C0 ?
                                                   000000000 ? 000000000 ?
kslwaitctx()+157     call     ksliwat()            078666348 ? 078666348 ?
                                                   005F5DFA7 ? 000000000 ?
                                                   100000000 ? 000000000 ?
kslwait()+136        call     kslwaitctx()         7FFFB345CE30 ? 000000000 ?
                                                   005F5DFA7 ? 000000000 ?
                                                   100000000 ? 000000000 ?
psdwat()+107         call     kslwait()            005F5DFA7 ? 000000167 ?
                                                   000000000 ? 005F5DFA7 ?
                                                   000000000 ? 000000000 ?
pevm_icd_call_commo  call     psdwat()             005F5DFA7 ? 000000167 ?
n()+421                                            000000000 ? 005F5DFA7 ?
                                                   000000000 ? 000000000 ?
pfrinstr_ICAL()+164  call     pevm_icd_call_commo  7FFFB345E0F0 ? 000000000 ?
                              n()                  000000001 ? 000000004 ?
                                                   7FAF00000001 ? 000000000 ?
pfrrun_no_tool()+63  call     pfrinstr_ICAL()      7FAF7E8A3500 ? 06A03B9AA ?
                                                   7FAF7E8A3570 ? 000000004 ?
                                                   7FAF00000001 ? 000000000 ?
pfrrun()+1025        call     pfrrun_no_tool()     7FAF7E8A3500 ? 06A03B9AA ?
                                                   7FAF7E8A3570 ? 000000004 ?
                                                   7FAF00000001 ? 000000000 ?
plsql_run()+769      call     pfrrun()             7FAF7E8A3500 ? 000000000 ?
                                                   7FAF7E8A3570 ? 7FFFB345E0F0 ?
                                                   7FAF00000001 ? 070C18646 ?
peicnt()+296         call     plsql_run()          7FAF7E8A3500 ? 000000001 ?
                                                   000000000 ? 7FFFB345E0F0 ?
                                                   7FAF00000001 ? 000000000 ?
kkxexe()+520         call     peicnt()             7FFFB345E0F0 ? 7FAF7E8A3500 ?
                                                   7FAF7E8C4028 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 000000000 ?
opiexe()+14796       call     kkxexe()             7FAF7E8A5128 ? 7FAF7E8A3500 ?
                                                   000000000 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 000000000 ?
kpoal8()+2283        call     opiexe()             000000049 ? 000000003 ?
                                                   7FFFB345F678 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 000000000 ?
opiodr()+1149        call     kpoal8()             00000005E ? 00000001C ?
                                                   7FFFB3462750 ? 7FFFB345E0F0 ?
                                                   7FAF7E8C1FD0 ? 5E00000001 ?
ttcpip()+1251        call     opiodr()             00000005E ? 00000001C ?
                                                   7FFFB3462750 ? 000000000 ?
                                                   008C5E7F0 ? 5E00000001 ?
opitsk()+1628        call     ttcpip()             00A9B0890 ? 0086BA768 ?
                                                   7FFFB3462750 ? 000000000 ?
                                                   7FFFB34621B0 ? 7FFFB3462954 ?
opiino()+953         call     opitsk()             00A9B0890 ? 000000000 ?
                                                   7FFFB3462750 ? 000000000 ?
                                                   7FFFB34621B0 ? 7FFFB3462954 ?
opiodr()+1149        call     opiino()             00000003C ? 000000004 ?
                                                   7FFFB3463E48 ? 000000000 ?
                                                   7FFFB34621B0 ? 7FFFB3462954 ?
opidrv()+565         call     opiodr()             00000003C ? 000000004 ?
                                                   7FFFB3463E48 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
sou2o()+98           call     opidrv()             00000003C ? 000000004 ?
                                                   7FFFB3463E48 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
opimai_real()+128    call     sou2o()              7FFFB3463E20 ? 00000003C ?
                                                   000000004 ? 7FFFB3463E48 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
ssthrdmain()+209     call     opimai_real()        000000002 ? 7FFFB3464010 ?
                                                   000000004 ? 7FFFB3463E48 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
main()+196           call     ssthrdmain()         000000002 ? 7FFFB3464010 ?
                                                   000000001 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
__libc_start_main()  call     main()               000000002 ? 7FFFB34641B8 ?
+253                                               000000001 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?
_start()+36          call     __libc_start_main()  0009D3D74 ? 000000002 ?
                                                   7FFFB34641A8 ? 000000000 ?
                                                   008C5E2A0 ? 7FFFB3462954 ?

I know, it’s a bit much to deal with at first glance. But take it a piece at a time, reading in conjunction with Metalink note 175982.1, and it all begins to make sense. Actually, I’m guessing some of these, because the Metalink note doesn’t give all the answers, but it seems to make sense.

Let’s start at the bottom, which is the beginning of the execution of the C program named ‘oracle’. The first two lines are the GLIBC execution wrappers that eventually call Oracle’s main() statement which starts at line 110 of the listing. The next several lines above are the initialisation of Oracle and the (O)racle (P)rogramatic (I)nterface, which is the internal API used in Oracle. TTCPIP is the Two-Task-Communication interface with my SQL*Plus process, but the first interesting line is at line 74, kkxexe(), which is the (k)ernel (k)ompile e(x)ecute) (exe)cute statement call, which is the start of the actual SQL processing. In this case it is actually a simple PL/SQL call to DBMS_LOCK.SLEEP(100000), and the entry to the Oracle Wait interface can be seen at line 50 with the kslwait() call, eventually ending up with a semtimedop() call, which is the way the sleep call has been implemented. Incidentally, the preceding sskgpwwait() call at line 37 is Oracle’s platform-specific code which, in this case, is Red Hat Linux.  At line 11 we see the call to ksedst(), which probably stands for (k)ernel (s)ervices (e)rrorstack (d)ump (s)tack (t)race, which is the part of code actually producing the stack trace we are reading.

So what does all this actually mean in terms of seeing what our Oracle session is up to? Well, we can see what is called, by whom, and how long each call took. We don’t need to know all the function names and what they are responsible for, but we can come up with a selection of interesting ones that we would like to pay attention to. Thinking in terms of SQL execution we can see, for example, a nested loop rowsource function calling out to an index fetch rowsource function, which in turn calls out to a table fetch function.

Even just a few consecutive stack traces of a running query will make it very plain which rowsource Oracle is spending the most time within, simply because of the laws of probability. If you look once a second for a minute, and every stack you see has the ‘index fetch’ rowsource as the current function, then you probably spent most of the last minute doing that kind of processing. But we can do better than that… we can start to look at quantifying where the time is spent, and also to supplement that information with other interesting aspects of the operating system operation, aspects which are somewhat  concealed from Oracle and yet can detrimentally affect its performance.

Join me back for part 3b for the conclusion of this series of articles!

I’m now an Oracle ACE Director

Just a quick post in celebration of my recent appointment as an Oracle ACE Director! I’m very happy to be invited to this group, and look forward to getting involved in some of the discussions.

The Oracle Wait Interface Is Useless – News About Part Two

As many have probably noticed, the blogging experiment with Tanel hasn’t been quite as productive as it might optimally have been. Tanel doesn’t currently have the time to write part two of this article, so I’ll I’m going to pick up the baton and write the next part. Watch this space!

Dropping interval partitions

One of the nice Oracle11g new features is interval partitioning which is an extension to range partitioning. The advantage of interval partitioning over range partitioning is that new partitions are created automatically when new rows are inserted which don’t belong in an existing partition. The question, however, is how to get rid of old partitions? […]

Partway Researched With A Chance Of FUD

I tend to keep the content of this blog fairly technical and engineering focused, but every now and then I have to venture off and do an editorial post.  Recently some of the ParAccel management decided to fire up the FUD machine on the ParAccel blog and take aim at Oracle’s Exadata making the following claims: “There are 12 SAS disks in the storage server with a speed of about 75 MB/s [The SUN Oracle Exadata Storage Server datasheet claims 125 MB/s but we think that is far-fetched.]” -Rick Glick, Vice President of Technology and Architecture (link) “We stand by the 75MB/sec as a conservative, reliable number. We see higher numbers in disk tests, but never anywhere near 125MB/sec.” -Barry Zane, Chief Technology Officer (link) Far Fetched Or Fact? As a database performance engineer, I strive to be extremely detailed and well researched with my work. Clearly, these comments from Rick and Barry were not well researched as is evident from information publicly available on the Internet. The first bit of documentation I would research before making such comments would be the hard disk drive specification sheet. The 12 drives in the Exadata Storage Server, a Sun Fire X4275, are [...]

2009 Year-End Zeitgeist

Another year in the books and another year on the Structured Data blog.  Hopefully 2009 treated you well and 2010 will bring good things in addition.  I thought I’d throw a few Top 5 lists together to reminisce about 2009.  Enjoy! Top 5 Most Visited Blog Posts of 2009 DBMS_STATS, METHOD_OPT and FOR ALL INDEXED COLUMNS Choosing An Optimal Stats Gathering Strategy Top 10 Oracle 11gR2 New Features Troubleshooting Bad Execution Plans Oracle 11g: Real-Time SQL Monitoring Using DBMS_SQLTUNE.REPORT_SQL_MONITOR Top 5 Most Popular Search Queries of 2009 structured data oracle 11gr2 new features db_file_multiblock_read_count oracle analytic functions dbms_stats method_opt