Cool Stuff

COUNT STOPKEY operation (the where ROWNUM

I was running tests on some very large tables on an Exadata cluster and found an interesting bug.

Instead of having to query 4Billion row tables, I have reproduced this case with a cartesian join…

Check this. I’m generating 8 Billion rows using a 3-way cartesian join of set of 2000 rows. So, this results in 2000 * 2000 * 2000 rows, which is 8 billion rows.

SQL>  with sq as (select null from dual connect by level <= 2000)
      select count(*)
      from sq a, sq b, sq c;

COUNT(*)
----------
8000000000

Everything worked well as you see. All 8 billion rows were nicely counted. Let’s modify this query a bit, by adding a WHERE rownum <= 8 000 000 000 predicate, which shouldn’t modify the outcome of my query as 8 billion rows is exactly what I have:

SQL> with sq as (select null from dual connect by level <= 2000)
     select count(*)
     from sq a, sq b, sq c
     where rownum <= 8000000000;

COUNT(*)
----------
4294967295

Ding! (as I’m unable to come up with a better word to present this problem :)

Seems like we have a problem! The COUNT operation thinks that we have only a bit over 4 billion rows returned from the query plan branches below it.

We clearly have a(n easily reproducible ) bug here!

What happens here is that whenever you use ROWNUM < N operator in a query predicate like I have done, you will introduce an additional step to the query plan (COUNT STOPKEY).

Check this simple query:

SELECT * FROM t WHERE ROWNUM <= 1000

Plan hash value: 508354683

--------------------------------------------
| Id  | Operation          | Name | E-Rows |
--------------------------------------------
|   0 | SELECT STATEMENT   |      |        |
|*  1 |  COUNT STOPKEY     |      |        |
|   2 |   TABLE ACCESS FULL| T    |     41 |
--------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=1000)

As the above plan explains, the “WHERE rownum <” predicate is applied in the execution plan line (row source) 1 – COUNT STOPKEY. You won’t see this line when you are just counting rows without any “ROWNUM <” predicate.

How this works is that the COUNT STOPKEY knows how many rows you want and will just keep calling its child function under it in the execution plan tree to get more and more rows, until the required amount of rows have been returned. And then the COUNT STOPKEY function just stops calling its child row sources and declares the end-of-data condition.

And here’s the catch – due to a bug, the variable used to keep track of number of processed rows by COUNT STOPKEY is only 4 bytes, even on 64bit platforms. So, it can hold values up to 4294967295 in it (the count returned above), which is 2 ^ 32 – 1. That wouldn’t be so much of a problem in practical real world applications, but what I don’t like is that the operation will silently return wrong data – it will just stop fetching more rows, even though we haven’t reached the ROWNUM <= 8 000 000 000 limit yet and there are rows to fetch from the child row-source, but COUNT STOPKEY declares end-of-data condition and returns ~4B as a result.

This is (obviously a bug) and now there’s a bug number for that as well (thanks to Greg Rahn for letting me know) – Bug# 10214991 (unfortunately it’s not public in MOS).

Now, there’s a reason why this bug has gone unnoticed for so long despite that 4+ billion-row tables have existed for long time (I worked first with such Oracle database in year 2000 – on Windows NT :).

  1. A real-life business query with a construct of WHERE ROWNUM <= X makes sense only when the data is ordered by some meaningful business attribute (a column). Otherwise you’ll get the query results in quite random order. Note that I’m talking about real, business queries here.
  2. The only right way to order data in SQL is via an ORDER BY clause. There is no other right way to do it, even though some hacks sometimes work (and will stop working after the next upgrade)
  3. Nobody (or not many people) have written queries like: give me the sum of 5 billion biggest purchases ordered by the selling price or give me the sum of last 5 billion purchases up to this date. If you replace the billion by just ten, hundred, or just a thousand, then hell yeah, such types of queries are being executed every day (or should I say minute, second).
    Whether using ROWNUM in your business queries (instead of actual date range or purchase price in $$$) is a good or bad design is a completely different question – the fact is that ROWNUM is used in such SQLs.

Now we’ll just need to wait until the data volumes (and reports) get so big that someone actually hits this problem in real life!

So, better patch your database before the end of this decade! :-)

Share

Announcing the first E2SN Virtual Conference with Jonathan Lewis, Cary Millsap, Kerry Osborne and me – 18-19 Nov!

Yes, it’s official. I’m organizing a virtual conference with some of THE top speakers in the world. The topic is Systematic Oracle SQL Optimization (in real world)

The dates are 18-19 November, the conference lasts for 4 hours on both days, so you’ll be able to still get some work done as well (and immediately apply the knowledge acquired!).

Well, none of the speakers need introduction, but just in case you’ve lived in space for last 20 years, here are the links to their blogs :)

I can tell you, (at least the first 3) people in the above list ROCK!!!

And all of them are OakTable members too :)

This conference will have 4 x 1.5 hour sessions, each delivered by a separate speaker. We aim to systematically cover the path of:

  1. Finding out where is the performance problem (and which SQLs cause it)
  2. Finding out what is the problem SQL execution plan doing and which part of it is slow
  3. How to write and fix your code so that the optimizer wouldn’t hate your SQL
  4. How to fix the SQL execution plan performance problem when you can’t touch the application code!

And as this is the first (pilot) virtual conference, then the price is awesome, especially if you get the early bird rate by signing up before 1. November!

So, check out the abstracts, details, agenda and sign up here!

#000000;">#0000ff;">http://tech.e2sn.com/virtual-conferences

P.S. I expect this event to be awesome!

Share

A: The most fundamental difference between hash and nested loop joins

Ok guys, thanks for waiting!

I ended up expanding the article quite a lot compared to what I had originally planned. In fact I only wrote 50% of what I plan to write, I’ll update the rest… um… later… Instead of just stating the difference between the joins I took a step back and elaborated something what I often see people doing (and talking about in newsgroups and lists too).

Basically the most fundamental (or biggest or most important) difference between nested loop and hash joins is that:

  • Hash joins can not look up rows from the inner (probed) row source based on values retrieved from the outer (driving) row source, nested loops can.

In other words, when joining table A and B (A is driving table, B is the probed table), then a nested loop join can take 1st row from A and perform a lookup to B using that value (of the column(s) you join by). Then nested loop takes the next row from A and performs another lookup to table B using the new value. And so on and so on and so on.

This opens up additional access paths to the table B, for example when joining ORDERS and ORDER_ITEMS by ORDER_ID (and ORDER_ID is leading column of PK in ORDER_ITEMS table), then for whatever orders are taken from ORDERS table, we can perform a focused, narrow index range scan on ORDER_ITEMS for every ORDER_ID retrieved from the driving ORDERS table. A hash join can’t do that.

Of course this doesn’t mean that hash joins can’t use any indexes for tables they read – index range scans and unique lookups can still be used under a hash join, but only if there are constant values in the query text (in form of literal or bind variables). If there are no such constant (filter) conditions under a hash join, then the other options to use that index would be to do an INDEX FULL SCAN (which is a range scan from end to end of the index) or INDEX FAST FULL SCAN (which is like a full table scan through the entire index segment). However none of these opportunities give the same benefits as nested loops looking up rows from row source B dynamically based on what was retrieved from A during runtime.

Note that this nested loops benefit isn’t limited to indexes only on table B, the difference is more fundamental than just a specific access path. For example, if table B happens to be a single table hash cluster or indexed X$ table, then the nested loop is also able to do “optimized” lookups from these row-sources, based on the values retrieved from table A.

So, my article with a lot of (loosely) related details is here:

In the comments section of my question, Tom, Bernard Polarski, Christian Antognini and Marc Musette got the closest to what I had in my mind when I asked the question. However, of course your mileage may vary somewhat depending on what kind of problems you have experienced the most over all the years. Also, Jonathan Lewis had a valid comment regarding that the answer depends on what exactly does one mean by “fundamental” and yeah this was open to interpretation.

Nevertheless, I wanted to emphasize that there’s a more important difference between NL and hash joins, than the usual stuff you see in training material which talk about implementation details like hash tables and memory allocation…

Some day I will complete that article, I plan to add some design advice in there, like denormalization opportunities for getting the best of the both worlds etc. But now I’m gonna get a beer instead.

Thanks for reading and answering my blog, I was quite impressed by the volume of comments & answers to my question. I must do this more often!

Share

A million kilometers in two years…

I’ve been tracking my business travel with Tripit.com‘s awesome service for about 2 years now.

After getting back from my Tallinn->Helsinki->New York->Detroit->New York->San Francisco->New York->Helsinki->Tallinn trip yesterday, Tripit reported that I have flown 1 007 509 km during my business trips (American readers, that’s about 42 miles ;)

Check yourself below :)

Tripit says I’ve visited 71 different cities in 27 countries within the last two years.

Here’s the map of places where I’ve visited my clients, done training or spoken at conferences:

Actually there’s probably couple of more cities where I’ve been in the last two years, for some reason Tripit doesn’t recognize my trip to Melbourne (but it does show the visit to Sydney which I did during the same trip).

Anyway, the conclusion here is that I think I’ve done enough flying for now. Now I plan to stay at home for a loooong time (I mean at least 3-4, maybe even 5 weeks in a row!!! ;)

But seriously, what I’ve decided is that:

  1. I won’t do any more public on-site seminars (with only few exceptions).
  2. I will move all my public seminar offering to web-based online seminars (using Citrix gotomeeting.com’s service), which I’ll deliver in person.
  3. I will still do private corporate training on-site occasionally, which offers flexibility of customizing the content and focus areas of the seminar to match the customer’s needs
  4. I will also offer private corporate online training, which gives much greater flexibility for choosing the seminar duration and times etc (it’s possible to spread a seminar to 1-day sections, each day delivered on a separate week, to reduce the impact of people being away from their work)
  5. I will still do consulting & advanced troubleshooting where I usually solve even the most complex problems in matter of couple of days (like explained here for example)

Ok, enough of self-promotion and advertising, back to work ;-)

P.S. I will publish my online seminar schedule “very soon now”!!!

P.P.S. I’m not affiliated with Tripit.com by any means business-wise, but if you travel frequently, then I recommend you to check out their awesome service (and iPhone app). The basic version is free, but I just decided to upgrade to Pro after couple of years of using it!

Share/Bookmark

Oracle Closed World presentation links

Thanks to everybody who attended my OCW hacking session!

Sorry to guys who attended via webinar – I’ll do the session again in a few weeks, with audio from end to end hopefully! And I will get someone to assist me with monitoring the transmission quality and attendee questions etc.

Note that this stuff is mostly for hacking and fun – don’t use the undocumented stuff in production!

The links are below:

Download scripts & Tools:

Rlwrap links:

Diagnostic events:

X$TRACE
I haven’t written any articles on X$TRACE yet, but you can find some stuff from one of my very old presentations:

Or just type:

ALTER TRACING ENABLE “10704:4:ALL”

Where 10704 is the KST event number, 4 is the level and ALL means all Oracle PIDs (Oracle PIDs, not OSPID’s).


Share/Bookmark

Hot stuff! Oracle Closed World Secret presentation webcast!

After hours of careful planning (as you’ll see from the images) I decided to webcast my tomorrow’s Oracle Closed World hacking session…

This session isn’t necessarily going to be useful, but it definitely should be fun !!!

You can register here:

http://tech.e2sn.com/secret

Share/Bookmark

Public appearances 2010

Here’s the list of events where I’ll speak this year:

Michigan OakTable Symposium 2010
Ann Arbor, MI
16-17 September 2010

Considering the concentration of OakTable members there, this will be an awesome event!

I will be delivering my “Back to Basics: Choosing The Entry Point to Performance Troubleshooting Wisely” and “Understanding LGWR, log file sync waits and commit performance” sessions there.

Promo video:
http://www.oaktable.net/media/michigan-oaktable-symposium-2010-promo

Agenda & Registration:
http://michigan.oaktable.net/

Oracle Open #ff0000;">Closed World
San Francisco, CA
19-22. September

Note that I won’t be speaking at the official Oracle Open World conference, but I will be speaking at a secret underground event there, about some really fun stuff, like deep internals, hacking, kernel tracing and of course advanced troubleshooting ;-) And rest of the time I’ll be in some bar.

NYOUG Fall 2010 Training Session
Manhattan, NYC, NY
16 November 2010

This is a full day seminar organized by NYOUG. I will be delivering my “Scripts and Tools for Oracle Troubleshooting and Advanced Performance Analysis” session there. It’s an updated version of the material I delivered at the Hotsos Symposium Training Day this year.

Agenda & Registration:
http://www.nyoug.org/upcoming_events.htm#NYOUG_Training_Days

UKOUG Tech & EBS Conference (to be confirmed)
Birmingham, UK
29 November – 1 December 2010

I submitted four papers to UKOUG Tech&EBS conference, so if all goes well, I’ll be there in end of Nov/beginning of Dec too.

http://techandebs.ukoug.org/


Share/Bookmark

The full power of Oracle’s diagnostic events, part 2: ORADEBUG DOC and 11g improvements

I haven’t written any blog entries for a while, so here’s a very sweet treat for low-level Oracle troubleshooters and internals geeks out there :)

Over a year ago I wrote that Oracle 11g has a completely new low-level kernel diagnostics & tracing infrastructure built in to it. I wanted to write a longer article about it with comprehensive examples and use cases, but by now I realize I won’t ever have time for this, so I’ll just point you to the right direction :)

Basically, since 11g, you can use SQL_Trace, kernel undocumented traces, various dumps and other actions at much better granularity than before.

For example, you can enable SQL_Trace for a specific SQL_ID only:

SQL> alter session set events 'sql_trace[SQL: 32cqz71gd8wy3{pgadep: exactdepth 0} {callstack: fname opiexe}
plan_stat=all_executions,wait=true,bind=true';


Session altered.

Actually I have done more in above example, I have also said that trace only when the PGA depth (the dep= in tracefile) is zero. This means that trace only top-level calls, issued directly by the client application and not recursively by some PL/SQL or by dictionary cache layer. Additionally I have added a check whether we are currently servicing opiexe function (whether the current call stack contains opiexe as a (grand)parent function) – this allows to trace & dump only in specific cases of interest!

Advanced Oracle Troubleshooting Session - PGA/UGA memory fragmentation

A troubleshooting session at one of my clients started with the following description of the problem:

In some specific database environments sometimes (but not always) a particular batch process takes significantly longer than expected - which in that case meant hours instead of a few minutes.

It could also be observed that particular SQL statements showed up as most prominent in that case when monitoring the process.

So the first thing was to be able to reproduce the issue at will which was not the case so far.

It was quite quickly possible to narrow the problem down to a single call to a complex PL/SQL stored procedure (that was part of a package and called a lot of other stored procedures / packages) that dealt with handling of LOBs representing XML data.

It turned out that by simply calling this stored procedure multiple times in a loop the issue could be reproduced at will and also in different database environments as initially reported.

The rather unique aspect of that stored procedure was that it got executed in an "exclusive" mode which means that it was run on a quite large IBM pSeries server (AIX 5.3, 11.1.0.7, 240 GB RAM, numerous P6 multi-cores, expensive storage) with no other business processes active at the same time. It was mostly a serial execution and when it was monitored, it spent most of its time in two SQL statements that were very simple and straightforward. It actually were static SQLs embedded in PL/SQL that queried a single, very small table using a unique index access plus a table access by ROWID. So we had two SQL statements that under normal circumstances at most could generate two logical I/Os per execution, since the index/table queried was so small and therefore was completely held in the buffer cache.

When monitoring the executions of these SQLs while running the above mentioned loop it became obvious that due to the design of the logic these SQL statements very called numerous times, actually thousands of times (depending on the content of the XML LOB), per procedure call.

It also became obvious that, because it was static SQL, the PL/SQL cursor cache (and the database block cache both) did a very good job - meaning that the SQLs were only parsed once and from then on only executed, and generated only logical I/O and no physical I/O.

Furthermore due to the logic implemented most of the time these SQLs actually didn't find the value looked up in the small table which meant that on average each of the execution required only a single logical I/O - the index unique scan that required only a single logical I/O and no further activity as the value could not be found in the index.

When checking the SQL execution statistics in different environments (Linux instead of AIX) it also became obvious that the same SQL statements were executed a similar number of times, but these environments didn't experience the same long runtime of the batch job, so the sheer number of SQL executions looked odd but didn't seem to be the root cause of the problem.

So we ended up with the following initial summary:

- We identified a stored procedure that deals with LOBs representing XML as culprit
- Due to the logic most of the time was spent in two static SQLs that got executed thousands of times per execution of the stored procedure
- However these SQLs used the "optimal" execution plan and actually generated only a single logical I/O per execution
- Furthermore due to the "exclusive" manner of the batch job no contention with other processes was possible
- So there are no usual suspects like "inefficient execution plans", "contention for resources", "serialization" etc.
- Therefore the usual performance monitoring methods like wait interface, extended SQL trace, active session history, session statistics / system statistics, ASH / AWR / ADDM reports didn't reveal any obvious culprit
- Executing the SQL statements identified in a separate process showed that they took on average 10 microseconds (!) per execution - which is quite fast and raised the question how it was possible that these statements showed up as the problematic ones - executed in a simple loop ten thousands of executions were possible within a single second.

So we were at a point where we basically had the classic "The Wait Interface Is Useless (Sometimes)" situation. All we could see is that most of the time was spent executing these SQLs and all we saw was that this time was spent on the single CPU executing these statements.

But wait, there was an interesting point showing up when monitoring the session statistics delta (e.g. using Tanel Poder's "snapper" utility): The process required more and more PGA/UGA memory while executing.

So it was time for advanced troubleshooting techniques - and for instance OakTable fellow Tanel Poder has done a lot of research in this area and provides numerous helpful tools.

Further checks revealed a few anomalies that allowed us to get an idea in which direction to investigate further:

- The PGA/UGA consumption of the process was increasing with every iteration of the loop executing the PL/SQL stored procedure. So there was a memory problem somewhere hidden. A good idea therefore seemed to be to take a PGA/UGA heapdump of the process to see what kind of memory chunks could be identified.

Caution: Taking heapdumps may crash your system or process, therefore be very cautious when doing so in a production-like environment. Taking PGA heapdumps usually only affects a single process and is therefore not that critical (unless it was a background process for instance...).

This is how you can take a heapdump using EVENTS:

ALTER SESSION SET EVENTS 'immediate trace name heapdump level ';

This is how you can request a heapdump using ORADEBUG:
AS SYSDBA after identifying the process using SETOSPID, SETORAPID OR SETMYPID:

ORADEBUG DUMP HEAPDUMP

See also Julian Dyke's website for an overview of the available heapdump levels.

In this case a level 4 + 1 = level 5 heapdump (PGA + UGA summary) seemed to to sufficient (although the current/user callheap might also be of interest which is level 8 + 16 + 4 + 1 = 29).

In our chapter 8 of the Expert Oracle Practices from the OakTable book Charles and I also describe the most important heapdump variants and the other performance monitoring methods applied here.

The trace file written can then be analyzed for example using Tanel's heapdump_analyzer shell script to get an initial overview - in this case since it was quite a huge trace file I simply used a slightly modified version of the awk program used by heapdump_analyzer to generate an input file for the Oracle external table feature to run more sophisticated reports on the file.

This is the Unix command that can be used to transform the raw trace into a suitable input for an Oracle external table:

cat $TRACE_FILE | awk '
/^HEAP DUMP heap name=/ { split($0,ht,"\""); HTYPE=ht[2]; doPrintOut = 1; }
/Chunk/{ if ( doPrintOut == 1 ) {
split($0,sf,"\"");
printf "%10d , %16s, %16s, %16s\n", $4, HTYPE, $5, sf[2];
}
}
/Total heap size/ {
doPrintOut=0;
}
' > $EXT_TAB_DIR/heapdump.txt

This is how this external table definition could look like:

create table heapdump
(
chunk_size integer
, heap_type varchar2(16)
, chunk_type varchar2(16)
, alloc_reason varchar2(16)
)
organization external
(
type oracle_loader
default directory ext_tab_dir
access parameters
(
records delimited by newline
fields terminated by ','
lrtrim
)
location ('heapdump.txt')
);

- The process "leaked" temporary lobs - in V$TEMPORARY_LOBS an ever increasing number of LOBs could be observed, however the increase was slow (one per execution of the stored procedure) and the space consumption in the temporary tablespace was low

- When calculating the average execution time of the two SQL statements getting executed many times it became obvious that these actually started as quickly observed in the separate execution but continuously slowed down - they became slower and slower over time

- After a couple of minutes these statements that initially took a few microseconds took already milliseconds (!) to execute - it was no wonder that the batch job took hours, since it still required to execute these statements numerous times, but now each execution took more than thousand times longer than in the beginning

- Looking closer it became obvious that every SQL statement that got executed was affected by that slowdown, but since many of these statements took several milliseconds anyway and were executed not that many times, an "overhead" of a few milliseconds didn't really make a significant difference. It were those statements that were executed that often that incurred the biggest penalty

- Eventually when the loop completed that executed the PL/SQL stored procedure, it became obvious that basically everything that was done within that session was affected, for example a simple SELECT * FROM DUAL or a null PL/SQL block like BEGIN NULL; END; took something like 30 centiseconds (0.3 secs!) to complete - doing the same in a fresh session (while the other session was still open) completed instantly (less than the 0.01 seconds measured by the SQL*Plus timing facility) - it was something that was specific to the session and not an instance-wide effect.

So there was something that obviously took more and more CPU time that added an overhead to basically "every" operation, no matter what was performed.

Time to go one level deeper and take traces on the O/S level to find out where a process was spending most of its time. Unfortunately this was AIX 5.3, so the range of available tools to perform such a measurement was quite limited. There is no DTrace on AIX, and the new probevue tool is only available from AIX 6 on. So we were left with Oracle's built-in oradebug short_stack and AIX's procstack tool to take samples of the call stack of the foreground process.

But Tanel is there again for a rescue - using his "OStackProf" tool set we were able to take samples of the call stack, only to find out that the process spent most of its time in memory management calls, and that was only an assumption since no-one could tell us what these sampled function names exactly meant. According to the MOS document 175982.1 these were related to memory management of ADTs (Oracle objects).

This is what such a stack trace sample looked like:

Below is the stack prefix common to all samples:
------------------------------------------------------------------------
Frame->function()
------------------------------------------------------------------------
# 36 ->main()
# 35 ->opimai_real()b0
# 34 ->sou2o()
# 33 ->opidrv()
# 32 ->opiodr()b98
# 31 ->opiino()f0
# 30 ->opitsk()
# 29 ->ttcpip()c
# 28 ->opiodr()b98
# 27 ->kpoal8()c
# 26 ->opiexe()
# 25 ->kkxexe()c
# 24 ->peicnt()
# 23 ->plsql_run()
# 22 ->pfrrun()
# 21 ->pfrrun_no_tool()c
# 20 ->pfrinstr_EXECC()c
# 19 ->pevm_EXECC()e4
# 18 ->psdnal()c
# 17 ->psddr0()
# 16 ->rpidrv()
# ...(see call profile below)
#
# -#--------------------------------------------------------------------
# - Num.Samples -> in call stack()
# ----------------------------------------------------------------------
76 ->rpiswu2()c0->rpidru()->skgmstack()c4->rpidrus()c8->opiodr()b98->opipls()->opiexe()e4->auddft()b8->audbeg()->ktcxbr0()a0->kocbeg()->koctxbg()->kohalc()->kohalmc()->kghfru()c->44c0->->
20 ->rpiswu2()c0->rpidru()->skgmstack()c4->rpidrus()c8->opiodr()b98->opipls()->opiexe()->44c0->->
1 ->rpiswu2()c0->rpidru()->skgmstack()c4->rpidrus()c8->opiodr()b98->opipls()->opiexe()b8c->kksfbc()ec->ktcsna()c->ktucloGetGlobalMinScn()->44c0->->
1 ->rpiswu2()c0->rpidru()->skgmstack()c4->rpidrus()c8->opiodr()b98->opipls()->opiexe()->ksuvrl()c->44c0->->
1 ->rpiswu2()c0->rpidru()->skgmstack()c4->rpidrus()c8->opiodr()b98->opipls()->opiexe()->44c0->->
1 ->rpiswu2()c0->rpidru()->skgmstack()c4->rpidrus()c8->opiodr()->ksuprc()c->44c0->->

The "auddft"/"audbeg" function names might indicate something related to auditing, but the database didn't have any auditing enabled, apart from the "audit_trail = DB" setting. Alex Fatkulin recently reported an issue with auditing enabled slowing down 11.2.0.1 and indeed when setting "audit_trail = NONE" in 11.1.07 or 11.2.0.1 the slow-down experienced was less which indicates that there are actually some code paths related to auditing involved, however it didn't solve the issue - there was still a significant slow-down observed.

The crucial hint came from the analysis of the PGA/UGA heapdump:
The PGA/UGA heapdump showed hundred of thousands very small chunks that seem to relate to XML handling ("qmxdpls_subhea") - hence the obvious assumption was that something related to XML object memory management was probably going wrong.

Therefore we focused on the code parts that performed the XML processing - and very quickly a crucial bug was identified: The DBMS_XMLDOM.FREEDOCUMENT call was missing from the code.

Apparently this bug was in the code right from the beginning but the application was initially developed on 10.2 which interestingly does not show the same slow down. It shows the same symptoms however, like increasing PGA/UGA memory consumption and leaked temporary lobs, but it doesn't experience the same slow-down. Now that the database has been upgraded to 11.1.0.7 some time ago obviously the slow-down problems started to surface.

So by simply adding or removing the freedocument call from the code, the issue could be reproduced / fixed.

This enabled us to come up with a generic testcase that allows to reproduce the issue at will in most environments running 11.1.0.7 or later.

Note that it looks like that the effect seems to depend on the port of Oracle - Oracle 11.1.0.7 on Linux x64 showed a bit different behaviour - only some SQL statements were affected by the slow-down, but not every statement as it seems to be the case with the AIX port.

drop table t_testloop;

purge table t_testloop;

create table t_testloop (
id integer not null
, vc varchar2(255)
, constraint t_testloop_pk primary key (id)
)
;

insert into
t_testloop
(
id
, vc
)
select
level as id
, rpad('x', 100, 'x') as vc
from
dual
connect by
level <= 100;

commit;

exec dbms_stats.gather_table_stats(null, 'T_TESTLOOP')

-- This is supposed to be a INDEX UNIQUE SCAN + TABLE ACCESS BY ROWID
explain plan for
select
id
, vc
from
t_testloop
where
id = to_number(:x);

set linesize 160
set pagesize 999

select * from table(dbms_xplan.display);

set timing on echo on serveroutput on

-- This is the normal (reference) execution time for running
-- the simple statement a thousand times
declare
procedure check_key
as
x integer;
n_id integer;
s_vc varchar2(255);
begin
x := 42 * 3;
select
id
, vc
into
n_id
, s_vc
from
t_testloop
where
id = x;
exception
when NO_DATA_FOUND then
null;
end;
begin
for i in 1..1000 loop
check_key;
end loop;
end;
/

-- "Deterministic" randomness :-))
exec dbms_random.seed(0)

declare
start_time number;
end_time number;

-- Generate some CLOB containing XML
-- Note that it looks like the CLOB needs
-- to be different for every iteration
-- otherwise the issue couldn't be reproduced
function return_clob return clob
as
the_lob clob;
s_name varchar2(20);
n_sal integer;
s_job varchar2(20);
begin
the_lob := '
';
for i in 1..20 loop
s_name := dbms_random.string('U', trunc(dbms_random.value(1, 21)));
n_sal := trunc(dbms_random.value(1, 1001));
s_job := dbms_random.string('U', trunc(dbms_random.value(1, 21)));
the_lob := the_lob || '
' || to_char(i, 'TM') || '
' || s_name || '
' || to_char(n_sal, 'TM') || '
' || s_job || '

';
end loop;
the_lob := the_lob || '
';

return the_lob;
end return_clob;

-- Some usage of the PL/SQL XML DOM API
-- Some dummy processing of the attributes of the given node
procedure process_attributes
(
in_node dbms_xmldom.DOMNode
)
is
len number;
n dbms_xmldom.DOMNode;
nnm dbms_xmldom.DOMNamedNodeMap;
key varchar2(1000);
val varchar2(32767);
BEGIN
nnm := dbms_xmldom.getAttributes(in_node);

if (dbms_xmldom.isNull(nnm) = FALSE) then
len := dbms_xmldom.getLength(nnm);

-- loop through attributes
for i in 0..len-1 loop
n := dbms_xmldom.item(nnm, i);
key := dbms_xmldom.getNodeName(n);
val := dbms_xmldom.getNodeValue(n);
end loop;
end if;

end process_attributes;

-- Some usage of the PL/SQL XML DOM API
-- Recursively walk the nodes of the DOM
-- and call the attribute processing per node
procedure walk_node
(
in_node dbms_xmldom.DOMNode
)
is
nl dbms_xmldom.DOMNodeList;
len number;
n dbms_xmldom.DOMNode;
node_name varchar2(100);
begin
-- loop through elements
node_name:=dbms_xmldom.getNodeName(in_node);

process_attributes(in_node);

nl := dbms_xmldom.getChildNodes(in_node);
len := dbms_xmldom.getLength(nl);
for i in 0..len-1 loop
n := dbms_xmldom.item(nl, i);
node_name := dbms_xmldom.getNodeName(n);
walk_node(n);
end loop;
end walk_node;

-- The main procedure
procedure process_xml_clob
as
the_lob clob;
var XMLType;
doc dbms_xmldom.DOMDocument;
root dbms_xmldom.DOMNode;
root_tag varchar2(100);
begin
-- Get the CLOB with the XML
the_lob := return_clob;

-- Instantiate an XMLTYPE
var := xmltype(the_lob);

-- Generate a new DOM document from the XMLType
-- This seems to allocate a temporary LOB under the covers
doc := dbms_xmldom.newDOMDocument(var);

-- Some rudimentary XML DOM processing
root := dbms_xmldom.makeNode(dbms_xmldom.getDocumentElement(doc));

root_tag := dbms_xmldom.getNodeName(root);

-- If you want to burn more CPU to exaggerate the effect
-- uncomment this
-- walk_node(root);

-- This omission here causes a significant PGA/UGA memory leak
-- and causes version 11.1 and 11.2 to slow down everything
-- in this session
-- Version 10.2 suffers from the same symptoms but doesn't slow down
--DBMS_XMLDOM.freeDocument(doc);
end;
begin
-- Run this a thousand times and measure / output the runtime per execution
for i in 1..1000 loop
start_time := dbms_utility.get_time;
process_xml_clob;
end_time := dbms_utility.get_time;
dbms_output.put_line('Run ' || to_char(i, 'TM') || ': Time (in seconds)= ' || ((end_time - start_time)/100));
end loop;
end;
/

-- Do the simple statement again a thousand times
-- Notice the difference in runtime when using 11.1.0.7 or 11.2.0.1
declare
procedure check_key
as
x integer;
n_id integer;
s_vc varchar2(255);
begin
x := 42 * 3;
select
id
, vc
into
n_id
, s_vc
from
t_testloop
where
id = x;
exception
when NO_DATA_FOUND then
null;
end;
begin
for i in 1..1000 loop
check_key;
end loop;
end;
/

The crucial line is the one above containing "DBMS_XMLDOM.FREEDOCUMENT". If you run this script without the FREEDOCUMENT call on a version that is affected by the slow-down, like 11.1.0.7 or 11.2.0.1, then a typical output might look like the following:

SQL> declare
2 procedure check_key
3 as
4 x integer;
5 n_id integer;
6 s_vc varchar2(255);
7 begin
8 x := 42 * 3;
9 select
10 id
11 , vc
12 into
13 n_id
14 , s_vc
15 from
16 t_testloop
17 where
18 id = x;
19 exception
20 when NO_DATA_FOUND then
21 null;
22 end;
23 begin
24 for i in 1..1000 loop
25 check_key;
26 end loop;
27 end;
28 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.94

.
.
.

Run 1: Time (in seconds)= .49
Run 2: Time (in seconds)= .08
Run 3: Time (in seconds)= .08
Run 4: Time (in seconds)= .08
Run 5: Time (in seconds)= .05
Run 6: Time (in seconds)= .03
Run 7: Time (in seconds)= .03
Run 8: Time (in seconds)= .03
Run 9: Time (in seconds)= .03
Run 10: Time (in seconds)= .02
Run 11: Time (in seconds)= .03
Run 12: Time (in seconds)= .03
Run 13: Time (in seconds)= .03
Run 14: Time (in seconds)= .03
Run 15: Time (in seconds)= .03
Run 16: Time (in seconds)= .03
Run 17: Time (in seconds)= .02
Run 18: Time (in seconds)= .03
Run 19: Time (in seconds)= .03
Run 20: Time (in seconds)= .03
Run 21: Time (in seconds)= .03
Run 22: Time (in seconds)= .03
Run 23: Time (in seconds)= .03
Run 24: Time (in seconds)= .03
Run 25: Time (in seconds)= .03
.
.
.
Run 287: Time (in seconds)= .03
Run 288: Time (in seconds)= .03
Run 289: Time (in seconds)= .03
Run 290: Time (in seconds)= .03
Run 291: Time (in seconds)= .02
Run 292: Time (in seconds)= .03
Run 293: Time (in seconds)= .03
Run 294: Time (in seconds)= .03
Run 295: Time (in seconds)= .03
Run 296: Time (in seconds)= .03
Run 297: Time (in seconds)= .03
Run 298: Time (in seconds)= .02
Run 299: Time (in seconds)= .03
Run 300: Time (in seconds)= .03
Run 301: Time (in seconds)= .03
Run 302: Time (in seconds)= .03
Run 303: Time (in seconds)= .17
Run 304: Time (in seconds)= .17
Run 305: Time (in seconds)= .17
Run 306: Time (in seconds)= .17
Run 307: Time (in seconds)= .17
Run 308: Time (in seconds)= .17
Run 309: Time (in seconds)= .17
Run 310: Time (in seconds)= .17
Run 311: Time (in seconds)= .18
Run 312: Time (in seconds)= .17
Run 313: Time (in seconds)= .18
Run 314: Time (in seconds)= .18
Run 315: Time (in seconds)= .18
Run 316: Time (in seconds)= .17
Run 317: Time (in seconds)= .19
Run 318: Time (in seconds)= .18
Run 319: Time (in seconds)= .18
Run 320: Time (in seconds)= .19
Run 321: Time (in seconds)= .18
Run 322: Time (in seconds)= .19
.
.
.
Run 973: Time (in seconds)= .82
Run 974: Time (in seconds)= .83
Run 975: Time (in seconds)= .83
Run 976: Time (in seconds)= .82
Run 977: Time (in seconds)= .83
Run 978: Time (in seconds)= .83
Run 979: Time (in seconds)= .82
Run 980: Time (in seconds)= .82
Run 981: Time (in seconds)= .83
Run 982: Time (in seconds)= .82
Run 983: Time (in seconds)= .83
Run 984: Time (in seconds)= .83
Run 985: Time (in seconds)= .82
Run 986: Time (in seconds)= .84
Run 987: Time (in seconds)= .83
Run 988: Time (in seconds)= .86
Run 989: Time (in seconds)= .84
Run 990: Time (in seconds)= .83
Run 991: Time (in seconds)= .85
Run 992: Time (in seconds)= .84
Run 993: Time (in seconds)= .84
Run 994: Time (in seconds)= .85
Run 995: Time (in seconds)= .84
Run 996: Time (in seconds)= .85
Run 997: Time (in seconds)= .84
Run 998: Time (in seconds)= .87
Run 999: Time (in seconds)= .84
Run 1000: Time (in seconds)= .85

PL/SQL procedure successfully completed.

Elapsed: 00:06:00.49
SQL>
SQL> declare
2 procedure check_key
3 as
4 x integer;
5 n_id integer;
6 s_vc varchar2(255);
7 begin
8 x := 42 * 3;
9 select
10 id
11 , vc
12 into
13 n_id
14 , s_vc
15 from
16 t_testloop
17 where
18 id = x;
19 exception
20 when NO_DATA_FOUND then
21 null;
22 end;
23 begin
24 for i in 1..1000 loop
25 check_key;
26 end loop;
27 end;
28 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:03.02
SQL>

Notice how after a certain number of iterations the execution gets slower and slower. And what is even more interesting is that the simple loop of 1,000 SQL statement executions afterwards takes significantly longer than before the code block got executed.

You might want to monitor the PGA/UGA consumption in the Session Statistics or V$PROCESS and V$TEMPORARY_LOBS while the block is executing.

If you run the same in 10.2.0.4, you'll monitor the same symptoms, but the execution time will stay consistent, at least up to 1,000 iterations - obviously there will be a point where a system might get trouble with the ever increasing memory consumption, however that is not the point here.

If DBMS_XMLDOM.FREEDOCUMENT is called correctly then the same code block in 11.1.0.7 and 11.2.0.1 will also show consistent execution times (and no increased PGA/UGA memory or temporary LOBs).

Interestingly Nigel Noble also has recently published a blog note about PL/SQL memory issues that are only reproducible in 11.1.0.7 but don't show up in 10.2.0.4.

So although the issues reported by Alex Fatkulin and Nigel Noble aren't necessarily related to this one here, all this seems to suggest that Oracle release 11 introduced some changes to the way memory is managed (may be due to the introduction of Automatic Memory Management (AMM) and the memory_target parameter) that seem to behave differently than in previous releases under certain circumstances.

References:
Tanel Poder's Advanced Oracle Troubleshooting series
Advanced Oracle Troubleshooting
Tanel's posts on troubleshooting

Expert Oracle Practices from the OakTable, Chapter 8

Flexible Sqlplus command line history with RLWRAP

At Hotsos Symposium Training Day I used rlwrap with sqlplus – which gives nice command line editing and history capabilities for tools like sqlplus. Additionally I pre-generated commonly used Oracle keywords, data dictionary view and package names into rlwrap wordfile, so I got nice tab-completion too. Sqlplus sucks much less with rlwrap ;-)

It’s relatively easy to install rlwrap on Unix (there are rlwrap RPMs out there, Solaris freeware packages and I installed it on Mac via macports.org). Just google around…

You can have rlwrap on Windows too – As rlwrap has been coded for Unix flavors, then on Windows you need to run it on a Unix library environment emulator – like Cygwin.

Dave Herring and Michael Paddock have both written an article about how to get rlwrap & sqlplus running on Windows, check out the articles here. It’s worth reading both as they have different additions…

So, if you want command line history, search and tab completion for sqlplus on Unix flavors or Windows, check these articles out!

http://daveherringsdbablog.blogspot.com/2010/03/sqlplus-command-history-with-cygwin.html

http://blog.enkitec.com/2010/04/29/using-sqlplus-with-rlwrap-on-ms-windows/