Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Myth: Bitmap Indexes With High Distinct Columns (Blow Out)

I just couldn’t resist.   One of the great myths in Oracle is that bitmap indexes are only suitable and should only be used with columns that have so-called low cardinality (few distinct) values. A classic example of this myth being propagated is in this latest Burleson news item, “Oracle bitmap index maximum distinct values“, from Feb 16 [...]

How to CANCEL a query running in another session?

Here’s a treat for Oracle geeks, hackers and maniacs out there…

Update: As the beginning says, this article was meant as something interesting about Oracle’s internals and CTRL+C / OCICancel() handling. There’s a more practical way for canceling session calls if you are running Oracle Enterprise Edition and are currently using resource manager:

You can set the consumer group for a session to CANCEL_SQL to cancel its current call:

DBMS_RESOURCE_MANAGER.SWITCH_CONSUMER_GROUP_FOR_SESS (
session_id IN NUMBER,
session_serial IN NUMBER,
consumer_group IN VARCHAR2);

Thanks to commenter “null” for this info. Note that I haven’t tested how/whether this feature works correctly so there’s homework for you ;-)

I recently received a question about how to cancel queries running in another Oracle session, so that the session would not be killed, but would remain alive.

Well, there’s no supported way I can tell you, but thanks to how Oracle handles out-of-band breaks on Unix platforms, you can cancel database calls using an OS tool – kill.

Before we go on, here’s how query cancellation (pressing CTRL+C in sqlplus for example) works in Oracle:

  1. The user presses CTRL+C or clicks some button in App which runs OCICancel()
  2. The client process sends an urgent TCP packet (which is a regular TCP packet with URG bit set) to the socket in the database server it is connected to
  3. The server OS TCP stack receives the urgent TCP packet and sends URGENT signal to the process which owns the socket (the Oracle server process)
  4. Unix interrupts whatever the current process was doing and calls the URGENT signal handler function (which Oracle has registered during process startup)
  5. The urgent signal handler blindly assumes that the urgent signal has been received because user wants to cancel the query, stops the execution and returns back with an error: ORA-01013: user requested cancel of current operation

So, if we can’t make our application send the break packet, OCICancel() then we can just send the SIGURG signal to the Oracle process just like the OS TCP stack would do when it receives the packet with urgent bit set.

Here’s an example:

In one session I’m running a DBMS_STATS call:

SQL> exec dbms_stats.gather_database_stats;

I identify the SPID of that session’s process and send an URG signal to that process:

kill -URG 4476

And the call gets cancelled in the other session:

SQL> exec dbms_stats.gather_database_stats;
BEGIN dbms_stats.gather_database_stats; END;

*
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-01013: user requested cancel of current operation
ORA-06512: at "SYS.DBMS_STATS", line 13336
ORA-06512: at "SYS.DBMS_STATS", line 13682
ORA-06512: at "SYS.DBMS_STATS", line 13826
ORA-06512: at "SYS.DBMS_STATS", line 13790
ORA-06512: at line 1

My session was not killed – I still can run queries in it:

SQL> select * from dual;

D
-
X

SQL>

This works only on Unix platforms. Also this does not work when your client application is Windows sqlplus! This is because Windows sqlplus does not set up the out-of-band break checking properly when connecting. Maybe this is because old Windows versions TCP stacks didn’t know anything about urgent TCP packets! :)

A word of warning – this stuff is not for your everyday production usage! While it works and we know how and why it works, it’s not a good idea to send “random” signals to Oracle processes at your will. So the better way is to make your application able to cancel its database calls when you want it, but well in real world its not always (or should I even say rarely) possible.

Another thing to consider is when you run Oracle with Direct NFS, there will be network connections to the NFS server used by your server process, in addition to the client-server communication. I haven’t tested what happens when you send URG packet to a process in the DNFS case.

So try this out at your own risk ;-)

If you want to know more about query cancelling and what the in-band and out-of-band break checking is then you can read one of my old blog posts about it:

Share/Bookmark

How to CANCEL a query running in another session?

Here’s an old article of mine, explaining some of the Oracle internals and the OS touch point. It actually goes beyond explaining just commands for canceling SQL. I have written a couple of updates into the begginning of this article about SQL cancellation commands, but if you want to learn the internals, scroll down to The original article section.
Update 1: As the beginning says, this article was meant as something interesting about Oracle internals and CTRL+C / OCICancel() handling.

How to CANCEL a query running in another session?

Here’s an old article of mine, explaining some of the Oracle internals and the OS touch point. It actually goes beyond explaining just commands for canceling SQL. I have written a couple of updates into the begginning of this article about SQL cancellation commands, but if you want to learn the internals, scroll down to The original article section.
Update 1: As the beginning says, this article was meant as something interesting about Oracle internals and CTRL+C / OCICancel() handling.

How to CANCEL a query running in another session?

Here’s an old article of mine, explaining some of the Oracle internals and the OS touch point. It actually goes beyond explaining just commands for canceling SQL. I have written a couple of updates into the begginning of this article about SQL cancellation commands, but if you want to learn the internals, scroll down to The original article section.
Update 1: As the beginning says, this article was meant as something interesting about Oracle internals and CTRL+C / OCICancel() handling.

How to CANCEL a query running in another session?

Here’s an old article of mine, explaining some of the Oracle internals and the OS touch point. It actually goes beyond explaining just commands for canceling SQL. I have written a couple of updates into the begginning of this article about SQL cancellation commands, but if you want to learn the internals, scroll down to The original article section.
Update 1: As the beginning says, this article was meant as something interesting about Oracle internals and CTRL+C / OCICancel() handling.

RMOUG 2010: My presentations

It is very disappointing to me that I had to cancel my trip to RMOUG training days. I am sick and was not able to catch the flight due to that.

But, I can always share my presentations here. I had two presentations planned in this training day and can be accessed as below:

Advanced RAC troubleshooting
Riyaj_Advanced_rac_troubleshooting_RMOUG_2010_doc
Riyaj_Advanced_rac_troubleshooting_RMOUG_2010_ppt

Why optimizer hates my sql
Riyaj_Why_optimizer_hates_my_sql_2010

RMOUG training days audience: Please accept my sincere apologies.

A Free afternoon seminar in Singapore (24th Feb)

If you are in Singapore and have 24th Feb afternoon available then you can register and join a free Oracle performance troubleshooting seminar I’m doing in Singapore Management University’s (SMU) campus.

The seminar will be about:

  • Systematic Oracle Performance Troubleshooting
  • Identifying performance troublemakers
  • Understanding execution plans

The date is Wednesday, 24th Feb

The seminar time is from 15:30-19:00 (don’t be late)

Registration and more details are here:

Share/Bookmark

The Oracle Wait Interface Is Useless (sometimes) – part 3b

Welcome back for the concluding part of this series of blogs.
In the last part I went through a brief primer about code execution and stack frames in preparation for this posting. I hope that wasn’t too boring for everyone – I think it’s really important to make sure all the preliminary knowledge is covered and not make too many assumptions about the reader. In this part I will finally get to the point, and talk about some alternative techniques for determining the reasons for poor performance for our example user session.
To recap very quickly, we have a user session taking 100% CPU in top, nothing is emitted in SQL trace files, Oracle’s Time Model shows it is executing SQL, and that’s about all the information we have. As Tanel stated in his posting for Part Two, this is the stage where it makes good sense to dig into session-level statistics (not system-level statistics) to see if any of the counters are incrementing in a conspicuous way. If you don’t already have a copy of Tanel’s Snapper code, I suggest you grab it and get familiar with it – it’s a nice implementation of an essential tool that many DBAs already have some kind of script for.

Tanel found in Part Two that the SQL execution was revisiting the same small number of buffers very frequently, indicating that there might be a problem with one of the Nested Loop join methods used by the optimiser. Perhaps there is something wrong with that part of the plan (or with the whole plan)? Using counters can often reveal a potential problem this way, but they don’t necessarily  put a neon sign reading “Here’s Where All The Time Went” on top of the offending item. This is where we might want to dig further using some of our knowledge of code execution and stack traces, in combination with a dynamic tracing infrastructure such as Dtrace or Systemtap.

The power of Dtrace and friends lies in a few key areas:

  1. It can report time-based information
  2. It can show user and kernel information
  3. It is dynamic – unlike gprof and other code profilers, it is not necessary to run everything with a profiled binary

These are really key features which, when combined, allow for tremendously powerful on-demand tracing.

Consider the following sample output from a very small Dtrace script to look into what happens when a single printf() library call is made from a C program:

CPU FUNCTION
  0  -> my_function                            USER CODE
    (57 lines of dynamic library loading omitted)
  0    -> printf                               USER CODE
  0      -> _setorientation                    USER CODE
  0      <- _setorientation                    USER CODE
  0      -> _ndoprnt                           USER CODE
  0        -> _findbuf                         USER CODE
  0          -> isatty                         USER CODE
  0            -> ___errno                     USER CODE
  0            <- ___errno                     USER CODE
  0            -> ioctl                        USER CODE
  0              -> syscall_mstate             KERN
  0                -> gethrtime_unscaled       KERN
  0                  -> tsc_gethrtimeunscaled_delta  KERN
  0                    -> tsc_gethrtimeunscaled  KERN
  0                    <- tsc_gethrtimeunscaled  KERN
  0                  <- tsc_gethrtimeunscaled_delta  KERN
    (180 more lines of kernel functions for ioctl call omitted)
 0            <= ioctl                         SYSCALL: ioctl 1743296 ns
    (34 lines of kernel function (probably DTrace intrumentation!) omitted)
 0        <- ioctl                             USER CODE
 0  <= fstat64                                 SYSCALL: fstat64 352699 ns
    (~300 lines of kernel  and dynamic library loading omitted)
 0    -> _write                                USER CODE
    (~200 lines of kernel  and dynamic library loading omitted)
 0  <= write                                   SYSCALL: write 2154427 ns
    (123 lines of kernel code omitted)
 0  <- printf                                  USER CODE

In the above output, there are around 1000 functions called in User and Kernel domains. The output gives you some idea of the level of detail possible with Dtrace; we can see all of the following attributes:

  • All User mode function calls are logged
  • All Kernel mode function calls are logged
  • All system calls are timed

Note: The printf() itself is very much a user-mode call,part of the standard C library (libc). It actually makes a couple of system calls under the bonnet: Initially an ioctl() and eventually the write(2) system call.

Let’s look at something else – time spent in the Virtual Memory subsystem of the kernel doing Page Ins (page faults). Again, this is a very simple script, and produces output similar to the following:

WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 0 ms
WAIT: PAGE IN: 90 ms
WAIT: PAGE IN: 836 ms
WAIT: PAGE IN: 1098 ms
WAIT: PAGE IN: 853 ms
WAIT: PAGE IN: 816 ms
FINAL WAIT: PAGE IN: 263 ms

So now we can quantify how much time is spent servicing page faults, in this case showing approximately how much per second. Here we have very specific information about how much time our particular session is spending servicing page faults. Don’t assume that soft page faults are cheap/fast, either – with increasing virtualisation trends, soft page faults as observed by the guest operating system could actually be hard page faults in the underlying hypervisor.

These are just a few of the things that can be monitored in Dtrace. Now let’s combine the Oracle stack tracing that we introduced in part 3a.  In that posting I concluded that we could dump the stack trace once in a while, observe the name of the current function, and determine where we are spending the most time. With Dtrace we can make this much more concrete, and can quantitatively report the time spent in each function, and combine other contributing factors from the Operating System! I have put together a Dtrace script to demonstrate this concept. I won’t post the script here (yet) because it is still very much in development and still has a number of deficiencies. What I’m trying to get to here is a really quick way to see exactly what an Oracle session is doing by only looking at the running process using Dtrace. The script currently has the following features:

  • Shows all OWI wait events and durations
  • Shows all system calls and durations
  • Shows timing stats every second for: OWI wait times, page faults, latch acquisition, and a selection of common rowsource fetch operators

The rowsource operators are taken from tracing the function calls to the various rowsource operators defined in Metalink note 175982.1. Specifically, the script currently traces calls to the following rowsource operators:

  • GS – Group By Sort
  • JO – Join (Nested Loop)
  • UA – Union All
  • VW – View
  • HJ – Hash Join
  • TB – Table
  • JOT – Unknown. Possible relation to Join, observed frequently during stack tracing
  • SO – Sort
  • IXT – Index
  • PF – Prefetch

Running the same misbehaving SQL as Tanel in his last posting, we get the following output (snippet only):

LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 464:JO 0:UA 0:VW 0:HJ 0:TB 21:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=1 rsrc=(0:GS 430:JO 0:UA 0:VW 0:HJ 0:TB 9:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 471:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 467:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 466:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 469:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
SYSCALL: times 7104 ns
SYSCALL: times 1749 ns
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 468:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 466:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 468:JO 0:UA 0:VW 0:HJ 0:TB 20:JOT 0:SO)
LAST1S (ms): wt=0 pgtm=0 latch=2 rsrc=(0:GS 469:JO 0:UA 0:VW 0:HJ 0:TB 22:JOT 0:SO)

Remember that this SQL example precisely fits our profile where the session is using 100% usermode CPU and emitting no wait information, so we cannot see any OWI waits detailed here – just summaries of where the time was spent every second and a couple of system calls. We also have no page faults taking place on this test, so no time is accounted for there. The big user of CPU (468ms per second) is the nested loop join rowsource operator, which agrees with Tanel’s findings using session-statistics. We are also spending a little time acquiring latches, but it’s not very significant. There are clearly some rowsource functions missing, given that we can only account for half the CPU time, so I did a bit more function tracing to find the names of the missing functions, and added them to the script. The output now looks as follows (scroll right to see the new functions):

LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 461:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 163:IXT 759:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 691:JO 0:UA 0:VW 0:HJ 0:TB 455:JOT 0:SO 304:IXT 843:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 460:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 162:IXT 759:PF)
SYSCALL: times 7810 ns
SYSCALL: times 1750 ns
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 686:JO 0:UA 0:VW 0:HJ 0:TB 444:JOT 0:SO 301:IXT 842:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 436:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 177:IXT 721:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 708:JO 0:UA 0:VW 0:HJ 0:TB 444:JOT 0:SO 286:IXT 880:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 357:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 228:IXT 588:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 788:JO 0:UA 0:VW 0:HJ 0:TB 445:JOT 0:SO 235:IXT 1013:PF)
LAST1S (ms): wt=0 pgtm=0 latch=0 rsrc=(0:GS 277:JO 0:UA 0:VW 0:HJ 0:TB 0:JOT 0:SO 276:IXT 458:PF)

Looks like we’ve found the missing time, and then some (the timers add up to more than 1000ms each second). The reason for this timing exaggeration is that there is a good deal of double accounting going on with the current version of the script – child function times are also accounted for in the parent. This is a known bug, hopefully I’ll get a chance to fix that one soon.

Well that’s about it for now. I hope that this journey has been an interesting one, I know it’s been a lot of reading to take in all the parts. We started with a session that was emitting no wait information, and now we have a simple Dtrace script that is easy to execute with no impact to other sessions, and easy to stop once the underlying cause is determined. The query does not need to be stopped and restarted, and no additional overhead is imposed from enabling additional Oracle diagnostics such as setting statistics_level=all. I intend to expand on this approach over time, with a particular focus on:

  • Multiplatform support – Systemtap version is required
  • Better design to support more function tracing
  • Fix double counting issue
  • Add identifiers for optimiser plan line number rather than rolling up by function name

How Does An Execution Plan Suddenly Change When The Statistics (And Everything Else) Remains The Same ? (In Limbo)

I’ve slipped this post in as there have been a number of discussions recently on how execution plans have changed while nothing else appears to have changed in the database. How can an execution plan suddenly change when no one has made any changes to the database ?   By no changes, it means that there [...]