Search

Top 60 Oracle Blogs

Recent comments

Getting to know Oracle wait events in Linux.

May 4th: some updates after discussion with Jeff Holt, Cary Millsap and Ron Christo of Method-R.

There’s all the documentation, and there all the terrific blogs about performance and Oracle wait events. But if you more or less start with this stuff, or encounter a wait event that is not (extensive enough) documented, or an event turns up and gives a response time you don’t expect, you need to understand what that wait event means. If you don’t know, it’s easy to get stuck at this point.

If you are familiar with waits, and just want to dig further, progress to “Get to know wait events”, if you want to get up speed with waits, read on.

What is a wait event?

This is the definition of the performance tuning guide in the Oracle documentation:

Statistics that are incremented by a server process/thread to indicate that it had to wait for an event to complete before being able to continue processing. Wait events are one of the first places for investigation when performing reactive performance tuning.

This description is quite good I think. If I would have to make a better description, I would say a wait event is essentially the registration of time spend on executing a part of the Oracle database server code, mostly doing something elementary. The granularity of wait events can differ between Oracle versions. Wait events can describe interaction with the operating system (reading or writing are good examples of this), but also interaction between processes (enqueue’s and ipc messaging are good examples of that). Needless to say, both interactions require operating system calls.

Also, a wait event is not necessary a bad thing. Every Oracle database process is either running on CPU or waiting. If your SQL*Plus session is not running, it is waiting for ‘SQL*Net message from client’, which means the database tries to tell you that it waits for the next instruction. Another interesting thing to note about wait events is that the execution of certain instrumented (covered by a wait event) parts if Oracle server code can not report a wait per kernel developer decision. This means, for example, that if doing a latch get read went efficiently, you will not get a ‘latch free’ or ‘latch: latch name‘ wait event, because the wait registration would take exponentially more time than the actual latch get.

How to get the wait events during execution

(skip this section if you are familiar with enabling and disabling sql trace at level 8)

The way to get all the waits of the execution of one or more SQL statements for a given process is to enable SQL trace, alias event 10046, at level 8. This can be done at the system level (meaning all processes) too, but unless you want to reverse engineer the interaction between ALL the process, this is not very useful. There much written about all the ways you can enable SQL trace at level 8. For the sake of simplicity, I use setting event 10046 at level 8 in the session itself. The session requires ‘alter session’ rights to be able to set it. This is how it’s done:

SQL> alter session set events '10046 trace name context forever, level 8';
Session altered.

And can be turned off the following way:

SQL> alter session set events '10046 trace name context off';
Session altered.

Get to know wait events: strace

At this point we know how to enable tracing, but want to dig deeper and understand the operating system interaction inside the waits.

To see what a process (for example your Oracle foreground process) is doing from the perspective of the Operating System, there’s the ‘strace’ executable. The ‘strace’ executable can display all the system calls an operating system process is executing if it has enough privileges to access it. It resembles SQL trace a bit, but at the Operating System/Linux layer. First, you determine the process ID of the session you want to use ‘strace’ on, then use another process to execute ‘strace’. This is what it looks like:

1. Setup a sqlplus session
$ sqlplus ts/ts@v11203

SQL*Plus: Release 11.2.0.3.0 Production on Wed Apr 25 23:03:22 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options

TS@v11203 >

2. Attach to the sqlplus process with strace in a session on the linux host where the database is running:
$ strace -p 14128
Process 14128 attached - interrupt to quit
read(14,

3. Now execute something in the sqlplus session, for example ‘select * from dual’:
TS@v11203 > select * from dual;

D
-
X

TS@v11203 >

4. This results in the following list of system calls:

read(14, "\1/\6\21i \376\377\377\377\377\377\377\377\2\2"..., 8208) = 303
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372954
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 19996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "\1r\6\20\27\2\374T\rD@\255\262t\t\313\242\1\247-8x"..., 370) = 370
read(14, "\25\6\3\5\"\2\17", 8208) = 21
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "\254\6\4\1 \1\1{\5\2"..., 172) = 172
read(14, "\1\337\6\3^#)\4\4\376\377\377\377\377\377\377\3774"..., 8208) = 479
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 20996}, ru_stime={0, 7998}, ...}) = 0
times({tms_utime=2, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 430372955
write(14, "010\6\v\1\5\314\2\1p\5\207y"..., 304) = 304
read(14,

Wow! Nice! But what does it mean? Well, these are all the system calls used by the Oracle foreground process in order to execute the (simple) SQL. It’s hard to tell something about the execution of the SQL with this list. And be aware, this is only a ‘select * from dual’, so a ‘real’ SQL will probably give a list so long you do not dare to begin studying the list to get something useful out of it.

Well, how about the next chapter:

Get to know wait events: combining waits and strace

How about combining the sql trace output with the system calls? The ‘strace’ executable has quite some options, which can be very helpful:

- list the data written by the process (‘-e write=all’)
This option lists the data written by the session. The trace file is written by the server process, so it will list the contents of that.
- list a timestamp for every system call (‘-tt’)
- list the time spend doing the system call (‘-T’)
- save the output to a file (‘-o /path/to/file.txt’)

These are all helpful to get more information the system calls.

Now setup a session in the database, enable SQL trace at level 8, fetch the process ID of session and run ‘strace -ttT -e write=all -p PID’ in another session and run something. Yes, this generates a massive amount of output. But, as I use it, it’s to investigate what certain waits are doing. Here are some examples:


23:46:20.145793 pread(257, "\6\242\202\3\1\356O\6\2049\245\2\3226\264O\6"..., 1032192, 1996505088) = 1032192
23:46:20.148986 lseek(8, 0, SEEK_CUR) = 46126
23:46:20.149030 write(8, "WAIT #139830933048328: nam='db f"..., 125) = 125
| 00000 57 41 49 54 20 23 31 33 39 38 33 30 39 33 33 30 WAIT #13 98309330 |
| 00010 34 38 33 32 38 3a 20 6e 61 6d 3d 27 64 62 20 66 48328: n am='db f |
| 00020 69 6c 65 20 73 63 61 74 74 65 72 65 64 20 72 65 ile scat tered re |
| 00030 61 64 27 20 65 6c 61 3d 20 33 31 39 38 20 66 69 ad' ela= 3198 fi |
| 00040 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 36 36 34 le#=5 bl ock#=664 |
| 00050 33 34 20 62 6c 6f 63 6b 73 3d 31 32 36 20 6f 62 34 block s=126 ob |
| 00060 6a 23 3d 31 34 30 33 34 20 74 69 6d 3d 31 33 33 j#=14034 tim=133 |
| 00070 35 33 39 30 33 38 30 31 34 38 39 36 31 53903801 48961 |
23:46:20.149121 write(9, "J?e83~1z1\n", 10) = 10
| 00000 4a 3f 65 38 33 7e 31 7a 31 0a J?e83~1z 1. |
23:46:20.149163 write(8, "\n", 1) = 1
| 00000 0a . |
23:46:20.157813 pread(256, "\6\242\2\4\1\360O\6\2\4\217T\2\3226\264O\6"..., 1032192, 2000699392) = 1032192
23:46:20.161124 lseek(8, 0, SEEK_CUR) = 46252
23:46:20.161169 write(8, "WAIT #139830933048328: nam='db f"..., 125) = 125
| 00000 57 41 49 54 20 23 31 33 39 38 33 30 39 33 33 30 WAIT #13 98309330 |
| 00010 34 38 33 32 38 3a 20 6e 61 6d 3d 27 64 62 20 66 48328: n am='db f |
| 00020 69 6c 65 20 73 63 61 74 74 65 72 65 64 20 72 65 ile scat tered re |
| 00030 61 64 27 20 65 6c 61 3d 20 33 33 31 37 20 66 69 ad' ela= 3317 fi |
| 00040 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 36 36 35 le#=5 bl ock#=665 |
| 00050 36 32 20 62 6c 6f 63 6b 73 3d 31 32 36 20 6f 62 62 block s=126 ob |
| 00060 6a 23 3d 31 34 30 33 34 20 74 69 6d 3d 31 33 33 j#=14034 tim=133 |
| 00070 35 33 39 30 33 38 30 31 36 31 31 30 30 53903801 61100 |
23:46:20.161260 write(9, "J?Hv2~1z1\n", 10) = 10
| 00000 4a 3f 48 76 32 7e 31 7a 31 0a J?Hv2~1z 1. |
23:46:20.161624 lseek(8, 0, SEEK_CUR) = 46377
23:46:20.161659 write(8, "\n", 1) = 1
| 00000 0a

The pread() call is the actual IO call done by the database.
The lseek() call is the seek towards the end of the trace file, and sets the pointer for the upcoming write.
The write() call is the write in the trace file. You see the multiblock read (scattered read in the trace file) was done using a pread() call, and the time measured by strace () is line with the time measured by Oracle (ela= 3317), and the number of blocks in the trace (126) matches the number of bytes read 1032192 (126*8192).
The time spend in the system calls is between the ‘smaller’ and ‘bigger’ signs, and get removed by wordpress, so if you where searching for it in the above example, you wouldn’t find it
The write() to file descriptor 9 with the garbage in it is the write to the .trm file.
The lseek() sets the pointer in the file after the writes, and a write() is done to enter a carriage return (\n) to the tracefile.
This is a simple example, to show how a synchronous multiblock read is reported, let’s see a a-synchronous one!


00:17:27.557053 pread(256, "#\242\202\251\243O\6\1\4[\4"..., 8192, 1190150144) = 8192
00:17:27.558973 lseek(8, 0, SEEK_CUR) = 2659
00:17:27.559009 write(8, "WAIT #140244789954888: nam='db f"..., 124) = 124
| 00000 57 41 49 54 20 23 31 34 30 32 34 34 37 38 39 39 WAIT #14 02447899 |
| 00010 35 34 38 38 38 3a 20 6e 61 6d 3d 27 64 62 20 66 54888: n am='db f |
| 00020 69 6c 65 20 73 65 71 75 65 6e 74 69 61 6c 20 72 ile sequ ential r |
| 00030 65 61 64 27 20 65 6c 61 3d 20 31 39 30 33 20 66 ead' ela = 1903 f |
| 00040 69 6c 65 23 3d 35 20 62 6c 6f 63 6b 23 3d 34 33 ile#=5 b lock#=43 |
| 00050 33 39 34 20 62 6c 6f 63 6b 73 3d 31 20 6f 62 6a 394 bloc ks=1 obj |
| 00060 23 3d 31 34 30 33 33 20 74 69 6d 3d 31 33 33 35 #=14033 tim=1335 |
| 00070 33 39 32 32 34 37 35 35 38 39 34 33 39224755 8943 |
00:17:27.559088 write(9, "N?Rb~22y1\n", 10) = 10
| 00000 4e 3f 52 62 7e 32 32 79 31 0a N?Rb~22y 1. |
00:17:27.559128 write(8, "\n", 1) = 1
| 00000 0a . |
00:17:27.559334 mmap(0x7f8d48e41000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48e41000
00:17:27.559450 mmap(NULL, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x1ec000) = 0x7f8d48c81000
00:17:27.559491 mmap(0x7f8d48c81000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48c81000
00:17:27.559576 mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, 7, 0x2fc000) = 0x7f8d48a81000
00:17:27.559614 mmap(0x7f8d48a81000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48a81000
00:17:27.559834 io_submit(140244822339584, 1, {{0x7f8d48a861f8, 0, 0, 0, 256}}) = 1
00:17:27.560008 mmap(0x7f8d48ad1000, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 7, 0) = 0x7f8d48ad1000
00:17:27.560080 io_submit(140244822339584, 1, {{0x7f8d48a86450, 0, 0, 0, 256}}) = 1
00:17:27.560272 io_getevents(140244822339584, 1, 128, {{0x7f8d48a86450, 0x7f8d48a86450, 122880, 0}}, {600, 0}) = 1
00:17:27.560363 io_getevents(140244822339584, 1, 128, {{0x7f8d48a861f8, 0x7f8d48a861f8, 106496, 0}}, {600, 0}) = 1
00:17:27.560865 lseek(8, 0, SEEK_CUR) = 2784
00:17:27.560902 write(8, "WAIT #140244789954888: nam='dire"..., 129) = 129
| 00000 57 41 49 54 20 23 31 34 30 32 34 34 37 38 39 39 WAIT #14 02447899 |
| 00010 35 34 38 38 38 3a 20 6e 61 6d 3d 27 64 69 72 65 54888: n am='dire |
| 00020 63 74 20 70 61 74 68 20 72 65 61 64 27 20 65 6c ct path read' el |
| 00030 61 3d 20 35 38 36 20 66 69 6c 65 20 6e 75 6d 62 a= 586 f ile numb |
| 00040 65 72 3d 35 20 66 69 72 73 74 20 64 62 61 3d 34 er=5 fir st dba=4 |
| 00050 33 33 39 35 20 62 6c 6f 63 6b 20 63 6e 74 3d 31 3395 blo ck cnt=1 |
| 00060 33 20 6f 62 6a 23 3d 31 34 30 33 33 20 74 69 6d 3 obj#=1 4033 tim |
| 00070 3d 31 33 33 35 33 39 32 32 34 37 35 36 30 38 34 =1335392 24756084 |
| 00080 35 5 |

This advanced example shows how first the segment header is fetched using the ‘db file sequential read’ wait. Despite being on ASM (this is ASM!), this clearly shows the segment header is read synchronously using the pread() call.
The next system calls and waits are even more interesting!
First, memory is allocated using the mmap() call. This is the beginning of a full scan, and the stuff read from disk need to be buffered somewhere. Please mind the memory allocation is not part of any wait. That is no problem, the allocation of memory should be very fast.
Next, TWO asynchronous reads are initiated using the two io_submit() calls, mind the iocb (IO control block) addresses, 0x7f8d48a861f8 and 0x7f8d48a86450.
See the man page of the io_submit call on your linux system using ‘man io_submit’, also use the man page of io_getevents to learn more about that call
Next, the asynchronous reads are reaped using the io_getevents() calls. What is noteworthy here is, is that the two requests are reaped in REVERSE order (this is consistent for the first two asynchronous call pairs for every segment, this example is Oracle version 11.2.0.3). So first request 0x7f8d48a86450 is fetched, then 0x7f8d48a861f8.
After 0x7f8d48a861f8 is reaped, the wait is reported for the FIRST IO call (io_submit() & io_getevents() combination). The second request (0x7f8d48a86450) of the start of a full segment scan is not reported. Never. Further requests do not reverse reaping/’encapsulate’ IO requests (mostly, it can happen a few times more).

I hope you get the idea of digging deeper into waits by examining the system calls.

I am not sure if the above ‘encapsulation’ of IO calls is a bug. But think of this: if you issue multiple IO requests (like we saw above, multiple IO requests are issued!) you cannot report the waits of all these distinct IO requests, because you would get more time reported than has passed. This has some consequences for the above ‘direct path read’ waits: these do not report individual IO times (like ‘db file sequential read’ does). This is just an indication on time spend in the process it had to wait for results from its IO requests, rather than disk IO latency.

The above conclusions are the result of observation and logical reasoning from my side. If you have anything to add, please respond!

Tagged: oracle waits performance events waitevents