The curious case of the missing semctl call

This article is about the internals of how the Oracle database handles transactions. In this case the communication mechanism of foreground sessions to the logwriter process is examined. The tests in this article have been executed using the following versions:
– Oracle database 12.1.0.2.161018
– Oracle linux 7.2, kernel 4.1.12-61.1.14.el7uek.x86_64 (UEK4)

In my previous article, ‘Transactions and SCNs’, I talked about redo generation, among other things. When no changes have been made to the way a commit works (parameters commit_logging, commit_wait and commit_write), a commit waits for its change vectors to be written to disk by either the logwriter or the log writer worker processes. In order for the logwriter or its worker processes to write these change vectors as quickly as possible, a foreground process needs to signal the logwriter. This is independent of the two modes of the logwriter, post/wait and polling, because these describe how the logwriter (or the helper processes) signal the foreground process after writing.

Whenever there is a need for a foreground process, it can signal the logwriter to write out the change vectors. This can not only happen during commit, but also during doing DML. Most people are aware that the technical implementation to trigger the logwriter is using the ‘semctl()’ call. Typically, an idle logwriter process is waiting for the wait event ‘rdbms ipc message’ on the Oracle level, which is the instrumentation of the process waiting for the system call ‘semtimedop()’. This way, the logwriter process is off cpu until the semaphore is triggered or semtimedop() times out.

By executing semctl() against the semaphore identifier and semaphore number the logwriter process is sleeping on, the logwriter is made runnable and starts working.

In order to prove this, this small gdb script can be used:

break kcrfw_copy_cv
commands
silent
print $rip
c
end
break kcrf_commit_force_int
commands
silent
print $rip
c
end
break semctl
commands
silent
print $rip
c
end

When this macro is run in a gdb session attached to an Oracle foreground process, and the following commands are executed:

insert into t values ('a');
commit;

The following output is generated:

$1 = (void (*)()) 0xccba840 
$2 = (void (*)()) 0xccba840 
$3 = (void (*)()) 0xccbb770 
$4 = (void (*)()) 0x7fc43b2189d0 
$5 = (void (*)()) 0xccbb770 

Row 1: this is the redo generated by the insert statement copied into the public logbuffer.
Row 2: this is the redo generated by the commit statement copied into the public logbuffer.
Row 3/4: entering the first ‘kcrf_commit_force_int’ phase, where the logwriter is semctl’ed.
Row 5: entering the second ‘kcrf_commit_force_int’ phase, where is waited for the logwriter to finish writing.

Sounds logical, right?

However, when testing this thesis with two sessions doing the insert and commit at exactly the same time (using the broadcast input option in iTerm2), I gotten the following output:

(first session)
$6 = (void (*)()) 0xccba840 
$7 = (void (*)()) 0xccba840 
$8 = (void (*)()) 0xccbb770 
$9 = (void (*)()) 0x7fc43b2189d0 
$10 = (void (*)()) 0xccbb770 

(second session)
$1 = (void (*)()) 0xccba840 
$2 = (void (*)()) 0xccba840 
$3 = (void (*)()) 0xccbb770 
$4 = (void (*)()) 0xccbb770 

I guess it’s easy to spot the call to semctl being missing for the second session. What is happening here?

First thing to do is investigate how semctl is called. The way to do that, is by changing the commands to execute when gdb breaks on semctl and get a backtrace using the ‘bt’ command:

(gdb) info break 3
Num     Type           Disp Enb Address            What
3       breakpoint     keep y   0x00007fc43b2189d0 
	breakpoint already hit 2 times
        silent
        print $rip
        c
(gdb) commands 3
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>silent
>print $rip
>bt
>c
>end
(gdb) c
Continuing.

Now execute the insert and the commit again, look at the backtrace:

$11 = (void (*)()) 0xccba840 
$12 = (void (*)()) 0xccba840 
$13 = (void (*)()) 0xccbb770 
$14 = (void (*)()) 0x7fc43b2189d0 
#0  0x00007fc43b2189d0 in semctl () from /lib64/libc.so.6
#1  0x000000000cdf00a5 in sskgpwpost ()
#2  0x000000000cdedcff in skgpwpost ()
#3  0x000000000c94466f in kslpsprns ()
#4  0x000000000c936553 in kskpthr ()
#5  0x000000000c9442af in ksl_post_proc ()
#6  0x000000000ccc0b28 in ksasnd ()
#7  0x000000000ccc17ff in ksbasend ()
#8  0x000000000ccbc13f in kcrf_commit_force_int ()
#9  0x000000000ccb730a in kcrfw_redo_gen_ext ()
#10 0x000000000c9df1c8 in kcbchg1_main ()
#11 0x000000000c9ddc05 in kcbchg1 ()
#12 0x000000000c99421f in ktucmt ()
#13 0x000000000c975dce in ktcCommitTxn_new ()
#14 0x000000000c97c147 in ktcCommitTxn ()
#15 0x000000000ce504f9 in ktdcmt ()
#16 0x000000000cc44917 in k2lcom ()
#17 0x00000000029cacda in k2send ()
#18 0x000000000c97f87c in xctCommitTxn ()
#19 0x000000000270191f in kksExecuteCommand ()
#20 0x000000000cb6d678 in opiexe ()
#21 0x00000000021cf90e in kpoal8 ()
#22 0x000000000cb60d1d in opiodr ()
#23 0x000000000cd8f4db in ttcpip ()
#24 0x0000000001b96d5c in opitsk ()
#25 0x0000000001b9b721 in opiino ()
#26 0x000000000cb60d1d in opiodr ()
#27 0x0000000001b92eab in opidrv ()
#28 0x000000000269e4d1 in sou2o ()
#29 0x0000000000ba8d9a in opimai_real ()
#30 0x00000000026a8c1c in ssthrdmain ()
#31 0x0000000000ba8cdc in main ()
$15 = (void (*)()) 0xccbb770 

This shows that inside kcrf_commit_force_int, these functions are called which lead up to semctl:

#0  0x00007fc43b2189d0 in semctl () from /lib64/libc.so.6
#1  0x000000000cdf00a5 in sskgpwpost ()
#2  0x000000000cdedcff in skgpwpost ()
#3  0x000000000c94466f in kslpsprns ()
#4  0x000000000c936553 in kskpthr ()
#5  0x000000000c9442af in ksl_post_proc ()
#6  0x000000000ccc0b28 in ksasnd ()
#7  0x000000000ccc17ff in ksbasend ()

This is starting to make sense, ksb is kernel service background processes, ksa is kernel service asynchronous messages. The next thing to do is execute the simultaneous commit again, break on these functions, and see how many of these are called:

(gdb) break sskgpwpost
Breakpoint 4 at 0xcdf0050
(gdb) break skgpwpost
Breakpoint 5 at 0xcdedce0
(gdb) break kslpsprns
Breakpoint 6 at 0xc944310
(gdb) break kskpthr
Breakpoint 7 at 0xc9364d0
(gdb) break ksl_post_proc
Breakpoint 8 at 0xc944270
(gdb) break ksasnd
Breakpoint 9 at 0xccc0a40
(gdb) break ksbasend
Breakpoint 10 at 0xccc1760
(gdb) commands 4-10
Type commands for breakpoint(s) 4-10, one per line.
End with a line saying just "end".
>silent
>print $rip
>c
>end
(gdb) c
Continuing.

This is how the output looks like:

(first session)
$1 = (void (*)()) 0xccba840 
$2 = (void (*)()) 0xccba840 
$3 = (void (*)()) 0xccbb770 
$4 = (void (*)()) 0xccc1760 
$5 = (void (*)()) 0xccc0a40 
$6 = (void (*)()) 0xc944270 
$7 = (void (*)()) 0xc9364d0 
$8 = (void (*)()) 0xc944310 
$9 = (void (*)()) 0xcdedce0 
$10 = (void (*)()) 0xcdf0050 
$11 = (void (*)()) 0x7f06f6f4a9d0 
$12 = (void (*)()) 0xccbb770 
(second session)
$1 = (void (*)()) 0xccba840 
$2 = (void (*)()) 0xccba840 
$3 = (void (*)()) 0xccbb770 
$4 = (void (*)()) 0xccc1760 
$5 = (void (*)()) 0xccc0a40 
$6 = (void (*)()) 0xccbb770 
$7 = (void (*)()) 0xccc1760 
$8 = (void (*)()) 0xccc0a40 

What this shows, is that the first session starts the ‘dive’ from the kcrf_commit_force_int function towards the semctl system call, and the second session is starting the same dive, but stops in the ksasnd function. After that the second session starts the second phase of kcrf_commit_force_int, and starts the same sequence of calls for semctl’ing the logwriter, and ends in ksasnd once again. The first session did do the entire dive to the semctl call, and remembers that when kcrf_commit_force_int is called the second time.

So how does this actually work in the functions ksbasend and ksasnd? The next step is to look how the function ksasnd is called. For that, change the breakpoint for the ksasnd function to print the arguments for the functions. Please mind I do not know the number of arguments, I normally print the first four arguments (which are in the cpu registers RDI, RSI, RDX and RCX in that order) as a starting point. The reason for picking the ksasnd function is there is a high probability (although it truly is a guess) that ksasnd means kernel service asynchronous messages send, and thus gives some indication why it did continue calling the functions up to semctl. This is how that is done:

(gdb) info break 9
Num     Type           Disp Enb Address            What
9       breakpoint     keep y   0x000000000ccc0a40 
	breakpoint already hit 1 time
        silent
        print $rip
        c
(gdb) commands 9
Type commands for breakpoint(s) 9, one per line.
End with a line saying just "end".
>printf "ksasnd %x, %x, %x, %x\n", $rdi, $rsi, $rdx, $rcx
>silent
>c
>end

When another insert/commit is executed, this gives the following output:

$29 = (void (*)()) 0xccba840 
$30 = (void (*)()) 0xccba840 
$31 = (void (*)()) 0xccbb770 
$32 = (void (*)()) 0xccc1760 
ksasnd 95dafd00, 2, 85, 18
$33 = (void (*)()) 0xc944270 
$34 = (void (*)()) 0xc9364d0 
$35 = (void (*)()) 0xc944310 
$36 = (void (*)()) 0xcdedce0 
$37 = (void (*)()) 0xcdf0050 
$38 = (void (*)()) 0x7f06f6f4a9d0 
$39 = (void (*)()) 0xccbb770 

Since KSA probably is kernel service asynchronous messages, one of the numbers could be related to the messages listed in x$messages. Upon some investigation it turns out 0x85 (decimal 133) is quite likely a x$messages number:

SYS@testdb AS SYSDBA> select indx, description, dest from x$messages where indx = 133;

      INDX DESCRIPTION               DEST
---------- ------------------------- ----------
       133 Redo writer post action   LGWR

That is what the session is doing, posting the logwriter for action! Just to settle what this means: activating the logwriter using the semctl systemcall is not the only thing that happens. Actually, the main thing happening is that the foreground process sends a specific request (message) to the logwriter, which is asking it to take action. The semctl call is just the way to make the process active.

At this point I got stuck on the function calls. However, there are more ways to see what is going on. I started to do some sampling using ‘perf’. Investigating the perf sampling resulted in finding an additional function that is called during insert and commit: ksaamb (which probably means kernel service asynchronous (messaging) allocate message buffer). Also memory profiling resulted in a memory location that is read during dive towards semctl: 0x60027e20.

The address 0x60027e20 lies within the fixed SGA variable kcrfsg_:

SQL> select ksmfsnam, ksmfsadr, ksmfssiz from x$ksmfsv
  2 where to_number('60027e20','XXXXXXXX')
  3 between to_number(ksmfsadr,'XXXXXXXXXXXXXXXX')
  4 and to_number(ksmfsadr,'XXXXXXXXXXXXXXXX')+ksmfssiz-1;

KSMFSNAM	     KSMFSADR	      KSMFSSIZ
-------------------- ---------------- --------
kcrfsg_ 	     0000000060027C30	  1608

However, it is not externalised in x$kcrfws; running an access watch in gdb and selecting the entire x$kcrfws table does not trigger the watchpoint.

With these two new found items, let’s continue the investigation:
– revert the breakpoint at ksasnd to just print it was triggered (print $rip).
– add a breakpoint for ksaamb in the same fashion as the other breakpoints.
– add an access watchpoint on address 0x60027e20 that continues when triggered:

(gdb) awatch *0x60027e20
Hardware access (read/write) watchpoint 1: *0x60027e20
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>c
>end

When this is run with two sessions at the same time (doing insert into and commit), the pieces fall into place:

(session 1)
$33 = (void (*)()) 0xccba840 
$34 = (void (*)()) 0xccba840 
$35 = (void (*)()) 0xccbb770 
$36 = (void (*)()) 0xccc1760 
$37 = (void (*)()) 0xccc0a40 
$38 = (void (*)()) 0xccc0bf0 
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 1
0x000000000ccc0c50 in ksaamb ()
$39 = (void (*)()) 0xccbb770 
$40 = (void (*)()) 0xccc1760 
$41 = (void (*)()) 0xccc0a40 
$42 = (void (*)()) 0xccc0bf0 
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 1
0x000000000ccc0c50 in ksaamb ()

(session 2)
$9 = (void (*)()) 0xccba840 
$10 = (void (*)()) 0xccba840 
$11 = (void (*)()) 0xccbb770 
$12 = (void (*)()) 0xccc1760 
$13 = (void (*)()) 0xccc0a40 
$14 = (void (*)()) 0xccc0bf0 
Hardware access (read/write) watchpoint 12: *0x60027e20

Old value = 1
New value = 0
0x000000000ccc0c50 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0ca1 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0d68 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Old value = 0
New value = 1
0x000000000ccc0d7c in ksaamb ()
$15 = (void (*)()) 0xc944270 
$16 = (void (*)()) 0xc9364d0 
$17 = (void (*)()) 0xc944310 
$18 = (void (*)()) 0xcdedce0 
$19 = (void (*)()) 0xcdf0050 
$20 = (void (*)()) 0x7f7320c8c9d0 
$21 = (void (*)()) 0xccbb770 

In this case, both session act completely the same, until the ksaamb function. In the ksaamb function, both sessions read memory address 0x60027e20. The first session reads the value 1, the second session reads 0 (which has been a 1 previously, which is the reason for printing old/new value).

It seems that 0 indicates there are no messages pending for the logwriter, which is why the second session continues to run in the ksaamb function, and the first session actually cancelling the action to execute semctl. In the second session the address 0x60027e20 is read a few more times, until it actually writes the value 1 into it.

To expand a bit on how this works, you can add latch gets and frees:

(gdb) break kslgetl
Breakpoint 13 at 0xc938af0
(gdb) commands
Type commands for breakpoint(s) 13, one per line.
End with a line saying just "end".
>silent
>printf "kslgetl %x\n", $rdi
>c
>end
(gdb) break kslfre
Breakpoint 14 at 0xc9397a0
(gdb) commands
Type commands for breakpoint(s) 14, one per line.
End with a line saying just "end".
>silent
>printf "kslfre %x\n", $rdi
>c
>end

This will show all the latch gets during both insert into and commit. However the interesting bit is:

(session 1)
$61 = (void (*)()) 0xccc0bf0 
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 1
0x000000000ccc0c50 in ksaamb ()

(session 2)
$6 = (void (*)()) 0xccc0bf0 
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0c50 in ksaamb ()
kslgetl 6000e660
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0ca1 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Value = 0
0x000000000ccc0d68 in ksaamb ()
Hardware access (read/write) watchpoint 12: *0x60027e20

Old value = 0
New value = 1
0x000000000ccc0d7c in ksaamb ()
kslfre 6000e660

The first session reads the value 1 and bails out. The second session reads the value 0 and continues. It then takes a latch at address 0x6000e660, and eventually sets the value to 1, and frees the latch at address 0x6000e660.

Probably most readers of this blog can already guess the latch name, but for completeness sake:

SYS@testdb AS SYSDBA> select addr, name from v$latch where to_number(addr,'XXXXXXXXXXXXXXXX') = to_number('6000e660','XXXXXXXX');

ADDR		 NAME
---------------- ----------------------------------------------------------------
000000006000E660 messages

The messages latch!

Conclusion
Whenever a session needs its redo to be written by the log writer, it makes sure the redo is available for the logwriter using the function kcrfw_copy_cv, and then triggers the logwriter. When profiling system calls, it looks like the trigger simply is waking up the logwriter using the system call semctl. Upon deeper investigation, the way it works is that actually a message is send to the logwriter explicitly asking the logwriter to take action on the Oracle level. The construction of the logwriter sleeping in semtimedop and a process triggering it using semctl is an implementation detail.

A session that needs to trigger the logwriter uses the Oracle ksa (kernel service asynchronous messages) layer of functions. The function ksaamb (kernel service asynchronous messages allocate message buffer) reads a variable (0x60027e20) in the fixed SGA to make sure this work is not already currently being done. 0 means no other process is doing that, a value of 1 means another session is already working on signalling the logwriter. If 1 is encountered the session stops trying to signal the logwriter, and “hopes” the other session triggers the logwriter and that the logwriter will write it’s redo. However, to be absolutely sure, it will try to signal the logwriter once again when the second “phase” of the kcrf_commit_force_int execution (which is waiting for the logwriter to finish writing the session’s redo).

Setting the fixed SGA variable is protected by a messages latch.

The essence of the article is you might not always see a semctl call where you expect it. This is actually an optimisation which makes sense once you understand the implementation details.