Search

OakieTags

Who's online

There are currently 0 users and 21 guests online.

Recent comments

Oakies Blog Aggregator

MERGE vs UPDATE/INSERT revisited

I wrote a few years back that for single row operations, MERGE might in fact have a large overhead than the do-it-yourself approach (ie, attempt an update, if it fails, then do an insert).

Just to show that it’s always good to revisit things as versions change, here’s the same demo (scaled up now because my laptop is faster Smile)

As you can see, there is still a little difference between between the two operations.  But even so, unless you need that absolute last little percentage of performance and you know the data distribution (and hence probability of updates versus inserts extremely well, in particular, you’re expecting near always updates) perhaps MERGE is the way to go, in that it is more accurately reflects the operation being performed, and is hence closer to the concept of “self-documenting” code.




SQL> drop table t1 purge;

Table dropped.

SQL>
SQL> create table t1
  2   ( x int primary key,
  3     y int );

Table created.

SQL> -- all inserts
SQL>
SQL> set timing on
SQL> begin
  2  for i in 1 .. 500000 loop
  3    merge into t1
  4    using ( select i x, i y
  5            from dual ) m
  6    on ( t1.x = m.x )
  7    when matched then update
  8    set t1.y = m.y
  9    when not matched then
 10    insert values (m.x, m.y );
 11  end loop;
 12  end;
 13  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:19.99

SQL> -- 50/50 updates and inserts
SQL
SQL> set timing on
SQL> begin
  2  for i in 250000 .. 750000 loop
  3    merge into t1
  4    using ( select i x, i y
  5            from dual ) m
  6    on ( t1.x = m.x )
  7    when matched then update
  8    set t1.y = m.y
  9    when not matched then
 10    insert values (m.x, m.y );
 11  end loop;
 12  end;
 13  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:19.19
SQL> -- all updates
SQL>
SQL> set timing on
SQL> begin
  2  for i in 250000 .. 750000 loop
  3    merge into t1
  4    using ( select i x, i+1 y
  5            from dual ) m
  6    on ( t1.x = m.x )
  7    when matched then update
  8    set t1.y = m.y
  9    when not matched then
 10    insert values (m.x, m.y );
 11  end loop;
 12  end;
 13  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:17.48
SQL>
SQL> drop table t1 purge;

Table dropped.

Elapsed: 00:00:04.94
SQL>
SQL> create table t1
  2   ( x int primary key,
  3     y int );

Table created.

Elapsed: 00:00:00.01
SQL>
SQL> set timing on
SQL> begin
  2  for i in 1 .. 500000 loop
  3    update t1 set y = i where x = i;
  4    if sql%notfound then insert into t1 values (i,i); end if;
  5  end loop;
  6  end;
  7  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:22.79
SQL>
SQL> set timing on
SQL> begin
  2  for i in 250000 .. 750000 loop
  3    update t1 set y = i where x = i;
  4    if sql%notfound then insert into t1 values (i,i); end if;
  5  end loop;
  6  end;
  7  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:18.08
SQL>
SQL> set timing on
SQL> begin
  2  for i in 250000 .. 750000 loop
  3    update t1 set y = i+1 where x = i;
  4    if sql%notfound then insert into t1 values (i,i+1); end if;
  5  end loop;
  6  end;
  7  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:13.88
SQL>
SQL>
SQL>

Datatypes for DATES

Richard Foote has written a post about not using the DATE datatype for storing dates.

So I’ve come up with a revolutionary system to store dates as well…using the very efficient RAW datatype.

Here’s a demo


SQL> create table t ( x raw(7) );

Table created.

SQL>
SQL> create or replace
  2  procedure store_date(p_yyyymmddhh24miss varchar2) is
  3  begin
  4    insert into t
  5    values
  6      (
  7        hextoraw(
  8         to_char(to_number(substr(p_yyyymmddhh24miss,1,2))+100,'FM0X')||
  9         to_char(to_number(substr(p_yyyymmddhh24miss,3,2))+100,'FM0X')||
 10         to_char(to_number(substr(p_yyyymmddhh24miss,5,2)),'FM0X')||
 11         to_char(to_number(substr(p_yyyymmddhh24miss,7,2)),'FM0X')||
 12         to_char(to_number(substr(p_yyyymmddhh24miss,9,2))+1,'FM0X')||
 13         to_char(to_number(substr(p_yyyymmddhh24miss,11,2))+1,'FM0X')||
 14         to_char(to_number(substr(p_yyyymmddhh24miss,13,2))+1,'FM0X')
 15        )
 16      );
 17  end;
 18  /

Procedure created.

SQL>
SQL> exec store_date('20160528211212')

PL/SQL procedure successfully completed.

SQL> select * from t;

X
--------------
7874051C160D0D

As you can see, the dates are stored in a compact 7-byte format. I’ve added 100 to the century and the year so we can also store negative dates (before 0AD) without any dramas. I’m quite impressed with my ingenuity here. I’m not going to have any of those “number of seconds since 1970” issues, where a 32-bit number might overflow etc etc.

So let us compare that to the DATE datatype.


SQL> create table t1 ( x date );

Table created.

SQL> insert into t1 values ( to_date('20160528211212','yyyymmddhh24miss'));

1 row created.

SQL> select dump(x) from t1;

DUMP(X)
--------------------------------------------------------------------------------
Typ=12 Len=7: 120,116,5,28,22,13,13

Let me convert that to plain old bytes so we can compare


SQL> create or replace
  2  function dump_to_hex(p_str varchar2) return varchar2 is
  3    l_str varchar2(100) := p_str;
  4    l_elem varchar2(10);
  5    l_hex varchar2(100);
  6  begin
  7    l_str := substr(l_str,instr(l_str,':')+2);
  8    loop
  9      l_elem := substr(l_str,1,instr(l_str,',')-1);
 10      exit when l_elem is null;
 11      l_hex := l_hex || to_char(to_number(l_elem),'FM0X');
 12      l_str := substr(l_str,instr(l_str,',')+1);
 13    end loop;
 14    return l_hex;
 15  end;
 16  /

Function created.

SQL>
SQL> select dump_to_hex(dump(x)) from t1;

DUMP_TO_HEX(DUMP(X))
-----------------------------------------------------------------
7874051C160D

Oh…. Looks like someone beat me to it Smile

So if you’re thinking about re-inventing your own datatype for dates, perhaps just stick with the one that’s provided for you … it works just fine Smile

Oracle SGA memory allocation on startup

Recently I have been presenting on what running on a large intel based NUMA system looks like (OTN EMEA tour in Düsseldorf and Milan, and I will be presenting about this at the Dutch AMIS 25th anniversary event in june). The investigation of this presentation is done on a SGI UV 300 machine with 24 terabyte of memory, 32 sockets (=NUMA nodes), 480 core’s and 960 threads.

Recently I have been given access to a new version of the UV 300, the UV 300 RL, for which the CPU has improved from Ivy Bridge to Haswell, and now has 18 core’s per socket instead of 15, which means the number of core’s on a fully equipped system is 576, which makes 1152 threads.

Now the get back to the actual purpose of this blogpost: SGA memory allocation on startup. One of the things you can do on such a system is allocate an extremely high number of memory. In my case I chose to allocate 10 terabyte for the Oracle SGA. When you go (excessively) outside of normal numbers, you run into things simply because there is a (very) low chance that that has been actually tested, and very few (or none) have actually done it before.

First let me show you the other technical details:
Operating system: Oracle Linux version 7.1
Kernel: kernel-uek-3.8.13-98.4.1.el7uek.x86_64
Oracle grid infrastructure version: 12.1.0.2.160419
Oracle database version: 12.1.0.2.160419
For the database instance the use_large_pages parameter is set to ONLY. It simply does not make sense not to use it in a normal database, in my opinion it really is mandatory if you work with large memory databases.

Firing up the instance…
At first I set sga_target to 10737418240000, alias 10 terabyte, and started up the instance. My sqlplus session got busy after pressing enter, however it remained busy. I got coffee downstairs, got back to my office, and it was still busy. Long story short: after 30 minutes it was still busy and I wondered if something was wrong.

What can you do? I first checked the alert.log of the instance. It told me:

Tue May 24 07:52:35 2016
Starting ORACLE instance (normal) (OS id: 535803)
Tue May 24 07:52:35 2016
CLI notifier numLatches:131 maxDescs:3801

Not a lot of information, but also no indication anything was wrong.

Then I looked with the ‘top’ utility what was active at my system:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 535803 oracle    20   0  9.765t  22492  17396 R  99.4  0.0  11:10.56 oracleSLOB (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

I see my local bequeath session running at and near 100% CPU. It’s working, and that is the correct state when you fire up an Oracle database instance. But what is it doing?

Next logical thing: strace:

[oracle@skynet5 frits]$ strace -fp 535803
Process 535803 attached

Yes, this is meant to show that strace does not show any output. This means the process is not doing any system calls. This does NOT mean the process is doing nothing, we saw with ‘top’ it’s busy, running close to 100% on a CPU (thread).

When strace does not show anything, the next layer to dive into is the C function layer with perf. The first thing I tried, and normally try at first, is ‘perf top’. Perf top -p zooms in on one process:

[oracle@skynet5 frits]$ perf top -p 535803
   PerfTop:    4003 irqs/sec  kernel:99.9%  exact:  0.0% [4000Hz cycles],  (target_pid: 535803)
-----------------------------------------------------------------------------------------------------------------------------------

    95.86%  [kernel]   [k] clear_page_c_e
     1.58%  [kernel]   [k] clear_huge_page
     0.41%  [kernel]   [k] __ticket_spin_lock
     0.32%  [kernel]   [k] clear_page
     0.29%  [kernel]   [k] _cond_resched
     0.28%  [kernel]   [k] mutex_lock
     0.11%  [kernel]   [k] apic_timer_interrupt
     0.11%  [kernel]   [k] hugetlb_fault

Ah! This shows what the process is actually doing! We can see it’s executing a function called ‘clear_page_c_e’ in kernel mode in 96% of the samples perf took of the running process. The name of the function describes what it does quite well: it is a kernel function to clear a page.

So, this does describe that the process starting up the instance is actually spending all of its time clearing memory, but it does not show how this relates to what the Oracle database is actually doing. In order to dig further into what Oracle is doing, we can use perf in another way: record backtraces, and later report on that. What perf does, is sample full backtraces in a file called ‘perf.data’. When perf is later used to report using that file, it will show the current function that was sampled most in the perf.data file, and all the backtraces leading to that function (when perf record is called with -g “call-graph”).

[oracle@skynet5 frits]$ perf record -g -p 535803
...after some time: ^C, then:
[oracle@skynet5 frits]$ perf report
# Overhead  Command       Shared Object                                      Symbol
# ........  .......  ..................  ..........................................
#
    97.14%   oracle  [kernel.kallsyms]   [k] clear_page_c_e
             |
             --- clear_page_c_e
                |
                |--100.00%-- hugetlb_fault
                |          handle_mm_fault
                |          __do_page_fault
                |          do_page_fault
                |          page_fault
                |          slaac_int
                |          slrac
                |          sskgm_segment_notify_action
                |          skgmcrone
                |          skgm_allocate_areas
                |          skgmcreate
                |          ksmcrealm
                |          ksmcsg
                |          opistr_real
                |          opistr
                |          opiodr
                |          ttcpip
                |          opitsk
                |          opiino
                |          opiodr
                |          opidrv
                |          sou2o
                |          opimai_real
                |          ssthrdmain
                |          main
                |          __libc_start_main
                 --0.00%-- [...]

What this shows, is again the kernel function ‘clear_page_c_e’ (it also shows [k] to indicate it’s a kernel function), and when we follow the backtrace down, we see
* hugetlb_fault: this is a page fault handling function, which shows that I am using huge pages.
* handle_mm_fault, __do_page_fault, do_page_fault, page_fault: these are linux kernel functions indicating it’s handling a page fault.
* slaac_int, slrac, sskgm_segment_notify_action: these are Oracle database functions. Obviously, slaac_int is the function actually ‘touching’ memory, because it results in a page_fault.

So what is a page fault? A page fault is an interrupt that is raised when a program tries to access memory that has already been allocated into the process’ virtual address space, but not actually loaded into main memory. The interrupt causes the kernel to handle the page fault, and make the page available. Essentially Linux, like most other operating systems, makes pages available once they are actually used instead of when they are ‘just’ allocated. (this is a summary, there is a lot to tell about page faults alone, but that’s beyond the scope of this article)

It’s important to realise there is no system call visible that either allocates memory (think about mmap(), malloc(), etc.). If you want to know more about these calls, please read up on them using: ‘man SYSCALLNAME’ (on linux obviously). So, the above backtrace looks like code that deliberately touches the Oracle SGA memory in order to get it truly allocated!

Actually, in my case, with NUMA enabled, and a large amount of memory to be alloced, the above process of getting 10 terabyte ‘touched’ or ‘page faulted’ can be watched in progress in the proc ‘numa_maps’ file of the ‘bequeathing’ process (the process that is starting up the instance). The linux ‘watch’ utility is handy for this:

[oracle@skynet5 frits]$ watch -n 1 -d cat /proc/535803/numa_maps
00400000 prefer:1 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle mapped=3124 N21=3124
10e74000 prefer:1 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=1 dirty=1 mapped=27 N1=1 N21=26
10e96000 prefer:1 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=17 dirty=17 mapped=79 N1=17 N21=62
110f1000 prefer:1 anon=28 dirty=28 N1=28
129b4000 prefer:1 heap anon=162 dirty=162 N1=80 N2=82
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=360268 N0=11259 N1=11259 N2=11259 N3=11259 N4=11259 N5=11259 N6=11259 N7=11259 N8
=11259 N9=11259 N10=11259 N11=11259 N12=11258 N13=11258 N14=11258 N15=11258 N16=11258 N17=11258 N18=11258 N19=11258 N20=11258 N21=11258 N22=11258 N23
=11258 N24=11258 N25=11258 N26=11258 N27=11258 N28=11258 N29=11258 N30=11258 N31=11258
fca0000000 prefer:1 file=/SYSV00000000\040(deleted) huge
7fcda7c4f000 prefer:1 anon=64 dirty=64 N2=64
7fcda7ccf000 prefer:1 anon=534 dirty=534 N1=73 N2=461
...

With ‘watch’, ‘-n 1’ means a 1 second interval, ‘-d’ means highlight differences.
When the instance is starting (with NUMA enabled, and with a huge amount of memory allocated for SGA), first you will see it’s busy in the line indicated with ‘80000000’ (which is a memory address). In my case it says ‘interleave:0-31’, which means the memory allocation is done over all the indicated NUMA nodes. Actually Nnr=nr means a memory allocation of nr of pages on NUMA node Nnr. In my case I see the number growing per NUMA node jumping from one NUMA node to another for some time. This is the shared pool allocation, that is spread out over all NUMA nodes.

However, after some time, you’ll see this:

00400000 prefer:0 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle mapped=3124 N21=3124
10e74000 prefer:0 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=1 dirty=1 mapped=27 N1=1 N21=26
10e96000 prefer:0 file=/u01/app/oracle/product/12.1.0.2/dbhome_1/bin/oracle anon=17 dirty=17 mapped=79 N1=17 N21=62
110f1000 prefer:0 anon=28 dirty=28 N1=28
129b4000 prefer:0 heap anon=162 dirty=162 N1=80 N2=82
60000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge
80000000 interleave:0-31 file=/SYSV00000000\040(deleted) huge dirty=516352 N0=16136 N1=16136 N2=16136 N3=16136 N4=16136 N5=16136 N6=16136 N7=16136 N8
=16136 N9=16136 N10=16136 N11=16136 N12=16136 N13=16136 N14=16136 N15=16136 N16=16136 N17=16136 N18=16136 N19=16136 N20=16136 N21=16136 N22=16136 N23
=16136 N24=16136 N25=16136 N26=16136 N27=16136 N28=16136 N29=16136 N30=16136 N31=16136
fca0000000 prefer:25 file=/SYSV00000000\040(deleted) huge dirty=144384 N25=144384
14320000000 prefer:26 file=/SYSV00000000\040(deleted) huge dirty=143872 N26=143872
18960000000 prefer:27 file=/SYSV00000000\040(deleted) huge dirty=12232 N27=12232
1cfa0000000 prefer:0 file=/SYSV00000000\040(deleted) huge
7fcda7c4f000 prefer:0 anon=64 dirty=64 N2=64
7fcda7ccf000 prefer:0 anon=534 dirty=534 N1=73 N2=461

Once the process is done allocating shared pool, it starts allocating buffer cache memory. It does this per NUMA node. You can see it here at the lines fca0000000, 14320000000, 18960000000. Actually, I copied these figures when the instance was still allocating memory at memory address 18960000000. You can see it does specific allocation of memory local to a specific NUMA node by ‘prefer:nr’.

At the end, when all memory was allocated, the sqlplus session showed the sizes of the different memory pools and opened the database. This took approximately 90 minutes. At first, I thought this had to do with the parameter ‘PRE_PAGE_SGA’. However, this parameter was set at its default value, FALSE.

Some investigating and some testing revealed this behaviour (touching the entire SGA) was governed by the hidden parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION, which defaults to TRUE in 12.1.0.2 on Linux.

Next I tested setting it to FALSE, and my startup time reduced to 3.5 minutes (!).

I used perf record again to look where the time is spend when _TOUCH_SGA_PAGES_DURING_ALLOCATION is set to FALSE:

# Overhead          Command         Shared Object                                       Symbol
# ........  ...............  ....................  ...........................................
#
    41.25%  oracle_563318_s  [kernel.kallsyms]     [k] clear_page_c_e
            |
            --- clear_page_c_e
               |
               |--99.99%-- hugetlb_fault
               |          handle_mm_fault
               |          __do_page_fault
               |          do_page_fault
               |          page_fault
               |          |
               |          |--93.26%-- __intel_new_memset
               |          |          |
               |          |          |--79.48%-- ksmnsin
               |          |          |          |
               |          |          |          |--95.98%-- kcbnfy
               |          |          |          |          kscnfy
               |          |          |          |          ksmcsg
               |          |          |          |          opistr_real
               |          |          |          |          opistr
               |          |          |          |          opiodr
               |          |          |          |          ttcpip
               |          |          |          |          opitsk
               |          |          |          |          opiino
               |          |          |          |          opiodr
               |          |          |          |          opidrv
               |          |          |          |          sou2o
               |          |          |          |          opimai_real
               |          |          |          |          ssthrdmain
               |          |          |          |          main
               |          |          |          |          __libc_start_main
               |          |          |          |
               |          |          |          |--3.32%-- kcbw_setup

This is a new backtrace, which looks like the former backtrace on the top (between clear_page_c_e and page_fault), simply because it’s doing exactly the same, handling a page fault. However, the Oracle functions are different. What is happening here is the Oracle performs a system call, memset(), which is used to wipe a memory area. If we go further down the stack trace, we see (full speculation!):
* ksmnsin: kernel system memory numa segment initialisation?
* kcbnfy/kscnfy: kernel cache buffers numa FY?/kernel system cache numa FY?
* ksmcsg: kernel system memory create sga?

When looking at the memory allocations during startup of the bequeathing process with _TOUCH_SGA_PAGES_DURING_ALLOCATION set to FALSE in numa_maps, way lesser memory pages are touched. numa_maps only shows truly allocated pages, instead of the logical memory allocations. This means the same memory allocations done per numa node (passing over the numa nodes several times) done for the shared pool, and then the buffer cache allocations per numa node, however now also passing over the different numa nodes several times too. Because of the much smaller amount of pages touched, the startup time of the instance is greatly reduced.

So, is this a solution to the long startup time?
On one hand yes, on the other hand no. What that means is: you need to page the memory, no matter what you do. If you choose not to page the memory during startup (_TOUCH_SGA_PAGES_DURING_ALLOCATION=FALSE), you need to do it later. Sounds logical, right: pay now, or pay later? But what does that mean?

After startup of the instance with _TOUCH_SGA_PAGES_DURING_ALLOCATION=FALSE, a lot of the memory if not really allocated. If you start to use the new prestine SGA with a normal session, you pay the price of paging which the bequeathing session otherwise took. Let me show you a ‘perf record -g’ example of a full table scan (which is reading the Oracle blocks into the buffercache, not doing a direct path read):

# Overhead          Command       Shared Object                                Symbol
# ........  ...............  ..................  ....................................
#
    61.25%  oracle_564482_s  [kernel.kallsyms]   [k] clear_page_c_e
            |
            --- clear_page_c_e
               |
               |--99.99%-- hugetlb_fault
               |          |
               |          |--99.86%-- handle_mm_fault
               |          |          __do_page_fault
               |          |          do_page_fault
               |          |          page_fault
               |          |          __intel_ssse3_rep_memcpy
               |          |          kcbzibmlt
               |          |          kcbzib
               |          |          kcbgtcr
               |          |          ktrget2
               |          |          kdst_fetch0
               |          |          kdst_fetch
               |          |          kdstf00000010000100kmP
               |          |          kdsttgr
               |          |          qertbFetch
               |          |          qergsFetch

If we follow the call stack from bottom to top:
* qergsFetch, qertbFetch: query execute row source code.
* kdsttgr: kernel data scan table get row.
* kdstf00000010000100kmP: kernel data scan table full, the ultra fast full table scan.
* kdst: other functions in kernel data scan table functions.
* ktrget2: kernel transaction layer.
* kcbgtcr: kernel cache buffers get consistent row.
* kcbz: kernel cache buffers Z, physical IO helper functions.
* __intel_ssse3_rep_memcpy: this is actually the function memcpy, but replaced by an optimised version for intel CPUs.
* page_fault, do_page_fault, __do_page_fault, handle_mm_fault, hugetlb_fault, clear_page_c_e: these are the same page fault kernel functions we saw earlier.

This clearly shows my full table scan now needs to do the paging!

If I set _TOUCH_SGA_PAGES_DURING_ALLOCATION to TRUE, startup the instance (for which the pages are touched and thus pages), and profile a full table scan, I see:

# Overhead          Command         Shared Object                                    Symbol
# ........  ...............  ....................  ........................................
#
    24.88%  oracle_577611_s  oracle                [.] ksl_get_shared_latch
            |
            --- ksl_get_shared_latch
               |
               |--78.46%-- kcbzfb
               |          kcbzgb
               |          |
               |          |--99.63%-- kcbzgm
               |          |          kcbzibmlt
               |          |          kcbzib
               |          |          kcbgtcr
               |          |          ktrget2
               |          |          kdst_fetch0
               |          |          kdst_fetch
               |          |          kdstf00000010000100kmP
               |          |          kdsttgr
               |          |          qertbFetch
               |          |          qergsFetch

This shows the most prominent function which is called is ‘ksl_get_shared_latch’. No paging to be seen.

At this point the reason for having _TOUCH_SGA_PAGES_DURING_ALLOCATION should be clear. The question I had on this point is: but how about PRE_PAGE_SGA? In essence, this parameter is supposed to more or less solve the same issue, having the SGA pages being touched at startup to prevent paging for foreground sessions.

BTW, if you read about PRE_PAGE_SGA in the online documentation, it tells a reason for using PRE_PAGE_SGA, which is not true (page table entries are prebuilt for the SGA pages), and it indicates the paging (=page faults) are done at startup, which also is not true. It also claims ‘every process that starts must access every page in the SGA’, again this is not true.

From what I can see, what happens when PRE_PAGE_SGA is set to true, is that a background process is started, that starts touching all SGA pages AFTER the instance has started and is open for usage. The background process I witnessed is ‘sa00’. When recording the backtraces of that process, I see:

# Overhead        Command      Shared Object                                      Symbol
# ........  .............  .................  ..........................................
#
    97.57%  ora_sa00_slob  [kernel.kallsyms]  [k] clear_page_c_e
            |
            --- clear_page_c_e
               |
               |--100.00%-- hugetlb_fault
               |          handle_mm_fault
               |          __do_page_fault
               |          do_page_fault
               |          page_fault
               |          ksmprepage_memory
               |          ksm_prepage_sga_seg
               |          skgmapply
               |          ksmprepage
               |          ksm_sslv_exec_cbk
               |          ksvrdp
               |          opirip
               |          opidrv
               |          sou2o
               |          opimai_real
               |          ssthrdmain
               |          main
               |          __libc_start_main
                --0.00%-- [...]

The kernel paging functions are exactly the same as we have seen several times now. It’s clear the functions executed by this process are specifically for the prepage functionality. The pre-paging as done on behalf of _TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE is done as part of the SGA creation and allocation (as can be seen by the Oracle function names). PRE_PAGE_SGA seems to be a ‘workaround’ if you don’t want to spend the long time paging on startup, but still want to page the memory as soon as possible after startup. Needless to say, this is not the same as _TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE, PRE_PAGE_SGA paging is done serially by a single process after startup when the database is open for usage. So normal foreground process that encounter non-paged memory, which means they use it before the sa00 process pages it, still need to do the paging.

Conclusion
If you want to allocate a large SGA with Oracle 12.1.0.2 (but may apply to earlier versions too), the startup time could be significant. The reason for that is the bequeathing session pages the memory on startup. This can be turned off by setting the undocumented parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION to FALSE. As a result, foreground (normal user) sessions need to do the paging. You can set PRE_PAGE_SGA parameter to TRUE to do paging, however the paging is done by a single process (sa00) that serially pages the memory after startup. Foreground processes that encounter non-paged memory, which means they use it before the sa00 process could page it, need to page it theirselves.

Thanks to: Klaas-Jan Jongsma for proofreading.

Friday Philosophy – Half Million Views

I only noticed yesterday that I’ve had just over half a million views on my blog since I started it back in 2009.

ScreenHunter_101 May. 27 13.31

I think that is direct views on the blog and does not include the odd syndicated place such as the Oaktable website. Though in my case, I don’t think syndicated views make much difference:-)

I know that there are many Oracle blogs and sites that get this volume of traffic in a couple of months, maybe even a few weeks for the top 2 or 3 (I should get Tim Hall drunk and ask him about OracleBase) but I’m still proud of keeping this blog going and that a small number of hundreds pop by each day Except weekends and holidays, when I am relieved to say most people find something way better to do than look at my blog! I also do not know how the number of views relates to the number of individuals who have at some point (whether once or a hundred times) looked at something on my blog. If 20 visits is the average, that would be 25 thousand people have been to my blog. That’s a small town!!!

Admittedly my blog is more my Friday Philosophies rather than deep technical content these days and it is the technical content that accounts for the bulk of visits to my blog, so I am slowly aging out!

Maybe this will be the prompt to do what I keep saying I’ll do (for about 4 years now) and do more actual technical content – you know, things with a SQL statement in it or a chunk of PL/SQL…

So to everyone who has popped by – thank you very much.

Let me START WITH sequences

It’s always cool that you can learn stuff every single day, even on the most simple of topics.  This one came from an AskTom question.  We define a sequence to start with 100, and then alter it to modify the INCREMENT BY.



SQL> create sequence test_seq INCREMENT BY 25 START WITH 100 MAXVALUE 1000 NOCACHE NOCYCLE;

Sequence created.

SQL> @pt "select * From user_sequences where sequence_name = 'TEST_SEQ'"
SEQUENCE_NAME                 : TEST_SEQ
MIN_VALUE                     : 1
MAX_VALUE                     : 1000
INCREMENT_BY                  : 25
CYCLE_FLAG                    : N
ORDER_FLAG                    : N
CACHE_SIZE                    : 0
LAST_NUMBER                   : 100
PARTITION_COUNT               :
SESSION_FLAG                  : N
KEEP_VALUE                    : N
-----------------

PL/SQL procedure successfully completed.

SQL> alter sequence test_seq INCREMENT BY 1 MAXVALUE 999999 NOCACHE NOCYCLE;

Sequence altered.

SQL> select test_seq.nextval from dual;

So the question is – what will the NEXTVAL query above return ?

Well, the following might come as a surprise.



SQL> select test_seq.nextval from dual;

   NEXTVAL
----------
        76

Yes, even though the sequence was defined as START WITH 100, when we look at the sequence definition, the combination of START WITH 100, and INCREMENT BY 25 has the following impact on the LAST_NUMBER value stored, when the ALTER command was issued.


SQL> @pt "select * From user_sequences where sequence_name = 'TEST_SEQ'"
SEQUENCE_NAME                 : TEST_SEQ
MIN_VALUE                     : 1
MAX_VALUE                     : 999999
INCREMENT_BY                  : 1
CYCLE_FLAG                    : N
ORDER_FLAG                    : N
CACHE_SIZE                    : 0
LAST_NUMBER                   : 76
PARTITION_COUNT               :
SESSION_FLAG                  : N
KEEP_VALUE                    : N
-----------------

PL/SQL procedure successfully completed.

SQL>

This is alluded to in the documentation for ALTER SEQUENCE

“If you change the INCREMENT BY value before the first invocation of NEXTVAL, then some sequence numbers will be skipped. Therefore, if you want to retain the original START WITH value, you must drop the sequence and re-create it with the original START WITH value and the new INCREMENT BY value. “

Options For Pre-EM13c Target Software and Non-Supported OS in EM13c

style="display:inline-block;width:320px;height:100px"
data-ad-client="ca-pub-5103295461547706"
data-ad-slot="5182487270">

Per Oracle documentation, you’re able to upgrade from EM12c to EM13c, including the OMR, (Oracle Management Repository) to DB12c and the OMS, (Oracle Management Service) to EM13c, but leave the agent software at a version of 12.1.0.3 or higher, leaving upgrades of the agents to be performed later on using the Gold Agent Image, removing a lot of work for the administrator.

tumblr_mkglvt0YfL1s604jro1_500

This is a great option for many customers, but what if your existing EM environment isn’t healthy and you need to start fresh?  What options are left for you then?

I’ve covered, (and no, it isn’t supported by Oracle) the ability to discover a target, doing a silent deployment push from the target host back to the OMS using a 12.1.0.5 software image by simply adding the -ignoreprereqs argument.  This does work and will deploy at least the last version of EM12c agent to the EM13c environment.  What I don’t know-  If you have an installation of software that was upgraded to a supported version in the past, (i.e. 12.1.0.2 upgraded 12.1.0.5, etc.)  I haven’t tested this and I can’t guarantee it, but it’s worth a try.  Same goes for an unsupported OS version, but I think if you choose to push the deploy from the target and choose to ignore the prerequisite checks, it may successfully add the target.

If you have an earlier version of EM12c agent, 12.1.0.3 to 12.1.0.5 that can’t be updated to EM13c, there is still hope outside of what I’ve proposed.  The word on the streets is that with the release of 13.2, there will be backward support of earlier versions of agent software and that WILL be supported fully by Oracle.

That also offers a silver lining for those that may be considering going to EM13c, won’t be upgrading and want to take advantage of redirecting existing targets with EM12c agent software to the new installation.  I’m assuming they’ll simply run the following command to redirect those targets, (or something close!):

emctl secure agent 

I have high hopes for this option being available in 13.2 and will cross my fingers along with you.



Tags:  , ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBA Kevlar [Options For Pre-EM13c Target Software and Non-Supported OS in EM13c], All Right Reserved. 2016.

BI Publisher will not start (using emctl oms start and emctl oms start -bip_only)

*Customer names obfuscated to protect the innocent (and the guilty!)

This post covers a particular customer’s recovery scenario. They had reverted their config.xml to a previous version due to a failed attempt to integrate LDAP into WLS myrealm, and ever since, they have been unable to start BI Publisher via the emctl command. Although they can start it from the WLS Admin Console, the ’emctl status oms -details’ shows BI Publisher Server is Down, and they can’t even connect to it via the https://em12c.example.com:9702/xmlpserver/. They then attempted to run the emcli setup_bipublisher to re-register it with EM but get:

$ ./emcli setup_bipublisher -proto=https -host=em12c.example.com -port=9702 -uri=xmlpserver -force
Error: BI Publisher could not be setup.:
Error inserting BIPInstance row for :https://em12c.example.com:9702/xmlpserver: javax.management.MBeanException: JPS-01051:
Credential audit events cannot be logged. Reason oracle.security.jps.service.audit.AuditException: JPS-00054: Failed to
create the auditor for JPS.

And the BIP.out file shows:
oracle.webservices.provider.ProviderException: java.lang.RuntimeException: java.security.AccessControlException: access denied (oracle.security.jps.JpsPermission IdentityAssertion)
at oracle.j2ee.ws.server.provider.ProviderProcessor.configureProviderManagement(ProviderProcessor.java:367)
at oracle.j2ee.ws.server.provider.ProviderProcessor.init(ProviderProcessor.java:198)

So what went wrong?

Well, we don’t have the details to understand what went wrong with their attempt to integrate LDAP into WLS, but what we can tell is that the config.xml that they recovered was from an intermediate stage of running ‘configureBIP’. This can be seen by these lines in the log files:

132c158,159
< 9700
---
> 9701
> false

You would only ever see a listen port of 9700 after the WebLogic domain has been extended with BIP, but before further configuration occurs.

I see no viable approach to some sort of ‘resumption’ of configureBIP from where it left off, as shown in the above config.xml. The inherent problem with manipulating config.xml files directly is that there are many other WebLogic artifacts that are tied to the config.xml. Some of these are utilized by BIP. Some of these, for example (there are numerous others), include:

  • JMS Queues
  • JNDI Datasources
  • Encrypted WebLogic files that are part of WebLogic, which require a private, embedded key.

Even if it was possible to somehow hand-rebuild their environment, it would set this customer up for all kinds of potential issues down the road e.g. when it is time to perform some of these follow-on tasks:

  • Adding additional OMSs (including BIPs)
  • Upgrading EM12c to EM13c
  • Upgrading the repository database
  • etc.

How does this customer recover from this scenario?

For all intents and purposes, this needs to be treated as a corruption of the EM middleware and WebLogic domain. Similar to filesystem corruption or hardware failure, we have a whole section of the Enterprise Manager Cloud Control Advanced Installation and Configuration Guide on backing up and restoring Enterprise Manager. Thankfully, they will have a backup of the OMS, that was done prior to running configureBIP. This was made with:
$ emctl exportconfig oms

which we perform automatically for them. They should use this as the restore point to rebuild EM with, following the relevant parts of section 21.6.3 of the doc.

The post BI Publisher will not start (using emctl oms start and emctl oms start -bip_only) appeared first on PeteWhoDidNotTweet.com.

Amis Conference June 2nd and 3rd

I will be at the Amis conference next Friday in Leiden not far from Amsterdam in Holland. The conference is held over two days, June 2nd and 3rd But I will be there just on the Friday due to other....[Read More]

Posted by Pete On 26/05/16 At 11:28 AM

Storing Date Values As Characters (What’s Really Happening)

For something that’s generally considered an extremely bad idea, I’ve lost count of the number of times I’ve come across applications that insist on storing date values as characters within the database. We’ve all seen them … I recently got called in to assist a customer who was having issues with a POC in relation […]

CBO++

While browsing the web recently for articles on the HyperLogLog algorithm that Oracle uses for some of its approximate functions, I came upon a blog post written in Jan 2014 with the title Use Subqueries to Count Distinct 50X Faster. There are various ways that subqueries can be used to rewrite queries for improved performance, but when the title caught my eye I couldn’t think of a way in which they could improve “count distinct”.  It turned out that the word “subquery” was being used (quite correctly) in the sense of “inline view” while my mind had immediately turned to subqueries in the select list or where clause.

The article started by pointing out that if you have a query that does a join then aggregates the result you might be able to improve performance by finding a way of rewriting the query to aggregate before doing the join. (See this note from 2008). The article then went one step further to optimise a “count distinct” by wrapping a “select count” around a “select distinct” inline view as follows:

Original
--------
  select
    dashboard_id,
    count(distinct user_id) as ct
  from time_on_site_logs 
  group by dashboard_id

Rewrite
-------
select 
    inline.dashboard_id, 
    count(1) as ct
  from (
    select distinct dashboard_id, user_id
    from time_on_site_logs
  ) as inline
  group by inline.dashboard_id

(I’ve reproduced only the central part of the query being examined and I’ve changed the name of the inline view to eliminate the potential visual confusion due to the word “distinct” appearing in its name in the original).

The article was written using the Postgres SQL with the comment that the technique was universal; and this brings me to the point of the post. The technique can be applied to Oracle’s dialect of SQL. Both ideas are good ideas whose effectiveness depends on the data patterns, data volume, and (potentially) indexing; but you may not need to rewrite the code because the optimizer is programmed to know that the ideas are good and it can transform your query to the appropriate form internally. The “place group by” transformation appeared in 11.1.0.6 in 2007, and the “transform distinct aggregation” appeared in 11.2.0.1 in 2009.

Here’s a litte demo of Oracle handling a variation of the query I’ve shown above:


rem     Script: transform_distinct_agg.sql
rem     Dated:  May 2016
rem     Author: J.P.Lewis

create table t1 nologging 
as 
select  * 
from    all_objects 
where   rownum <= 60000
;
execute dbms_stats.gather_table_stats(user,'t1', method_opt=>'for all columns size 1')

alter session set statistics_level = all;

select owner, count(distinct object_type) from t1 group by owner;
select * from table(dbms_xplan.display_cursor(null,null,'allstats last outline'));

prompt  ===============
prompt  Rewritten query
prompt  ===============

select  owner, count(1)
from    (
         select distinct owner, object_type
         from   t1
        ) distinct_types
group by
        owner
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last outline'));

Here are the two execution plans, pulled from memory – with the outline and some other peripheral lines deleted:


-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |           |      1 |        |      5 |00:00:00.23 |     865 |       |       |          |
|   1 |  HASH GROUP BY       |           |      1 |      5 |      5 |00:00:00.23 |     865 |  1452K|  1452K|  728K (0)|
|   2 |   VIEW               | VM_NWVW_1 |      1 |     78 |     30 |00:00:00.23 |     865 |       |       |          |
|   3 |    HASH GROUP BY     |           |      1 |     78 |     30 |00:00:00.23 |     865 |  4588K|  1708K| 2497K (0)|
|   4 |     TABLE ACCESS FULL| T1        |      1 |  60000 |  60000 |00:00:00.12 |     865 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------

===============
Rewritten query
===============

------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |      5 |00:00:00.23 |     865 |       |       |          |
|   1 |  HASH GROUP BY       |      |      1 |      5 |      5 |00:00:00.23 |     865 |  1452K|  1452K|  735K (0)|
|   2 |   VIEW               |      |      1 |     78 |     30 |00:00:00.23 |     865 |       |       |          |
|   3 |    HASH UNIQUE       |      |      1 |     78 |     30 |00:00:00.23 |     865 |  4588K|  1708K| 1345K (0)|
|   4 |     TABLE ACCESS FULL| T1   |      1 |  60000 |  60000 |00:00:00.12 |     865 |       |       |          |
------------------------------------------------------------------------------------------------------------------

Apart from the change from “HASH UNIQUE” to “HASH GROUP BY” the two plans are the same, using the same resources – the UNIQUE being a special case of the algorithm for the GROUP BY. Here (with some cosmetic editing) is the SQL of the “unparsed query” taken from the 10053 (CBO) trace file – notice how similar it is to the text suggested by the original article, in particular the inline view to get the distinct list of owner and object_type (using a group by with no aggregated columns, rather than a distinct):

SELECT 
        VM_NWVW_1.$vm_col_2 OWNER,
        COUNT(VM_NWVW_1.$vm_col_1) COUNT(DISTINCTOBJECT_TYPE)
FROM    (
                SELECT
                        T1.OBJECT_TYPE $vm_col_1,
                        T1.OWNER $vm_col_2
                FROM    TEST_USER.T1 T1
                GROUP BY 
                        T1.OWNER,T1.OBJECT_TYPE
        ) VM_NWVW_1
GROUP BY
        VM_NWVW_1.$vm_col_2
;

The Oracle optimizer is pretty good at finding efficient transformations for the query you wrote so, rather than rewriting a query (with the option for making a mistake as you do so), you may only need to add a couple of hints to generate a suitable SQL Plan Baseline that you can attach to the original query.

Footnote:

Sometimes the optimizer will decide not to transform when it should, or decide to transform when it shouldn’t, so it’s nice to know that there are hints to block transformations – here’s the effect of adding /*+ qb_name(main) no_transform_distinct_agg(main) */ to my query:


----------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      5 |00:00:00.25 |     865 |       |       |          |
|   1 |  SORT GROUP BY     |      |      1 |      5 |      5 |00:00:00.25 |     865 |  4096 |  4096 | 4096  (0)|
|   2 |   TABLE ACCESS FULL| T1   |      1 |  60000 |  60000 |00:00:00.12 |     865 |       |       |          |
----------------------------------------------------------------------------------------------------------------

The interesting thing to note here is that even though the query took a little longer to complete the amount of memory allocated to run the query in memory was only 4K compared to the 2M needed by the transformed query (In this example both workareas would have been in existence at the same time – that won’t be true of every query using multiple workareas.) This isn’t significant in this trivial case, but it demonstrates the point that sometimes there is no one best path – you can choose the path that protects the resource that’s under most pressure.