Search

Top 60 Oracle Blogs

Recent comments

Oakies Blog Aggregator

Flamegraph with function annotation

Recently, Franck Pachot tweeted the following:
https://fritshoogland.files.wordpress.com/2019/01/tweet-franck.png?w=84&... 84w, https://fritshoogland.files.wordpress.com/2019/01/tweet-franck.png 310w" sizes="(max-width: 168px) 100vw, 168px" />
This is a very clever way of using Brendan Gregg’s flame graphs for Oracle database process flame graphs and using my ora_functions project database.

However, the awk script uses a full match using f[$2]:
– f is an array with the function name annotations filled using {f[$1]=$2;next}.
– // means to only work on lines that have in it.
– The lines have the oracle function stuck to , for example: kcbgtcr.
– The first sub() function (substitute); which is sub(“”,”& “) places a space after “” so it looks like: kcbgtcr; “&” means “the first argument of the sub function”.
– FS=” ” means the field separator is a space.
– The next sub() function; which is sub(“”,””f[$2]”&”) substitutes on the pattern, and uses the f array using the second field ($2) of the input, which now is the oracle function thanks to the field separator and the previous sub() function.

But…the function names in the database/csv file are build up on layers. I done this deliberately to avoid having to type all the layer names for a function when entering new ones, but more importantly so I can actually identify the code layers in the oracle executable. Because of the layers, I don’t need the full function descriptions, I can still get an understanding what is going on by looking in what layer a function is called.

To get the layers in the function names, I added a script in my ora_functions repository called ‘annotate_flamegraph.awk’, which takes the function name csv file actually exactly like Franck did, but now builds up the annotation of every function in an existing flamegraph. It simply takes an existing flamegraph SVG file with oracle database functions as an input, and outputs the annotated version on STDOUT, so this is how it’s run to get an annotated SVG:

$ ./annotate_flamegraph.awk dt_11856.txt_1.fold.svg > dt_11856.txt_1.fold.annotated.svg

This is how it looks like (please mind these are snapshots from flamegraph output in a browser):
https://fritshoogland.files.wordpress.com/2019/01/flamegraph-annotation1... 600w, https://fritshoogland.files.wordpress.com/2019/01/flamegraph-annotation1... 150w" sizes="(max-width: 300px) 100vw, 300px" />
(click on the graph to see it in the original size)
If you hoover over a row in the flamegraph, it will show the full information at the bottom row, in this case I hoovered over ttcpip.
If you click a row, it will zoom in to the row. The next picture is a good way to show that the function layers and the stack make it obvious what is going on, while most of the functions do not have the full annotation:
https://fritshoogland.files.wordpress.com/2019/01/flamegraph-annotation3... 600w, https://fritshoogland.files.wordpress.com/2019/01/flamegraph-annotation3... 150w" sizes="(max-width: 300px) 100vw, 300px" />
(click on the graph to see it in the original size)

Again, thanks to Franck for the inspiration. I hope this will be useful for investigating what oracle is doing!

You don’t need the PLAN_TABLE table

This post is about the following error you may get when looking at an execution plan after setting the current_schema:

Error: cannot fetch last explain plan from PLAN_TABLE

It is related with old versions and relics from even older versions.

In the old times, PLAN_TABLE was a permanent shared regular table created by utlxplan.sql. Since Oracle 8i which introduced Global Temporary Tables, the PLAN_TABLE public synonym refers to SYS.PLAN_TABLE$ which is a GTT, not shared and emptied at the end of your session.

When I want to tune a query, I usually connect with my DBA user and change my session schema to the application one, so that I can explain or run the user query without having to prefix all tables. But when there is a PLAN_TABLE in the current schema, DBMS_XPLAN.DISPLAY may fail:

SQL> alter session set current_schema=SCOTT;
Session altered.
SQL> explain plan for select * from dual;
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT                                                  
--------------------------------------------------------------------
Error: cannot fetch last explain plan from PLAN_TABLE

I can check that in addition to the SYS GTT and PUBLIC synonyms, there’s a PLAN_TABLE in this schema

SQL> select owner,object_name,object_type,created from all_objects where object_name like 'PLAN_TABLE%' and owner in (sys_context('userenv','current_schema'),'PUBLIC','SYS');
OWNER    OBJECT_NAME OBJECT_T CREATED                    
-------- ----------- -------- -----------------
SCOTT PLAN_TABLE TABLE 07-MAR-2014 09:15
PUBLIC PLAN_TABLE SYNONYM 29-JAN-2014 21:47
SYS PLAN_TABLE$ TABLE 29-JAN-2014 21:47

This is annoying and not needed, then I drop it

SQL> drop table PLAN_TABLE;
Table PLAN_TABLE dropped.

Of course, you must be sure that this is not an application table with the same name. Just select to see what is there, and in case of doubt, rename it instead of drop:

SQL> rename PLAN_TABLE to "old plan table, delete me";
Table renamed.

Now EXPLAIN PLAN and DBMS_XPLAN.DISPLAY work as expected:

SQL> select * from table(dbms_xplan.display);
Plan hash value: 272002086

-------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Time |
-------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2 | 00:00:01 |
| 1 | TABLE ACCESS FULL| DUAL | 1 | 2 | 00:00:01 |
-------------------------------------------------------------

Actually, I think the problem does not appear in recent versions. I checked on 18c and EXPLAIN PLAN inserts into the current schema (it was inserted in the connected user schema in 11g).

Making some more sense of direct path reads during primary key lookups

After having published my first article of 2019 I have received feedback I felt like including. With a nod to @fritshoogland, @ChrisAntognini and @FranckPachot.

In the previous post I showed you output of Tanel Poder’s ashtop.sql as proof that direct path reads can occur even if all you do is look up data by primary key. This script touches v$active_session_history, and I’m not getting tired of mentioning that you need to license the system in scope for Enterprise Edition and the Diagnostics Pack to do so.

What I haven’t shown you in more detail in the previous article is what causes these direct path reads. This is what I’ll try to do next. I have started by repeating my test case but added the gather_plan_statistics hint to the statement, resulting in a different SQL ID. Yesterday I noticed that Safari doesn’t show the scroll bar at the bottom of this code snippet other browsers show, I hope output isn’t garbled. I’m using the universal excuse here “it works on my laptop”.

SQL> @xia a66axyw2g6yjb
Display Execution plan in advanced format for sqlid a66axyw2g6yjb

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  a66axyw2g6yjb, child number 0
-------------------------------------
select /*+ gather_plan_statistics CLOB standalone */ *  from
martin.orders_clob  where order_id = :1

Plan hash value: 3284193906

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name           | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                |      1 |        |       |     3 (100)|          |      0 |00:00:00.01 |       3 |
|   1 |  TABLE ACCESS BY INDEX ROWID| ORDERS_CLOB    |      1 |      1 |   137 |     3   (0)| 00:00:01 |      0 |00:00:00.01 |       3 |
|*  2 |   INDEX UNIQUE SCAN         | PK_ORDERS_CLOB |      1 |      1 |       |     2   (0)| 00:00:01 |      0 |00:00:00.01 |       3 |
----------------------------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - SEL$1 / ORDERS_CLOB@SEL$1
   2 - SEL$1 / ORDERS_CLOB@SEL$1

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('18.1.0')
      DB_VERSION('18.1.0')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      INDEX_RS_ASC(@"SEL$1" "ORDERS_CLOB"@"SEL$1" ("ORDERS_CLOB"."ORDER_ID"))
      END_OUTLINE_DATA
  */

Peeked Binds (identified by position):
--------------------------------------

   1 - :1 (NUMBER): 519990

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("ORDER_ID"=:1)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - "ORDER_ID"[NUMBER,22], "ORDERS_CLOB"."ORDER_CLOB"[LOB,4000], "ORDERS_CLOB"."ORDER_DATE"[TIMESTAMP WITH LOCAL TIME
       ZONE,11], "ORDERS_CLOB"."ORDER_MODE"[VARCHAR2,8], "ORDERS_CLOB"."CUSTOMER_ID"[NUMBER,22],
       "ORDERS_CLOB"."ORDER_STATUS"[NUMBER,22], "ORDERS_CLOB"."ORDER_TOTAL"[NUMBER,22], "ORDERS_CLOB"."SALES_REP_ID"[NUMBER,22],
       "ORDERS_CLOB"."PROMOTION_ID"[NUMBER,22], "ORDERS_CLOB"."WAREHOUSE_ID"[NUMBER,22], "ORDERS_CLOB"."DELIVERY_TYPE"[VARCHAR2,15],
       "ORDERS_CLOB"."COST_OF_DELIVERY"[NUMBER,22], "ORDERS_CLOB"."WAIT_TILL_ALL_AVAILABLE"[VARCHAR2,15],
       "ORDERS_CLOB"."DELIVERY_ADDRESS_ID"[NUMBER,22], "ORDERS_CLOB"."CUSTOMER_CLASS"[VARCHAR2,30],
       "ORDERS_CLOB"."CARD_ID"[NUMBER,22], "ORDERS_CLOB"."INVOICE_ADDRESS_ID"[NUMBER,22]
   2 - "ORDERS_CLOB".ROWID[ROWID,10], "ORDER_ID"[NUMBER,22]


57 rows selected.

After a few minutes of executing the statement repeatedly, direct path reads are clearly visible again

SELECT
    COUNT(*),
    event,
    session_state
FROM
    v$active_session_history ash
WHERE
    sql_id = 'a66axyw2g6yjb'
    AND sample_time > SYSDATE - 15 / 1440
GROUP BY
    event,
    session_state
ORDER BY
    1 DESC;

  COUNT(*) EVENT                                    SESSION
---------- ---------------------------------------- -------
      1592                                          ON CPU
        24 direct path read                         WAITING
         6 cursor: pin S                            WAITING
         4 SQL*Net more data to client              WAITING

So far so good. Getting back to the original question: what’s causing these? ASH provides the answer to that question as well.

SELECT
    COUNT(*),
    event,
    session_state,
    sql_plan_operation
    || ' '
    || sql_plan_options AS what,
    CASE
        WHEN wait_class IN(
            'Application',
            'Cluster',
            'Concurrency',
            'User I/O'
        )THEN object_name
        ELSE 'undef'
    END AS obj
FROM
    v$active_session_history ash,
    dba_objects o
WHERE
    ash.current_obj# = o.object_id
    AND sql_id = 'a66axyw2g6yjb'
GROUP BY
    event,
    session_state,
    sql_plan_operation
    || ' '
    || sql_plan_options,
    CASE
        WHEN wait_class IN(
            'Application',
            'Cluster',
            'Concurrency',
            'User I/O'
        )THEN object_name
        ELSE 'undef'
    END
ORDER BY 1 desc;

  COUNT(*) EVENT                                    SESSION WHAT                           OBJ
---------- ---------------------------------------- ------- ------------------------------ ------------------------------
      3006                                          ON CPU  SELECT STATEMENT               undef
       223                                          ON CPU  INDEX UNIQUE SCAN              undef
       126                                          ON CPU  TABLE ACCESS BY INDEX ROWID    undef
        50 direct path read                         WAITING SELECT STATEMENT               SYS_LOB0000081921C00002$$
         8 cursor: pin S                            WAITING SELECT STATEMENT               SYS_LOB0000081921C00002$$
         8 SQL*Net more data to client              WAITING SELECT STATEMENT               undef
         1                                          ON CPU                                 undef

7 rows selected.

I think you might be getting the idea by now … I can link the segment back to DBA_LOBS, and … it’s the LOB column of ORDERS_CLOB.

SELECT
    table_name,
    column_name,
    segment_name,
    cache,
    securefile
FROM
    dba_lobs
WHERE
    segment_name = 'SYS_LOB0000081921C00002$$';

TABLE_NAME      COLUMN_NAME     SEGMENT_NAME                   CACHE      SEC
--------------- --------------- ------------------------------ ---------- ---
ORDERS_CLOB     ORDER_CLOB      SYS_LOB0000081921C00002$$      NO         YES

QED.

Another option is to query session statistics. I looked at one of the sessions using snapper4.sql but the output was too wide to paste it as text. I’ll resort to the traditional way of doing this:

SELECT
name,
value
FROM
v$sesstat st
NATURAL JOIN v$statname sn
WHERE
st.sid = 128
AND st.value <> 0
AND REGEXP_LIKE(name, '(physical|securefile).*read.*')
ORDER BY
sn.name;

NAME VALUE
----------------------------------------- ----------
physical read IO requests 106836
physical read bytes 875200512
physical read total IO requests 106836
physical read total bytes 875200512
physical reads 106836
physical reads direct 106836
physical reads direct (lob) 106836
securefile direct read bytes 875200512
securefile direct read ops 106836
9 rows selected.

I admit this was a rather manual way of extracting this piece of information, you might be able to get the same data with an ASH report (provided the number of I/O samples against that segment is significant enough to show up). In this context I’d like to recommend planx.sql you can find on Carlos Sierra’s and Mauro Pagano’s github repository. It’s small, lightweight, and incredibly useful.

Oracle global vs. partition level statistics CBO usage

Global statistics are complex to gather. Gathering on the whole table can be very long and doesn’t ‘scale’ because the duration will increase with the volume. Incremental gathering can be a solution but has its side effects (such as the size of the synopsis). But having stale global statistics can be dangerous. Do you know when the optimizer bases its estimation on global or on partition level statistics? The partition level statistics are used only:

  • when partition pruning iterates on only one single partition
  • when this partition is known at optimization time during the parse phase

This is clearly explained in Jonathan Lewis ‘Cost-Based Oracle Fundamentals’:

Cost-Based Oracle Fundamentals

I also tested on the current release (18c) to verify this. Here is the test-case.

I create a table partitioned by year, fill 100 rows in one partition and 5000 rows in another. Then I gather the statistics:

SQL> create table DEMO (day date) partition by range(day) (
partition P2018 values less than (date '2019-01-01'),
partition P2019 values less than (date '2020-01-01'),
partition P2020 values less than (date '2021-01-01'),
partition P2021 values less than (date '2022-01-01')
);
Table DEMO created.
SQL> insert into DEMO 
select date '2019-01-01'+rownum from xmltable('1 to 100');
100 rows inserted.
SQL> insert into DEMO 
select date '2018-01-01'-rownum/24 from xmltable('1 to 5000');
5,000 rows inserted.
SQL> exec dbms_stats.gather_table_stats(user,'DEMO');
PL/SQL procedure successfully completed.

My goal is to show when the partition or the global statistics are used.

Here are the statistics:

SQL> select partition_name,object_type,num_rows,global_stats 
,to_date((regexp_replace(
extract(dbms_xmlgen.getxmltype(
'select high_value from dba_tab_partitions
where table_owner='''||owner||'''
and table_name='''||table_name||'''
and partition_name='''||partition_name||''''
),'/ROWSET/ROW/HIGH_VALUE/text()').getstringval()
,'[^;]*apos; *([^;]*) *[^;]apos;.*','\1'))
,'yyyy-mm-dd hh24:mi:ss') high_value
from dba_tab_statistics
where owner=user and table_name='DEMO' order by 1 nulls first;
PARTITION_ OBJECT_TYPE    NUM_ROWS GLO HIGH_VALUE 
---------- ------------ ---------- --- -----------
TABLE 5100 YES
P2018 PARTITION 5000 YES 01-JAN-2019
P2019 PARTITION 100 YES 01-JAN-2020
P2020 PARTITION 0 YES 01-JAN-2021
P2021 PARTITION 0 YES 01-JAN-2022

I’ll query partitions above 2019 where I have 100 rows. But the global statistics count 5100 rows. Let’s see which one is used.

Pstart = Pstop (PARTITION RANGE SINGLE) -> partition

This first query reads the rows from only one partition, and the optimizer knows it at parse time because I use literals:

SQL> select count(*) from DEMO where day between to_date( '2019-01-08','yyyy-mm-dd' ) and to_date( '2019-02-08','yyyy-mm-dd' ) ;
COUNT(*)
----------
32
SQL> select * from dbms_xplan.display_cursor(
format=>'basic +rows +outline +peeked_binds +partition');
PLAN_TABLE_OUTPUT                                                   
--------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select count(*) from DEMO
where day between to_date( '2019-01-08','yyyy-mm-dd' )
and to_date( '2019-02-08','yyyy-mm-dd' )

Plan hash value: 849908795

----------------------------------------------------------------
| Id | Operation | Name | Rows | Pstart| Pstop |
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | PARTITION RANGE SINGLE| | 33 | 2 | 2 |
| 3 | TABLE ACCESS FULL | DEMO | 33 | 2 | 2 |
----------------------------------------------------------------
  • Partition-level statistics are used when only one partition is accessed, known at the time of parsing. You see that with Pstart=Pstop = number.

Pstart <> Pstop (PARTITION RANGE ITERATOR/FULL) -> global

The second query still knows the partitions are parse time, but multiple partitions are concerned:

SQL> select count(*) from DEMO where day between to_date( '2019-01-08','yyyy-mm-dd' ) and to_date( '2020-02-08','yyyy-mm-dd' );
COUNT(*)
----------
94
SQL> select * from dbms_xplan.display_cursor(
format=>'basic +rows +outline +peeked_binds +partition');
PLAN_TABLE_OUTPUT                                                   
--------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select count(*) from DEMO
where day between to_date( '2019-01-08','yyyy-mm-dd' )
and to_date( '2020-02-08','yyyy-mm-dd' )

Plan hash value: 1988821877

------------------------------------------------------------------
| Id | Operation | Name | Rows | Pstart| Pstop |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | PARTITION RANGE ITERATOR| | 705 | 2 | 3 |
| 3 | TABLE ACCESS FULL | DEMO | 705 | 2 | 3 |
------------------------------------------------------------------

This high estimation comes from the high number known at the table level.

  • Global table level statistics are used when more than one partition is accessed, even when they are known at the time of parsing. You see that with Pstart <> Pstop.

KEY — KEY (PARTITION RANGE ITERATOR/FULL) -> global

For the third query, I use bind variables and set them to access only one partition:

SQL> variable d1 varchar2(10)
SQL> variable d2 varchar2(10)
SQL> exec :d1:='2019-01-08' ; :d2:='2019-02-08'
PL/SQL procedure successfully completed.
SQL> select count(*) from DEMO where day between to_date( :d1,'yyyy-mm-dd' ) and to_date( :d2,'yyyy-mm-dd' );
COUNT(*)
----------
32
SQL> select * from dbms_xplan.display_cursor(
format=>'basic +rows +outline +peeked_binds +partition');
PLAN_TABLE_OUTPUT                                                   
--------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select count(*) from DEMO
where day between to_date( :d1,'yyyy-mm-dd' )
and to_date( :d2,'yyyy-mm-dd' )

Plan hash value: 203823535

-------------------------------------------------------------------
| Id | Operation | Name | Rows | Pstart| Pstop |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | FILTER | | | | |
| 3 | PARTITION RANGE ITERATOR| | 33 | KEY | KEY |
| 4 | TABLE ACCESS FULL | DEMO | 33 | KEY | KEY |
-------------------------------------------------------------------


Peeked Binds (identified by position):
--------------------------------------

1 - :D1 (VARCHAR2(30), CSID=873): '2019-01-08'
2 - :D2 (VARCHAR2(30), CSID=873): '2019-02-08'

The cardinality is estimated correctly: the partition statistics were used.

  • Partition-level statistics are used when only one partition is accessed, known at the time of parsing, even the value is known with bind peeking. You see that with KEY/KEY for Pstart/Pstop and with bind variables listed by +peeked_binds format

Of course, this accurate estimation has a drawback: executing the query again for another range of dates may re-use the same execution plan. You should code a different statement when you know that you cover something different.

Without bind peeking -> global

It is not a recommendation, but to avoid the risk mentioned above, some sessions may prefer to disable bind peeking. When bind variable peeking is disabled, the optimization is done before any value is known, so there is no possibility to know which partition is concerned.

SQL> alter session set "_optim_peek_user_binds"=false;
Session altered.
SQL> select count(*) from DEMO where day between to_date( :d1,'yyyy-mm-dd' ) and to_date( :d2,'yyyy-mm-dd' );
COUNT(*)
----------
32
SQL> select * from dbms_xplan.display_cursor(
format=>'basic +rows +outline +peeked_binds +partition');
PLAN_TABLE_OUTPUT                                                   
--------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select count(*) from DEMO
where day between to_date( :d1,'yyyy-mm-dd' )
and to_date( :d2,'yyyy-mm-dd' )

Plan hash value: 203823535

-------------------------------------------------------------------
| Id | Operation | Name | Rows | Pstart| Pstop |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | FILTER | | | | |
| 3 | PARTITION RANGE ITERATOR| | 13 | KEY | KEY |
| 4 | TABLE ACCESS FULL | DEMO | 13 | KEY | KEY |
-------------------------------------------------------------------

Outline Data
-------------

/*+
BEGIN_OUTLINE_DATA
IGNORE_OPTIM_EMBEDDED_HINTS
OPTIMIZER_FEATURES_ENABLE('19.1.0')
DB_VERSION('19.1.0')
OPT_PARAM('_optim_peek_user_binds' 'false')
ALL_ROWS
OUTLINE_LEAF(@"SEL$1")
FULL(@"SEL$1" "DEMO"@"SEL$1")
END_OUTLINE_DATA
*/

Here the optimizer estimates 13/5100≈0.25% of global rows.

In conclusion, there is only one case where the partition level statistics can be used: partition pruning to one single partition, known at parse time by the literal value or the peeked bind. Then, you cannot ignore global statistics.

By the way, why did I use a range between 2019–01–08 and 2019–02–08? Probably to remind you that there’s only one month left for the #KaaS19 Belgium Tech Days …

Order Tickets " Techdays Belgium

Making sense of direct path reads during primary key lookups

I recently made an interesting observation while monitoring database performance on an Oracle Enterprise Edition system. While looking at some ASH data (for which you must be licensed appropriately!) I came across direct path reads attributed to a select statement performing a primary key lookup. At first, this didn’t make much sense to me, but it’s actually intended behaviour and not a bug.

In this post I’m reproducing what I observed. I am using Oracle 18.4.0 for this experiment running on my Linux lab environment. The hardware uses 1s8c16t with 64 GB of memory.

Direct Path Reads are multi-block reads you often find in decision support systems when a query is going over large amounts of data. They are un-buffered, as in they use the reading session’s private memory, not the buffer cache. Not exactly what you would expect with a primary key lookup if the index was used. It should be more like this tkprof’d SQL trace:

SQL ID: b5dxjj3wm4yz8 Plan Hash: 4043159647
select *
from
orders where order_id = 100

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.01 0.01 0 378 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.00 0.00 0 4 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.01 0.01 0 382 0 1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 82
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max) Row Source Operation
---------- ---------- ---------- ---------------------------------------------------
1 1 1 TABLE ACCESS BY INDEX ROWID ORDERS (cr=4 pr=0 pw=0 time=32 us starts=1 cost=3 size=95 card=1)
1 1 1 INDEX UNIQUE SCAN ORDER_PK (cr=3 pr=0 pw=0 time=22 us starts=1 cost=2 size=0 card=1)(object id 81853)

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL Net message to client 2 0.00 0.00
SQL Net message from client 2 11.40 11.40

No physical I/O anywhere to be seen.

During my original investigation I noticed that direct path reads were attributed to a table containing a CLOB column. The query was something along the lines of “select * from table where primaryKeyColumn = :bindVariable”. The primaryKeyColumn was defined as a number. This should be easy to recreate.

The test case

Based on the Swingbench “ORDERS” table I created a new table using the following DDL:

CREATE TABLE martin.orders_clob (
    order_id                  NUMBER(12,0)
        NOT NULL ENABLE,
    order_clob                CLOB
        NOT NULL ENABLE,
    order_date                TIMESTAMP(6) WITH LOCAL TIME ZONE
        NOT NULL ENABLE,
    order_mode                VARCHAR2(8 BYTE),
    customer_id               NUMBER(12,0)
        NOT NULL ENABLE,
    order_status              NUMBER(2,0),
    order_total               NUMBER(8,2),
    sales_rep_id              NUMBER(6,0),
    promotion_id              NUMBER(6,0),
    warehouse_id              NUMBER(6,0),
    delivery_type             VARCHAR2(15 BYTE),
    cost_of_delivery          NUMBER(6,0),
    wait_till_all_available   VARCHAR2(15 BYTE),
    delivery_address_id       NUMBER(12,0),
    customer_class            VARCHAR2(30 BYTE),
    card_id                   NUMBER(12,0),
    invoice_address_id        NUMBER(12,0),
    constraint pk_orders_clob primary key ( order_id )
) tablespace bigfile_tbs
lob ( order_clob ) store as securefile (enable storage in row);

Please note that the clob is stored in row.

Next I inserted a large number of rows into the table, based again on soe.orders:

INSERT/*+ enable_parallel_dml append parallel(4) */ INTO martin.orders_clob
    SELECT /*+ parallel(4) */
        order_id,
        case
          when mod(rownum,10) = 0 then rpad('X',5000,'Y')
          else rpad('Y', 50, 'Z') 
        end,
        order_date,
        order_mode,
        customer_id,
        order_status,
        order_total,
        sales_rep_id,
        promotion_id,
        warehouse_id,
        delivery_type,
        cost_of_delivery,
        wait_till_all_available,
        delivery_address_id,
        customer_class,
        card_id,
        invoice_address_id
    FROM
        soe.orders
    WHERE    
        ROWNUM <= 1E6;

The case statement ensures that every 10th row has a lob exceeding the size limit of an inline LOB. Apparently that’s 4000 bytes minus a little bit of overhead. As a result the LOB data should be stored outside the table.

Load!

I am now generating some load against the system, being careful not to overload my lab system. I limited myself to 8 sessions:

SQL> select count(*) from v$session where username = 'MARTIN' and program like 'JDBC%';

  COUNT(*)
----------
         8

After a few minutes I can see these direct path reads show up. This is really easy if you have the right tools – I use Tanel Poder’s scripts a lot. One particularly useful script is ashtop.sql (remember the license thing again!) that you can see here:

SQL> @ashtop sql_id,event,session_state username='MARTIN' "to_date('19.12.2018 15:35:00', 'dd.mm.yyyy hh24:mi:ss')" "to_date('19.12.2018 15:45:00','dd.mm.yyyy hh24:mi:ss')"

    Total
  Seconds     AAS %This   SQL_ID        EVENT                                    SESSION FIRST_SEEN          LAST_SEEN           DIST_SQLEXEC_SEEN
--------- ------- ------- ------------- ---------------------------------------- ------- ------------------- ------------------- -----------------
     1380     2.3   74% | 7hth4y8d9h7q8                                          ON CPU  2018-12-19 15:35:02 2018-12-19 15:44:58              1232
      225      .4   12% |                                                        ON CPU  2018-12-19 15:35:04 2018-12-19 15:44:59                 2
      175      .3    9% | 7hth4y8d9h7q8 library cache: mutex X                   WAITING 2018-12-19 15:35:15 2018-12-19 15:44:49               175
       65      .1    3% | 7hth4y8d9h7q8 direct path read                         WAITING 2018-12-19 15:35:25 2018-12-19 15:44:41                65
        8      .0    0% | 7hth4y8d9h7q8 SQL*Net more data to client              WAITING 2018-12-19 15:38:52 2018-12-19 15:44:29                 8
        3      .0    0% | 7hth4y8d9h7q8 cursor: pin S                            WAITING 2018-12-19 15:36:59 2018-12-19 15:42:08                 1
        3      .0    0% |               cursor: pin S                            WAITING 2018-12-19 15:35:57 2018-12-19 15:42:08                 1

7 rows selected.

I’ll also show you the execution plan to confirm I’m not actually performing an unintentional full table scan:

SQL> select * from dbms_xplan.display_cursor('7hth4y8d9h7q8');

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------
SQL_ID  7hth4y8d9h7q8, child number 0
-------------------------------------
select /* CLOB */ * from martin.orders_clob where order_id =
trunc(dbms_random.value(1,1000000))

Plan hash value: 3284193906

----------------------------------------------------------------------------------------------
| Id  | Operation                   | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                |       |       |     3 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| ORDERS_CLOB    |     1 |   227 |     3   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN         | PK_ORDERS_CLOB |     1 |       |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("ORDER_ID"=TRUNC("DBMS_RANDOM"."VALUE"(1,1000000)))


20 rows selected.

I didn’t want to use Java’s random method generator, hence the call to dbms_random.value. So there you have it: direct path reads when performing index lookups.

Thanks and big shout out to Tanel Poder for his scripts, they are awesome.

New “Let’s Talk Database” events in Australia/NZ in February/March 2019 !!

Very excited to announce some new “Let’s Talk Database” events scheduled for February 2019 in various locations in Australia/NZ: Canberra: Wednesday, 20 February 2019: Registration Link Sydney: Thursday, 21 Feburary 2019: Registration Link Brisbane: Friday, 22 Feburary 2019: Registration Link Melbourne: Wednesday, 27 February 2019: Registration Link Wellington: Friday, 1 March 2019: Registration Link   […]

Oracle literal vs bind-variable in partition pruning and Top-N queries

Here is a query that retrieves the most recent value for a variable within a time range. The table has the variable ID, the timestamp and the value. All is properly indexed (it is actually an IOT) correctly: index on (VARIABLE_ID, TIMESTAMP, VALUE) locally partitioned on TIMESTAMP.

For whatever reason, the optimizer chooses an index fast full scan. With small partitions, this should not be a problem as we do not need to read all partitions: we want only the last value (rownum=1 after order by desc).

literals

Here is the execution plan with literals:

The interesting information is in the ‘Starts’ column: only two partitions have been read. That’s because we have found our row and there’s no need to continue. The line 5 is there for that with a STOPKEY on ROWNUM=1. This is possible because the partitions are read, by PARTITION RANGE ITERATOR, in the same order as we want from the ORDER BY: descending on TIMESTAMP. There’s a clue from Pstart/Pstop but the best way to see it is from partition pruning event 10128 level 2 (see https://support.oracle.com/epmos/faces/DocContentDisplay?id=166118.1)

The dump mentions a DESCENDING RANGE SCAN:

Partition Iterator Information:
partition level = PARTITION
call time = RUN
order = DESCENDING
Partition iterator for level 1:
iterator = RANGE [3954, 4266]
index = 4266
current partition: part# = 4266, subp# = 1048576, abs# = 4266

and the KKPAP_PRUNING table lists which one were actually read:

SQL> select ID,TYPE,LEVEL,ORDER,CALL,PARTNUM,APNUM 
from KKPAP_PRUNING;
ID TYPE  LEVEL     ORDER      CALL PARTNUM APNUM
-- ----- --------- ---------- ---- ------- -----
3 RANGE PARTITION DESCENDING RUN 4267 4266
3 RANGE PARTITION DESCENDING RUN 4266 4265

Those are the 2 ones counted by ‘Starts’.

bind variables

Here is the same query with bind variables for the timestamp range:

The big difference is that the STOPKEY is above the partition iterator here. This means that we cannot stop before having read all partitions. And that was the case: 313 ‘Starts’ — one for each partition — and 513 rows read before deciding which was the last one in the timestamp order.

It is different because the partitions are know known at parse time, because of the bind variables. This is why we see KEY/KEY rather than partition identifiers in Pstart/Pstop.

The event 10128 level 2 confirms that we have read all partitions:

SQL> select ID,TYPE,LEVEL,ORDER,CALL,PARTNUM,APNUM 
from KKPAP_PRUNING;
ID TYPE  LEVEL     ORDER      CALL PARTNUM APNUM
-- ----- --------- ---------- ---- ------- -----
6 RANGE PARTITION DESCENDING RUN 4267 4266
6 RANGE PARTITION DESCENDING RUN 4266 4265
6 RANGE PARTITION DESCENDING RUN 4265 4264
6 RANGE PARTITION DESCENDING RUN 4264 4263
...

However, as PARTITION DESCENDING is mentioned, I would have expected that the same optimization is possible as with the literals. That’s a big difference in execution time when the predicate covers many partitions, and we can see above that this is not considered by the optimizer: the cost is the same.

Note that Mauro Pagano has a note about the bind variables case where he mentions that the SORT step above the partition iterator happens even with RANGE SCAN:

What I forgot and had to relearn about DESC indexes

In summary, whether it is necessary or not, the optimizer may be more clever with partition iterations when he knows the partitions at optimization time, which means when parameters are passed as literals rather than bind variable. Of course, you must also consider the execution frequency because of the parsing (CPU and latch) overhead.

2018-what grabbed your attention

Here are the blog posts that you hit on most this year, with the most viewed entry on top. Unsurprisingly it is about the release of 18c, but interestingly the ORA-14758 and the Active Sessions post have come up again from last years list, so it appears they are still common issues for the modern Oracle professional. And of course, it is nice to see that my Openworld Mega-download is helping the community.

Thanks for supporting the blog, and as always, there will be more content next year !

Data Guard gap history

V$ARCHIVED_LOG has a lot of information, and one that is interesting in a Data Guard configuration is APPLIED, which a boolean, or rather a VARCHAR2(3) YES/NO as there are no booleans in Oracle SQL. But I would love to see a DATE or TIMESTAMP for it. As a workaround, here is a little AWK script that parses the standby alert.log to get this information. And join it with V$ARCHIVED_LOG.COMPLETION_TIME to see if we had gaps in the past between the archived logs and applied ones.

awk '
BEGIN {
print "set linesize 200 pagesize 1000"
print "column completion_time format a32"
print "column applied_time format a38"
print "column gap format a30"
s="with a as("
}
/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]:[0-9][0-9][.][0-9]*[+][0-9]*:[0-9]*/ {
sub(/T/," ");ts=$0
}
/Media Recovery Log/{
print s" select timestamp" q ts q "applied_time,"q $NF q "name from dual "
s=" union all"
}
END{
print ") select thread#,sequence#,cast(completion_time as timestamp) completion_time,applied_time,applied_time-completion_time gap"
print " from a right outer join v$archived_log using(name) order by completion_time, applied_time;"
}
' q="'" $ORACLE_BASE/diag/rdbms/?*/?*/trace/alert_*.log | sqlplus -s / as sysdba

The result is something like this:

This tells me that I had a two hours gap around 12:15 that was resolved quickly.

Another example (RAC):

Here is an example in MAA (Data Guard + RAC) which encountered some problems. On one node I can see 3 recovery attempts for Thread 1 Sequence 8:

Actually, the recovery failed 3 times on this instance. Here is the detail in alert.log:

[oracle@cdb1-dg02-2 ~]$ grep -A 2 o1_mf_1_8_g1sqm9kw $ORACLE_BASE/diag/rdbms/?*/?*/trace/alert_*.log
PR00 (PID:30983): Media Recovery Log /u03/app/oracle/fast_recovery_area/ORCLDG02/archivelog/2018_12_21/o1_mf_1_8_g1sqm9kw_.arc
2018-12-21T12:37:05.588750+00:00
Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT NODELAY
--
Media Recovery Log /u03/app/oracle/fast_recovery_area/ORCLDG02/archivelog/2018_12_21/o1_mf_1_8_g1sqm9kw_.arc
2019-01-05T09:37:13.802072+00:00
Starting background process NSV0
--
Errors with log /u03/app/oracle/fast_recovery_area/ORCLDG02/archivelog/2018_12_21/o1_mf_1_8_g1sqm9kw_.arc
2019-01-05T09:38:12.389363+00:00
Media Recovery user canceled with status 16016
--
Media Recovery Log /u03/app/oracle/fast_recovery_area/ORCLDG02/archivelog/2018_12_21/o1_mf_1_8_g1sqm9kw_.arc
2019-01-05T09:38:39.837743+00:00
Starting background process NSV0
--
Errors with log /u03/app/oracle/fast_recovery_area/ORCLDG02/archivelog/2018_12_21/o1_mf_1_8_g1sqm9kw_.arc
2019-01-05T09:39:37.412067+00:00
Media Recovery user canceled with status 16016
--
Media Recovery Log /u03/app/oracle/fast_recovery_area/ORCLDG02/archivelog/2018_12_21/o1_mf_1_8_g1sqm9kw_.arc
2019-01-05T09:40:04.846300+00:00
Starting background process NSV0
--
Errors with log /u03/app/oracle/fast_recovery_area/ORCLDG02/archivelog/2018_12_21/o1_mf_1_8_g1sqm9kw_.arc
2019-01-05T09:41:02.277054+00:00
Media Recovery user canceled with status 16016
--
Media Recovery Log /u03/app/oracle/fast_recovery_area/ORCLDG02/archivelog/2018_12_21/o1_mf_1_8_g1sqm9kw_.arc
2019-01-05T14:32:06.257452+00:00
Starting background process RSM0

Then the other node attempted the recovery and failed 4 times before it is successful:

I used this only once. Usually, the apply gap is monitored in real-time from V$DATAGUARD_STATS (APPLY_LAG from TIME_COMPUTED). If you have any better idea to verify the past gap, please add a comment here or on the Twitter thread:

Of course, the best would be an Enhancement Request to get the APPLIED_TIME in V$ARCHIVED_LOG. Anyway, if you open an Active Data Guard for reporting, it is not a bad idea to have an heartbeat table with a scheduled timestamp update so that you can include the ‘as of’ date in your report.

Oracle listener static service hi-jacking

We must be careful about the services that are registered to a listener because the user connects to them with a good idea of the database she wants to connect to, but another database or PDB can dynamically register a service with the same name, and then get the connections which were expected another destination. Of course, as a security best practice, user/password should not be the same in different databases, but what if the connection is done by a common user in multitenant? By creating a service, you can hi-jack the connections to CDB$ROOT and have them connected to your PDB.

You may think that static registration (the SID_LIST_LISTENER in listener.ora) is a solution, especially with the (STATIC_LISTENER=TRUE) introduced in 12cR2, but this defines only the target instance. The PDB is resolved dynamically.

This post is there only to raise awareness. I’ll not expose the possible exploits. But you can imagine: something that is scheduled to run as SYSDBA on CDB$ROOT can be high-jacked to be run on a PDB where you have more privileges to attempt some privilege escalation.

Here is my CDB1 database with PDB1 pluggable database. In addition to the default services, I’ve defined a static service CDB1_DGMGRL.

# lsnrctl status
LSNRCTL for Linux: Version 19.0.0.0.0 - Development on 05-JAN-2019 18:06:26
Copyright (c) 1991, 2018, Oracle.  All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=0.0.0.0)(PORT=1521)))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for Linux: Version 19.0.0.0.0 - Development
Start Date 16-DEC-2018 17:33:58
Uptime 20 days 0 hr. 32 min. 28 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /u01/app/oracle/product/19.0.0/network/admin/listener.ora
Listener Log File /u01/app/oracle/diag/tnslsnr/VM190/listener/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=VM190)(PORT=1521)))
Services Summary...
Service "7cd707c6f0a7108ce053be38a8c0058b" has 1 instance(s).
Instance "CDB1", status READY, has 1 handler(s) for this service...
Service "CDB1" has 1 instance(s).
Instance "CDB1", status READY, has 1 handler(s) for this service...
Service "CDB1XDB" has 1 instance(s).
Instance "CDB1", status READY, has 1 handler(s) for this service...
Service "CDB1_DGMGRL" has 1 instance(s).
Instance "CDB1", status UNKNOWN, has 1 handler(s) for this service...
Service "pdb1" has 1 instance(s).
Instance "CDB1", status READY, has 1 handler(s) for this service...
The command completed successfully

I am the PDB administrator, connecting with the local user DEMO to PDB1 and create a service here, with the same name CDB1_DGMGRL

SQL> connect demo/demo@//localhost/PDB1
Connected.
SQL> show con_name
CON_NAME
------------------------------
PDB1
SQL> exec dbms_service.create_service(service_name=>'CDB1_DGMGRL', network_name=>'CDB1_DGMGRL');
PL/SQL procedure successfully completed.
SQL> exec dbms_service.start_service('CDB1_DGMGRL');
PL/SQL procedure successfully completed.

As I have started the service, it is registered by the listener, in addition to the static one.

Services Summary...
Service "7cd707c6f0a7108ce053be38a8c0058b" has 1 instance(s).
Instance "CDB1", status READY, has 1 handler(s) for this service...
Service "CDB1" has 1 instance(s).
Instance "CDB1", status READY, has 1 handler(s) for this service...
Service "CDB1XDB" has 1 instance(s).
Instance "CDB1", status READY, has 1 handler(s) for this service...
Service "CDB1_DGMGRL" has 2 instance(s).
Instance "CDB1", status UNKNOWN, has 1 handler(s) for this service...
Instance "CDB1", status READY, has 1 handler(s) for this service...
Service "pdb1" has 1 instance(s).
Instance "CDB1", status READY, has 1 handler(s) for this service...
The command completed successfully

The registration is not a problem, as it goes to the same instance. The problem is that the instance knows that service as belonging to PDB1. Then, when a common user connects with this service his session switches to the PDB1 container:

# sqlplus -s sys/oracle@"(DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=CDB1_DGMGRL))(ADDRESS=(PROTOCOL=tcp)(HOST=127.0.0.1)(PORT=1521)))" as sysdba <<<'show con_name'
CON_NAME
------------------------------
PDB1

Do not expect (STATIC_SERVICE=TRUE) to change anything, because the static registration has no information about the PDB:

# sqlplus -s sys/oracle@"(DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=CDB1_DGMGRL)(INSTANCE_NAME=CDB1)(STATIC_SERVICE=TRUE))(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521)))" as sysdba <<<'show con_name'
CON_NAME
------------------------------
PDB1

By the way, even when the service is stopped:

SQL> connect demo/demo@//localhost/PDB1
Connected.
SQL> exec dbms_service.stop_service('CDB1_DGMGRL');
PL/SQL procedure successfully completed.

and then the listener knows only the static declaration:

Services Summary...
Service "7cd707c6f0a7108ce053be38a8c0058b" has 1 instance(s).
Instance "CDB1", status READY, has 1 handler(s) for this service...
Service "CDB1" has 1 instance(s).
Instance "CDB1", status READY, has 1 handler(s) for this service...
Service "CDB1XDB" has 1 instance(s).
Instance "CDB1", status READY, has 1 handler(s) for this service...
Service "CDB1_DGMGRL" has 1 instance(s).
Instance "CDB1", status UNKNOWN, has 1 handler(s) for this service...
Service "pdb1" has 1 instance(s).
Instance "CDB1", status READY, has 1 handler(s) for this service...
The command completed successfully

the instance will still redirect to PDB1 even if the session is not connected with a service:

SQL> connect sys/oracle@"(DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=CDB1_DGMGRL))(ADDRESS=(PROTOCOL=tcp)(HOST=127.0.0.1)(PORT=1521)))" as sysdba
Connected.

SQL> select sys_context('userenv','service_name') from dual;
SYS_CONTEXT('USERENV','SERVICE_NAME')
--------------------------------------------------------------------
SYS$USERS
SQL> show con_name
CON_NAME
------------------------------
PDB1

Only when the service is deleted, the connections will stay in CDB$ROOT:

SQL> exec dbms_service.delete_service('CDB1_DGMGRL');
PL/SQL procedure successfully completed.
# sqlplus -s sys/oracle@"(DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=CDB1_DGMGRL)(INSTANCE_NAME=CDB1)(STATIC_SERVICE=TRUE))(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521)))" as sysdba <<<'show con_name'
CON_NAME
------------------------------
CDB$ROOT

Solution(s)

Basics: be careful with powerful privileges, such as INHERIT PRIVILEGES, which can allow a PDB DBA to overwrite a common user procedure with authid current_user.

Safe attitude: when you expect your connection to run in CDB$ROOT be sure to do it explicitely with ALTER SESSION SET CONTAINER=CDB$ROOT

Isolation: dedicate a listener for your system administration. Another listener will be the target of REMOTE_LISTENER for dynamic registration.

Least privileges: when you need only to access to CDB$ROOT connect with a user that has no CONNECT privilege on the PDBs.

DBaaS: if you are using multitenant for database on-demand provisioning where the user can choose the name of the PDB then be sure that the name cannot be the same as a static service you use. Naming conventions may help there.