Search

Top 60 Oracle Blogs

Recent comments

IOT Hash

It’s another of my double-entendre titles. The optimizer can turn a hash join involving an index-organized table into a real performance disaster (though you may have to help it along the way by using a silly definition for your primary key columns). This post was inspired by a question posted on the Oracle Developer Community forum recently so the table and column names I’ve used in my model reflect (almost, with a few corrections) the names used in the post.

We start with a simple requirement expressed through the following SQL:


rem
rem     Script:         iot_hash.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

insert
        /*+
                qb_name(insert)
        */
into t_iot(
        id, inst_id, nr_time,
        o_time, st, null_col, svname
)
select
        /*+
                qb_name(main)
                unnest(@subq)
                leading(@sel$a93afaed apar@main ob@subq)
                use_hash(@sel$a93afaed ob@subq)
                swap_join_inputs(@sel$a93afaed ob@subq)
                index_ss_asc(@sel$a93afaed ob@subq (t_iot.st t_iot.inst_id t_iot.svname))a
        */
        apar.id,
        'UP',
        to_date('2019-10-24','yyyy-mm-dd'),
        to_date('1969-12-31','yyyy-mm-dd'),
        'IDLE',
        null,
        'tkt007.jj.bb.com'
from
        t_base apar
where
        apar.id not in (
                select
                        /*+
                                qb_name(subq)
                        */
                        id
                from
                        t_iot ob
                where
                        inst_id = 'UP'
        )
and     nvl(apar.gp_nm,'UA') = 'UA'
and     rownum <= 5000
/

The requirement is simple – insert into table t_iot a set of values dictated by a subset of the rows in table t_base if they do not already exist in t_iot. To model the issue that appeared I’ve had to hint the SQL to get the following plan (which I pulled from memory after enabling rowsource execution stats):


---------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name        | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |             |      1 |        |   296 (100)|      0 |00:00:00.03 |     788 |    148 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL | T_IOT       |      1 |        |            |      0 |00:00:00.03 |     788 |    148 |       |       |          |
|*  2 |   COUNT STOPKEY          |             |      1 |        |            |    100 |00:00:00.03 |      99 |     91 |       |       |          |
|*  3 |    HASH JOIN RIGHT ANTI  |             |      1 |    100 |   296   (2)|    100 |00:00:00.03 |      99 |     91 |    14M|  1843K|   15M (0)|
|*  4 |     INDEX SKIP SCAN      | T_IOT_STATE |      1 |  12614 |   102   (0)|  10000 |00:00:00.01 |      92 |     91 |       |       |          |
|*  5 |     TABLE ACCESS FULL    | T_BASE      |      1 |    100 |     2   (0)|    100 |00:00:00.01 |       7 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(ROWNUM<=5000)
   3 - access("APAR"."ID"="ID")
   4 - access("INST_ID"='UP')
       filter("INST_ID"='UP')
   5 - filter(NVL("APAR"."GP_NM",'UA')='UA')

The optimizer has unnested (as hinted) the subquery and converted it to an anti-join using a right hash anti-join. Take a look at the Used-mem for the hash join – would it surprise you to learn that the total size of the (not compressed in any way) IOT, and all its indexes, and the t_base table together total less than 4 MB. Something dramatically awful has happened in the hash join to generated a requirement of 14MB. (In the case of the OP this appeared as an unexpected 5GB written to the temporary tablespace.)

Before I address the source of the high memory usage, take a close look at the Predicate Information, particularly operation 3, and ask yourself what the definition of index t_iot_state might be. The predicate joins t_base.id to t_iot.id, and here’s the code to create both tables and all the indexes.

create table t_iot (
        nr_time         timestamp,
        id              varchar2(1024),
        inst_id         varchar2(200),
        o_time          timestamp,
        st              varchar2(200),
        null_col        varchar2(100),
        svname          varchar2(200),
        constraint t_iot_pk primary key(nr_time, id, inst_id)
)
organization index
/

insert into t_iot
select
        sysdate,
        dbms_random.string('l',10),
        'UP',
        sysdate,
        'IDLE',
        null,
        rpad('x',25,'x')
from
        all_objects
where
        rownum <= 1e4 -- > hint to avoid wordpress format issue
/

create index t_iot_state on t_iot(st, inst_id, svname); 
create index idx2        on t_iot(id, inst_id, svname);

create table t_base(
        id              varchar2(400) not null,
        gp_nm           varchar2(200)
)
/

insert into t_base
select
        dbms_random.string('l',10),
        'UA'
from
        all_objects
where
        rownum <= 100 -- > hint to avoid wordpress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 't_iot',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 't_base',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );
end;
/


The index t_iot_state that Oracle has used in the hash join is defined on the columns (st, inst_id, svname) – so the predicate is doing a comparison with a column that’s not in the index! At least, it’s not visibly declared in the index; but this is a secondary index on an IOT, and IOTs don’t have “normal” rowids, the rowid in a secondary index is the value of the primary key (plus a “block guess”). So the columns in the index (even though not declared in the index) are: (st, inst_id, svname, {nr_time, id, inst_id, blockguess}). So this index does supply the required id column.

Side note: you’ll see in the list of columns above that inst_id appears twice. In fact (since Oracle 9, I think) the code to handle secondary indexes has been smart enough to avoid this duplication. If the secondary index contains columns from the primary key then the “rowid” doesn’t store those columns, the code knows how to construct the primaryh key value from the stored PK columns combined with the needed columns from the index entry. This can make IOTs a very nice choice of implementation for “intersection” tables that are used to represent many-to-many joins between two other tables.

Unfortunately this “rowid” is the explanation for the massive memory demand. Take a look at the “Column Projection Information” for my execution plan:


Column Projection Information (identified by operation id):
-----------------------------------------------------------
   2 - "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200], ROWNUM[8]
   3 - (#keys=1) "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200]
   4 - "OB".ROWID[ROWID,1249], "NR_TIME"[TIMESTAMP,11], "ID"[VARCHAR2,1024], "INST_ID"[VARCHAR2,200], "OB".ROWID[ROWID,1249]
   5 - "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200]

The interesting line is operation 4. A hash join takes the rowsource from its first child (the build table) and creates an in-memory hash table (which may spill to disc, of course), so if I see an unreasonable memory allocation (or unexpected spill to disc) a good starting point is to look at what the first child is supplying. In this case the first child seems to be saying that it’s supplying (or allowing for) nearly 3,700 bytes to be passed up to the hash join.

On closer inspection we can see it’s reporting the “rowid” twice, and also reporting the three columns that make up that rowid. I think it’s reasonable to assume that it’s only supplying the rowid once, and maybe it’s not even supplying the other three columns because they are embedded in the rowid. Doing a quick arithmetic check, let’s multiply the size of the rowid by the value of A-rows: 1,249 * 10,000 = 12,490,000. That’s pretty close to the 14MB reported by the hash join in operation 3.

Hypothesis – to get at the id column, Oracle has used this index (actually a very bad choice of those available) to extract the rowid and then passed the rowid up to the parent in a (length padded) fixed format. Oracle has then created a hash table by extracting the id column from the rowid and building the hash table on it but also carrying the length-padded rowid into the hash table.  Possible variants on this theme are that some or all of the other columns in the Column Projection Information are also passed upwards so that the id doesn’t have to be extracted, but if they are they are not padded to their maximum length.

A simple test that this is broadly the right assumption is to re-run the model making the declared length of the rowid much larger to see what happens to the memory allocation. Changing the inst_id declaration from 200 bytes to 1000 bytes (note the stored value is only the 2 bytes needed for the value ‘UP’) the Used-mem jumps to 23 MB (which is an increment very similar to 800 * 10,000).  You’ll note that I chose to experiment with a column that wasn’t the column used in the join. It was a column in the secondary index definition, though, so another test would be to change the nr_time column from a timestamp (11 bytes) to a large varchar2, so I re-ran the test declaring the nr_time as a varchar2(1000) – reverting the inst_id to varchar2(200) – and the Used-mem increased to 25MB.

Preliminary Conclusion

If Oracle uses the contents of the rowid of a secondary index on an IOT in a join then it constructs a fixed format version for the rowid by padding every column in the primary key to its maximum length and concatenating the results. This can have catastrophic side effects on performance if you’ve declared some very long columns “just in case”. Any time you use index organized tables you should remember to check the Column Projection Information in any execution plans that use secondary indexes in case they are passing a large, padded, primary key through the plan to a point where a blocking operation (such as a hash join or merge join) has to accumulate a large number of rows.

Footnote

In my test case I had to hint the query heavily to force Oracle into the path I wanted to demonstrate.

It’s surprising that the optimizer should have chosen this path in the OP’s system, given that there’s another secondary index that contains the necessary columns in its definition. (So one thought is that there’s a statistics problem to address, or possibly the “good” index is subject to updates that make it become very inefficient (large) very quickly.)

Another oddity of the OP’s system was that Oracle should have chosen to do a right hash anti-join when it looked as if joining the tables in the opposite order would produce a much smaller memory demand and lower cost – there was an explict swap_join_inputs() hint in the Outline Information (so copying the outline into the query and changing that to no_swap_join_inputs() might have been abother viable workaround.) In the end the OP hinted the query to use a nested loop (anti-)join from t_base to t_iot – which is another way to avoid the hash table threat with padded rowids.