Search

OakieTags

Who's online

There are currently 0 users and 49 guests online.

Recent comments

Affiliations

histograms

Column Groups - Edge Cases

Oracle 11g added Extended Statistics support for column groups in order to detect correlated columns for filter predicates using an equal comparison.

Note that Oracle 11g also added the ability to use the number of distinct keys of a composite index as an upper limit for the cardinality estimates for matching column predicates, which means that the optimizer is now capable of detecting correlated columns without the explicit addition of Extended Statistics / Column Groups.

NewDensity

A recent comment on a note I wrote some time ago about faking histograms asked about the calculations of selectivity in the latest versions of Oracle. As I read the question, I realised that I had originally supplied a formula for calculating cardinality, rather than selectivity, so I thought I’d supply a proper example.

We’ll start with a script to create some data and stats – and I’m going to start with a script I wrote in Jan 2001 (which is why it happens to use the analyze command rather than dbms_stats.gather_table_stats, even though this example comes from an instance of 11.2.0.2).

METHOD_OPT=> SIZE AUTO Quiz Solution (The Trickster)

I was going to leave it for a few days but there have already been so many comments and discussions on all this, I thought I better write something up. In case anyone was wondering, yes I probably am driving my colleagues at work mad with my “Question of the Day” !! Unfortunately, some might [...]

Implicit Datatype Conversion + Histograms = Bad Execution Plan?

Earlier today I exchanged some tweets with @martinberx about some optimizer questions and after posting more information on the ORACLE-L list, I was able to reproduce what he was observing.

The issue:

DB: 11.2.0.2.0 – 64bit /> I have a small query with a little error, which causes big troubles. /> The relevant part of the query is /> WHERE …. /> AND inst_prod_type=003 /> AND setid=’COM01′

but INST_PROD_TYPE is VARCHAR2.

this leads to filter[ (TO_NUMBER("INST_PROD_TYPE")=3 AND "SETID"='COM01') ]

based on this TO_NUMBER ( I guess!) the optimiser takes a fix selectivity of 1%.

Can someone tell me if this 1% is right? Jonathan Lewis “CBO Fundamentals” on page 133 is only talking about character expressions.

Unfortunately there are only 2 distinct values of INST_PROD_TYPE so this artificial [low] selectivity leads to my problem: /> An INDEX SKIP SCAN on PS0RF_INST_PROD is choosen. (columns of PS0RF_INST_PROD: INST_PROD_TYPE, SETID, INST_PROD_ID )

After fixing the statement to /> AND inst_prod_type=’003′ /> another index is used and the statement performs as expected.

Now I have no problem, but want to find the optimizers decisions in my 10053 traces.

The Important Bits of Information

From Martin’s email we need to pay close attention to:

  • Predicate of “inst_prod_type=003″ where INST_PROD_TYPE is VARCHAR2 (noting no single quotes around 003)
  • Implicite datatype conversion in predicate section of explain plan – TO_NUMBER(“INST_PROD_TYPE”)=3
  • only 2 distinct values of INST_PROD_TYPE

From this information I’ll construct the following test case:

create table foo (c1 varchar2(8));
insert into foo select '003' from dual connect by level <= 1000000;
insert into foo select '100' from dual connect by level <= 1000000;
commit;
exec dbms_stats.gather_table_stats(user,'foo');

And using the display_raw function we’ll look at the column stats.

col low_val     for a8
col high_val    for a8
col data_type   for a9
col column_name for a11

select
   a.column_name,
   display_raw(a.low_value,b.data_type) as low_val,
   display_raw(a.high_value,b.data_type) as high_val,
   b.data_type,
   a.density,
   a.histogram,
   a.num_buckets
from
   user_tab_col_statistics a, user_tab_cols b
where
   a.table_name='FOO' and
   a.table_name=b.table_name and
   a.column_name=b.column_name
/

COLUMN_NAME LOW_VAL  HIGH_VAL DATA_TYPE    DENSITY HISTOGRAM       NUM_BUCKETS
----------- -------- -------- --------- ---------- --------------- -----------
C1          003      100      VARCHAR2          .5 NONE                      1

Take note of the lack of a histogram.

Now let’s see what the CBO estimates for a simple query with and without quotes (explicit cast and implicit cast).

SQL> explain plan for select count(*) from foo where c1=003;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   875   (3)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1000K|  3906K|   875   (3)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TO_NUMBER("C1")=003)

14 rows selected.

SQL> explain plan for select count(*) from foo where c1='003';

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   868   (2)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1000K|  3906K|   868   (2)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("C1"='003')

14 rows selected.

In this case the estimated number of rows is spot on – 1 million rows. Now lets regather stats and because of our queries using C1 predicates, it will become a candidate for a histogram. We can see this from sys.col_usage$.

select  oo.name owner,
        o.name table_name,
        c.name column_name,
        u.equality_preds,
        u.equijoin_preds,
        u.nonequijoin_preds,
        u.range_preds,
        u.like_preds,
        u.null_preds,
        u.timestamp
from    sys.col_usage$ u,
        sys.obj$ o,
        sys.user$ oo,
        sys.col$ c
where   o.obj#   = u.obj#
and     oo.user# = o.owner#
and     c.obj#   = u.obj#
and     c.col#   = u.intcol#
and     oo.name  = 'GRAHN'
and     o.name   = 'FOO'
/

OWNER TABLE_NAME COLUMN_NAME EQUALITY_PREDS EQUIJOIN_PREDS NONEQUIJOIN_PREDS RANGE_PREDS LIKE_PREDS NULL_PREDS TIMESTAMP
----- ---------- ----------- -------------- -------------- ----------------- ----------- ---------- ---------- -------------------
GRAHN FOO        C1                       1              0                 0           0          0          0 2011-06-08 22:29:59

Regather stats and re-check the column stats:

SQL> exec dbms_stats.gather_table_stats(user,'foo');

PL/SQL procedure successfully completed.

SQL> select
  2     a.column_name,
  3     display_raw(a.low_value,b.data_type) as low_val,
  4     display_raw(a.high_value,b.data_type) as high_val,
  5     b.data_type,
  6     a.density,
  7     a.histogram,
  8     a.num_buckets
  9  from
 10     user_tab_col_statistics a, user_tab_cols b
 11  where
 12     a.table_name='FOO' and
 13     a.table_name=b.table_name and
 14     a.column_name=b.column_name
 15  /

COLUMN_NAME LOW_VAL  HIGH_VAL DATA_TYPE    DENSITY HISTOGRAM       NUM_BUCKETS
----------- -------- -------- --------- ---------- --------------- -----------
C1          003      100      VARCHAR2  2.5192E-07 FREQUENCY                 2

Note the presence of a frequency histogram. Now let’s re-explain:

SQL> explain plan for select count(*) from foo where c1=003;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   875   (3)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |     1 |     4 |   875   (3)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TO_NUMBER("C1")=003)

SQL> explain plan for select count(*) from foo where c1='003';

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   868   (2)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1025K|  4006K|   868   (2)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("C1"='003')

And whammy! Note that the implicit cast [ filter(TO_NUMBER("C1")=003) ] now has an estimate of 1 row (when we know there is 1 million). /> So what is going on here? Let’s dig into the optimizer trace for some insight:

SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for FOO[FOO]
  Column (#1):
    NewDensity:0.243587, OldDensity:0.000000 BktCnt:5458, PopBktCnt:5458, PopValCnt:2, NDV:2
  Column (#1): C1(
    AvgLen: 4 NDV: 2 Nulls: 0 Density: 0.243587
    Histogram: Freq  #Bkts: 2  UncompBkts: 5458  EndPtVals: 2
  Using prorated density: 0.000000 of col #1 as selectvity of out-of-range/non-existent value pred
  Table: FOO  Alias: FOO
    Card: Original: 2000000.000000  Rounded: 1  Computed: 0.50  Non Adjusted: 0.50
  Access Path: TableScan
    Cost:  875.41  Resp: 875.41  Degree: 0
      Cost_io: 853.00  Cost_cpu: 622375564
      Resp_io: 853.00  Resp_cpu: 622375564
  Best:: AccessPath: TableScan
         Cost: 875.41  Degree: 1  Resp: 875.41  Card: 0.50  Bytes: 0

As you can see from the line

Using prorated density: 0.000000 of col #1 as selectvity of out-of-range/non-existent value pred

The presence of the histogram and the implicit conversion of TO_NUMBER(“C1″)=003 causes the CBO to use a density of 0 because it thinks it’s a non-existent value. The reason for this is that TO_NUMBER(“C1″)=003 is the same as TO_NUMBER(“C1″)=3 and for the histogram the CBO uses TO_CHAR(C1)=’3′ and 3 is not present in the histogram only ’003′ and ’100′.

Dumb Luck?

So, what if the predicate contained a number that was not left padded with zeros, say 100, the other value we put in the table?

SQL> explain plan for select count(*) from foo where c1=100;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   875   (3)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1009K|  3944K|   875   (3)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TO_NUMBER("C1")=100)

While not exact, the CBO estimate is quite close to the 1 million rows with C1=’100′.

Summary

It’s quite clear that Martin’s issue came down to the following:

  • implicit casting
  • presences of histogram
  • zero left padded number/string

The combination of these created a scenario where the CBO thinks the value is out-of-range and uses a prorated density of 0 resulting in a cardinality of 1 when there are many more rows than 1.

The moral of the story here is always cast your predicates correctly. This includes explicit cast of date types as well – never rely on the nls settings.

All tests performed on 11.2.0.2.