Search

OakieTags

Who's online

There are currently 0 users and 26 guests online.

Recent comments

Affiliations

cardinality

HAVING Cardinality

When performing aggregate GROUP BY operations an additional filter on the aggregates can be applied using the HAVING clause.Usually aggregates are one of the last steps executed before the final result set is returned to the client.However there are various reasons, why a GROUP BY operation might be somewhere in the middle of the execution plan operation, for example it might be part of a view that cannot be merged (or was hinted not to be merged using the NO_MERGE hint), or in the more recent releases (11g+) the optimizer decided to use the GROUP BY PLACEMENT transformation that deliberately can move the GROUP BY operation to a different execution step of the plan.In such cases, when the GROUP BY operation will be input to some other operation, it becomes essential for the overall efficiency of the execution plan preferred by the optimizer that the cardinality estimates are in the right ballpark, as it will influe

Forced Cursor Sharing And Virtual Columns

So you have that application that cannot be changed but makes use of some weird expressions that screw up the cardinality estimates of the optimizer.

Consider this simple example:

create table t as select rownum as id , case when mod(rownum, 100000) = 0 then 'y' else 'n' end as flag , rpad('x', 100) as filler from dual connect by level <= 1000000 ; exec dbms_stats.gather_table_stats(null, 't', method_opt => 'for all columns size 1 for columns flag size 2', estimate_percent => 30) select flag, count(*) from t group by flag; explain plan for select count(*) from t where flag = 'y' ; set linesize 200 tab off pagesize 0 select * from table(dbms_xplan.display);

Column Groups - Edge Cases

Oracle 11g added Extended Statistics support for column groups in order to detect correlated columns for filter predicates using an equal comparison.

Note that Oracle 11g also added the ability to use the number of distinct keys of a composite index as an upper limit for the cardinality estimates for matching column predicates, which means that the optimizer is now capable of detecting correlated columns without the explicit addition of Extended Statistics / Column Groups.

Dynamic Sampling On Multiple Partitions - Bugs

In a recent OTN thread I've been reminded of two facts about Dynamic Sampling that I already knew but had forgotten in the meantime:

1. The table level dynamic sampling hint uses a different number of blocks for sampling than the session / cursor level dynamic sampling. So even if for both for example level 5 gets used the number of sampled blocks will be different for most of the 10 levels available (obviously level 0 and 10 are exceptions)

2. The Dynamic Sampling code uses a different approach for partitioned objects if it is faced with the situation that there are more partitions than blocks to sample according to the level (and type table/cursor/session) of Dynamic Sampling

Note that all this here applies to the case where no statistics have been gathered for the table - I don't cover the case when Dynamic Sampling gets used on top of existing statistics.

Table Functions And Join Cardinality Estimates

If you consider the usage of Table Functions then you should be aware of some limitations to the optimizer calculations, in particular when considering a join between a Table Function and other row sources.

As outlined in one of my previous posts you can and should help the optimizer to arrive at a reasonable cardinality estimate when dealing with table functions, however doing so doesn't provide all necessary inputs to the join cardinality calculation that are useful and available from the statistics when dealing with regular tables.

Therefore even when following the recommended practice regarding the cardinality estimates it is possible to end up with some inaccuracies. This post will explain why.

Join Cardinality Basics

Star Transformation And Cardinality Estimates

If you want to make use of Oracle's cunning Star Transformation feature then you need to be aware of the fact that the star transformation logic - as the name implies - assumes that you are using a proper star schema.

Here is a nice example of what can happen if you attempt to use star transformation but your model obviously doesn't really correspond to what Oracle expects:

drop table d;

purge table d;

drop table t;

purge table t;

create table t
as
select
rownum as id
, mod(rownum, 100) + 1 as fk1
, 1000 + mod(rownum, 10) + 1 as fk2
, 2000 + mod(rownum, 100) + 1 as fk3
, rpad('x', 100) as filler
from
dual
connect by
level <= 1000000
;

exec dbms_stats.gather_table_stats(null, 't')

create bitmap index t_fk1 on t (fk1);

Multi-Column Joins, Expressions and 11g

Introduction

I've already outlined in one of my previous posts that getting a reasonable cardinality estimate for multi-column joins can be tricky, in particular when dealing with correlated column values in the join columns.

Since Oracle 10g several "Multi-Column Join Cardinality" sanity checks have been introduced that prevent a multi-column join from producing too low join cardinalities - this is controlled via the "_optimizer_join_sel_sanity_check" internal parameter that defaults to true from 10g on.

Implicit Datatype Conversion + Histograms = Bad Execution Plan?

Earlier today I exchanged some tweets with @martinberx about some optimizer questions and after posting more information on the ORACLE-L list, I was able to reproduce what he was observing.

The issue:

DB: 11.2.0.2.0 – 64bit /> I have a small query with a little error, which causes big troubles. /> The relevant part of the query is /> WHERE …. /> AND inst_prod_type=003 /> AND setid=’COM01′

but INST_PROD_TYPE is VARCHAR2.

this leads to filter[ (TO_NUMBER("INST_PROD_TYPE")=3 AND "SETID"='COM01') ]

based on this TO_NUMBER ( I guess!) the optimiser takes a fix selectivity of 1%.

Can someone tell me if this 1% is right? Jonathan Lewis “CBO Fundamentals” on page 133 is only talking about character expressions.

Unfortunately there are only 2 distinct values of INST_PROD_TYPE so this artificial [low] selectivity leads to my problem: /> An INDEX SKIP SCAN on PS0RF_INST_PROD is choosen. (columns of PS0RF_INST_PROD: INST_PROD_TYPE, SETID, INST_PROD_ID )

After fixing the statement to /> AND inst_prod_type=’003′ /> another index is used and the statement performs as expected.

Now I have no problem, but want to find the optimizers decisions in my 10053 traces.

The Important Bits of Information

From Martin’s email we need to pay close attention to:

  • Predicate of “inst_prod_type=003″ where INST_PROD_TYPE is VARCHAR2 (noting no single quotes around 003)
  • Implicite datatype conversion in predicate section of explain plan – TO_NUMBER(“INST_PROD_TYPE”)=3
  • only 2 distinct values of INST_PROD_TYPE

From this information I’ll construct the following test case:

create table foo (c1 varchar2(8));
insert into foo select '003' from dual connect by level <= 1000000;
insert into foo select '100' from dual connect by level <= 1000000;
commit;
exec dbms_stats.gather_table_stats(user,'foo');

And using the display_raw function we’ll look at the column stats.

col low_val     for a8
col high_val    for a8
col data_type   for a9
col column_name for a11

select
   a.column_name,
   display_raw(a.low_value,b.data_type) as low_val,
   display_raw(a.high_value,b.data_type) as high_val,
   b.data_type,
   a.density,
   a.histogram,
   a.num_buckets
from
   user_tab_col_statistics a, user_tab_cols b
where
   a.table_name='FOO' and
   a.table_name=b.table_name and
   a.column_name=b.column_name
/

COLUMN_NAME LOW_VAL  HIGH_VAL DATA_TYPE    DENSITY HISTOGRAM       NUM_BUCKETS
----------- -------- -------- --------- ---------- --------------- -----------
C1          003      100      VARCHAR2          .5 NONE                      1

Take note of the lack of a histogram.

Now let’s see what the CBO estimates for a simple query with and without quotes (explicit cast and implicit cast).

SQL> explain plan for select count(*) from foo where c1=003;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   875   (3)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1000K|  3906K|   875   (3)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TO_NUMBER("C1")=003)

14 rows selected.

SQL> explain plan for select count(*) from foo where c1='003';

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   868   (2)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1000K|  3906K|   868   (2)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("C1"='003')

14 rows selected.

In this case the estimated number of rows is spot on – 1 million rows. Now lets regather stats and because of our queries using C1 predicates, it will become a candidate for a histogram. We can see this from sys.col_usage$.

select  oo.name owner,
        o.name table_name,
        c.name column_name,
        u.equality_preds,
        u.equijoin_preds,
        u.nonequijoin_preds,
        u.range_preds,
        u.like_preds,
        u.null_preds,
        u.timestamp
from    sys.col_usage$ u,
        sys.obj$ o,
        sys.user$ oo,
        sys.col$ c
where   o.obj#   = u.obj#
and     oo.user# = o.owner#
and     c.obj#   = u.obj#
and     c.col#   = u.intcol#
and     oo.name  = 'GRAHN'
and     o.name   = 'FOO'
/

OWNER TABLE_NAME COLUMN_NAME EQUALITY_PREDS EQUIJOIN_PREDS NONEQUIJOIN_PREDS RANGE_PREDS LIKE_PREDS NULL_PREDS TIMESTAMP
----- ---------- ----------- -------------- -------------- ----------------- ----------- ---------- ---------- -------------------
GRAHN FOO        C1                       1              0                 0           0          0          0 2011-06-08 22:29:59

Regather stats and re-check the column stats:

SQL> exec dbms_stats.gather_table_stats(user,'foo');

PL/SQL procedure successfully completed.

SQL> select
  2     a.column_name,
  3     display_raw(a.low_value,b.data_type) as low_val,
  4     display_raw(a.high_value,b.data_type) as high_val,
  5     b.data_type,
  6     a.density,
  7     a.histogram,
  8     a.num_buckets
  9  from
 10     user_tab_col_statistics a, user_tab_cols b
 11  where
 12     a.table_name='FOO' and
 13     a.table_name=b.table_name and
 14     a.column_name=b.column_name
 15  /

COLUMN_NAME LOW_VAL  HIGH_VAL DATA_TYPE    DENSITY HISTOGRAM       NUM_BUCKETS
----------- -------- -------- --------- ---------- --------------- -----------
C1          003      100      VARCHAR2  2.5192E-07 FREQUENCY                 2

Note the presence of a frequency histogram. Now let’s re-explain:

SQL> explain plan for select count(*) from foo where c1=003;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   875   (3)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |     1 |     4 |   875   (3)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TO_NUMBER("C1")=003)

SQL> explain plan for select count(*) from foo where c1='003';

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   868   (2)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1025K|  4006K|   868   (2)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("C1"='003')

And whammy! Note that the implicit cast [ filter(TO_NUMBER("C1")=003) ] now has an estimate of 1 row (when we know there is 1 million). /> So what is going on here? Let’s dig into the optimizer trace for some insight:

SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for FOO[FOO]
  Column (#1):
    NewDensity:0.243587, OldDensity:0.000000 BktCnt:5458, PopBktCnt:5458, PopValCnt:2, NDV:2
  Column (#1): C1(
    AvgLen: 4 NDV: 2 Nulls: 0 Density: 0.243587
    Histogram: Freq  #Bkts: 2  UncompBkts: 5458  EndPtVals: 2
  Using prorated density: 0.000000 of col #1 as selectvity of out-of-range/non-existent value pred
  Table: FOO  Alias: FOO
    Card: Original: 2000000.000000  Rounded: 1  Computed: 0.50  Non Adjusted: 0.50
  Access Path: TableScan
    Cost:  875.41  Resp: 875.41  Degree: 0
      Cost_io: 853.00  Cost_cpu: 622375564
      Resp_io: 853.00  Resp_cpu: 622375564
  Best:: AccessPath: TableScan
         Cost: 875.41  Degree: 1  Resp: 875.41  Card: 0.50  Bytes: 0

As you can see from the line

Using prorated density: 0.000000 of col #1 as selectvity of out-of-range/non-existent value pred

The presence of the histogram and the implicit conversion of TO_NUMBER(“C1″)=003 causes the CBO to use a density of 0 because it thinks it’s a non-existent value. The reason for this is that TO_NUMBER(“C1″)=003 is the same as TO_NUMBER(“C1″)=3 and for the histogram the CBO uses TO_CHAR(C1)=’3′ and 3 is not present in the histogram only ’003′ and ’100′.

Dumb Luck?

So, what if the predicate contained a number that was not left padded with zeros, say 100, the other value we put in the table?

SQL> explain plan for select count(*) from foo where c1=100;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   875   (3)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1009K|  3944K|   875   (3)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TO_NUMBER("C1")=100)

While not exact, the CBO estimate is quite close to the 1 million rows with C1=’100′.

Summary

It’s quite clear that Martin’s issue came down to the following:

  • implicit casting
  • presences of histogram
  • zero left padded number/string

The combination of these created a scenario where the CBO thinks the value is out-of-range and uses a prorated density of 0 resulting in a cardinality of 1 when there are many more rows than 1.

The moral of the story here is always cast your predicates correctly. This includes explicit cast of date types as well – never rely on the nls settings.

All tests performed on 11.2.0.2.

Frequency histograms - edge cases

Oracle introduced with the 10.2.0.4 patch set a significant change how non-existing values are treated when there is a frequency histogram on a column / expression. See Jonathan Lewis' blog post which is probably the most concise description of the issue. In a nutshell the change is about the following (quoted from Jonathan's post): "If the value you supply does not appear in the histogram, but is inside the low/high range of the histogram then the cardinality will be half the cardinality of the least frequently occurring value that is in the histogram".

I'm still a bit puzzled why Oracle introduced such a significant change to the optimizer with a patch set, but one of the most obvious reasons might be that the change allows to generate frequency histograms using a rather low sample size, because there is no longer a similar threat as before when the frequency histogram misses one of the existing values (which would then return an estimate of 1 if they didn't appear in the histogram).

In fact when using the default DBMS_STATS.AUTO_SAMPLE_SIZE Oracle exactly does this: It uses by default a quite low sample size to perform the additional runs required for each histogram - probably an attempt to minimize the additional work that needs to be done for histogram generation.

In it is however quite interesting to see how exactly this behaviour together with the new treatment of non-existing values can turn into a threat as a recent thread on OTN demonstrated.

Consider the following scenario: You have a column with a highly skewed data distribution; there is a single, very popular value, and a few other, very unpopular values.

Now you have a query type that filters on this column and frequently searches for non-existing values. In order to speed up the query an index has been created on the column, and in order to make the optimizer aware of the fact that the data distribution is highly skewed a histogram is generated, so that the optimizer should favor the index only for those unpopular respectively non-existing values.

The following test case (run on 11.1.0.7) emulates this scenario:

create table t1 (
id number(*, 0)
, t_status number(*, 0)
, vc varchar2(100)
);

-- 1 million rows
-- One very popular value
-- Two very unpopular values
-- in column T_STATUS
insert /*+ append */ into t1 (id, t_status, vc)
with generator as (
select /*+ materialize */
level as id
from
dual
connect by
level <= 10000
)
select /*+ use_nl(v1, v2) */
rownum as id
, case
when rownum <= 10
then 1
when rownum <= 20
then 2
else 0
end as t_status
, rpad('x', 100) as vc
from
generator v1
, generator v2
where
rownum <= 1000000;

commit;

create index t1_idx on t1 (t_status);

exec dbms_stats.gather_table_stats(null, 'T1', method_opt => 'for all columns size 1 for columns t_status size 254')

The first interesting point are the generated column statistics:

SQL>
SQL> -- Note that the basic column statistics
SQL> -- are generated with a much higher sample size
SQL> -- The histogram however was created with a sample size
SQL> -- of 5,000 rows only
SQL> -- Therefore we get three distinct values but only a single bucket
SQL> -- But not always => Potential instability issue
SQL> select
2 column_name
3 , num_distinct
4 , sample_size
5 from
6 user_tab_col_statistics
7 where
8 table_name = 'T1';

COLUMN_NAME NUM_DISTINCT SAMPLE_SIZE
--------------- ------------ -----------
ID 1000000 1000000
T_STATUS 3 5499
VC 1 1000000

SQL>
SQL> -- This results in a single bucket
SQL> -- But not always => Potential instability issue
SQL> select
2 column_name
3 , endpoint_number
4 , endpoint_value
5 from
6 user_tab_histograms
7 where
8 table_name = 'T1'
9 and column_name = 'T_STATUS';

COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------- --------------
T_STATUS 5499 0

Notice the inconsistency: The basic column statistics (number of distinct values, low value, high value) obviously have been generated with a much higher sample size (in fact a compute in this case) than the histogram on T_STATUS. The histogram therefore misses the unpopular values and consists only of a single value - the very popular one.

Now watch closely what happens to the cardinality estimates of the non-existing values:

SQL>
SQL> -- The popular value
SQL> explain plan for
2 select
3 *
4 from
5 t1
6 where
7 t_status = 0
8 ;

Explained.

SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 999K| 102M| 4319 (2)| 00:00:52 |
|* 1 | TABLE ACCESS FULL| T1 | 999K| 102M| 4319 (2)| 00:00:52 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("T_STATUS"=0)

13 rows selected.

SQL>
SQL> -- A non-existing value
SQL> explain plan for
2 select
3 *
4 from
5 t1
6 where
7 t_status = 1000
8 ;

Explained.

SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 500K| 51M| 4316 (2)| 00:00:52 |
|* 1 | TABLE ACCESS FULL| T1 | 500K| 51M| 4316 (2)| 00:00:52 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("T_STATUS"=1000)

13 rows selected.

SQL>
SQL> -- Two non-existing values
SQL> explain plan for
2 select
3 *
4 from
5 t1
6 where
7 t_status in (1000, 2000)
8 ;

Explained.

SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1000K| 102M| 4325 (3)| 00:00:52 |
|* 1 | TABLE ACCESS FULL| T1 | 1000K| 102M| 4325 (3)| 00:00:52 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("T_STATUS"=1000 OR "T_STATUS"=2000)

13 rows selected.

SQL>

Ouch, ouch: Whereas the estimate for the popular value is correct, the estimates for the unpopular values are totally way-off - in fact using an IN clause with two non-existing values gets an estimate of all rows contained in the table.

The explanation: Since the least popular value is the single bucket covering virtually all rows, half of it is still 50% of the total cardinality - so two non-existing values in an IN clause end up with a selectivity of 1.

Furthermore the usual decay to values outside the low/high column values doesn't apply here either - no matter what non-existing values get used, the overestimation stays the same.

These overestimates have obviously a significant impact - here the suitable index doesn't get used - more complex plans might turn even into a complete disaster.

The default behaviour that histograms are generated with a much lower sample size than the basic column statistics also introduces a kind of instability - try to run the example several times. Sometimes one of the unpopular values might be caught by the default histogram generation, sometimes not. The effect is dramatic: If one of the unpopular values gets caught, the estimates will be reasonable again, since then the least popular value do have a very low cardinality - if not, you get exactly the opposite result just demonstrated.

If the old behaviour gets re-activated, the results are as expected with the same set of statistics:

SQL>
SQL> -- Switch off new behaviour
SQL> alter session set "_fix_control" = '5483301:off';

Session altered.

SQL>
SQL> -- The popular value
SQL> explain plan for
2 select
3 *
4 from
5 t1
6 where
7 t_status = 0
8 ;

Explained.

SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 999K| 102M| 4319 (2)| 00:00:52 |
|* 1 | TABLE ACCESS FULL| T1 | 999K| 102M| 4319 (2)| 00:00:52 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("T_STATUS"=0)

13 rows selected.

SQL>
SQL> -- A non-existing value
SQL> explain plan for
2 select
3 *
4 from
5 t1
6 where
7 t_status = 1000
8 ;

Explained.

SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 546753835

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 107 | 4 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 107 | 4 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T1_IDX | 1 | | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T_STATUS"=1000)

14 rows selected.

SQL>
SQL> -- Two non-existing values
SQL> explain plan for
2 select
3 *
4 from
5 t1
6 where
7 t_status in (1000, 2000)
8 ;

Explained.

SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3743026710

---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 107 | 5 (0)| 00:00:01 |
| 1 | INLIST ITERATOR | | | | | |
| 2 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 107 | 5 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | T1_IDX | 1 | | 4 (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("T_STATUS"=1000 OR "T_STATUS"=2000)

15 rows selected.

Switching back to the former treatment that non-existing values lead to an estimate of 1 will fix this issue, however since this has a potential impact on every execution plan thorough regression testing would be required with this used as a global setting.

Increasing the sample size is another option, if it ensures that unpopular values get caught by the histogram generation, so that the least popular value of the histogram is one with a low cardinality. Note however that this will increase the amount of work necessary to gather the statistics for the histograms.

SQL>
SQL> -- Gather statistics with 100% sample size
SQL> exec dbms_stats.gather_table_stats(null, 'T1', estimate_percent => null, method_opt => 'for all columns size 1 for columns t_status size 254')

PL/SQL procedure successfully completed.

SQL>
SQL> -- Now the statistics are more representative
SQL> select
2 column_name
3 , num_distinct
4 , sample_size
5 from
6 user_tab_col_statistics
7 where
8 table_name = 'T1';

COLUMN_NAME NUM_DISTINCT SAMPLE_SIZE
--------------- ------------ -----------
ID 1000000 1000000
T_STATUS 3 1000000
VC 1 1000000

SQL>
SQL> -- The histogram now covers also the unpopular values
SQL> select
2 column_name
3 , endpoint_number
4 , endpoint_value
5 from
6 user_tab_histograms
7 where
8 table_name = 'T1'
9 and column_name = 'T_STATUS';

COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------- --------------
T_STATUS 999980 0
T_STATUS 999990 1
T_STATUS 1000000 2

SQL>
SQL> -- Switch back on new behaviour
SQL> alter session set "_fix_control" = '5483301:on';

Session altered.

SQL>
SQL> -- A non-existing value
SQL> explain plan for
2 select
3 *
4 from
5 t1
6 where
7 t_status = 1000
8 ;

Explained.

SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 546753835

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 107 | 4 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 107 | 4 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | T1_IDX | 1 | | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T_STATUS"=1000)

14 rows selected.

SQL>
SQL> -- Two non-existing values
SQL> explain plan for
2 select
3 *
4 from
5 t1
6 where
7 t_status in (1000, 2000)
8 ;

Explained.

SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3743026710

---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 107 | 5 (0)| 00:00:01 |
| 1 | INLIST ITERATOR | | | | | |
| 2 | TABLE ACCESS BY INDEX ROWID| T1 | 1 | 107 | 5 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | T1_IDX | 1 | | 4 (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("T_STATUS"=1000 OR "T_STATUS"=2000)

15 rows selected.

As suggested by Jonathan Lewis in the OTN thread, another elegant solution to the problem of searching for non-existing values would be to add a virtual column that filtered out the popular values. This approach has several advantages if applicable: The index maintained is very small, minimizes the maintenance effort (and might also address index-efficiency related issues in case of frequent updates to the column) and solves the histogram issue since the histogram will only cover the unpopular values:

SQL>
SQL> -- Add a virtual column (the same could be achieved using a function-based index instead for pre-11g versions)
SQL> alter table t1 add (t_status_unpop as (case when t_status != 0 then t_status end));

Table altered.

SQL>
SQL> exec dbms_stats.gather_table_stats(null, 'T1', method_opt => 'for columns t_status_unpop size 254')

PL/SQL procedure successfully completed.

SQL>
SQL> -- Since the virtual column only covers
SQL> -- the unpopular values, the histogram
SQL> -- will be very precise even with a low sample size
SQL> select
2 column_name
3 , num_distinct
4 , sample_size
5 from
6 user_tab_col_statistics
7 where
8 table_name = 'T1';

COLUMN_NAME NUM_DISTINCT SAMPLE_SIZE
--------------- ------------ -----------
ID 1000000 1000000
T_STATUS 3 1000000
VC 1 1000000
T_STATUS_UNPOP 2 20

SQL>
SQL> -- The histogram only covers the unpopular values
SQL> select
2 column_name
3 , endpoint_number
4 , endpoint_value
5 from
6 user_tab_histograms
7 where
8 table_name = 'T1'
9 and column_name = 'T_STATUS_UNPOP';

COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------- --------------
T_STATUS_UNPOP 10 1
T_STATUS_UNPOP 20 2

SQL>
SQL> -- Two non-existing values
SQL> -- So we don't run into the same problem here
SQL> -- The estimate is reasonable for non-existing values
SQL> explain plan for
2 select
3 *
4 from
5 t1
6 where
7 t_status_unpop in (1000, 2000)
8 ;

Explained.

SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 108 | 4470 (6)| 00:00:54 |
|* 1 | TABLE ACCESS FULL| T1 | 1 | 108 | 4470 (6)| 00:00:54 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("T_STATUS_UNPOP"=1000 OR "T_STATUS_UNPOP"=2000)

13 rows selected.

I haven't created an index on the virtual column, but the estimate is correct and a suitable index would get used if it existed.