Removing Unnecessary Indexes: 1. The Need for Extended Statistics

This is the first post in a series about unnecessary indexes and some of the challenges that they present 

I have always noted that it is remarkably easy to add indexes to systems, especially OLTP systems. However, removing them is generally more difficult.
The reward for removing an unused, or at least an unnecessary, index is that you no longer need to maintain it during DML operations, thus reducing I/O and not loading index blocks into the buffer cache. However, the risk is that performance will degrade somewhere, because an execution plan changes.
I will start by looking at a class of indexes that are easily identified as candidates for removal. If the columns in a non-unique index are the same as the leading columns in another index then it is reasonable to remove it because any query that uses it could also use the other index. We cannot drop unique indexes because they are need to enforce a unique or primary key constraint.
However, there is a catch. A multi-column index might be used by the optimizer to provide the number of distinct keys without being referenced in the execution plan. Here is a simple example.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">CREATE TABLE t 
(a NUMBER
,b NUMBER
,c NUMBER
,d NUMBER
,CONSTRAINT t_pk PRIMARY KEY (a)
);

CREATE INDEX t_ab ON t (a,b);
CREATE INDEX t_bc ON t (b,c);
CREATE INDEX t_bcd ON t (b,c,d);
  • T_BC is a subset of T_BCD, so it is redundant and could be dropped.
  • (correction 7.1.2017) The primary key index T_PK is a subset of T_AB.  It is also redundant because the primary key constraint, on A alone, could be altered to use index T_AB.
#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">ALTER TABLE t MODIFY PRIMARY KEY USING INDEX t_ab;

However, I am going to leave the default primary key index in place for the rest of this demonstration.  I will now populate the table, and gather statistics but not histograms.

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">INSERT /*+APPEND*/ INTO t
    WITH x AS (
    SELECT rownum-1 n FROM DUAL connect by level <= 1E5)
    SELECT n
    , MOD(n,100)
    , ROUND(MOD(n,100),-1)
    , dbms_random.value(1,100)
    FROM x;
    EXEC dbms_stats.gather_table_stats(null,'T',cascade=>TRUE,method_opt=>'FOR ALL COLUMNS SIZE 1');

    The table has 100,000 rows, column B has 100 distinct, and column C has 11 distinct values. But there is a close correlation between them. The value of C is the value of B rounded off to the nearest 10.

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">column column_name format a20
    SELECT column_name, num_distinct, num_buckets, num_nulls, histogram
    FROM user_tab_col_statistics
    WHERE table_name = 'T'
    ORDER BY 1
    /

    COLUMN_NAME NUM_DISTINCT NUM_BUCKETS NUM_NULLS HISTOGRAM
    -------------------- ------------ ----------- ---------- ---------------
    A 100000 1 0 NONE
    B 100 1 0 NONE
    C 11 1 0 NONE
    D 100000 1 0 NONE

    We can also see that index T_BC has 100 distinct keys.

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">SELECT index_name, uniqueness, visibility, distinct_keys, clustering_factor
    FROM user_indexes WHERE table_name = 'T'
    /

    INDEX_NAME UNIQUENES VISIBILIT DISTINCT_KEYS CLUSTERING_FACTOR
    ---------- --------- --------- ------------- -----------------
    T_PK UNIQUE VISIBLE 100000 525
    T_AB NONUNIQUE VISIBLE 100000 525
    T_BC NONUNIQUE VISIBLE 100 52500
    T_BCD NONUNIQUE VISIBLE 100000 99894

    Let's look at the execution plans for some simple queries: This first query uses index T_BC because. It doesn't even need to look up the table because column A is not null.

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">SELECT COUNT(A) FROM t WHERE b = 42 AND c = 40          

    Plan hash value: 2931606879
    --------------------------------------------------------------------------------------------------------------------
    | Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers |
    --------------------------------------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 1 | | | 3 (100)| | 1 |00:00:00.01 | 5 |
    | 1 | SORT AGGREGATE | | 1 | 1 | 6 | | | 1 |00:00:00.01 | 5 |
    |* 2 | INDEX RANGE SCAN| T_BC | 1 | 1000 | 6000 | 3 (0)| 00:00:01 | 1000 |00:00:00.01 | 5 |
    --------------------------------------------------------------------------------------------------------------------

    This query references column D, which is nullable, so it uses index T_BCD.

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">SELECT COUNT(D) FROM t WHERE b = 42 AND c = 40

    Plan hash value: 2099617883
    ---------------------------------------------------------------------------------------------------------------------
    | Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers |
    ---------------------------------------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 1 | | | 8 (100)| | 1 |00:00:00.01 | 8 |
    | 1 | SORT AGGREGATE | | 1 | 1 | 28 | | | 1 |00:00:00.01 | 8 |
    |* 2 | INDEX RANGE SCAN| T_BCD | 1 | 1000 | 28000 | 8 (0)| 00:00:01 | 1000 |00:00:00.01 | 8 |
    ---------------------------------------------------------------------------------------------------------------------

    Note that in both cases the optimizer estimated it would receive 1000 rows, and it did.

    Invisible Indexes 

    Rather than drop index T_BC, I am going to make it invisible, and repeat the test. I would always do this in a real production situation before dropping the index later. If removing the index does have undesirable consequences it can instantly be made visible again, whereas rebuilding it may take time and cause contention on a large and active table.

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">ALTER INDEX t_bc INVISIBLE;

    The execution plan for the first query has changed to use T_BCD (which is perfectly reasonable), but the estimated number of rows is now just 91, not the 1000 that were actually found. Although it didn't make a difference to this execution plan, it is this misestimate which could cause plan regressions in more complex cases.

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">SELECT COUNT(A) FROM t WHERE b = 42 AND c = 40          

    Plan hash value: 2099617883
    ---------------------------------------------------------------------------------------------------------------------
    | Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers |
    ---------------------------------------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 1 | | | 3 (100)| | 1 |00:00:00.01 | 8 |
    | 1 | SORT AGGREGATE | | 1 | 1 | 6 | | | 1 |00:00:00.01 | 8 |
    |* 2 | INDEX RANGE SCAN| T_BCD | 1 | 91 | 546 | 3 (0)| 00:00:01 | 1000 |00:00:00.01 | 8 |
    ---------------------------------------------------------------------------------------------------------------------

    The other query produces the same plan, with the same misestimate of the number of rows.

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">SELECT COUNT(D) FROM t WHERE b = 42 AND c = 40          

    Plan hash value: 2099617883
    ---------------------------------------------------------------------------------------------------------------------
    | Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers |
    ---------------------------------------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 1 | | | 3 (100)| | 1 |00:00:00.02 | 8 |
    | 1 | SORT AGGREGATE | | 1 | 1 | 28 | | | 1 |00:00:00.02 | 8 |
    |* 2 | INDEX RANGE SCAN| T_BCD | 1 | 91 | 2548 | 3 (0)| 00:00:01 | 1000 |00:00:00.01 | 8 |
    ---------------------------------------------------------------------------------------------------------------------

    The optimizer misestimates of the number of rows because it can no longer use index T_BC to tell it that there are only 100 distinct values on the table for the combination of columns B and C.

    Extended Statistics

    Now I will create extended statistics on the table

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">select dbms_stats.create_extended_stats(null,'T','(B,C)') from dual;
    EXEC dbms_stats.gather_table_stats(null,'T',cascade=>TRUE,method_opt=>'FOR ALL COLUMNS SIZE 1');

    Or I could just have done this. It comes to the same.

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">EXEC dbms_stats.gather_table_stats(null,'T',cascade=>TRUE,method_opt=>'FOR ALL COLUMNS SIZE 1 FOR COLUMNS SIZE 1 (B,C)');

    The extended statistics on the combination of columns B and C tells the optimizer that there are 100 distinct values for these columns. Note that the extended statistics only have a single bucket.

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">COLUMN_NAME                    NUM_DISTINCT NUM_BUCKETS  NUM_NULLS HISTOGRAM
    ------------------------------ ------------ ----------- ---------- -----------
    A 100000 1 0 NONE
    B 100 1 0 NONE
    C 11 1 0 NONE
    D 100000 1 0 NONE
    SYS_STU3$58HONF9VK$$69P2OW4P4X 100 1 0 NONE

    SELECT * FROM user_stat_extensions WHERE table_name = 'T'

    TABLE_NAME EXTENSION_NAME EXTENSION CREATO DRO
    ---------- ------------------------------ -------------------- ------ ---
    T SYS_STU3$58HONF9VK$$69P2OW4P4X ("B","C") USER YES

    Now the optimizer reverts to correctly estimating the number of rows as 1000.

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">SELECT COUNT(A) FROM t WHERE b = 42 AND c = 40                                                                    

    Plan hash value: 2099617883
    ---------------------------------------------------------------------------------------------------------------------
    | Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers |
    ---------------------------------------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 1 | | | 8 (100)| | 1 |00:00:00.01 | 8 |
    | 1 | SORT AGGREGATE | | 1 | 1 | 6 | | | 1 |00:00:00.01 | 8 |
    |* 2 | INDEX RANGE SCAN| T_BCD | 1 | 1000 | 6000 | 8 (0)| 00:00:01 | 1000 |00:00:00.01 | 8 |
    ---------------------------------------------------------------------------------------------------------------------

    #eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">SELECT COUNT(D) FROM t WHERE b = 42 AND c = 40                                                                    

    Plan hash value: 2099617883
    ---------------------------------------------------------------------------------------------------------------------
    | Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers |
    ---------------------------------------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 1 | | | 8 (100)| | 1 |00:00:00.02 | 8 |
    | 1 | SORT AGGREGATE | | 1 | 1 | 28 | | | 1 |00:00:00.02 | 8 |
    |* 2 | INDEX RANGE SCAN| T_BCD | 1 | 1000 | 28000 | 8 (0)| 00:00:01 | 1000 |00:00:00.01 | 8 |
    ---------------------------------------------------------------------------------------------------------------------

    Conclusion

    • If you are going to drop a redundant multi-column index, at least replace it with extended statistics on that combination of columns.
    • If you are going to add columns to an existing index rather than create another, perhaps to satisfy a query without the need to visit the table, then again consider creating extended statistics on the original list of columns.

    Acknowledgement: my thanks to Jonathan Lewis for his blog: Index Usage, and for prompting the correction.