Search

Top 60 Oracle Blogs

Recent comments

Clustering_Factor

A few days ago I published a little note of a script I wrote some time ago to estimate the clustering_factor of an index before it had been built. At the time I pointed out that one of its limitations was that it would not handle cases where you were planning to set the table_cached_blocks preference, but a couple of days later I decided that I’d write another version of the code that would cater for the new feature – and that’s how I made an embarrassing discovery.

Having reviewed a number of notes I’ve published about the table_cached_blocks preference and its impact on the clustering_factor I’ve realised the what I’ve written has always been open to two interpretations – the one that I had in mind as I was writing, and the correct one.  I made this discovery because I had written a simple SQL statement – using the match_recognize() mechanism – to do what I considered to  be the appropriate calculation. After testing the query with a few sets of sample data that produced the correct results I emailed Stew Ashton (my “go-to” person for match_recognize() questions) asking if he would do a sanity check on the code because it was rather slow and I wondered if there was a better way of writing it.

His reply was roughly:

“I’ve read the notes you and Richard Foote have written about the clustering_factor and table_cached_blocks, and this isn’t doing what your description says it should.”

Then he explained what he had inferred from what I had written … and it made more sense than what I had been thinking when I wrote it. He also supplied some code to implement his interpretation – so I designed a couple of data models that would produce the wrong prediction for whichever piece of code implemented the wrong interpretation. His code gave the right answers, mine didn’t.

So here’s the difference in interpretation – the wrong one first – using 16 as a discussion value for the table_cached_blocks:

  • WRONG interpretation:  As you walk through index entries in order remember the last 16 rowids (that’s rowid for the rows in the table that the index is pointing to) you’ve seen. If the current rowid has a block id component that doesn’t match the block id from one of the remembered 16 rowids then increment the counter for the clustering_factor.
    • The simplicity of this algorithm means you can fix a “circular” array of 16 entries and keep walking around the circle overwriting the oldest entry each time you read a new one. It’s a pity that it’s the wrong idea because there’s a simple (though massively CPU -intensive match_recognize() strategy for implementing it – and if you were using an internal library mechanism during a proper gather_index_stats() it could be incredibly efficient.
  • RIGHT interpretation: set up an array for 16 block ids, each with an associated “row-number”. Walk through the index in order – giving each entry a row-number as you go. Extract the block id from the current entry and search through the array for a matching block id.  If you find a match then update its entry with the current row-number (so you can remembr how recently you saw the block id); if you don’t find a match then replace the entry that has the smallest (i.e. greatest distance into the past) row-number with the current block id and row-number and increment the counter for the clustering_factor.

The first piece of code that Stew Ashton sent me was an anonymous PL/SQL block that included some hard-coded fragments and embedded SQL to use a test table and index that I had defined, but he then sent a second piece of code that creates a generic function that uses dynamic SQL to construct a query against a table and an index definition that you want to test. The latter is the code I’ve published (with permission) below:


create or replace function predict_clustering_factor(
/*
Function to predict the clustering factor of an index,
taking into account the intended value of
the TABLE_CACHED_BLOCKS parameter of DBMS_STATS.SET_TABLE_PREFS.

Input is the table name, the list of column names
and the intended value of TABLE_CACHED_BLOCKS.

The function collects the last N block ids (not the last N entries).
When there is no more room, it increments the clustering factor
and replaces the least recently used block id with the current one.

Note: here a "block id" is a rowid with the row_number portion set to 0.
It is effectively a "truncated" rowid.
*/
  p_table_name in varchar2,
  p_column_list in varchar2,
  p_table_cached_blocks in number
) return number authid current_user is

  rc sys_refcursor;
  type tt_rids is table of rowid;
  lt_rids tt_rids;
  
  type t_block_list is record(
    rid rowid,
    last_hit number
  );

  type tt_block_list is table of t_block_list;
  lt_block_list tt_block_list := new tt_block_list();

  l_rid rowid;
  l_clustering_factor number := 0;
  b_block_found boolean;
  l_rn number := 0;
  l_oldest_hit number;
  i_oldest_hit binary_integer := 0;
  
  function truncated_rid(p_rid in rowid) return rowid is
    rowid_type number;
    object_number NUMBER;
    relative_fno NUMBER;
    block_number NUMBER;
    row_number NUMBER;
    rid rowid;

  begin

    DBMS_ROWID.ROWID_INFO (
      p_rid,
      rowid_type,
      object_number,
      relative_fno,
      block_number,
      row_number
    );

    rid := DBMS_ROWID.ROWID_CREATE (
      rowid_type,
      object_number,
      relative_fno,
      block_number,
      0
    );

    return rid;

  end truncated_rid;
  
begin
  if p_table_cached_blocks != trunc(p_table_cached_blocks)
  or p_table_cached_blocks not between 1 and 255 then
    raise_application_error(
      -20001, 
      'input parameter p_table_cached_blocks must be an integer between 1 and 255'
    );
  end if;

  open rc for 'select rowid from '||p_table_name||' order by '||p_column_list||', rowid';
  loop
    fetch rc bulk collect into lt_rids limit 1000;

    for irid in 1..lt_rids.count loop
      l_rn := l_rn + 1;
      l_rid := truncated_rid(lt_rids(irid));
      b_block_found := false;
      l_oldest_hit := l_rn;

      if l_rn = 1 then
        l_clustering_factor := l_clustering_factor + 1;
        lt_block_list.extend;
        lt_block_list(1).rid := l_rid;
        lt_block_list(1).last_hit := l_rn;

      else

        for i in 1..lt_block_list.count loop
          if l_oldest_hit > lt_block_list(i).last_hit then
            l_oldest_hit := lt_block_list(i).last_hit;
            i_oldest_hit := i;
          end if;
          if lt_block_list(i).rid = l_rid then
            b_block_found := true;
            lt_block_list(i).last_hit := l_rn;
            exit;
          end if;
        end loop;

        if not b_block_found then
          l_clustering_factor := l_clustering_factor + 1;
          if lt_block_list.count < p_table_cached_blocks then
            lt_block_list.extend;
            lt_block_list(lt_block_list.count).rid := l_rid;
            lt_block_list(lt_block_list.count).last_hit := l_rn; 
          else         
            lt_block_list(i_oldest_hit).rid := l_rid;
            lt_block_list(i_oldest_hit).last_hit := l_rn;
          end if;
        end if;

      end if;

    end loop;
    exit when rc%notfound;
  end loop;

  close rc;
  return l_clustering_factor;

exception when others then
  if rc%isopen then
    close rc;
  end if;
  raise;

end predict_clustering_factor;
/

After executing the above to create the function, here’s an example of usage:

rem
rem     Script:         clustering_factor_est_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2019
rem
rem     Last tested
rem             19.3.0.0
rem             12.2.0.1
rem

create table t1
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        cast(rownum as varchar2(10))            v1,
        trunc(dbms_random.value(0,10000))       rand,
        rpad('x',100,'x')                       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
/

-- -------------------------------------------------------------------

SQL> execute dbms_output.put_line('Predicted cf for t1(rand, id): ' || predict_clustering_factor('t1','rand, id',16))
Predicted cf for t1(rand, id): 997218
Elapsed: 00:00:07.54

SQL> execute dbms_output.put_line('Predicted cf for t1(rand, id): ' || predict_clustering_factor('t1','rand, id',255))
Predicted cf for t1(rand, id): 985607
Elapsed: 00:00:50.61

You’ll notice that the larger the setting for the “table_cached_blocks” parameter the more time it takes to predict the clustering_factor – and it was all CPU time in my example. This isn;t surprising given the need to search through an array holding the previous history. In this example the table t1 holds 1,000,000 rows, and the number and scatter of distinct values is so arranged that the code will hardly ever find a cached block id – essentially it’s the sort of index that isn’t going to cause much of confusion to the optimizer and isn’t likely to need special attention to make the optimizer use it when it should and ignore it when it’s inappropriate.

Finally a cut-n-paste to show the accuracy of the two predictions:

SQL> create index t1_i on t1(rand, id);
Elapsed: 00:00:02.96

SQL> execute dbms_stats.set_table_prefs(null,'t1','table_cached_blocks',16)
Elapsed: 00:00:00.01

SQL> execute dbms_stats.gather_index_stats(null,'t1_i')
Elapsed: 00:00:09.55

SQL> select clustering_factor from user_indexes where index_name = 'T1_I';

CLUSTERING_FACTOR
-----------------
           997218

Elapsed: 00:00:00.11

SQL> execute dbms_stats.set_table_prefs(null,'t1','table_cached_blocks',255)
Elapsed: 00:00:00.01

SQL> execute dbms_stats.gather_index_stats(null,'t1_i')
Elapsed: 00:00:07.80

SQL> select clustering_factor from user_indexes where index_name = 'T1_I';

CLUSTERING_FACTOR
-----------------
           985607

Elapsed: 00:00:00.00

Both match perfectly – but you might notice that creating the index and gathering the stats was much faster than predicting the clustering factor for the case where we set table_cached_blocks = 255.

(If you’re wondering, my “simple but irrelevant” match_recognize() query took 370 CPU second to complete for table_cached_blocks = 200 – and a limit on march_recognize() meant that 200 was the maximum value I was allowed to use – so now you know why I emailed Stew Ashton (and just for lagniappe. he also told me about a simple workaround for the 200 limit)).