# statistics

## Spatial space

One thing you (ought to) learn very early on in an Oracle career is that there are always cases you haven’t previously considered. It’s a feature that is frequently the downfall of “I found it on the internet” SQL.  Here’s one (heavily paraphrased) example that appeared on the OTN database forum a few days ago:

select table_name,round((blocks*8),2)||’kb’ “size” from user_tables where table_name = ‘MYTABLE';

select table_name,round((num_rows*avg_row_len/1024),2)||’kb’ “size” from user_tables where table_name = ‘MYTABLE';

The result from the first query is 704 kb,  the result from the second is 25.4 kb … fragmentation, rebuild, CTAS etc. etc.

## Quantum Data

That’s data that isn’t there until you look for it, sort of, from the optimizer’s perspective.

Here’s some code to create a sample data set:

```
create table t1
as
with generator as (
select	--+ materialize
rownum id
from dual
connect by
level <= 1e4
)
select
rownum					id,
mod(rownum-1,200)			mod_200,
mod(rownum-1,10000)			mod_10000,
from
generator	v1,
generator	v2
where
rownum <= 1e6
;

begin
dbms_stats.gather_table_stats(
ownname		 => user,
tabname		 =>'T1',
method_opt 	 => 'for all columns size 1'
);
end;
/

```

Now derive the execution plans for a couple of queries noting, particularly, that we are using queries that are NOT CONSISTENT with the current state of the data (or more importantly the statistics about the data) – we’re querying outside the known range.

## Comparisons

“You can’t compare apples with oranges.”

Oh, yes you can! The answer is 72,731,533,037,581,000,000,000,000,000,000,000.

Here’s a question that appeared in my email a few days ago:

Based on the formula: “sreadtim = ioseektim + db_block_size/iotrfrspeed” sreadtim should always bigger than ioseektim.

But I just did a query on my system, find it otherwise, get confused,

This is a quick response to a question on an old blog post asking how you can adjust the high value if you’ve already got a height-balanced histogram in place. It’s possible that someone will come up with a tidier method, but this was just a quick sample I created and tested on 11.2.0.4 in a few minutes.  (Note – this is specifically for height-balanced histograms,  and it’s not appropriate for 12c which has introduced hybrid histograms that will require me to modify my “histogram faking” code a little).

## Extended stats

Like the recent article on deleting histograms this is another draft that I rediscovered while searching for some notes I had written on a different topic – so I’ve finally finished it off and published it.

Here’s a quirky little detail of extended stats that came up in an OTN thread earlier on this week [ed: actually 8th Jan 2014]. When you create column group stats, Oracle uses an undocumented function sys_op_combined_hash() to create a hash value, and if you gather simple stats on the column (i.e. no histogram) you can get some idea of the range of values that Oracle generates through the hash function. For example:

## Delete Histogram

Here’s a note which I drafted in Novemeber 2010, and then didn’t publish. I found it earlier on this morning while looking for another note I’d written about histograms so, even though it may not be something that people need so much these days, I thought: better late than never.

I’ve pointed out in the past that I’m not keen on seeing lots of histograms on a system and tend to delete them if I think they are not needed. Here’s an example of the type of code I use to delete a histogram.

## Auto Sample Size

In the past I have enthused mightily about the benefits of the approximate NDV mechanism and the benefit of using auto_sample_size to collect statistics in 11g; however, as so often happens with Oracle features, there’s a down-side or boundary condition, or edge case. I’ve already picked this up once as an addendum to an earlier blog note on virtual stats, which linked to an article on OTN describing how the time taken to collect stats on a table increased dramatically after the addition of an index – where the index had this definition:

## Bitmap join indexes

Here’s another of my “draft” notes that needs some exapansion and, most importantly, proof.

I have a fact table with a “status id” column that shows a massive skew. Unfortunately I have a dimension table that holds the “status code”, so (in theory, at least) I have to do a join from the statuses table to the fact table to find rows of a given status. Unfortunately the join hides the skew:

```select  f.*
from    facts f, statuses s
where   s.code = 'C'
and     f.status_id = s.status_id
;
```

The optimizer knows that the status_id column on the facts table has a highly skewed distribution and will create a histogram on it, but it can’t know which status code corresponds to which status_id, so the histogram doesn’t help in calculating the join cardinality.

Will a bitmap join index help ? Answer – NO.

## Control Charts

Last week, while working on customer engagement, I learned a new method of quantifying behavior of time-series data. The method is called “Control Chart” and credit to Josh Wills, our director of data science, for pointing it out. I thought I’ll share it with my readers as its easy to understand, easy to implement, flexible and very useful in many situations.

The problem is ages old – you collect measurements over time and want to know when your measurements indicate abnormal behavior. “Abnormal” is not well defined, and thats on purpose – we want our method to be flexible enough to match what you define as an issue.

For example, lets say Facebook are interested in tracking usage trend for each user, catching those with decreasing use

There are few steps to the Control Chart method: