In the past I have enthused mightily about the benefits of the approximate NDV mechanism and the benefit of using auto_sample_size to collect statistics in 11g; however, as so often happens with Oracle features, there’s a down-side or boundary condition, or edge case. I’ve already picked this up once as an addendum to an earlier blog note on virtual stats, which linked to an article on OTN describing how the time taken to collect stats on a table increased dramatically after the addition of an index – where the index had this definition:
Here’s another of my “draft” notes that needs some exapansion and, most importantly, proof.
I have a fact table with a “status id” column that shows a massive skew. Unfortunately I have a dimension table that holds the “status code”, so (in theory, at least) I have to do a join from the statuses table to the fact table to find rows of a given status. Unfortunately the join hides the skew:
select f.* from facts f, statuses s where s.code = 'C' and f.status_id = s.status_id ;
The optimizer knows that the status_id column on the facts table has a highly skewed distribution and will create a histogram on it, but it can’t know which status code corresponds to which status_id, so the histogram doesn’t help in calculating the join cardinality.
Will a bitmap join index help ? Answer – NO.
Last week, while working on customer engagement, I learned a new method of quantifying behavior of time-series data. The method is called “Control Chart” and credit to Josh Wills, our director of data science, for pointing it out. I thought I’ll share it with my readers as its easy to understand, easy to implement, flexible and very useful in many situations.
The problem is ages old – you collect measurements over time and want to know when your measurements indicate abnormal behavior. “Abnormal” is not well defined, and thats on purpose – we want our method to be flexible enough to match what you define as an issue.
For example, lets say Facebook are interested in tracking usage trend for each user, catching those with decreasing use
There are few steps to the Control Chart method:
It has taken much longer than I anticipated to get around to writing part 3 of this mini-series on what Oracle has done about histograms in 12c.
In part 1 I gave a thumbnail sketch of the three types of histogram available in 12c
In part 2 I described in some detail the improvements in performance and accuracy for the frequency and top-frequency histograms
Or – to be more accurate – real statistics on a virtual column.
This is one of the “10 top tips” that I came up with for my session with Maria Colgan at OOW13. A method of giving more information that might improve execution plans when you can change the code. I’ll start with a small data set including a virtual column (running 184.108.40.206), and a couple of problem queries:
Here’s a little demo cut-n-pasted from a session running Oracle 220.127.116.11 (it works on 11g, too). All it does is create a table by copying from a well-known table, gather extended stats on a column group, then show you the resulting column names by querying view user_tab_cols.
Here’s a funny little problem I came across some time ago when setting up some materialized views. I have two tables, orders and order_lines, and I’ve set up materialized view logs for them that allow a join materialized view (called orders_join) to be fast refreshable. Watch what happens if I refresh this view just before gathering stats on the order_lines table.
I have a little script that start with “set echo on”, then calls two packaged procedures, one to refresh the join view, the other to collect stats on the order_lines table; here’s the output from that script:
I posted a couple of examples some time ago of oddities and boundary cases for frequency histograms on character columns. Part of the process of playing around with the 12c Beta was to re-run such cases to see if newer code made any difference. Looking at these examples, one was fixed (or improved, at least) the other wasn’t, so I’ve added a footnote to each and produced this little note to highlight the changes:
In part 2 of this mini-series I’ll be describing the new mechanism for the simple frequency histogram and the logic of the Top-N frequency histogram. In part 3 I’ll be looking at the new hybrid histogram. You need to know about the approximate NDV before you start – but there’s a thumbnail sketch at the end of the posting if you need a quick reminder.
To allow for collection of simple frequency histogram – record the first rowid for each hash value generated and count the number of times the hash value is generated. If, by the end of the table you have no more than the requested (default 254, max 2,000) distinct hash values you can look up the actual values with a query by rowid.