In part 2 of this mini-series I’ll be describing the new mechanism for the simple frequency histogram and the logic of the Top-N frequency histogram. In part 3 I’ll be looking at the new hybrid histogram. You need to know about the approximate NDV before you start – but there’s a thumbnail sketch at the end of the posting if you need a quick reminder.

To allow for collection of simple frequency histogram – record the first rowid for each hash value generated and count the number of times the hash value is generated. If, by the end of the table you have no more than the requested (default 254, max 2,000) distinct hash values you can look up the actual values with a query by rowid.

**Outstanding question** – what if two values hash to the same result ? I can’t tell until I get two values that give the same hash; but I assume the code must check for collisions, so perhaps Oracle keeps a linked list with each hash value showing the original value (with its first rowid) so that in the extremely unlikely event of a collision it can keep a sample rowid for the second (and subsequent) values, and keep separate counts.

Note that the work done to generate a simple histogram requires extra memory and a little extra CPU than the basic approximate NDV – but the marginal cost is very small compared to the cost of the tablescan that is required to generate the approximate NDV.

If the number of hash values exceeds the number of requested buckets Oracle would traditionally have to fall back to a “height-balanced” histogram; but in 12c Oracle can recognise the special case where MOST of the data falls into a relatively small number of buckets leaving a small amount of data that spans a large number of distinct values. If we call the small amount of data the “non-popular” data, then “small” means the unpopular data will fit into a single bucket.

To give a concrete example – assume you have 100,000 rows in a table, and in one column you have 5,000 distinct values, where 95 of those values account for more than 99,000 of the rows; if you ask for a histogram of 100 buckets, Oracle will check to see if the top 100 * [corrected from 99 - see comment]* popular values account for more than 99/100ths of the data; if you ask for a histogram of 200 buckets Oracle will check to see if the top 200

The low and high values are needed for “out of range” checks for predicates, and if they are non-popular values then Oracle does a little bit of fudging around the edges to include them. (They are available from the standard min/max parts of the stats gathering query – but Oracle seems to record them in the histogram with a count of 1 if they aren’t popular values).

If the total number of distinct values is less than the 16,384 that Oracle uses as its limit for the approximate NDV then including the count for each hash value allows it to sort the frequencies in order and decide whether a small enough number of hash values covers a large enough percentage of the data to allow for the production of a Top-N histogram.

**Outstanding Question** – If the total number of distinct values is greater than the 16,384 that Oracle uses as its limit, then the hash table will have split (at least once). But Oracle can still check the sorted counts of the hash values it has got to see if a small enough number of hash values covers a large enough percentage of the WHOLE data set, because the basic stats gathering query includes a count of the rows in the table. However, I think that something more sophisticated than this is going on because I ran a number of tests that should (probably) have discarded half the popular data – and yet Oracle managed to produce a Top-N histogram for every test where it was appropriate.

Again the time to produce the histogram is similar to the time needed to do tablescan that generates the min, max, and count values for the column under the basic approximate NDV strategy. The important thing about this type of histogram is that in many cases it’s likely to be much more **accurate **and **stable **than the height-balanced histogram you would have got with earlier versions of Oracle.

Interestingly, I’ve been saying for many years that if you want stability when your data has a large number of distinct values then the best strategy is to code up(with dbms_stats.set_index_stats et. al.) a frequency histogram that captures the most popular 252 values, the low value, the high value, and a fake density (or a fake for the lowest frequency you create that allows Oracle to model the non-popular data correctly).

Although simple frequency histograms can be more accurate because they allow up to 2,000 buckets (though you probably don’t need that many in most cases), and although Top-N frequency histograms are likely to be much better than height-balanced histograms – don’t forget that your choice of WHEN to create the histogram can be the most important part of histogram collection. There’s no point in collecting a histogram at 10:00 pm that tells you about a * status* column if the “interesting” data in that column appears between 9:00 am and 5:00 pm and has all disappeared by 6:00 pm – your histogram would be an accurate picture of the wrong data and could therefore lead to unsuitable execution plans during the day.

**Footnote:**

Approximate NDV for a column in a nutshell (for a full description see * Amit Poddar’s paper*): for

- August 2020 (14)
- July 2020 (38)
- June 2020 (49)
- May 2020 (50)
- April 2020 (64)
- March 2020 (62)
- February 2020 (59)
- January 2020 (90)
- December 2019 (48)
- November 2019 (47)
- October 2019 (64)
- September 2019 (40)
- August 2019 (42)
- July 2019 (45)
- June 2019 (37)
- May 2019 (43)
- April 2019 (49)
- March 2019 (61)
- February 2019 (25)
- January 2019 (35)
- December 2018 (39)
- November 2018 (55)
- October 2018 (75)
- September 2018 (38)
- August 2018 (89)
- July 2018 (58)
- June 2018 (61)
- May 2018 (66)
- April 2018 (40)
- March 2018 (61)

## Recent comments

2 years 27 weeks ago

2 years 40 weeks ago

2 years 44 weeks ago

2 years 45 weeks ago

2 years 49 weeks ago

3 years 18 weeks ago

3 years 39 weeks ago

4 years 16 weeks ago

5 years 6 days ago

5 years 1 week ago