The aim of this post is not to explain how the APPROX_MEDIAN function works (you find basic information in the documentation) but to show you the results of a test case I ran to assess how well it works.

Here’s what I did…

I started in the *Oracle Database Public Cloud* an instance of version 12.2.

Then I created a table with several numerical columns (the name of each column shows how many distinct values it contains), loaded 150 million rows into it (the size of the segment is 20 GB), and gathered the object statistics.

```
CREATE TABLE t AS
WITH
t1000 AS (SELECT /*+ materialize */ rownum AS n
FROM dual
CONNECT BY level <= 1E3)
SELECT /*+ monitor */ rownum AS id,
mod(rownum,2) AS n_2,
mod(rownum,4) AS n_4,
mod(rownum,8) AS n_8,
mod(rownum,16) AS n_16,
mod(rownum,32) AS n_32,
mod(rownum,64) AS n_64,
mod(rownum,128) AS n_128,
mod(rownum,256) AS n_256,
mod(rownum,512) AS n_512,
mod(rownum,1024) AS n_1024,
mod(rownum,2048) AS n_2048,
mod(rownum,4096) AS n_4096,
mod(rownum,8192) AS n_8192,
mod(rownum,16384) AS n_16384,
mod(rownum,32768) AS n_32768,
mod(rownum,65536) AS n_65536,
mod(rownum,131072) AS n_131072,
mod(rownum,262144) AS n_262144,
mod(rownum,524288) AS n_524288,
mod(rownum,1048576) AS n_1048576,
mod(rownum,2097152) AS n_2097152,
mod(rownum,4194304) AS n_4194304,
mod(rownum,8388608) AS n_8388608,
mod(rownum,16777216) AS n_16777216,
mod(rownum,33554432) AS n_33554432,
mod(rownum,67108864) AS n_67108864,
mod(rownum,134217728) AS n_134217728
FROM t1000, t1000, t1000
WHERE rownum <= 150E6;
execute dbms_stats.gather_table_stats(user,'t')
```

Once the test table was ready, I ran four queries for every column and measured the elapsed time, the maximum amount of PGA used by the query, and the precision of the result. Note that the test case was designed to avoid disk I/O operations. The buffer cache was large enough to store the whole test table and enough PGA was available to run the MEDIAN/APPROX_MEDIAN functions without having to spill data to a temporary tablespace. As a result, the processing was CPU bound. Also note that no parallel execution was used.

The following are the four test queries:

- The "regular" query using the MEDIAN function

```
SELECT median(n_m)
FROM t
```

- The MEDIAN function is replaced by the APPROX_MEDIAN function (non-deterministic processing)

```
SELECT approx_median(n_m)
FROM t
```

- The MEDIAN function is replaced by the APPROX_MEDIAN function (deterministic processing)

```
SELECT approx_median(n_m DETERMINISTIC)
FROM t
```

- Same approximate query as the previous one with, in addition, information about the error rate and its confidence

```
SELECT approx_median(n_m DETERMINISTIC),
approx_median(n_m DETERMINISTIC, 'ERROR_RATE'),
approx_median(n_m DETERMINISTIC, 'CONFIDENCE')
FROM t
```

Let’s have a look to three charts summarizing how well the APPROX_MEDIAN function works:

- The first chart shows that for both functions the elapsed time depends on the number of distinct values. That said, while for the non-deterministic processing with the APPROX_MEDIAN function the impact of the number of distinct values is minimal (the elapsed time goes between 24 and 33 seconds), for the deterministic processing with the APPROX_MEDIAN function and for the MEDIAN function it’s much higher (for the former, the elapsed time goes between 25 and 81 seconds; for the latter, the elapsed time goes between 97 and 182 seconds). Notice that I run the test a number of times and every time I got exactly the same results.

- The second chart (be careful that the y-axis scale is logarithmic) shows that for both functions the maximum amount of PGA doesn't depend on the number of distinct values. It is about 100MB for the MEDIAN function, and about 4MB for the APPROX_MEDIAN function.

- The aim of the third chart is threefold. First, it shows that the precision of the APPROX_MEDIAN function is better when it performs a deterministic processing. Second, that the precision of the APPROX_MEDIAN function is decent only when there are at least a dozen of distinct keys. Third, that the error rate returned by the APPROX_MEDIAN function isn't very reliable (note that for all values a confidence of 99% was given).

According to this test case, in my opinion, the performance and accuracy of the APPROX_MEDIAN function with numerical values (I still have to test other data types), provided that the number of distinct value isn't too low, are good. Only the error rate and confidence information aren't very reliable. It goes without saying that it would be interesting to run a similar test with different values/distributions to see whether the results would change.

- September 2018 (26)
- August 2018 (64)
- July 2018 (58)
- June 2018 (57)
- May 2018 (62)
- April 2018 (40)
- March 2018 (61)
- February 2018 (67)
- January 2018 (53)
- December 2017 (37)
- November 2017 (43)
- October 2017 (57)
- September 2017 (46)
- August 2017 (61)
- July 2017 (52)
- June 2017 (42)
- May 2017 (53)
- April 2017 (53)
- March 2017 (68)
- February 2017 (40)
- January 2017 (64)
- December 2016 (56)
- November 2016 (76)
- October 2016 (71)
- September 2016 (55)
- August 2016 (71)
- July 2016 (41)
- June 2016 (59)
- May 2016 (68)
- April 2016 (74)

## Recent comments

34 weeks 4 days ago

46 weeks 5 days ago

51 weeks 1 day ago

51 weeks 6 days ago

1 year 4 weeks ago

1 year 25 weeks ago

1 year 45 weeks ago

2 years 23 weeks ago

3 years 7 weeks ago

3 years 8 weeks ago