Search

OakieTags

Who's online

There are currently 0 users and 42 guests online.

Recent comments

CBO++

While browsing the web recently for articles on the HyperLogLog algorithm that Oracle uses for some of its approximate functions, I came upon a blog post written in Jan 2014 with the title Use Subqueries to Count Distinct 50X Faster. There are various ways that subqueries can be used to rewrite queries for improved performance, but when the title caught my eye I couldn’t think of a way in which they could improve “count distinct”.  It turned out that the word “subquery” was being used (quite correctly) in the sense of “inline view” while my mind had immediately turned to subqueries in the select list or where clause.

The article started by pointing out that if you have a query that does a join then aggregates the result you might be able to improve performance by finding a way of rewriting the query to aggregate before doing the join. (See this note from 2008). The article then went one step further to optimise a “count distinct” by wrapping a “select count” around a “select distinct” inline view as follows:

Original
--------
  select
    dashboard_id,
    count(distinct user_id) as ct
  from time_on_site_logs 
  group by dashboard_id

Rewrite
-------
select 
    inline.dashboard_id, 
    count(1) as ct
  from (
    select distinct dashboard_id, user_id
    from time_on_site_logs
  ) as inline
  group by inline.dashboard_id

(I’ve reproduced only the central part of the query being examined and I’ve changed the name of the inline view to eliminate the potential visual confusion due to the word “distinct” appearing in its name in the original).

The article was written using the Postgres SQL with the comment that the technique was universal; and this brings me to the point of the post. The technique can be applied to Oracle’s dialect of SQL. Both ideas are good ideas whose effectiveness depends on the data patterns, data volume, and (potentially) indexing; but you may not need to rewrite the code because the optimizer is programmed to know that the ideas are good and it can transform your query to the appropriate form internally. The “place group by” transformation appeared in 11.1.0.6 in 2007, and the “transform distinct aggregation” appeared in 11.2.0.1 in 2009.

Here’s a litte demo of Oracle handling a variation of the query I’ve shown above:


rem     Script: transform_distinct_agg.sql
rem     Dated:  May 2016
rem     Author: J.P.Lewis

create table t1 nologging 
as 
select  * 
from    all_objects 
where   rownum <= 60000
;
execute dbms_stats.gather_table_stats(user,'t1', method_opt=>'for all columns size 1')

alter session set statistics_level = all;

select owner, count(distinct object_type) from t1 group by owner;
select * from table(dbms_xplan.display_cursor(null,null,'allstats last outline'));

prompt  ===============
prompt  Rewritten query
prompt  ===============

select  owner, count(1)
from    (
         select distinct owner, object_type
         from   t1
        ) distinct_types
group by
        owner
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last outline'));

Here are the two execution plans, pulled from memory – with the outline and some other peripheral lines deleted:


-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |           |      1 |        |      5 |00:00:00.23 |     865 |       |       |          |
|   1 |  HASH GROUP BY       |           |      1 |      5 |      5 |00:00:00.23 |     865 |  1452K|  1452K|  728K (0)|
|   2 |   VIEW               | VM_NWVW_1 |      1 |     78 |     30 |00:00:00.23 |     865 |       |       |          |
|   3 |    HASH GROUP BY     |           |      1 |     78 |     30 |00:00:00.23 |     865 |  4588K|  1708K| 2497K (0)|
|   4 |     TABLE ACCESS FULL| T1        |      1 |  60000 |  60000 |00:00:00.12 |     865 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------

===============
Rewritten query
===============

------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |      5 |00:00:00.23 |     865 |       |       |          |
|   1 |  HASH GROUP BY       |      |      1 |      5 |      5 |00:00:00.23 |     865 |  1452K|  1452K|  735K (0)|
|   2 |   VIEW               |      |      1 |     78 |     30 |00:00:00.23 |     865 |       |       |          |
|   3 |    HASH UNIQUE       |      |      1 |     78 |     30 |00:00:00.23 |     865 |  4588K|  1708K| 1345K (0)|
|   4 |     TABLE ACCESS FULL| T1   |      1 |  60000 |  60000 |00:00:00.12 |     865 |       |       |          |
------------------------------------------------------------------------------------------------------------------

Apart from the change from “HASH UNIQUE” to “HASH GROUP BY” the two plans are the same, using the same resources – the UNIQUE being a special case of the algorithm for the GROUP BY. Here (with some cosmetic editing) is the SQL of the “unparsed query” taken from the 10053 (CBO) trace file – notice how similar it is to the text suggested by the original article, in particular the inline view to get the distinct list of owner and object_type (using a group by with no aggregated columns, rather than a distinct):

SELECT 
        VM_NWVW_1.$vm_col_2 OWNER,
        COUNT(VM_NWVW_1.$vm_col_1) COUNT(DISTINCTOBJECT_TYPE)
FROM    (
                SELECT
                        T1.OBJECT_TYPE $vm_col_1,
                        T1.OWNER $vm_col_2
                FROM    TEST_USER.T1 T1
                GROUP BY 
                        T1.OWNER,T1.OBJECT_TYPE
        ) VM_NWVW_1
GROUP BY
        VM_NWVW_1.$vm_col_2
;

The Oracle optimizer is pretty good at finding efficient transformations for the query you wrote so, rather than rewriting a query (with the option for making a mistake as you do so), you may only need to add a couple of hints to generate a suitable SQL Plan Baseline that you can attach to the original query.

Footnote:

Sometimes the optimizer will decide not to transform when it should, or decide to transform when it shouldn’t, so it’s nice to know that there are hints to block transformations – here’s the effect of adding /*+ qb_name(main) no_transform_distinct_agg(main) */ to my query:


----------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      5 |00:00:00.25 |     865 |       |       |          |
|   1 |  SORT GROUP BY     |      |      1 |      5 |      5 |00:00:00.25 |     865 |  4096 |  4096 | 4096  (0)|
|   2 |   TABLE ACCESS FULL| T1   |      1 |  60000 |  60000 |00:00:00.12 |     865 |       |       |          |
----------------------------------------------------------------------------------------------------------------

The interesting thing to note here is that even though the query took a little longer to complete the amount of memory allocated to run the query in memory was only 4K compared to the 2M needed by the transformed query (In this example both workareas would have been in existence at the same time – that won’t be true of every query using multiple workareas.) This isn’t significant in this trivial case, but it demonstrates the point that sometimes there is no one best path – you can choose the path that protects the resource that’s under most pressure.