I’ve just been prompted to complete and publish a draft I started a few years ago. It’s (ultimately) about a feature that appeared in 9i but doesn’t seem to show up very often at client sites or as a common solution to performance problems on the various Oracle forums – but maybe that’s not surprising given how slowly analytic functions have been taken up.
I want to work towards the feature by starting with a requirement, then examine several solutions. To supply a touch of realism I’ll create an orders table, which holds a customer id and an order date (including time), ,and then ask for a report of the most recent order for each customer. Here’s some starting data:
rem rem Script: order_sample.sql rem Author: Jonathan Lewis rem Dated: June 2006 rem Purpose: rem rem Last tested rem 19.3.0.0 rem 12.2.0.1 rem 12.1.0.0 Costs are consistent rem 11.2.0.4 Costs become consistent by 11.2.0.3 rem 11.1.0.7 rem 10.2.0.3 rem 9.2.0.8 rem create table orders as with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid wordpress format issue ) select rownum id, mod(rownum-1,200) customer, sysdate - 20 + dbms_random.value(0,20) date_ordered, rpad('x' || to_char(trunc(dbms_random.value(0,1000)),'FM009'),100) padding from generator v1, generator v2 where rownum <= 1e4 -- > comment to avoid wordpress format issue ; alter table orders modify customer not null; alter table orders modify date_ordered not null; alter table orders add constraint ord_pk primary key(id); create index ord_cus on orders(customer); -- create unique index ord_cus_date on orders(customer, date_ordered); begin dbms_stats.gather_table_stats( ownname => user, tabname =>'orders', method_opt => 'for all columns size 1', cascade => true ); end; /
I’ve got 200 customers, at 50 orders per customer dating back over the last 20 days. There’s a primary key on the table and (as it stands) an obvious “foreign key index” on the customer column, though I’ve allowed for changing this to a more useful (customer, date_ordered) combination which I’ve decided could be declared as unique.
With this data, how do I report “the most recent order for each customer”? The first question to ask in response to this request is: “do you literally mean ‘THE’ most recent; what if the customer has placed two or more orders on the same day or, in my initial case, at the same time?” There’s a special case to think about the moment you start to turn the natural language request into a formal language specification.
In this case I’m going to run with the “customer-only” index and allow for the fact that two or more orders could be placed at the same time by the same customer, and report both (all) of them if any such simultaneously placed orders appear.
Start with a list showing the most recent order date for each customer and then report all orders that we can identify using that list of (customer, date_ordered). To do that I’ll start with a simple aggregate query and use the result it produced in an “IN” subquery:
prompt ========================= prompt Use IN subquery for max() prompt ========================= select /*+ qb_name(main) */ ord1.* from orders ord1 where (ord1.customer, ord1.date_ordered) in ( select /*+ qb_name(subq) */ ord2.customer, max(ord2.date_ordered) from orders ord2 group by ord2.customer ) order by ord1.customer ; select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last')); Plan hash value: 1500776991 ------------------------------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ------------------------------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | | 54 (100)| 200 |00:00:00.01 | 344 | | | | | 1 | SORT ORDER BY | | 1 | 1 | 54 (15)| 200 |00:00:00.01 | 344 | 115K| 115K| 102K (0)| |* 2 | HASH JOIN RIGHT SEMI| | 1 | 1 | 53 (14)| 200 |00:00:00.01 | 344 | 1695K| 1695K| 1568K (0)| | 3 | VIEW | VW_NSO_1 | 1 | 200 | 28 (18)| 200 |00:00:00.01 | 172 | | | | | 4 | HASH GROUP BY | | 1 | 200 | 28 (18)| 200 |00:00:00.01 | 172 | 1484K| 1484K| 1421K (0)| | 5 | TABLE ACCESS FULL| ORDERS | 1 | 10000 | 24 (5)| 10000 |00:00:00.01 | 172 | | | | | 6 | TABLE ACCESS FULL | ORDERS | 1 | 10000 | 24 (5)| 10000 |00:00:00.01 | 172 | | | | ------------------------------------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ORD1"."CUSTOMER"="CUSTOMER" AND "ORD1"."DATE_ORDERED"="MAX(ORD2.DATE_ORDERED)")
I’ve included the qb_name() hint in both query blocks here – it’s always a good idea as it gives you a little extra edge in interpreting the execution plan when the queries get more complicated.
The first thing you’ll notice about the resulting execution plan is that the optimizer has “unnested” the subquery to create an inline view (which it has named VW_NSO_1) and then used a simple join to get the final result. That’s an interesting observation, and it’s something that will often happen with an “IN” subquery – and that brings us to strategy 2.
Some people will take as gospel the claim that the optimizer “cannot handle subqueries efficiently” and will prefer to write their own inline views (possibly using the “WITH subquery” a.k.a. “Common Table Expression (CTE)” mechanism). There will be occasions, even in the latest versions of Oracle, where you may need to do this but there will also be occasions where the optimizer hasn’t done it because it would produce the wrong results – and I have seen a couple of accidents go into production code where this variant has been written incorrectly.
prompt ============================== prompt Introduce max() as inline view prompt ============================== select /*+ qb_name(main) */ ord1.* from ( select /*+ qb_name(in_line) */ ord2.customer, max(ord2.date_ordered) date_ordered from orders ord2 group by ord2.customer ) ordv, orders ord1 where ord1.customer = ordv.customer and ord1.date_ordered = ordv.date_ordered order by ord1.customer ; select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last')); Plan hash value: 2750501889 ---------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ---------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 54 (100)| 200 |00:00:00.01 | 344 | | | | | 1 | SORT ORDER BY | | 1 | 200 | 54 (15)| 200 |00:00:00.01 | 344 | 115K| 115K| 102K (0)| |* 2 | HASH JOIN | | 1 | 200 | 53 (14)| 200 |00:00:00.01 | 344 | 1695K| 1695K| 1531K (0)| | 3 | VIEW | | 1 | 200 | 28 (18)| 200 |00:00:00.01 | 172 | | | | | 4 | HASH GROUP BY | | 1 | 200 | 28 (18)| 200 |00:00:00.01 | 172 | 1484K| 1484K| 1413K (0)| | 5 | TABLE ACCESS FULL| ORDERS | 1 | 10000 | 24 (5)| 10000 |00:00:00.01 | 172 | | | | | 6 | TABLE ACCESS FULL | ORDERS | 1 | 10000 | 24 (5)| 10000 |00:00:00.01 | 172 | | | | ---------------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ORD1"."CUSTOMER"="ORDV"."CUSTOMER" AND "ORD1"."DATE_ORDERED"="ORDV"."DATE_ORDERED")
You’ll notice, of course, the remarkable similarity between the previous plan and this one – the only significant difference being that the optimimzer used a “plain” hash join here rather than the “hash join right semi” that appeared in the previous plan. The “right semi” is an indication that the optimizer has first transformed your “IN” subquery to an equivalent “EXISTS” (“=ANY”) subquery. Don’t be misled by the “right”, by the way, this isn’t indicating any sort of outer join it’s just trying to let you know which table is the one where Oracle should stop its probe after finding the first row. It is, however, unfortunate that it gets a little awkward trying to sort out left from right when Oracle can do a “swap join inputs” on you.
It would have been nice if the VIEW operatio1n had reported the name of my inline view (to correspond to the generated VW_NSO_1 viewname from the previous plan) – but you if you included the ‘alias’ formatting option in the call to display_cursor() it would have reported the alias ordv@main at operation 3.
We might have decided to check every row in the table to see if the date in that row was the most recent date for the customer in that row, which we could do by running a correlated subquery to do the check for every row in the table.
prompt ======================================== prompt Orders with correlated EQUALITY subquery prompt ======================================== select /*+ qb_name(main) */ ord1.* from orders ord1 where ord1.date_ordered = ( select /*+ qb_name(subq) */ max(ord2.date_ordered) from orders ord2 where ord2.customer = ord1.customer ) order by ord1.customer ; select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last')); Plan hash value: 1152467146 ----------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 54 (100)| 200 |00:00:00.01 | 344 | | | | | 1 | SORT ORDER BY | | 1 | 200 | 54 (15)| 200 |00:00:00.01 | 344 | 115K| 115K| 102K (0)| |* 2 | HASH JOIN | | 1 | 200 | 53 (14)| 200 |00:00:00.01 | 344 | 1695K| 1695K| 1622K (0)| | 3 | VIEW | VW_SQ_1 | 1 | 200 | 28 (18)| 200 |00:00:00.01 | 172 | | | | | 4 | HASH GROUP BY | | 1 | 200 | 28 (18)| 200 |00:00:00.01 | 172 | 1484K| 1484K| 1435K (0)| | 5 | TABLE ACCESS FULL| ORDERS | 1 | 10000 | 24 (5)| 10000 |00:00:00.01 | 172 | | | | | 6 | TABLE ACCESS FULL | ORDERS | 1 | 10000 | 24 (5)| 10000 |00:00:00.01 | 172 | | | | ----------------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ORD1"."DATE_ORDERED"="MAX(ORD2.DATE_ORDERED)" AND "ITEM_1"="ORD1"."CUSTOMER")
Yet again we end up with the same execution plan (barring the “right semi” issue) but with a different generated name for the unnested subquery. This is an interesting facet of Oracle (and SQL in general) – completely different ways of stating a requirement can end up doing the same work in the same way.
An important corrollary to this observation is that the first thing you should do when you start writing an SQL statement is to write it in a way that clearly expresses the requirement and is easy for others to comprehend. Don’t (at the first stage) try to do anything clever because (a) you may do it wrong and (b) the optimizer might have taken your clear, simple, code and done the clever bit behind the scenes for you.
However, we may have to move on to doing something new (-ish) and exciting.
An “obvious” defect in the three plans so far is that we have to visit the orders table twice. Is there a way we can avoid doing this? The answer is yes. Oracle 8.1.6 gave us the analytic functions:
prompt ======================= prompt Analytic max() function prompt ======================= column padding noprint column date_ordered noprint select /*+ qb_name(main) */ ordv.* from ( select /*+ qb_name(inline) */ customer, id, date_ordered, padding, max(date_ordered) over ( partition by customer ) max_date from orders ord2 ) ordv where ordv.date_ordered = ordv.max_date order by ordv.customer ; select * from table(dbms_xplan.display_cursor(null,null,'cost outline allstats last')); Plan hash value: 813391662 -------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | -------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 262 (100)| 200 |00:00:00.01 | 172 | | | | |* 1 | VIEW | | 1 | 10000 | 262 (3)| 200 |00:00:00.01 | 172 | | | | | 2 | WINDOW SORT | | 1 | 10000 | 262 (3)| 10000 |00:00:00.01 | 172 | 1612K| 624K| 1432K (0)| | 3 | TABLE ACCESS FULL| ORDERS | 1 | 10000 | 24 (5)| 10000 |00:00:00.01 | 172 | | | | -------------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("ORDV"."DATE_ORDERED"="ORDV"."MAX_DATE")
By adding the analytic max() function I can acquire the necessary “raw” data once and post-process it to find the max(date_ordered) for each customer before discarding all the rows where the row’s date doesn’t match the maximum date. The expression “max(date_ordered) over (partition by customer)” is like a virtual column that tells Oracle to partition the data by customer and find the maximum date within customer. Imagine copying the original data into a spreadsheet, sorting it by customer, then using one of the spreadsheet functions to add an extra column that derives it’s value by looking at the rows that have the same customer as the current row and you’ve got an exact replica of what Oracle is doing here.
So we’ve managed to produce the same result with a single tablescan of orders instead of the two tablescans we saw in every other plan. But there’s a drawback – to be able to partition by customer Oracle has had to fetch every row and column we’re interested in and sort the data before deriving values for the new column: the cost of this plan (262) is much higher than the cost of the plan (54) we got from the previous three queries.
In this case the variation in actual run-time for the two different plans was undetectable – and insignificant compared to the time spent getting the result set to the terminal and displaying. In general, though, you need to consider the trade off between the sorting that goes into the use of analytic functions and the “double visit” work of using subqueries.
There is (at least) one more possibility that I’ve used in the past when the data structure has allowed it to produce the right answers; and it’s the one that is the ultimate target of this blog. Consider the following SQL:
select customer, max(id) keep (dense_rank last order by date_ordered) as max_id, max(date_ordered) keep (dense_rank last order by date_ordered) as max_date, -- max(padding) keep (dense_rank last order by date_ordered) as max_padding trim( max(padding) keep (dense_rank last order by date_ordered) ) as max_padding from orders group by customer ;
(The trim() function on the padding column doesn’t change the fundamental behaviour of this query, it’s just there to avoid line-wrapping on my output.)
I’ve written a query that does an aggregate on customer, so “customer, max() group by customer”, but it’s a variant of the analytic max() function based on “keep(dense_rank last order by …)” rather then the more familiar “over(partition by … order by …)” form.
Because of the group by customer, the max() function is applied per customer (i.e. behaving like over(partition by customer)), and we’re not actually looking for the maximum value of the referenced column, we’re first ordering by the date_ordered (within customer) applying the dense_rank mechanism, keeping only the rows that have the highest (last) dense_rank, and then taking the maximum of that subset of the data.
Here’s an example applying the combination of mechanisms to a tiny data set:
Raw data ========= N1 V1 ----- --- 93 'C' 54 'Q', 43 'A' 13 'X' 93 'G' 54 'W', Ordered by N1 and dense_rank() appended ======================================== N1 V1 dr() ----- --- ---- 13 'X' 1 43 'A' 2 54 'W', 3 54 'Q', 3 93 'G' 4 93 'C' 4 Keep(dense rank last) ===================== N1 V1 dr() ----- --- ---- 93 'G' 4 93 'C' 4 max(v1) keep(dense rank last order by n1) V1 --- 'G'
In this tiny example we had cases where there were multiple rows for some of the rankings, but if we go back to our orders table and guarantee (by adding a unique constraint) that a customer will never have more than one row for any one value of date_ordered, then the expression max(id) keep (dense_rank last order by date_ordered) for any one customer will be the id of the row that has the maximum order date for that customer and, similarly, max(date_ordered) keep(…), and max(padding) keep (,,,) will also be the values from that same row.
Given the (absolutely critical) uniqueness constraint, we can get the data for the most recent for the customer using this dense_rank() strategy.
The question, of course, is why would we do something that may not be entirely intuitive and looks as if it could make Oracle do a lot of extra work to get the result. Here’s the answer – which is just the execution plan for the query on my orders table – with the unique constraint added:
------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 28 (100)| 200 |00:00:00.01 | 172 | | | | | 1 | SORT GROUP BY | | 1 | 200 | 28 (18)| 200 |00:00:00.01 | 172 | 142K| 142K| 126K (0)| | 2 | TABLE ACCESS FULL| ORDERS | 1 | 10000 | 24 (5)| 10000 |00:00:00.01 | 172 | | | | -------------------------------------------------------------------------------------------------------------------------------
The path uses a basic SORT GROUP BY, that “sorts” only 200 rows (A-rows) using only 126KB of memory. Compare that with the plan for the analytic max() over() in strategy 4 that takes 1.6MB of memory and sorts 10,000 rows and you’ll appreciate that the keep(dense_rank last) mechanism is doing something much more efficient. For cases where the drop from “num_rows” to “num_distinct” for the aggregating column(s) the benefit of using the somewhat non-intuitive dense_rank() approach may make a significant difference to memory, CPU, and even (if it avoids a spill to disk) I/O.
There are two major variations on how you can use the dense_rank() function, as well as this form in which dense_rank appears in the KEEP LAST (and FIRST) mechanism.
Remember the absolutely critical point that the “keep dense_rank last” strategy is only correct if there is a suitable unique constraint on the data set viz: unique on ({group by column(s)},{order by column(s)}).
There is another major option for getting the same “most recent” rows, which is to use the match_recognize() functionality, but I think I probably wrote this draft before the mechanism even existed – so it’s left as an exercise to the reader to work it out. A key reason why I’m not going to do it myself is that (like the analytic over() in strategy 4) it will require all 10,000 rows to be sorted, and is therefore likely to be less efficient than strategy 5.
Finally – I thought I’d written a note explaining why a “sort group by” can use far less memory and CPU then a basic “sort order by”, but if I have it wasn’t on this blog. I do have a note on how the mechanism to handle “rownum <= N” with a preceding “order by” minimises its use of memory, and that note should give you some idea of what the “sort group by” is doing to minimise memory usage. I’ll try to write a little note on the aggregate mechanism some time over the next few days.
Recent comments
3 years 5 weeks ago
3 years 17 weeks ago
3 years 22 weeks ago
3 years 23 weeks ago
3 years 27 weeks ago
3 years 48 weeks ago
4 years 16 weeks ago
4 years 46 weeks ago
5 years 30 weeks ago
5 years 31 weeks ago