Join Elimination bug

It is possible to take subquery factoring (common table expressions / CTEs) too far. The most important purpose of factoring is to make a complex query easier to understand – especially if you can identify a messy piece of text that is used in more than one part of the query – but I have seen a couple of patterns appearing that make the SQL harder to read.

• In one pattern each table is given its own factored subquery holding the non-join predicates (and, possibly, filter subqueries) and then the main query is just a simple join of the factored subqueries with nothing but join (ON) predicates.
• In another pattern each factored subquery consists of the previous subquery with one more table added to it, so every subquery is no more that a “two-table” query block and the final subquery is a simple “select from last_factored_subquery”.

Neither of these patterns is helpful – but today’s blog note is not going to be about going to extremes with subquery factoring; instead it’s an example of a fairly reasonable use of subquery factoring that ran into a “wrong results” bug.

Consider a system that collects data from some type of meters. Here’s a two-table definition for meters and meter readings:

```
create table meters (
meter_id        number,
meter_type      varchar2(10),  -- references meter_types
date_installed  date,
constraint met_pk primary key(meter_id)
)
;

meter_id        number,
constraint      mrd_fk_met foreign key (meter_id) references meters
)
;

insert into meters
select
1e6 + rownum,
case mod(rownum,3)
when 0 then 'A'
when 1 then 'B'
else 'C'
end,
trunc(sysdate) - mod(rownum,5),
from
dual
connect by
level <= 10
;

execute dbms_stats.gather_table_stats(null,'meters')

select
met.meter_id,
met.date_installed - v.id + 2,
dbms_random.value,
from
meters met,
(select rownum id from dual connect by level <= 4) v
;

commit;

```

I’ve picked the obvious primary keys for the two tables and defined the appropriate referential integrity constraint – which means the optimzer should be able to choose the best possible strategies for any query that joins the two tables.

I’ve created a very small data set – a few meters installed in the last few days, and a few readings per meters over the last few days. So lets report the readings for the last 48 hours, and include in the output any meters that haven’t logged a reading in that interval.

Here’s the query I wrote, with its output, running on a 19.3 instance on 17th Feb 2020:

```
with mrd_cte as (
select
from
where
)
select
from
meters met
left join
mrd_cte
on      mrd_cte.meter_id = met.meter_id
;

---------- --------- --------- ----------
1000001 16-FEB-20 17-FEB-20       .063
1000002 15-FEB-20
1000003 14-FEB-20
1000004 13-FEB-20
1000005 17-FEB-20 18-FEB-20        .37
1000005 17-FEB-20 17-FEB-20       .824
1000006 16-FEB-20 17-FEB-20       .069
1000007 15-FEB-20
1000008 14-FEB-20
1000009 13-FEB-20
1000010 17-FEB-20 17-FEB-20       .161
1000010 17-FEB-20 18-FEB-20       .818

12 rows selected.

```

The query returns 12 rows – which SQL*Plus can report because it counts them as it fetches them so it can give you the total at the end of the query.

Of course, sometimes people write preliminary queries to find out how big the result set would be before they run the query to acquire the result set itself. In cases like that (where they’re just going to select a “count(*)” the optimizer may a choose different execution path from the base query – perhaps finding a way to do an index-only execution, and maybe eliminating a few table joins from the query. So let’s execute a count of the above query:

```
rem
rem     Script:         join_elimination_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2020
rem

with mrd_cte as (
select
from
where
)
select count(*) from (
select
from
meters met
left join
mrd_cte
on      mrd_cte.meter_id = met.meter_id
)
;

COUNT(*)
----------
10

1 row selected.

```

You’ll have to take my word for it, of course, but no-one else was using this database while I was running this test, and no-one else has access to the schema I was using anyway. Unfortunately when I count the 12 rows instead of reporting them Oracle thinks there are only 10 rows. Oops!

Step 1 in investigating the problem – check the execution plans to see if there are any differences in the structure of the plan, the use of predicates, or the outline information. I won’t bother with the plan for the base query because it was very obvious from the count query where the problem lay.

```
-------------------------------------------------------------------
| Id  | Operation        | Name   | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT |        |       |     1 (100)|          |
|   1 |  SORT AGGREGATE  |        |     1 |            |          |
|   2 |   INDEX FULL SCAN| MET_PK |    10 |     1   (0)| 00:00:01 |
-------------------------------------------------------------------

Outline Data
-------------
/*+
BEGIN_OUTLINE_DATA
IGNORE_OPTIM_EMBEDDED_HINTS
OPTIMIZER_FEATURES_ENABLE('19.1.0')
DB_VERSION('19.1.0')
ALL_ROWS
OUTLINE_LEAF(@"SEL\$69B21C86")
OUTLINE(@"SEL\$00F67CF8")
MERGE(@"SEL\$C43CA2CA" >"SEL\$2")
OUTLINE(@"SEL\$2")
OUTLINE(@"SEL\$C43CA2CA")
MERGE(@"SEL\$D28F6BD4" >"SEL\$E6E74641")
OUTLINE(@"SEL\$E6E74641")
ANSI_REARCH(@"SEL\$4")
OUTLINE(@"SEL\$D28F6BD4")
MERGE(@"SEL\$1" >"SEL\$006708EA")
OUTLINE(@"SEL\$4")
OUTLINE(@"SEL\$006708EA")
ANSI_REARCH(@"SEL\$3")
OUTLINE(@"SEL\$1")
OUTLINE(@"SEL\$3")
INDEX(@"SEL\$69B21C86" "MET"@"SEL\$3" ("METERS"."METER_ID"))
END_OUTLINE_DATA
*/

```

This is the plan as pulled from memory by a call to dbms_xplan.display_cursor(). We note particularly the following: meter_readings doesn’t appear in the plan, there is no predicate section (and no asterisks against any of the operations that would tell us there ought to be some predicate information), and there’s a very revealing ELIMINATE_JOIN(@”SEL\$00F67CF8″ “METER_READINGS”@”SEL\$1”) in the outline information.

For some reason the optimizer has decided that it’s okay to remove meter_readings from the query (even though there may be many meter readings for each meter), so it was inevitable that it produced the wrong result.

Despite my opening note, this is not an issue with subquery factoring – it just looked that way when I started poking at the problem. In fact, if you rewrite the query using an inline view you get the same error, if you turn the inline view into a stored view you get the error, and if you turn the whole query into a simple (left) join with the date predicate as part of the ON clause you still get the error.

The problem lies somewhere in the join elimination transformation. If you go back to the outline information from the bad plan you’ll see the line: ELIMINATE_JOIN(@”SEL\$00F67CF8″ “METER_READINGS”@”SEL\$1”) – by changing this to NO_ELIMINATE_JOIN(…) and adding it to the main query block I got a suitable plan joining the two tables and producing the right result.

The problem appears in 12.2.0.1 and 19.5.0.0 (tested on livesql) – but does not appear in 12.1.0.2 or 11.2.0.4

There is a known bug associated with this problem:

Bug: 29182901
Abstract: WRONG COUNT WAS RETURNED WHEN _OPTIMIZER_ENHANCED_JOIN_ELIMINATION=TRUE
Query with Outer Join Returned a Wrong Result due to Join Elimination (Doc ID 29182901.8)

The bug is fixed in 20.1, with a fix that has been backported into the Jan 2020 patches for 19, 18, and 12.1