Search

Top 60 Oracle Blogs

Recent comments

Join Elimination bug

It is possible to take subquery factoring (common table expressions / CTEs) too far. The most important purpose of factoring is to make a complex query easier to understand – especially if you can identify a messy piece of text that is used in more than one part of the query – but I have seen a couple of patterns appearing that make the SQL harder to read.

  • In one pattern each table is given its own factored subquery holding the non-join predicates (and, possibly, filter subqueries) and then the main query is just a simple join of the factored subqueries with nothing but join (ON) predicates.
  • In another pattern each factored subquery consists of the previous subquery with one more table added to it, so every subquery is no more that a “two-table” query block and the final subquery is a simple “select from last_factored_subquery”.

Neither of these patterns is helpful – but today’s blog note is not going to be about going to extremes with subquery factoring; instead it’s an example of a fairly reasonable use of subquery factoring that ran into a “wrong results” bug.

Consider a system that collects data from some type of meters. Here’s a two-table definition for meters and meter readings:


create table meters (
        meter_id        number,
        meter_type      varchar2(10),  -- references meter_types
        date_installed  date,
        padding         varchar2(100),
        constraint met_pk primary key(meter_id)
)
;

create table meter_readings (
        meter_id        number,
        date_read       date,
        reading         number(10,3),
        padding         varchar2(100),
        constraint      mrd_pk primary key(meter_id, date_read),
        constraint      mrd_fk_met foreign key (meter_id) references meters
)
;

insert into meters
select
        1e6 + rownum,
        case mod(rownum,3)
                when 0 then 'A'
                when 1 then 'B'
                       else 'C'
        end,
        trunc(sysdate) - mod(rownum,5),
        rpad('x',100,'x')
from
        dual 
connect by 
        level <= 10
;

execute dbms_stats.gather_table_stats(null,'meters')

insert into meter_readings 
select
        met.meter_id,
        met.date_installed - v.id + 2,
        dbms_random.value,
        rpad('x',100,'x')
from
        meters met,
        (select rownum id from dual connect by level <= 4) v
;

commit;

execute dbms_stats.gather_table_stats(null,'meter_readings')

I’ve picked the obvious primary keys for the two tables and defined the appropriate referential integrity constraint – which means the optimzer should be able to choose the best possible strategies for any query that joins the two tables.

I’ve created a very small data set – a few meters installed in the last few days, and a few readings per meters over the last few days. So lets report the readings for the last 48 hours, and include in the output any meters that haven’t logged a reading in that interval.

Here’s the query I wrote, with its output, running on a 19.3 instance on 17th Feb 2020:


with mrd_cte as (
        select 
                meter_id, date_read, reading
        from 
                meter_readings
        where 
                date_read in (trunc(sysdate), trunc(sysdate)+1)
)
select
        met.meter_id, met.date_installed, mrd_cte.date_read, reading
from
        meters met
left join 
        mrd_cte
on      mrd_cte.meter_id = met.meter_id
;

  METER_ID DATE_INST DATE_READ    READING
---------- --------- --------- ----------
   1000001 16-FEB-20 17-FEB-20       .063
   1000002 15-FEB-20
   1000003 14-FEB-20
   1000004 13-FEB-20
   1000005 17-FEB-20 18-FEB-20        .37
   1000005 17-FEB-20 17-FEB-20       .824
   1000006 16-FEB-20 17-FEB-20       .069
   1000007 15-FEB-20
   1000008 14-FEB-20
   1000009 13-FEB-20
   1000010 17-FEB-20 17-FEB-20       .161
   1000010 17-FEB-20 18-FEB-20       .818

12 rows selected.

The query returns 12 rows – which SQL*Plus can report because it counts them as it fetches them so it can give you the total at the end of the query.

Of course, sometimes people write preliminary queries to find out how big the result set would be before they run the query to acquire the result set itself. In cases like that (where they’re just going to select a “count(*)” the optimizer may a choose different execution path from the base query – perhaps finding a way to do an index-only execution, and maybe eliminating a few table joins from the query. So let’s execute a count of the above query:


rem
rem     Script:         join_elimination_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2020
rem

with mrd_cte as (
        select 
                meter_id, date_read, reading
        from 
                meter_readings
        where 
                date_read in (trunc(sysdate), trunc(sysdate)+1)
)
select count(*) from (
        select
                met.meter_id, met.date_installed, mrd_cte.date_read, mrd_cte.reading
        from
                meters met
        left join 
                mrd_cte
        on      mrd_cte.meter_id = met.meter_id
)
;

  COUNT(*)
----------
        10

1 row selected.

You’ll have to take my word for it, of course, but no-one else was using this database while I was running this test, and no-one else has access to the schema I was using anyway. Unfortunately when I count the 12 rows instead of reporting them Oracle thinks there are only 10 rows. Oops!

Step 1 in investigating the problem – check the execution plans to see if there are any differences in the structure of the plan, the use of predicates, or the outline information. I won’t bother with the plan for the base query because it was very obvious from the count query where the problem lay.


-------------------------------------------------------------------
| Id  | Operation        | Name   | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT |        |       |     1 (100)|          |
|   1 |  SORT AGGREGATE  |        |     1 |            |          |
|   2 |   INDEX FULL SCAN| MET_PK |    10 |     1   (0)| 00:00:01 |
-------------------------------------------------------------------

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      DB_VERSION('19.1.0')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$69B21C86")
      ELIMINATE_JOIN(@"SEL$00F67CF8" "METER_READINGS"@"SEL$1")
      OUTLINE(@"SEL$00F67CF8")
      MERGE(@"SEL$C43CA2CA" >"SEL$2")
      OUTLINE(@"SEL$2")
      OUTLINE(@"SEL$C43CA2CA")
      MERGE(@"SEL$D28F6BD4" >"SEL$E6E74641")
      OUTLINE(@"SEL$E6E74641")
      ANSI_REARCH(@"SEL$4")
      OUTLINE(@"SEL$D28F6BD4")
      MERGE(@"SEL$1" >"SEL$006708EA")
      OUTLINE(@"SEL$4")
      OUTLINE(@"SEL$006708EA")
      ANSI_REARCH(@"SEL$3")
      OUTLINE(@"SEL$1")
      OUTLINE(@"SEL$3")
      INDEX(@"SEL$69B21C86" "MET"@"SEL$3" ("METERS"."METER_ID"))
      END_OUTLINE_DATA
  */

This is the plan as pulled from memory by a call to dbms_xplan.display_cursor(). We note particularly the following: meter_readings doesn’t appear in the plan, there is no predicate section (and no asterisks against any of the operations that would tell us there ought to be some predicate information), and there’s a very revealing ELIMINATE_JOIN(@”SEL$00F67CF8″ “METER_READINGS”@”SEL$1”) in the outline information.

For some reason the optimizer has decided that it’s okay to remove meter_readings from the query (even though there may be many meter readings for each meter), so it was inevitable that it produced the wrong result.

Despite my opening note, this is not an issue with subquery factoring – it just looked that way when I started poking at the problem. In fact, if you rewrite the query using an inline view you get the same error, if you turn the inline view into a stored view you get the error, and if you turn the whole query into a simple (left) join with the date predicate as part of the ON clause you still get the error.

The problem lies somewhere in the join elimination transformation. If you go back to the outline information from the bad plan you’ll see the line: ELIMINATE_JOIN(@”SEL$00F67CF8″ “METER_READINGS”@”SEL$1”) – by changing this to NO_ELIMINATE_JOIN(…) and adding it to the main query block I got a suitable plan joining the two tables and producing the right result.

The problem appears in 12.2.0.1 and 19.5.0.0 (tested on livesql) – but does not appear in 12.1.0.2 or 11.2.0.4

There is a known bug associated with this problem:

Bug: 29182901
Abstract: WRONG COUNT WAS RETURNED WHEN _OPTIMIZER_ENHANCED_JOIN_ELIMINATION=TRUE
Query with Outer Join Returned a Wrong Result due to Join Elimination (Doc ID 29182901.8)

The bug is fixed in 20.1, with a fix that has been backported into the Jan 2020 patches for 19, 18, and 12.1