Partition-wise join

So just what is a “partition-wise” join ?  We will use a metaphor to hopefully Smile explain the benefit.

image

Let’s say two people, Logan and Shannon, decide to move in together.  If each of them already have an existing residence, they will both have a lot of the common items that you find in any household.  So they have a decision to make – do they keep two of everything, or do they have a bit of a “cull” of things that they have in common.  In this imaginary scenario, we will focus on household items in the bathroom and the kitchen.  Logan grabs a set of kitchen knives a knife block, calls Shannon and asks: “Hey Shannon, do you already have a knife block?”

What do you think Shannon will do ? Search the entire house for an existing knife block ?  Of course not.  If there is a knife block, then the only place it will be located will be in the kitchen.  In fact, when matching up the items throughout the house, Shannon and Logan will restrict their investigation to the room that makes sense for the item in question.  That is just common sense – why would anyone search in the bathroom for (say) forks and spoons ?  It would just be a waste of effort.

(Editors Note:  Anyone with young children will of course dispute this metaphor, stating quite correctly that you can probably find every possible household item in every possible room, and probably outside as well Smile but we’ll omit that possibility for the sake of this discussion)

image

And that is exactly what a partition-wise join enables us to do in the database.  If two tables are partitioned with the same definition, and we are joining on the partition key, then that definition guarantees that for a row in one table with partition key “K” and hence partition “P”, we only need to seek that row in the same partition in the table we are joining to (where “same” is based on the partitioning definition).  It is the partitioning equivalent of “only searching in the kitchen and not the bathroom”.  We can see this via the execution plan when doing such a join.  Let’s create two tables with equal partition definitions and then join on the partition key.


SQL> --
SQL> -- Example 1
SQL> --
SQL>
SQL> drop table t1 purge;

Table dropped.

SQL> drop table t2 purge;

Table dropped.

SQL>
SQL> create table t1 ( x int, y int )
  2  partition by range ( x )
  3  (
  4  partition p_kitchen values less than (10000),
  5  partition p_bathroom values less than (20000),
  6  partition p_dining values less than (30000)
  7  );

Table created.

SQL>
SQL>
SQL> create table t2 ( x int, y int )
  2  partition by range ( x )
  3  (
  4  partition p_kitchen values less than (10000),
  5  partition p_bathroom values less than (20000),
  6  partition p_dining values less than (30000)
  7  );

Table created.

SQL>
SQL>
SQL> insert into t1 select rownum, rownum from dual connect by level < 30000;

29999 rows created.

SQL> insert into t2 select * from t1;

29999 rows created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats('','t1')

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.gather_table_stats('','t2')

PL/SQL procedure successfully completed.

SQL>
SQL> --
SQL> set autotrace traceonly explain
SQL> select count(t1.y), count(t2.y)
  2  from t1,t2
  3  where t1.x = t2.x;

Execution Plan
----------------------------------------------------------
Plan hash value: 3155849676

---------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |     1 |    20 |  1641   (1)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE      |      |     1 |    20 |            |          |       |       |
|   2 |   PARTITION RANGE ALL|      | 29999 |   585K|  1641   (1)| 00:00:01 |     1 |     3 |
|*  3 |    HASH JOIN         |      | 29999 |   585K|  1641   (1)| 00:00:01 |       |       |
|   4 |     TABLE ACCESS FULL| T1   | 29999 |   292K|   820   (1)| 00:00:01 |     1 |     3 |
|   5 |     TABLE ACCESS FULL| T2   | 29999 |   292K|   820   (1)| 00:00:01 |     1 |     3 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("T1"."X"="T2"."X")

SQL> set autotrace off
SQL>

The key part of the execution plan here is that the HASH JOIN is occurring within (or “under”) the PARTITION RANGE ALL iteration.  This can be interpreted as: “Start with the first partition in each table, do a hash join on that partition.  Then move onto the next partition; do a hash join on that partition”, and so on.  This is efficient on resources because at no point are we trying (and obviously failing) to join a row from table T1 partition P_KITCHEN to table T2 partition P_BATHROOM or P_DINING.  Each hash join is a smaller operation and hence also more likely to be completed in the available PGA allocation for that session.  Also, when it comes to running such a query in parallel, then each parallel slave can tackle the job of handling a partition in isolation to the other slaves.

If the partitions do not align (see the Editors note above Smile), then our join will not be as efficient.


SQL> --
SQL> -- Example 2
SQL> --
SQL>
SQL>
SQL> drop table t2 purge;

Table dropped.

SQL> create table t2 ( x int, y int )
  2  partition by range ( x )
  3  (
  4  partition p1 values less than (15000),
  5  partition p3 values less than (30000)
  6  );

Table created.

SQL>
SQL> --
SQL> -- all partitions do NOT align, so we do NOT see partition-wise join
SQL> --
SQL>
SQL> insert into t2 select * from t1;

29999 rows created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats('','t2')

PL/SQL procedure successfully completed.

SQL> set autotrace traceonly explain
SQL> select count(t1.y), count(t2.y)
  2  from t1,t2
  3  where t1.x = t2.x;

Execution Plan
----------------------------------------------------------
Plan hash value: 666786458

---------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |         |     1 |    20 |  1369   (1)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE               |         |     1 |    20 |            |          |       |       |
|*  2 |   HASH JOIN                   |         | 29999 |   585K|  1369   (1)| 00:00:01 |       |       |
|   3 |    PART JOIN FILTER CREATE    | :BF0000 | 29999 |   585K|  1369   (1)| 00:00:01 |       |       |
|   4 |     PARTITION RANGE ALL       |         | 29999 |   292K|   820   (1)| 00:00:01 |     1 |     3 |
|   5 |      TABLE ACCESS FULL        | T1      | 29999 |   292K|   820   (1)| 00:00:01 |     1 |     3 |
|   6 |    PARTITION RANGE JOIN-FILTER|         | 29999 |   292K|   548   (1)| 00:00:01 |:BF0000|:BF0000|
|   7 |     TABLE ACCESS FULL         | T2      | 29999 |   292K|   548   (1)| 00:00:01 |:BF0000|:BF0000|
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."X"="T2"."X")

Note
-----
   - this is an adaptive plan

SQL> set autotrace off
SQL>
SQL>

The key element here is that the HASH JOIN now sits above the cycling through all of the partitions.  In earlier releases of Oracle, you would not see the line containing the :BF0000, so it would be a simple join across all the rows as if the tables were not partitioned at all.  But when the partitions do not align, things are slightly better in modern releases.  We use a “Bloom filter” (hence the :BF prefix) to reduce the overhead of joining the two tables.  Since I’m using metaphors in this post, think of “phoning ahead” to the cinema to see if there are seats available for your favourite movie.  If the cinema owner says the movie is sold out, you have saved yourself a car trip. But just because the owner says there are seats available, it is still possible you might drive there and find that the movie has sold out during that time.  A Bloom filter is like phoning ahead – there’s a good chance you can avoid some work, but it is not a guarantee.  You can read about Bloom filters here in a great whitepaper by Christian Antognini.

Note that all of the partitions must align. Here is an example where the first three partitions are in alignment, having boundaries are 10000, 20000 and 30000, but our second table T2 has an additional partition defined.  Once again, we fall back to the Bloom filter option.


SQL> --
SQL> -- Example 3
SQL> --
SQL> drop table t2 purge;

Table dropped.

SQL> create table t2 ( x int, y int )
  2  partition by range ( x )
  3  (
  4  partition p1 values less than (10000),
  5  partition p2 values less than (20000),
  6  partition p3 values less than (30000),
  7  partition p4 values less than (40000)
  8  );

Table created.

SQL>
SQL> --
SQL> -- all partitions do NOT align, so we do NOT see partition-wise join
SQL> --
SQL>
SQL> insert into t2 select rownum, rownum from dual connect by level < 40000;

39999 rows created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats('','t2')

PL/SQL procedure successfully completed.

SQL> set autotrace traceonly explain
SQL> select count(t1.y), count(t2.y)
  2  from t1,t2
  3  where t1.x = t2.x;

Execution Plan
----------------------------------------------------------
Plan hash value: 666786458

---------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |         |     1 |    20 |  1913   (1)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE               |         |     1 |    20 |            |          |       |       |
|*  2 |   HASH JOIN                   |         | 29999 |   585K|  1913   (1)| 00:00:01 |       |       |
|   3 |    PART JOIN FILTER CREATE    | :BF0000 | 29999 |   585K|  1913   (1)| 00:00:01 |       |       |
|   4 |     PARTITION RANGE ALL       |         | 29999 |   292K|   820   (1)| 00:00:01 |     1 |     3 |
|   5 |      TABLE ACCESS FULL        | T1      | 29999 |   292K|   820   (1)| 00:00:01 |     1 |     3 |
|   6 |    PARTITION RANGE JOIN-FILTER|         | 39999 |   390K|  1093   (1)| 00:00:01 |:BF0000|:BF0000|
|   7 |     TABLE ACCESS FULL         | T2      | 39999 |   390K|  1093   (1)| 00:00:01 |:BF0000|:BF0000|
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."X"="T2"."X")

Note
-----
   - this is an adaptive plan

SQL> set autotrace off
SQL>
SQL>
SQL>


So faster queries on partitioned tables is not just about partition pruning.  Partition-wise joins also can make a beneficial impact on query response times.