Search

Top 60 Oracle Blogs

Recent comments

Hash Joins

I’ve written notes about the different joins in the past – but such things are always worth revisiting, so here’s an accumulated bundle of comments about hash joins.

A hash join takes two inputs that (in most of the Oracle literature) are referred to as the “build table” and the “probe table”. These rowsources may be extracts from real tables, or indexes, or might be result sets from previous joins. Oracle uses the “build table” to build a hash table in memory, consuming and using the rowsource in a single call; it then consumes the “probe table” one row at a time, probing the in-memory hash table to find a match.  Access to the hash table is made efficient by use of a hashing function that has been used on the join columns – rows with the same value on the join column end up hashing to the same place in the hash table. It is possible for different input values to produce the same hash value (a hash collision) so Oracle still has to check the actual values once it has identified “probable” joins in the hash table. Because the comparison is based on a hash value, hash joins cannot be used for range-based comparisons.

The hint to force a hash join is /*+ use_hash(rowsource_alias) */. This tells the optimizer that the join method to be used when “rowsource_alias” is the next rowsource in the join order should be a hash join. Note, particularly, it does not tell the optimizer whether that rowsource should be used as the build table or the probe table. To specify how the rowsource is used you need a second hint: no_swap_join_inputs(rowsource_alias) if you want Oracle to use rowsource as the probe table, swap_join_inputs(rowsource_alias) if you want Oracle to use it as the build table; for example, with a little cosmetic editing:


select  /*+ leading(table_1 table_2) use_hash(table_2) no_swap_join_inputs(table_2) */ *
from    t1 table_1, t2 table_2
where   table_2.n1 = table_1.n1
;

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 45000 |    16M|    44 |
|*  1 |  HASH JOIN         |      | 45000 |    16M|    44 |
|   2 |   TABLE ACCESS FULL| T1   |  3000 |   547K|    14 |
|   3 |   TABLE ACCESS FULL| T2   |  3000 |   547K|    14 |
-----------------------------------------------------------

select  /*+ leading(table_1 table_2) use_hash(table_2) swap_join_inputs(table_2) */ *
from    t1 table_1, t2 table_2
where   table_1.n1 = table_1.n1
;

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 45000 |    16M|    44 |
|*  1 |  HASH JOIN         |      | 45000 |    16M|    44 |
|   2 |   TABLE ACCESS FULL| T2   |  3000 |   547K|    14 |
|   3 |   TABLE ACCESS FULL| T1   |  3000 |   547K|    14 |
-----------------------------------------------------------

If you were to look at the second execution plan without analysing the optimizer trace file you would probably assume that the join order (very specifically the thing that Oracle labels “join order” for the purposes of examining execution paths in a structured sequence) was (table_2, table_1) – it is important to remember that the “internal” join order and the apparent join order are not necessarily the same because the optimizer may have done some “side-swapping” for a plan using hash joins.

A common misunderstanding is that a hint of the form /*+ use_hash(table_1 table_2) */ is a directive to Oracle to do a hash join with table_1 as the build table and table_2 as the probe table. This is not the case; the hint in this form is simply a shorthand for a pair of single-table hints, viz: use_hash(table_1) use_hash(table_2). In the simplest case this translates as: “if table_1 is the second table in the join order then use a hash join to access it, if table_2 is the second table in the join order then use a hash join to access it”; given this interpretation it’s not too surprising that the hint appears to mean more than it really does.

There is a third hint that you can associate with hash joins, and this is used to dictate how one set of slaves should distribute data to the next set of slaves when running parallel queries – the pq_distribute() hint. The description in the manuals for this hint reports something like the following: pq_distribute( tablespec, inner_distribution, outer_distribution). I think this leaves some room for confusion – which table, for example, should be supplied in the tablespec, and how should you interpret inner and outer ?  (The manual helpfully tells us that the outer_distribution is the distribution for the outer table, but that doesn’t really explain anything.) The “outer” table is the one specified in the hint, the “inner” table is the previous rowsource – and this doesn’t change even if you’ve used the swap_join_inputs() hint. Basically, when you hint a hash join for a table in a parallel query you need three hints to describe the hash join and for clarity you might as well make them three consecutive hints:


/*+
        use_hash(table_X)
        [no_]swap_join_inputs(table_X)
        pq_distribute(table_X {distribution for previous rowsource} {distribution for table_X})
*/

I won’t go into the details of the possible distribution methods and the effects they have. but here’s a little example demonstrating the hint. I’ve supplied an SQL statement (with the pq_distribute() hint commented out) and shown two plans – the first is the unhinted plan, the second shows the effect of my chosen distribution:

select  /*+
                leading(table_1 table_2)
                use_hash(table_2)
                swap_join_inputs(table_2)
--              pq_distribute(table_2 broadcast none)
        */ *
from    t1 table_1, t2 table_2
where   table_2.n1 = table_1.n1
;

-------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          | 45000 |    16M|    19 |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 | 45000 |    16M|    19 |  Q1,02 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED   |          | 45000 |    16M|    19 |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE          |          |  3000 |   547K|     8 |  Q1,02 | PCWP |            |
|   5 |      PX SEND HASH       | :TQ10000 |  3000 |   547K|     8 |  Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR |          |  3000 |   547K|     8 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T2       |  3000 |   547K|     8 |  Q1,00 | PCWP |            |
|   8 |     PX RECEIVE          |          |  3000 |   547K|     8 |  Q1,02 | PCWP |            |
|   9 |      PX SEND HASH       | :TQ10001 |  3000 |   547K|     8 |  Q1,01 | P->P | HASH       |
|  10 |       PX BLOCK ITERATOR |          |  3000 |   547K|     8 |  Q1,01 | PCWC |            |
|  11 |        TABLE ACCESS FULL| T1       |  3000 |   547K|     8 |  Q1,01 | PCWP |            |
-------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          | 45000 |    16M|    19 |        |      |            |
|   1 |  PX COORDINATOR          |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)    | :TQ10001 | 45000 |    16M|    19 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN             |          | 45000 |    16M|    19 |  Q1,01 | PCWP |            |
|   4 |     PX BLOCK ITERATOR    |          |  3000 |   547K|     8 |  Q1,01 | PCWC |            |
|   5 |      TABLE ACCESS FULL   | T2       |  3000 |   547K|     8 |  Q1,01 | PCWP |            |
|   6 |     BUFFER SORT          |          |       |       |       |  Q1,01 | PCWC |            |
|   7 |      PX RECEIVE          |          |  3000 |   547K|     8 |  Q1,01 | PCWP |            |
|   8 |       PX SEND BROADCAST  | :TQ10000 |  3000 |   547K|     8 |  Q1,00 | P->P | BROADCAST  |
|   9 |        PX BLOCK ITERATOR |          |  3000 |   547K|     8 |  Q1,00 | PCWC |            |
|  10 |         TABLE ACCESS FULL| T1       |  3000 |   547K|     8 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------

I’ve dictated the join order (t1, t2) and a hash join into t2, but I’ve also told the optimizer to swap the join inputs so when you look at the execution plan the join order appears to be (t2, t1).

By default - i.e. in the first plan – the optimizer used the (hash hash) distribution – so the first set of parallel execution slaves scanned t2 (sharing the task of the full tablescan fairly, we hope) and distributed the data “pseudo-randomly” (the hashing algorithm on the join column) between the slaves in the second set; then it scanned t1 and distributed that data using the same hashing algorithm on the join column. The second set of slaves built and probed with the partial data sets they were given.

In the second plan the first set of slaves scanned its selected portion of t2 and built a hash table from it, after which [see footnote] the second set of slaves started scanning t1 - but each slave in the second set sent every row it had scanned to every slave in the first set. This example is a little unusual, I’ve chosen it simply to demonstrate the principle; normally you might expect to broadcast the hash table if it produced a “small” data set as this tends to minimise the number of PX messages passed between slaves.

It’s worth pointing out that there’s a lot more you can do to investigate parallelism and hash joins – especially if you’re working with the Exadata database machine – and I will be publishing further notes on the topic. I’ll leave you with one warning, though. If you change the join in the example to an outer join:  “table_2.n1(+) = table_1.n1″ the (broadcast none) distribution will become invalid so the pq_distribute() hint will be “ignored”, even though the use_hash() and swap_join_inputs() will still be followed – you have to be very careful when trying to engineer parallel queries to follow a specific path.

Footnote: taking a closer look at the set of trace files generated in the broadcast test, I discovered that the first set of slave start their parallel tablescan of t1 first, but stops after just one read from each slave, then the second set of slaves scans and builds the hash table before calling for further data from the first set.