Search

Top 60 Oracle Blogs

Recent comments

Serial Bloom

Following the recent note I wrote about an enhancement to the optimizer’s use of Bloom filters, I received a question by email asking about the use of Bloom filters in serial execution plans:

I’m having difficulty understanding the point of a Bloom filter when used in conjunction with a hash join where everything happens within the same process.

I believe you mentioned in your book (Cost Based Oracle) that hash joins have a mechanism similar to a Bloom filter where a row from the probe table is checked against a bitmap, where each hash table bucket is indicated by a single bit. (You have a picture on page 327 of the hash join and bitmap, etc).

The way that bitmap looks and operates appears to be similar to a Bloom filter to me…. So it looks (to me) like hash joins have a sort of “Bloom filter” already built into them.

My question is… What is the advantage of adding a Bloom filter to a hash join if you already have a type of Bloom filter mechanism thingy built in to hash joins?

I can understand where it would make sense with parallel queries having to pass data from one process to another, but if everything happens within the same process I’m just curious where the benefit is.

 

The picture on page 327 of CBO-F is a variation on the following, which is the penultimate snapshot of the sequence of events in a multi-pass hash join. The key feature is the in-memory bitmap at the top of the image describing which buckets in the (partitioned and spilled) hash table hold rows from the build table. I believe that it is exactly this bitmap that is used as the Bloom filter.

#400040;margin-top:2px;margin-bottom:2px;" src="https://jonathanlewis.files.wordpress.com/2020/09/hash_join2.jpg" alt="" width="848" height="319" srcset="https://jonathanlewis.files.wordpress.com/2020/09/hash_join2.jpg?w=848&h=319 848w, https://jonathanlewis.files.wordpress.com/2020/09/hash_join2.jpg?w=1696&... 1696w, https://jonathanlewis.files.wordpress.com/2020/09/hash_join2.jpg?w=150&h=56 150w, https://jonathanlewis.files.wordpress.com/2020/09/hash_join2.jpg?w=300&h... 300w, https://jonathanlewis.files.wordpress.com/2020/09/hash_join2.jpg?w=768&h... 768w, https://jonathanlewis.files.wordpress.com/2020/09/hash_join2.jpg?w=1024&... 1024w" sizes="(max-width: 848px) 100vw, 848px" />

The question of why it might be worth creating and using a Bloom filter in a simple serial hash join is really a question of scale. What is the marginal benefit of the Bloom filter when the basic hash join mechanism is doing all the hash arithmetic and comparing with a bitmap anyway?

If the hash join is running on an Exadata machine then the bitmap can be passed as a predicate to the cell servers and the hash function can be used at the cell server to minimise the volume of data that has to be passed back to the database server – with various optimisations dependent on the version of the Exadata software. Clearly minimising traffic through the interconnect is going to have some benefit.

Similarly, as the email suggests, for a parallel query where (typically) one set of parallel processes will read the probe table and distribute the data to the second set of parallel processes which then do the hash join it’s clearly sensible to allow the first set of procsses to apply the hash function and discard as many rows as possible before distributing the survivors – minimising inter-process communication.

In both these cases, of course, there’s a break point to consider of how effective the Bloom filter needs to be before it’s worth taking advantage of the technology. If the Bloom filter allows 99 rows out of every hundred to be passed to the database server / second set of parallel processes then Oracle has executed the hash function and checked the bitmap 100 times to avoid sending one row (and it will (may) have to do the same hash function and bitmap check again to perform the hash join); on the other hand if the Bloom filter discards 99 rows and leaves only one row surviving then that’s a lot of traffic eliminated – and that’s likely to be a good thing. This is why there are a few hidden parameters defining the boundaries of when Bloom filters should be used – in particular there’s a parameter “_bloom_filter_ratio” which defaults to 35 and is, I suspect, a figure which says something like “use Bloom filtering only if it’s expected to reduce the probe data to 35% of the unfiltered size”.

So the question then becomes: “how could you benefit from a serial Bloom filter when it’s the same process doing everything and there’s no “long distance” traffic going on between processes?” The answer is simply that we’re operating at a much smaller scale. I’ve written blog notes in the past where the performance of a query depends largely on the number of rows that are passed up a query plan before being eliminated (for example here, where the volume of data moving results in a significant fraction of the total time).

If you consider a very simple hash join its plan is going to be shaped something like this:


-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |   720 |    31 |
|*  1 |  HASH JOIN         |      |    45 |   720 |    31 |
|*  2 |   TABLE ACCESS FULL| T2   |    15 |   120 |    15 |
|   3 |   TABLE ACCESS FULL| T1   |  3000 | 24000 |    15 |
-----------------------------------------------------------

If you read Tanel Poder’s article on execution plans as a tree of Oracle function calls you’ll appreciate that you could translate this into informal English along the lines of:

  • Operation 1 calls a function (at operation 2) to do a tablescan of t1 and return all the relevant rows, building an in-memory hash table by applying a hashing function to the join column(s) of each row returned by the call to the tablescan. As the hash table is populated the operation also constructs a bitmap to flag buckets in the hash table that have been populated.
  • Operation 1 then calls a function (at operation 3) to start a tablescan and then makes repeated calls for it to return one row (or, in newer versions, a small rowset) at a time from table t2. For each row returned operation 1 applies the same hash function to the join column(s) and checks the bitmap to see if there’s a potential matching row in the relevant bucket of the hash table, and if there’s a potential match Oracle examines the actual contents of the bucket (which will be stored as a linked list) to see if there’s an actual match.

Taking the figures above, let’s imagine that Oracle is using a rowset size of 30 rows. Operation 1 will have to make 100 calls to Operation 3 to get all the data, and call the hashing function 3,000 times.  A key CPU component of the work done is that the function represented by operation 3 is called 100 times and (somehow) allocates and fills an array of 30 entries each time it is called.

Now assume operation 1 passes the bitmap to operation 3 as an input and it happens to be a perfect bitmap. Operation 3 starts its tablescan and will call the hash function 3,000 times, but at most 45 rows will get past the bitmap. So operation 1 will only have to call operation 3 twice.  Admittedly operation 1 will (possibly) call the hash function again for each row – but maybe operation 3 will supply the hash value in the return array. Clearly there’s scope here for a trade-off between the reduction in work due to the smaller number of calls and the extra work needed to take advantage of the bitmap technology.

Here’s an example that shows the potential for savings – if you want to recreate this test you’ll need about 800MB of free space in the database, the first table takes about 300MB and the second about 450MB.


rem
rem     Script:         bloom_filter_serial_02.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        lpad(rownum,30,'0')             v1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e7 -- > comment to avoid WordPress format issue
;

create table t2
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        round(rownum + 0.5,2)           id,
        mod(rownum,1e5)                 n1,
        lpad(rownum,10)                 v1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e7 -- > comment to avoid WordPress format issue
;


prompt  =================
prompt  With Bloom filter
prompt  =================

select 
        /*+ 
                px_join_filter(t1) 
                monitor
        */
        t1.v1, t2.v1
from 
        t2, t1
where 
        t2.n1 = 0
and 
        t1.id = t2.id
/

prompt  ===============
prompt  No Bloom filter
prompt  ===============

select 
        /*+
                monitor
        */
        t1.v1, t2.v1
from 
        t2, t1
where 
        t2.n1 = 0
and 
        t1.id = t2.id
/

I’ve created tables t1 and t2 with an id column that never quite matches, but the range of values is set so that the optimizer thinks the two tables might have a near-perfect 1 to 1 match. I’ve given t2 an extra column with 105 distinct values in its 107 rows, so it’s going to have 100 rows per distinct value. Then I’ve presented the optimizer with a query that looks as if it’s going to find 100 rows in t2 and needs to find a probable 100 rows of matches in t1. For my convenience, and to highlight a couple of details of Bloom filters, it’s not going to find any matches.

In both runs I’ve enabled the SQL Monitor feature with the /*+ monitor */ hint, and in the first run I’ve also hinted the use of a Bloom filter. Here are the resulting SQL Monitor outputs. Bear in mind we’re looking at a reasonably large scale query – volume of input data – with a small result set.

First without the Bloom filter:


Global Stats
================================================================
| Elapsed |   Cpu   |    IO    | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
================================================================
|    3.00 |    2.24 |     0.77 |     1 |  96484 |  773 | 754MB |
================================================================

SQL Plan Monitoring Details (Plan Hash Value=2959412835)
==================================================================================================================================================
| Id |      Operation       | Name |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity | Activity Detail |
|    |                      |      | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================
|  0 | SELECT STATEMENT     |      |         |       |         2 |     +2 |     1 |        0 |      |       |     . |          |                 |
|  1 |   HASH JOIN          |      |     100 | 14373 |         2 |     +2 |     1 |        0 |      |       |   2MB |          |                 |
|  2 |    TABLE ACCESS FULL | T2   |      99 |  5832 |         2 |     +1 |     1 |      100 |  310 | 301MB |     . |          |                 |
|  3 |    TABLE ACCESS FULL | T1   |     10M |  8140 |         2 |     +2 |     1 |      10M |  463 | 453MB |     . |          |                 |
==================================================================================================================================================

According to the Global Stats the query has taken 3 seconds to complete, of which 2.24 seconds is CPU. (The 750MB read in 0.77 second would be due to the fact that I’m running off SSD, and I’ve got a 1MB read size that helps). A very large fraction of the CPU appears because of the number of calls from operation 1 to operation 3 (the projection information pulled from memory reports a rowset size of 256 rows, so that’s roughly 40,000 calls to the function.

When we force the use of a Bloom filter the plan doesn’t change much (though the creation and use of the Bloom filter has to be reported) – but the numbers do change quite significantly.

Global Stats
================================================================
| Elapsed |   Cpu   |    IO    | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
================================================================
|    1.97 |    0.99 |     0.98 |     1 |  96484 |  773 | 754MB |
================================================================

SQL Plan Monitoring Details (Plan Hash Value=4148581417)
======================================================================================================================================================
| Id |       Operation       |  Name   |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity | Activity Detail |
|    |                       |         | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |   (# samples)   |
======================================================================================================================================================
|  0 | SELECT STATEMENT      |         |         |       |         1 |     +1 |     1 |        0 |      |       |     . |          |                 |
|  1 |   HASH JOIN           |         |     100 | 14373 |         1 |     +1 |     1 |        0 |      |       |   1MB |          |                 |
|  2 |    JOIN FILTER CREATE | :BF0000 |      99 |  5832 |         1 |     +1 |     1 |      100 |      |       |     . |          |                 |
|  3 |     TABLE ACCESS FULL | T2      |      99 |  5832 |         1 |     +1 |     1 |      100 |  310 | 301MB |     . |          |                 |
|  4 |    JOIN FILTER USE    | :BF0000 |     10M |  8140 |         1 |     +1 |     1 |    15102 |      |       |     . |          |                 |
|  5 |     TABLE ACCESS FULL | T1      |     10M |  8140 |         1 |     +1 |     1 |    15102 |  463 | 453MB |     . |          |                 |
======================================================================================================================================================


In this case, the elapsed time dropped to 1.97 seconds (depending on your viewpoint that’s either a drop of “only 1.03 seconds” or drop of “an amazing 34.3%”; with the CPU time dropping from 2.24 seconds to 0.99 seconds (55.8% drop!)

In this case you’ll notice that the tablescan of t1 produced only 15,102 rows to pass up to the hash join at operation 1 thanks to the application of the predicate (not reported here): filter(SYS_OP_BLOOM_FILTER(:BF0000,”T1″.”ID”)). Instead of 40,000 calls for the next rowset the hash function has been applied during the tablescan and operation 5 has exhausted the tablescan after only about 60 calls. This is what has given us the (relatively) significant saving in CPU.

This example of the use of a Bloom filter highlights up the two points I referred to earlier.

  • First, although we see operations 4 and 5 as Join (Bloom) filter use and Table access full respectively I don’t think the data from the tablescan is being “passed up” from operation 5 to 4; I believe operation 4 can be views as a “placeholder” in the plan to allow us to see the Bloom filter in action, the hashing and filtering actually happening during the tablescan.
  • Secondly, we know that there are ultimately no rows in the result set, yet the application of the Bloom filter has not eliminated all the data. Remember that the bitmap that Oracle constructs of the hash table identifies used buckets, not actual values. Those 15,102 rows are rows that “might” find a match in the hash table because they belong in buckets that are flagged. A Bloom filter won’t discard any data that is needed, but it might fail to eliminate data that subsequently turns out to be unwanted.

How parallel is parallel anyway?

I’ll leave you with one other thought. Here’s an execution plan from 12c (12.2.0.1) which joins three dimension tables to a fact table. There are 343,000 rows in the fact table and the three joins individually identify about 4 percent of the data in the table. In a proper data warehouse we might have been looking at a bitmap star transformation solution for this query, but in a mixed system we might want to run warehouse queries against normalised data – this plan shows what Bloom filters can do to minimise the workload. The plan was acquired from memory after enabling rowsource execution statistics:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |      1 |        |        |      |            |      1 |00:00:00.05 |      22 |      3 |       |       |          |
|   1 |  SORT AGGREGATE              |          |      1 |      1 |        |      |            |      1 |00:00:00.05 |      22 |      3 |       |       |          |
|   2 |   PX COORDINATOR             |          |      1 |        |        |      |            |      2 |00:00:00.05 |      22 |      3 | 73728 | 73728 |          |
|   3 |    PX SEND QC (RANDOM)       | :TQ10000 |      0 |      1 |  Q1,00 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   4 |     SORT AGGREGATE           |          |      2 |      1 |  Q1,00 | PCWP |            |      2 |00:00:00.09 |    6681 |   6036 |       |       |          |
|*  5 |      HASH JOIN               |          |      2 |     26 |  Q1,00 | PCWP |            |     27 |00:00:00.09 |    6681 |   6036 |  2171K|  2171K|     2/0/0|
|   6 |       JOIN FILTER CREATE     | :BF0000  |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|*  7 |        TABLE ACCESS FULL     | T3       |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|*  8 |       HASH JOIN              |          |      2 |    612 |  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6634 |   6026 |  2171K|  2171K|     2/0/0|
|   9 |        JOIN FILTER CREATE    | :BF0001  |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|* 10 |         TABLE ACCESS FULL    | T2       |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|* 11 |        HASH JOIN             |          |      2 |  14491 |  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6614 |   6022 |  2171K|  2171K|     2/0/0|
|  12 |         JOIN FILTER CREATE   | :BF0002  |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|* 13 |          TABLE ACCESS FULL   | T1       |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|  14 |         JOIN FILTER USE      | :BF0000  |      2 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|  15 |          JOIN FILTER USE     | :BF0001  |      2 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|  16 |           JOIN FILTER USE    | :BF0002  |      2 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|  17 |            PX BLOCK ITERATOR |          |      2 |    343K|  Q1,00 | PCWC |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|* 18 |             TABLE ACCESS FULL| T4       |     48 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.05 |    6594 |   6018 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------

It’s a parallel plan, but it’s used the 12c “PQ_REPLICATE” strategy. The optimizer has decided that all the dimension tables are so small that it’s going to allow every PX process to read every (dimension) table through the buffer cache and build its own hash tables from them. (In earlier versions you might have seen the query coordinator scanning and broadcasting the three small tables, or one set of PX processes scanning and broadcasting to the other set).

So every PX process has an in-memory hash table of all three dimension tables and then (operation 17) they start a tablescan of the fact table, picking non-overlapping rowid ranges to scan. But since they’ve each created three in-memory hash tables they’ve also been able to create three Bloom filters each, which can all be applied simultaneously as the tablescan takes place; so instead of 343,000 rows being passed up the plan and through the first hash join (where we see from operation 11 that the number of surviving rows would have been about 14,500 ) we see all but 27 rows discarded very early on in the processing. Like bitmap indexes part of the power of Bloom filters lies in the fact that with the right plan the optimizer can combine them and identify a very small data set very precisely, very early.

The other thing I want you to realise about this plan, though, is that it’s not really an “extreme” parallel plan. It’s effectively running as a set of concurrent, non-interfering, serial plans. Since I was running (parallel 2) Oracle started just 2 PX processes: they both built three hash tables from the three dimension tables then split the fact table in half and took half each to do all the joins, and passed the nearly complete result to the query co-ordinator at the last moment. That’s as close as you can get to two serial, non-interfering, queries and still call it a parallel query. So, if you wonder why there might be any benefit in serial Bloom filters – Oracle’s actually being benefiting from it under the covers for several years.

Summary

Bloom filters trade a decrease in messaging against an increase in preparation and hashing operations. For Exadata systems with predicate offloading it’s very easy to see the potential benefit; for general parallel execution; it’s also fairly easy to see the potential benefit for parallel query execution what inter-process message between two sets of PX processes can be resource intensive; but even for serial queries there can be some benefit though, in absolute terms, they are likely to be only a small saving in CPU.