Search

Top 60 Oracle Blogs

Recent comments

Match_recognize

In the spirit of Cary Millsap’s comment: “The fastest way to do anything is to not do it at all”, here’s my take (possibly not an original one) on solving problems:

“The best time to solve a problem is before it has happened.”

I spend quite a lot of my “non-contact” time thinking about boundary cases, feature collisions, contention issues, and any other things that could go wrong when you start to implement real systems with (new) Oracle features. The benefit of doing this, of course, is that when I’m looking at a client’s system I can often solve problems because I recognise symptoms that I’ve previously created “in the lab”. The strange thing about this is that there have been times when I’ve pushed Oracle to a breaking point, documented it, and then dismissed the threat because “no one would do that in real life” only to find that someone has done it in real life.

All this is just a preamble to a demonstration of a threat with a terrific feature that is just beginning to gain greater acceptance as a solution to some interesting problems – and the demonstration is going to exaggerate the problem to a level that (probably) won’t appear in a production. The driving example appeared as a question on the OTN/ODC database forum:

“I need customers who have done a transaction in September but not in October.”

There are obviously many ways to address this type of requirement (my first thought was to use the MINUS operator), and a few questions you might ask before trying to address it, but the OP had supplied some data to play which consisted of just a few rows of a table with three columns and some data restricted to just one year, and one solution offered was a very simple query using the 12c feature match_recognize():


CREATE TABLE TEST_TABLE   
  ( T_ID NUMBER, -- trans-id  
    CUST_ID NUMBER,   
    TRANS_DT DATE  
  ) ;  
                  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (1,100,to_date('12-SEP-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (2,100,to_date('12-OCT-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (3,200,to_date('12-SEP-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (4,300,to_date('12-OCT-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (5,400,to_date('12-JAN-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (6,500,to_date('12-OCT-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (7,500,to_date('12-MAR-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (8,600,to_date('12-SEP-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (9,600,to_date('12-JUL-17','DD-MON-RR'));  

commit;

select * from test_table
match_recognize
(
  partition by cust_id
  order by trans_dt
  pattern( x+ y* $)
  define
    x as extract(month from trans_dt)  = 9,
    y as extract(month from trans_dt) != 10
);
 
   CUST_ID
----------
       200
       600
      

The obvious benefit of this solution over a solution involving a set-wise MINUS is that it need only scan the data set once (whereas the MINUS strategy will be scanning it twice with a select distinct in each scan) – but it’s a solution that is likely to be unfamiliar to many people and may need a little explanation.

The partition by cust_id order by trans_dt means we sort the data by those two columns, breaking on cust_id. Then for each cust_id we walk through the data looking for a pattern which is defined as: “one or more rows where the month is september followed by zero or more rows where the month is NOT october followed by the end of the set for the customer”. The SQL leaves many details to default so the result set is just the cust_id column and only one row per occurrence of the pattern (which, given the data set, can occur at most once per customer).

For a cust_id that shows a matching pattern the work we will have done is:

  • Walk through rows for Jan to Aug until we reach the first September – which is the start of pattern
  • Keep on walking through to the last of the Septembers – which is a partial match
  • One of
  • Walk through zero rows of November and December and reach the end of cust_id
  • Walk through one or more rows of November and/or December then reach the end of cust_id
  • Record the end of pattern by reporting one row
  • Move on to next cust_id

The excitement starts when we think about a cust_id that doesn’t have a matching pattern – and for that I’m going to generate a new, extreme, data set.


rem
rem     Script:         match_recognize_07.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level  comment to avoid WordPress format issue
)
select
        rownum                          id,
        99                              cust_id,
        to_date('01-Sep-2017')          trans_dt,
        lpad(rownum,1000,'0')           padding
from
        generator       v1,
        generator       v2
where
        rownum  comment to avoid WordPress format issue
;

update t1
set
        trans_dt = to_date('01-Oct-2017','dd-mon-yyyy')
where
        rownum = 1
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

select  *
from    (
        select 
                t1.*,
                extract(year from trans_dt) yr, 
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding
        pattern( x+  y*  $ )
        define
                x as mth = 9,
                y as mth != 10
);

I’ve moved the calculation of month number from the define clause into an in-line view purely to make the match_recognize() clause a little tidier.

I’ve created a table with just one customer with 100,000 transactions on 1st September 2017, then I’ve updated one row from September to October. Thanks to that one row Oracle is not going to be able to find the requested pattern. I’ve added a padding column of 1,000 characters to the table and included it in the measures that I want to select, so Oracle will have to sort roughly 100MB of data (100,000 rows at roughly 1KB per row) before it starts walking the data to find matches – and, though it’s not visible in the script, the workarea settings mean the session won’t be allowed to expand its PGA to accommodate the whole 100MB.

Test 1 – comment out the update and see how long it takes to produce a result: 0.67 seconds, and the padding value reported was the last one from the pattern.
Test 2 – put the update back in place and try again:

After running for 46 seconds with no result and interrupting the query these are some figures from a snapshot of the session stats:

Name                                                 Value
----                                                 -----
CPU used when call started                           3,662
DB time                                              3,711
user I/O wait time                                   1,538
consistent gets                                     14,300
physical reads direct                            1,298,939
physical read IO requests                          736,478
physical read bytes                         10,640,908,288      
physical writes                                     25,228
physical writes direct                              25,228
physical reads direct temporary tablespace       1,298,939
physical writes direct temporary tablespace         25,228
table scan rows gotten                             100,000
table scan blocks gotten                            14,286

  • I’ve scanned a table of 14,286 blocks to find 100,000 rows.
  • I’ve sorted and spilled to disc, using roughly 25,000 blocks of direct path writes and reads to do the sort.
  • Then I’ve spend the rest of the time burning up CPU and reading 1.27 million blocks from the temporary tablespace trying to find a match

The way that basic pattern matching works on a match failure is to go back to the row after the one where the current match attempt started, and begin all over again. So in this example, after dumping 100MB of Septembers to temp Oracle started at row 1, read 999,999 rows, then found the October that failed the match; so it went to row 2 [ed: doing some very expensive back-tracking: see comment #2 from Stew Ashton], read 999,998 rows, then found the October that failed the match; so it went to row 3 and so on. Every time it went back to (nearly) the beginning it had to start re-reading that 100,000 rows from temp because the session wasn’t allowed to keep the whole 100MB in memory.

You need to avoid defining a pattern that has to scan large volumes of data to identify a single occurrence of the pattern if the matching process is likely to fail. Even if you can keep the appropriate volume of data in memory for the entire time and avoid a catastrophic volume of reads from the temporary tablespace you can still see a huge amount of CPU being used to process the data – when I reduced the table from 100,000 rows to 10,000 rows it still took me 99 CPU seconds to run the query.

tl;dr

The 12c match_recognize() is a terrific tool, but you must remember two important details about the default behaviour when you think about using it:

  • You will sort a volume of data that is the number of input rows multiplied but the total length of the measures/partition output.
  • If you have a long sequence of rows that ends up failing to match a pattern Oracle goes back to the row after the start of the previous match attempt.

With the usual proviso that “large”, “small” etc. are all relative: keep the data volume small, and try to define patterns that will be short  runs of rows.

Do note, however, that I engineered this example to produce a catastrophe. There are many non-default actions you can choose to minimise the workload you’re likely to produce with match_recognize(), and if you just spare a little time to think about worst case events you probably won’t need to face a scenario like this in a real production environment.

See also:

Part 6 (which includes a list of earlier installments) of an introductory series to match_recognize() by Keith Laker.

A pdf file of Keith Laker’s presentation on match_recognize(), including some technical implementation details.