Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1570

Add MATCH_RECOGNIZE operator, for event pattern-matching

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Resolved
    • Affects Version/s: None
    • Fix Version/s: 1.13.0
    • Component/s: None
    • Labels:

      Description

      Add the MATCH_RECOGNIZE operator, for event pattern-matching. Oracle introduced this in 11i (for tables) and Esper implemented it (for streams of events).

      It would be most useful for streaming SQL but it makes sense in non-streaming SQL too (and of course it's good to be able to run streaming queries on historic data).

      Here is an example from oracle-base:

      SELECT *
      FROM   sales_history MATCH_RECOGNIZE (
               PARTITION BY product
               ORDER BY tstamp
               MEASURES  STRT.tstamp AS start_tstamp,
                         FINAL LAST(UP.tstamp) AS peak_tstamp,
                         MATCH_NUMBER() AS mno,
                         CLASSIFIER() AS cls
               ALL ROWS PER MATCH
               AFTER MATCH SKIP TO LAST DOWN
               PATTERN (STRT UP+ DOWN{1} UP+)
               DEFINE
                 UP AS UP.units_sold > PREV(UP.units_sold),
                 DOWN AS DOWN.units_sold < PREV(DOWN.units_sold)
             ) MR
      ORDER BY MR.product, MR.tstamp;
      

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          julianhyde Julian Hyde added a comment -

          Fabian Hueske has been building CEP functionality into Apache Flink and has talked about extensions to SQL for matching patterns. Here is an example from his presentation with Till Rohrmann, Streaming Analytics & CEP: Two sides of the same coin?:

          SELECT
            TUMBLE_START(tStamp, INTERVAL '1' DAY) as day,
            AVG(duration) as avgDuration
          FROM (
            // CEP pattern
            SELECT (b.tStamp ­ a.tStamp) as duration, b.tStamp as tStamp
              FROM inputs
              PATTERN
                a FOLLOW BY b PARTITION BY orderId ORDER BY tStamp 
                WITHIN INTERVAL '1’ HOUR
          WHERE
             a.status = ‘received’ AND b.status = ‘shipped’
            )
          GROUP BY
            TUMBLE(tStamp, INTERVAL '1’ DAY)
          Show
          julianhyde Julian Hyde added a comment - Fabian Hueske has been building CEP functionality into Apache Flink and has talked about extensions to SQL for matching patterns. Here is an example from his presentation with Till Rohrmann , Streaming Analytics & CEP: Two sides of the same coin? : SELECT   TUMBLE_START(tStamp, INTERVAL '1' DAY) as day,   AVG(duration) as avgDuration FROM (    // CEP pattern   SELECT (b.tStamp ­ a.tStamp) as duration, b.tStamp as tStamp     FROM inputs     PATTERN       a FOLLOW BY b PARTITION BY orderId ORDER BY tStamp        WITHIN INTERVAL '1’ HOUR WHERE    a.status = ‘received’ AND b.status = ‘shipped’   ) GROUP BY   TUMBLE(tStamp, INTERVAL '1’ DAY)
          Hide
          ransom Zhiqiang He added a comment -

          Julian Hyde Flink is only support pattern api now. and I think MATCH_RECOGNIZE is more compatible standard sql syntax, patterns is also can be changed to match_recognize syntax after rewrite it. I want to support match_recognize syntax in calcite. if you are agree, i will do it.

          Show
          ransom Zhiqiang He added a comment - Julian Hyde Flink is only support pattern api now. and I think MATCH_RECOGNIZE is more compatible standard sql syntax, patterns is also can be changed to match_recognize syntax after rewrite it. I want to support match_recognize syntax in calcite. if you are agree, i will do it.
          Show
          ransom Zhiqiang He added a comment - Flink patter JIRAS: https://issues.apache.org/jira/browse/FLINK-3318 https://issues.apache.org/jira/browse/FLINK-3319
          Hide
          julianhyde Julian Hyde added a comment -

          Zhiqiang He, It would be great if you could contribute this feature. I suggest that we choose a subset of MATCH_RECOGNIZE to implement initially. We'd implement it all the way from SQL parsing, representation as a SqlNode AST, validation, representation as a new RelNode sub-class, and implementation (maybe via a rewrite rule to say the Window RelNode).

          Show
          julianhyde Julian Hyde added a comment - Zhiqiang He , It would be great if you could contribute this feature. I suggest that we choose a subset of MATCH_RECOGNIZE to implement initially. We'd implement it all the way from SQL parsing, representation as a SqlNode AST, validation, representation as a new RelNode sub-class, and implementation (maybe via a rewrite rule to say the Window RelNode).
          Hide
          ransom Zhiqiang He added a comment -

          Good! I will develop it.
          the match recognize syntax like this:
          https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8980

          Show
          ransom Zhiqiang He added a comment - Good! I will develop it. the match recognize syntax like this: https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8980
          Hide
          julianhyde Julian Hyde added a comment -

          That's a lot. Can you cut it down to minimum viable functionality?

          Show
          julianhyde Julian Hyde added a comment - That's a lot. Can you cut it down to minimum viable functionality?
          Hide
          ransom Zhiqiang He added a comment -

          OK, I will split it to some tasks.

          Show
          ransom Zhiqiang He added a comment - OK, I will split it to some tasks.
          Hide
          julianhyde Julian Hyde added a comment - - edited

          Rather than slicing horizontally (into tasks each in a different area of the code), let's slice vertically and do a spike for the stupidest, simplest example. I propose that we do enough to implement this example from here and nothing more:

          select its.itemId
            from tkrfid_ItemTempStream MATCH_RECOGNIZE (
                PARTITION BY itemId
                MEASURES A.itemId as itemId
                PATTERN (A B* C)
                DEFINE
                    A  AS  (A.temp >= 25),
                    B  AS  ((B.temp >= 25) and (to_timestamp(B.element_time) - to_timestamp(A.element_time) < INTERVAL "0 00:00:05.00" DAY TO SECOND)),
                    C  AS  (to_timestamp(C.element_time) - to_timestamp(A.element_time) >= INTERVAL "0 00:00:05.00" DAY TO SECOND)
            ) as its
          

          But let's make it even dumber if we can. There's a lot that we don't know we don't know, and we will only find it when we go end-to-end. It don't want you to build a lot of SQL grammar (and other stuff) and have to tear it down.

          Also let's stick to standard SQL lexical norms, like single-quotes around intervals, double-quotes to quote identifiers. In other words, use what is already defined in Parser.jj, such as SimpleIdentifier() and IntervalLiteral().

          And let's write this only in terms of tables, not streams, at first. It will be much easier to test that way.

          Show
          julianhyde Julian Hyde added a comment - - edited Rather than slicing horizontally (into tasks each in a different area of the code), let's slice vertically and do a spike for the stupidest, simplest example. I propose that we do enough to implement this example from here and nothing more: select its.itemId from tkrfid_ItemTempStream MATCH_RECOGNIZE ( PARTITION BY itemId MEASURES A.itemId as itemId PATTERN (A B* C) DEFINE A AS (A.temp >= 25), B AS ((B.temp >= 25) and (to_timestamp(B.element_time) - to_timestamp(A.element_time) < INTERVAL "0 00:00:05.00" DAY TO SECOND)), C AS (to_timestamp(C.element_time) - to_timestamp(A.element_time) >= INTERVAL "0 00:00:05.00" DAY TO SECOND) ) as its But let's make it even dumber if we can. There's a lot that we don't know we don't know, and we will only find it when we go end-to-end. It don't want you to build a lot of SQL grammar (and other stuff) and have to tear it down. Also let's stick to standard SQL lexical norms, like single-quotes around intervals, double-quotes to quote identifiers. In other words, use what is already defined in Parser.jj, such as SimpleIdentifier() and IntervalLiteral(). And let's write this only in terms of tables, not streams, at first. It will be much easier to test that way.
          Hide
          ransom Zhiqiang He added a comment -

          Julian Hyde Thanks for your suggestion, I renewed subtasks.

          Show
          ransom Zhiqiang He added a comment - Julian Hyde Thanks for your suggestion, I renewed subtasks.
          Hide
          julianhyde Julian Hyde added a comment -

          Resolved in release 1.13.0 (2017-06-26).

          Show
          julianhyde Julian Hyde added a comment - Resolved in release 1.13.0 (2017-06-26).

            People

            • Assignee:
              ransom Zhiqiang He
              Reporter:
              julianhyde Julian Hyde
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development