Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1935

Reference implementation for MATCH_RECOGNIZE

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We now have comprehensive support for parsing and validating MATCH_RECOGNIZE queries (see CALCITE-1570 and sub-tasks) but we cannot execute them. I know the purpose of this work is to do CEP within Flink, but a reference implementation that works on non-streaming data would be valuable.

      I propose that we add a class EnumerableMatch that can generate Java code to evaluate MATCH_RECOGNIZE queries on Enumerable data. It does not need to be efficient. I don't mind if it (say) buffers all the data in memory and makes O(n ^ 3) passes over it. People can make it more efficient over time.

      When we have a reference implementation, people can start playing with this feature. And we can start building a corpus of data sets, queries, and their expected result. The Flink implementation will be able to test against those same queries, and should give the same results, even though Flink will be reading streaming data.

      Let's create match.iq with the following query based on https://oracle-base.com/articles/12c/pattern-matching-in-oracle-database-12cr1:

      !set outputformat mysql
      !use match
      
      SELECT *
      FROM sales_history MATCH_RECOGNIZE (
               PARTITION BY product
               ORDER BY tstamp
               MEASURES  STRT.tstamp AS start_tstamp,
                         LAST(UP.tstamp) AS peak_tstamp,
                         LAST(DOWN.tstamp) AS end_tstamp,
                         MATCH_NUMBER() AS mno
               ONE ROW PER MATCH
               AFTER MATCH SKIP TO LAST DOWN
               PATTERN (STRT UP+ FLAT* DOWN+)
               DEFINE
                 UP AS UP.units_sold > PREV(UP.units_sold),
                 FLAT AS FLAT.units_sold = PREV(FLAT.units_sold),
                 DOWN AS DOWN.units_sold < PREV(DOWN.units_sold)
             ) MR
      ORDER BY MR.product, MR.start_tstamp;
      
      PRODUCT    START_TSTAM PEAK_TSTAMP END_TSTAMP         MNO
      ---------- ----------- ----------- ----------- ----------
      TWINKIES   01-OCT-2014 03-OCT-2014 06-OCT-2014          1
      TWINKIES   06-OCT-2014 08-OCT-2014 09-OCT-2014          2
      TWINKIES   09-OCT-2014 13-OCT-2014 16-OCT-2014          3
      TWINKIES   16-OCT-2014 18-OCT-2014 20-OCT-2014          4
      
      4 rows selected.
      
      !ok
      

        Activity

        Hide
        julianhyde Julian Hyde added a comment -

        Zhiqiang He or Dian Fu, could one of you take on this task?

        Show
        julianhyde Julian Hyde added a comment - Zhiqiang He or Dian Fu , could one of you take on this task?
        Hide
        dian.fu Dian Fu added a comment -

        Julian Hyde, sorry for late response. IMO, it's not easy to add the reference implementation in calcite as we need a NFA for pattern detection. Zhiqiang He, thoughts?

        Show
        dian.fu Dian Fu added a comment - Julian Hyde , sorry for late response. IMO, it's not easy to add the reference implementation in calcite as we need a NFA for pattern detection. Zhiqiang He , thoughts?
        Hide
        julianhyde Julian Hyde added a comment -

        Remember, it doesn't have to be efficient. Could you use Java's built-in regular expression support? Compile a regular expression, and as each row comes in, add a string token to a StringBuilder, then see whether the regular expression has matched.

        Show
        julianhyde Julian Hyde added a comment - Remember, it doesn't have to be efficient. Could you use Java's built-in regular expression support? Compile a regular expression, and as each row comes in, add a string token to a StringBuilder, then see whether the regular expression has matched.
        Hide
        ransom Zhiqiang He added a comment -

        I do not think we can use regular expression support. It is too simple and some case maybe can not support.
        I think I can develop it used NFA. but it is very complex. It may be cost a long time to develop.
        I will create some subtask when I familiar with corpus data sets.
        Julian Hyde can you give me a test case in calcite project for corpus data sets. I want to kown how to develop it and which interface will be implametion. thanks.

        Show
        ransom Zhiqiang He added a comment - I do not think we can use regular expression support. It is too simple and some case maybe can not support. I think I can develop it used NFA. but it is very complex. It may be cost a long time to develop. I will create some subtask when I familiar with corpus data sets. Julian Hyde can you give me a test case in calcite project for corpus data sets. I want to kown how to develop it and which interface will be implametion. thanks.
        Hide
        julianhyde Julian Hyde added a comment -

        Maybe regexp cannot handle everything. But we've been working on this feature for several months and there are no tests that run queries. A simple reference implementation will allow us to start writing tests. It doesn't matter too much that the reference implementation is not very efficient, or cannot handle complex cases. It's much more important that we have tests. We can use those tests to help us build the second, better, implementation.

        Show
        julianhyde Julian Hyde added a comment - Maybe regexp cannot handle everything. But we've been working on this feature for several months and there are no tests that run queries. A simple reference implementation will allow us to start writing tests. It doesn't matter too much that the reference implementation is not very efficient, or cannot handle complex cases. It's much more important that we have tests. We can use those tests to help us build the second, better, implementation.
        Hide
        ransom Zhiqiang He added a comment -

        But this simple reference implementation can not be used in Real System. It's test only.
        It is cost our more and more time only and no any other result。
        and we wil rewrite it used NFA finally, why we not accomplish it first?

        Show
        ransom Zhiqiang He added a comment - But this simple reference implementation can not be used in Real System. It's test only. It is cost our more and more time only and no any other result。 and we wil rewrite it used NFA finally, why we not accomplish it first?
        Hide
        julianhyde Julian Hyde added a comment -

        As you have said, building an NFA is expensive. It might be less effort to write if you have working tests before you write it. A reference implementation will be like scaffolding, getting you to those working tests.

        As I said earlier, this project has lots of algebra tests but no query tests. That seems off to me.

        Lastly, there will be several implementations of this feature (streaming, non-streaming, possibly multiple engines, and possibly implementation using rewrites to other operators). Tests will make those multiple implementations possible. A simple reference implementation will help us check semantics.

        Show
        julianhyde Julian Hyde added a comment - As you have said, building an NFA is expensive. It might be less effort to write if you have working tests before you write it. A reference implementation will be like scaffolding, getting you to those working tests. As I said earlier, this project has lots of algebra tests but no query tests. That seems off to me. Lastly, there will be several implementations of this feature (streaming, non-streaming, possibly multiple engines, and possibly implementation using rewrites to other operators). Tests will make those multiple implementations possible. A simple reference implementation will help us check semantics.
        Hide
        ransom Zhiqiang He added a comment -

        But regular expression can not conver all functions in match_recognize.
        so there is not point to do it by regular expression. many test case can not be supportted, so the whole match_recognize feture can not be used.

        Show
        ransom Zhiqiang He added a comment - But regular expression can not conver all functions in match_recognize. so there is not point to do it by regular expression. many test case can not be supportted, so the whole match_recognize feture can not be used.
        Hide
        julianhyde Julian Hyde added a comment -

        The important point is that we need to start executing MATCH_RECOGNIZE queries in tests. I don't really care whether you write a simple, slow "reference implementation" or you do more efficient implementation. I won't accept pull requests for parser/validator changes for much longer if they are not accompanied by tests that execute SQL.

        Show
        julianhyde Julian Hyde added a comment - The important point is that we need to start executing MATCH_RECOGNIZE queries in tests. I don't really care whether you write a simple, slow "reference implementation" or you do more efficient implementation. I won't accept pull requests for parser/validator changes for much longer if they are not accompanied by tests that execute SQL.
        Hide
        julianhyde Julian Hyde added a comment -
        Show
        julianhyde Julian Hyde added a comment - Here is a dev branch with match.iq: https://github.com/julianhyde/calcite/tree/1935-execute-match
        Hide
        ransom Zhiqiang He added a comment -

        I'm very sorry to replay so late.
        I still thinks that we can not used regular express to test match_recognize. measures and partition by , order by can not be defined by regulare express.
        so I still want to used NFA to run it.
        If you are agree it. I will continue to develop it based on your branch.
        and It may be cost a long time to develop it.

        Show
        ransom Zhiqiang He added a comment - I'm very sorry to replay so late. I still thinks that we can not used regular express to test match_recognize. measures and partition by , order by can not be defined by regulare express. so I still want to used NFA to run it. If you are agree it. I will continue to develop it based on your branch. and It may be cost a long time to develop it.
        Hide
        julianhyde Julian Hyde added a comment -

        There is an old result that a regexp is equivalent to an NFA (or a DFA, for that matter). See https://en.wikipedia.org/wiki/Regular_language. So, whatever sequence of events your NFA recognizes, there is an equivalent regular expression that matches precisely the same set of events.

        In other words, you can save yourself the effort of building an NFA by hacking the one inside the Java regexp implementation.

        I can see how an NFA would be much more efficient than other approaches. But if efficiency were not an issue, would you still build it?

        I strongly believe that your should build a very inefficient implementation first. O(n ^ 5) would be fine. Then when we have 20 running test queries, start on the second implementation, which will be efficient and perhaps also be able to run on unbounded input.

        Show
        julianhyde Julian Hyde added a comment - There is an old result that a regexp is equivalent to an NFA (or a DFA, for that matter). See https://en.wikipedia.org/wiki/Regular_language . So, whatever sequence of events your NFA recognizes, there is an equivalent regular expression that matches precisely the same set of events. In other words, you can save yourself the effort of building an NFA by hacking the one inside the Java regexp implementation. I can see how an NFA would be much more efficient than other approaches. But if efficiency were not an issue, would you still build it? I strongly believe that your should build a very inefficient implementation first. O(n ^ 5) would be fine. Then when we have 20 running test queries, start on the second implementation, which will be efficient and perhaps also be able to run on unbounded input.
        Hide
        ransom Zhiqiang He added a comment -

        ok. I will try to develop it by regexp expression.
        I will create some sub task in this issuess.
        and the develop may be cost some times.
        I will update process in tasks issuess.

        Show
        ransom Zhiqiang He added a comment - ok. I will try to develop it by regexp expression. I will create some sub task in this issuess. and the develop may be cost some times. I will update process in tasks issuess.

          People

          • Assignee:
            julianhyde Julian Hyde
            Reporter:
            julianhyde Julian Hyde
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development