Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-786

Detect if materialized view can be used to rewrite a query in non-trivial cases

    Details

    • Type: Task
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4.0-incubating, 1.3.0-incubating
    • Fix Version/s: 1.4.0-incubating
    • Component/s: core
    • Labels:
      None

      Description

      Improvement to detection if MV can be used to rewrite queries in non-trivial cases.

      Pasting the email conversation below that happened over this which briefly discusses the approach taken:

      ---------- Forwarded message ----------
      From: Amogh Margoor <amoghm@qubole.com>
      Date: Fri, Jun 26, 2015 at 10:11 AM
      Subject: Re: Detect if materialized view can be used to rewrite a query in non-trivial cases
      To: dev@calcite.incubator.apache.org, Rajat Venkatesh <rvenkatesh@qubole.com>

      Hi Julian,

      Thanks a lot Julian for your feedback. I have inlined my response below which also includes the commit done.

      Regards,
      Amogh

      On Fri, Jun 26, 2015 at 1:05 AM, Julian Hyde <jhyde@apache.org> wrote:

      This is great work. Certainly consistent with where I am heading.

      I would not be inclined to use DNF (because of its tendency to inflate
      certain predicates) but if you are able to get something effective I
      will happily use it. I think you should package it behind a method –
      "find out what is left to satisfy p when you have already satisfied q"
      or something – and write lots of tests of that method, and it doesn't
      really matter what algorithm is behind it.

      Take a look at SubstitutionVisitor.simplfy(RexNode) and how it focuses
      on finding whether

      p1 AND p2 AND p3 AND NOT (q1 AND q2 AND q3)

      is satisfiable.

      >> I saw this method. I will try to use this in improvements to follow.
      >> It didnot seem to solve this currently: (x>10 => x>30) i.e., find if
      >> NOT (NOT(x>10 ) OR x >30) is satisfiable.
      >> We have currently packaged it as "if X => Y" (see RexImplicationChecker
      >> in the commit I shared below), but agree it should be
      >> more generic like what you suggested above and something we can try to achieve.

      Later we will want to know not just "can I satisfy query Q using
      materialization M?" but "can I satisfy part of Q using M, and what is
      left over?". I can convert most of Q to use an aggregate table over
      years 2012 .. 2014 and 2015 Jan - May, and then scan the raw data for
      June 1st onwards, that is a big win.

      >> This certainly should be something we should aim at.

      What branch are you working on? Your master branch
      https://github.com/qubole/incubator-calcite/tree/master seems to be
      the same as apache/master right now.

      >> We work on https://github.com/qubole/incubator-calcite/tree/qds-1.3 .
      >> This is the commit: https://github.com/qubole/incubator-calcite/pull/1/files?diff=unified
      >> We are in the process of writing UTs for it. We did most of the testing through our client code till now.
      >> We have created new Visitor extending SustitutionVisitor because did not want to mess with the existing code.
      >> More rules need to be added to the new Visitor.
      >> Will raise a PR once UTs are added and testing is complete.

      If you can divide this work into pull requests with unit tests, I will
      happy commit each change as you make progress.

      By the way, I logged a few jira cases connected to materialized view
      rewrite today. They were motivated by the phoenix team wanting to use
      secondary indexes. But they could by applied to any scan-project-sort
      materialization. See

      >> Thanks for sharing this info Julian. Will definitely take a look.

      Julian

      On Thu, Jun 25, 2015 at 11:46 AM, Amogh Margoor <amoghm@qubole.com> wrote:
      > Hi,
      > We were working on a problem to detect if materialized view can be used to
      > rewrite a query in non-trivial cases. Will briefly describe the problem and
      > approach below and would appreciate feedback on the same.
      >
      > Motivation
      > ---------------
      > For instance there exists a table "logs" and a partition (materialized
      > view) named "log_after_01_Jan" created on it and described by SQL :
      > "Select * from logs where log_date > '01-01-2015' ".
      >
      > Assume that the table "log_after_01_Jan" is much smaller than table "logs".
      >
      > For user query:
      > "Select log_body from logs where log_date > '03-03-2015' and
      > char_length(log_body) < 100",
      > we should detect that the materialized view "log_after_01_Jan" can be used
      > and transform the query into:
      >
      > "Select log_body from log_after_01_Jan where log_date > '03-03-2015' and
      > char_length(log_body) < 100"
      >
      > Approach
      > --------------
      > One of the fundamental problems we would come across here is to check if a
      > boolean condition X implies (=>) Y. This quickly reduces to the
      > Satisfiability problem which is NP complete for propositional logic. But
      > there are many instances like above which can be detected easily. We have
      > implemented an approach to handle several useful cases for few operators
      > and types of operands. Will be extending it further for more types of
      > operations.
      >
      > Top Level approach:
      >
      > 1. Currently, VolcanoPlanner:useMaterialization tries to rewrite original
      > query using MV using SubstitutionVisitor. Have extended SubstitutionVisitor
      > to detect above cases and do the substitution.
      >
      > 2. To check if a condition X => Y,
      > a. Convert both of them into Disjunctive Normal Form.
      > Say X is transformed into x_1 or x_2 or x_3 ... or x_m and
      > Y is transformed into y_1 or y_2 ,... or y_i, where any x_i and y_i
      > are conjunctions of atomic predicates.
      > For instance condition "(a>10 or b>20) and c <90" will be converted
      > to DNF: (a>10 and c<90) or (b>20 and c<90).
      >
      > b. For X=>Y to be a tautology i.e., hold always true, every conjunction
      > x_i should imply atleast one of the conjunction y_j.
      > We wrote some set of simple heuristics to check if a conjunction of
      > atomic predicates implies another.
      > This also involves executing RexNode using RexImplExecutor.
      >
      > We have checked in code for this in our fork of
      > calcite(qubole/incubator-calcite). This is ongoing work and we will be
      > making many more improvements to it. If this is useful or anybody is
      > interested in giving feedback then I can share the commit so that we can
      > discuss about it and take it forward.
      >
      > Regards,
      > Amogh
      > Member of Technical Staff
      > Qubole

        Activity

        Hide
        amargoor Amogh Margoor added a comment - - edited

        Hi Julian Hyde,
        cc: Rajat Venkatesh
        Raised a Pull request for the same: https://github.com/apache/incubator-calcite/pull/102
        Can you review this one and see if this can be added to calcite ?

        Thanks,
        Amogh

        Show
        amargoor Amogh Margoor added a comment - - edited Hi Julian Hyde , cc: Rajat Venkatesh Raised a Pull request for the same: https://github.com/apache/incubator-calcite/pull/102 Can you review this one and see if this can be added to calcite ? Thanks, Amogh
        Hide
        julianhyde Julian Hyde added a comment -

        Yes, will do.

        Show
        julianhyde Julian Hyde added a comment - Yes, will do.
        Hide
        julianhyde Julian Hyde added a comment -

        I've reviewed, and added some comments on individual files. You should fix those, and make sure that the build runs clean (mvn clean && mvn install site).

        But the biggest question is: why is MaterializedViewSubstitutionVisitor separate from SubstitutionVisitor? I see you try SubstitutionVisitor, then try MaterializedViewSubstitutionVisitor. Is it possible to combine into one class, and make only one call to "go"? Then we'd have a matcher that was really powerful.

        Show
        julianhyde Julian Hyde added a comment - I've reviewed, and added some comments on individual files. You should fix those, and make sure that the build runs clean (mvn clean && mvn install site). But the biggest question is: why is MaterializedViewSubstitutionVisitor separate from SubstitutionVisitor? I see you try SubstitutionVisitor, then try MaterializedViewSubstitutionVisitor. Is it possible to combine into one class, and make only one call to "go"? Then we'd have a matcher that was really powerful.
        Hide
        amargoor Amogh Margoor added a comment -

        Thanks for the review. I will make the suggested changes, but build runs clean on my branch supportMV without any test failure as reported by you. Even checkstyle is always on for me, so was surprised to see some styling issues there. Am I missing something here ? Btw tried merging master of apache/incubator-calcite with my branch, and it was already up-to-date.

        Regarding your question on new MaterializedViewSubstituitonVisitor, to start with it wasn't there. But later figured out that new rule would be required and decided to extend SV rather than changing it for 2 reasons:
        1. SV was already very lengthy.
        2. Didn't want to change behavior of SubstitutionVisitor for it's clients (if there exists any), as was not sure if this kind of substitution is ok with all use cases. Moreover, if there exists any clients then they would also be isolated from bugs in new rule.
        However, plan was to register all the old rules in new visitor and call 'go' only on that and not have two 'go' methods though. We are making this change in next set of improvements.

        Show
        amargoor Amogh Margoor added a comment - Thanks for the review. I will make the suggested changes, but build runs clean on my branch supportMV without any test failure as reported by you. Even checkstyle is always on for me, so was surprised to see some styling issues there. Am I missing something here ? Btw tried merging master of apache/incubator-calcite with my branch, and it was already up-to-date. Regarding your question on new MaterializedViewSubstituitonVisitor, to start with it wasn't there. But later figured out that new rule would be required and decided to extend SV rather than changing it for 2 reasons: 1. SV was already very lengthy. 2. Didn't want to change behavior of SubstitutionVisitor for it's clients (if there exists any), as was not sure if this kind of substitution is ok with all use cases. Moreover, if there exists any clients then they would also be isolated from bugs in new rule. However, plan was to register all the old rules in new visitor and call 'go' only on that and not have two 'go' methods though. We are making this change in next set of improvements.
        Hide
        julianhyde Julian Hyde added a comment -

        Did you "mvn clean"? The code won't even compile due to the "package plan;" declaration in two files.

        Checkstyle does not catch all style issues. (The version we are using is not very good at checking indentation.)

        I know it is hard to combine the two visitors. But if we don't, we will have one visitor that matches patterns A, B, C and another that matches patterns X, Y, Z, and neither will handle a query that contains patterns A and Y. So it is worth doing, I think.

        Show
        julianhyde Julian Hyde added a comment - Did you "mvn clean"? The code won't even compile due to the "package plan;" declaration in two files. Checkstyle does not catch all style issues. (The version we are using is not very good at checking indentation.) I know it is hard to combine the two visitors. But if we don't, we will have one visitor that matches patterns A, B, C and another that matches patterns X, Y, Z, and neither will handle a query that contains patterns A and Y. So it is worth doing, I think.
        Hide
        julianhyde Julian Hyde added a comment -

        Regarding

        Didn't want to change behavior of SubstitutionVisitor for it's clients (if there exists any), as was not sure if this kind of substitution is ok with all use cases. Moreover, if there exists any clients then they would also be isolated from bugs in new rule.

        Don't worry about that. Any functionality required by clients has a unit test. If not, it's my fault for not writing enough tests, not your fault for re-organizing the code. This is a basic principle of our development process - no feature without a matching test. Refactor away.

        Show
        julianhyde Julian Hyde added a comment - Regarding Didn't want to change behavior of SubstitutionVisitor for it's clients (if there exists any), as was not sure if this kind of substitution is ok with all use cases. Moreover, if there exists any clients then they would also be isolated from bugs in new rule. Don't worry about that. Any functionality required by clients has a unit test. If not, it's my fault for not writing enough tests, not your fault for re-organizing the code. This is a basic principle of our development process - no feature without a matching test. Refactor away.
        Hide
        amargoor Amogh Margoor added a comment -

        Hi Julian,
        I have take care of all your comments.
        MaterializedViewSubstitutionVisitor has been added old rules of substitution, thus we can only use new visitor.

        However, we did not see any test failures (not just me but even others who are using this code in our fork). But if the test is still failing for you, then we can @Ignore it as it only depicts false positive which can be worked on and not the false negative.

        Thanks,
        Amogh

        Show
        amargoor Amogh Margoor added a comment - Hi Julian, I have take care of all your comments. MaterializedViewSubstitutionVisitor has been added old rules of substitution, thus we can only use new visitor. However, we did not see any test failures (not just me but even others who are using this code in our fork). But if the test is still failing for you, then we can @Ignore it as it only depicts false positive which can be worked on and not the false negative. Thanks, Amogh
        Hide
        julianhyde Julian Hyde added a comment -

        What branch are those changes on? I don't see any changes in https://github.com/apache/incubator-calcite/pull/102.

        Show
        julianhyde Julian Hyde added a comment - What branch are those changes on? I don't see any changes in https://github.com/apache/incubator-calcite/pull/102 .
        Hide
        amargoor Amogh Margoor added a comment -

        You can see the commit if you scroll down the PR, just above my last comment. This is the commit btw: https://github.com/amoghmargoor/incubator-calcite/commit/0ae833f978ae738386b9073cafc1d8d2d60461e2

        Show
        amargoor Amogh Margoor added a comment - You can see the commit if you scroll down the PR, just above my last comment. This is the commit btw: https://github.com/amoghmargoor/incubator-calcite/commit/0ae833f978ae738386b9073cafc1d8d2d60461e2
        Hide
        amargoor Amogh Margoor added a comment - - edited

        Hi Julian,

        Could you see the commit in PR: https://github.com/amoghmargoor/incubator-calcite/commit/0ae833f978ae738386b9073cafc1d8d2d60461e2 ?
        In PR it is shown just above my last comment. It is submitted to the same branch supportMV on which PR is raised.

        Regards,
        Amogh

        Show
        amargoor Amogh Margoor added a comment - - edited Hi Julian, Could you see the commit in PR: https://github.com/amoghmargoor/incubator-calcite/commit/0ae833f978ae738386b9073cafc1d8d2d60461e2 ? In PR it is shown just above my last comment. It is submitted to the same branch supportMV on which PR is raised. Regards, Amogh
        Hide
        julianhyde Julian Hyde added a comment -

        OK, that's much better. I've rebased onto master and have done some fix-ups & refactoring, in branch https://github.com/julianhyde/incubator-calcite/tree/786-mv.

        Can you please review. If you say it's OK I'll squash your commits and mine on that branch and merge to master.

        1. "(y = 10) implies (y < 30 AND x > 30)" really?

        2. testSimpleDate() still fails for me. Yes, really. Maybe it's my timezone (but I doubt it). Maybe it's the Java version I'm running (JDK 1.8.0_51). I disabled it.

        3. "mvn site" still failed for me. I fixed it. But you really can't put raw ">" in javadoc.

        Show
        julianhyde Julian Hyde added a comment - OK, that's much better. I've rebased onto master and have done some fix-ups & refactoring, in branch https://github.com/julianhyde/incubator-calcite/tree/786-mv . Can you please review. If you say it's OK I'll squash your commits and mine on that branch and merge to master. 1. "(y = 10) implies (y < 30 AND x > 30)" really? 2. testSimpleDate() still fails for me. Yes, really. Maybe it's my timezone (but I doubt it). Maybe it's the Java version I'm running (JDK 1.8.0_51). I disabled it. 3. "mvn site" still failed for me. I fixed it. But you really can't put raw ">" in javadoc.
        Hide
        amargoor Amogh Margoor added a comment - - edited

        Julian, Thanks a lot for fixup. It looks much better now and will serve as a template for me to make smoother submissions from next time. Just left couple of minor comments.

        >> "(y = 10) implies (y < 30 AND x > 30)" really?
        My mistake. Pointed it in the commit above. It was intended to be y=10 implies y<30 or x>30.

        >> testSimpleDate() still fails for me. Yes, really. Maybe it's my timezone (but I doubt it). Maybe it's the Java version I'm running (JDK 1.8.0_51). I disabled it.

        I use JDK 1.7 and my wild guess is that the date format being thrown by 1.8 is something which we don't handle. I will take this as an opportunity to ask few suggestions from you. The real pain point currently is that for executor (in RexImplicationchecker) to execute, we need to give it a data context which should contain values of particular Java type. Now this requires converting data from one type to another and this is the messy part of my code, as it has to handle many combinations of input datatype and expected datatype. Any suggestions on how to go about it especially for type Date as partitioning on column of type Date is pretty common and a very important use case ?

        >> "mvn site" still failed for me. I fixed it. But you really can't put raw ">" in javadoc.
        My apologies. Would be careful with that from next time.

        I am perfectly OK with you squash merging it and committing. Thanks a lot.

        Show
        amargoor Amogh Margoor added a comment - - edited Julian, Thanks a lot for fixup. It looks much better now and will serve as a template for me to make smoother submissions from next time. Just left couple of minor comments. >> "(y = 10) implies (y < 30 AND x > 30)" really? My mistake. Pointed it in the commit above. It was intended to be y=10 implies y<30 or x>30. >> testSimpleDate() still fails for me. Yes, really. Maybe it's my timezone (but I doubt it). Maybe it's the Java version I'm running (JDK 1.8.0_51). I disabled it. I use JDK 1.7 and my wild guess is that the date format being thrown by 1.8 is something which we don't handle. I will take this as an opportunity to ask few suggestions from you. The real pain point currently is that for executor (in RexImplicationchecker) to execute, we need to give it a data context which should contain values of particular Java type. Now this requires converting data from one type to another and this is the messy part of my code, as it has to handle many combinations of input datatype and expected datatype. Any suggestions on how to go about it especially for type Date as partitioning on column of type Date is pretty common and a very important use case ? >> "mvn site" still failed for me. I fixed it. But you really can't put raw ">" in javadoc. My apologies. Would be careful with that from next time. I am perfectly OK with you squash merging it and committing. Thanks a lot.
        Hide
        julianhyde Julian Hyde added a comment -

        The real pain point currently is that for executor (in RexImplicationchecker) to execute, we need to give it a data context which should contain values of particular Java type. Now this requires converting data from one type to another and this is the messy part of my code, as it has to handle many combinations of input datatype and expected datatype. Any suggestions on how to go about it especially for type Date as partitioning on column of type Date is pretty common and a very important use case ?

        Since you're dealing with RexNodes, the constants will be of the types used internally by RexLiteral: BigDecimal for numbers, Calendar for date-times, etc. So you just need to handle those types.

        Also, rely on Executor as much as you can. It is tasked with implementing SQL semantics (3 valued boolean logic, conversions using CAST, implicit rounding and truncation and all that hard stuff). If you don't trust Executor, write a test. I've just added RexExecutorTest.checkConstant so you can easily check that say, DATE '2015-07-15' < DATE '2015-07-15' reduces to FALSE.

        I will squash and commit shortly, and then I will mark this issue closed. Please start developing on top of that commit as soon as it is checked in.

        Show
        julianhyde Julian Hyde added a comment - The real pain point currently is that for executor (in RexImplicationchecker) to execute, we need to give it a data context which should contain values of particular Java type. Now this requires converting data from one type to another and this is the messy part of my code, as it has to handle many combinations of input datatype and expected datatype. Any suggestions on how to go about it especially for type Date as partitioning on column of type Date is pretty common and a very important use case ? Since you're dealing with RexNodes, the constants will be of the types used internally by RexLiteral: BigDecimal for numbers, Calendar for date-times, etc. So you just need to handle those types. Also, rely on Executor as much as you can. It is tasked with implementing SQL semantics (3 valued boolean logic, conversions using CAST, implicit rounding and truncation and all that hard stuff). If you don't trust Executor, write a test. I've just added RexExecutorTest.checkConstant so you can easily check that say, DATE '2015-07-15' < DATE '2015-07-15' reduces to FALSE. I will squash and commit shortly, and then I will mark this issue closed. Please start developing on top of that commit as soon as it is checked in.
        Hide
        julianhyde Julian Hyde added a comment -
        Show
        julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/incubator-calcite/commit/4a9b1939 . Thanks for the patch!
        Hide
        jnadeau Jacques Nadeau added a comment -

        Resolved in release 1.4.0-incubating (2015-08-23)

        Show
        jnadeau Jacques Nadeau added a comment - Resolved in release 1.4.0-incubating (2015-08-23)

          People

          • Assignee:
            julianhyde Julian Hyde
            Reporter:
            amargoor Amogh Margoor
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development