Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1753

Push expressions into null-generating side of a join if they are "strong" (null-preserving)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.13.0
    • Component/s: core
    • Labels:
      None

      Description

      ProjectJoinTransposeRule with preserveExprCondition pushes projects below outer-joins.

      I am pushing case statements past joins using ProjectJoinTransposeRule. For inner joins, the current behavior seems fine, but outer joins can lead to weird behavior, where the project is below the join and nulls can cause issues.

      select 
      	count(*), case when t3.a3 is not null then t3.a3 else 100 end
      from 
      	t1 left outer join t3 on t1.c1 = t3.c3
      group by
              case when t3.a3 is not null then t3.a3 else 100 end
      order by
              case when t3.a3 is not null then t3.a3 else 100 end
      

      Currently, ProjectJoinTransposeRule will push the case statement below the join as below. But, this case statement shouldn't be pushed. The query shouldn't return null but instead as 100 for any "unmatched" join condition since it is a left outer join with a case statement. But, the current plan would not prevent that.

      LogicalProject with case statement
         LogicalJoin
            LogicalTableScan(table=[t1])
            LogicalTableScan(table=[t3])
      
      LogicalProject 
         LogicalJoin
            LogicalProject with case statement
              LogicalTableScan(table=[t1])
            LogicalTableScan(table=[t3])
      

        Activity

        Hide
        minjikim MinJi Kim added a comment -

        I debated between adding another preserveExprCondition to PushProjector that would apply to outer join's nullable side. I couldn't think of a case where that would be useful (at least, use cases that I could think of, I wouldn't want to push any expressions to the nullable side of an outer join). Let me know if you think it would be. I will update the pull request.

        https://github.com/apache/calcite/pull/425

        Show
        minjikim MinJi Kim added a comment - I debated between adding another preserveExprCondition to PushProjector that would apply to outer join's nullable side. I couldn't think of a case where that would be useful (at least, use cases that I could think of, I wouldn't want to push any expressions to the nullable side of an outer join). Let me know if you think it would be. I will update the pull request. https://github.com/apache/calcite/pull/425
        Hide
        jni Jinfeng Ni added a comment -

        Seems to me that the reason we could not push the expressions below right table in left outer join is because those expressions do not have "null-if-null" properties. For "null-if-null" expressions, it seems to be safe to push down them below left outer join.

        Whether it's performance beneficial to push down the expression for a left outer join depends on whether left table has more rows than right table. In case left table has more rows, push down the expression to right side would means the expression would be evaluated fewer times, which would lead to some performance benefits.

        Show
        jni Jinfeng Ni added a comment - Seems to me that the reason we could not push the expressions below right table in left outer join is because those expressions do not have "null-if-null" properties. For "null-if-null" expressions, it seems to be safe to push down them below left outer join. Whether it's performance beneficial to push down the expression for a left outer join depends on whether left table has more rows than right table. In case left table has more rows, push down the expression to right side would means the expression would be evaluated fewer times, which would lead to some performance benefits.
        Hide
        minjikim MinJi Kim added a comment -

        I agree with what you said, Jinfeng Ni. Do you have a suggestion as to how we would we distinguish that for the operators? I suggested having another preserveExprCondition to allow the user of the rule to specify which operations they would want to push past outer joins, but that seems to be shifting the work to the user vs. figuring it out automatically.

        Show
        minjikim MinJi Kim added a comment - I agree with what you said, Jinfeng Ni . Do you have a suggestion as to how we would we distinguish that for the operators? I suggested having another preserveExprCondition to allow the user of the rule to specify which operations they would want to push past outer joins, but that seems to be shifting the work to the user vs. figuring it out automatically.
        Hide
        julianhyde Julian Hyde added a comment -

        Jinfeng Ni is correct about the "null-if-null" behavior. I call a predicate "strong" if it is null if any of its inputs is null, and we have some utility methods in https://calcite.apache.org/apidocs/org/apache/calcite/plan/Strong.html.

        Rather than making this change specifically in terms of CASE, could you make this in terms of strong-ness? And rather than passing join type, maybe you could pass an ImmutableBitSet of columns it needs to be a strong in?

        In your example, case when t3.a3 is not null then t3.a3 else 100 end is not strong in t3.a3. If t3.a3 is null, the expression will not necessarily be null.

        Let's check whether it is "if", "only if" or "if and only if". I believe we need "if", and let's prove it by considering an expression which is "if" but not "only if": "nullif(t3.a3, t3.a4)". I believe we can safely push this down. Jesus Camacho Rodriguez and Jinfeng Ni, can you check my reasoning?

        Show
        julianhyde Julian Hyde added a comment - Jinfeng Ni is correct about the "null-if-null" behavior. I call a predicate "strong" if it is null if any of its inputs is null, and we have some utility methods in https://calcite.apache.org/apidocs/org/apache/calcite/plan/Strong.html . Rather than making this change specifically in terms of CASE, could you make this in terms of strong-ness? And rather than passing join type, maybe you could pass an ImmutableBitSet of columns it needs to be a strong in? In your example, case when t3.a3 is not null then t3.a3 else 100 end is not strong in t3.a3. If t3.a3 is null, the expression will not necessarily be null. Let's check whether it is "if", "only if" or "if and only if". I believe we need "if", and let's prove it by considering an expression which is "if" but not "only if": "nullif(t3.a3, t3.a4)". I believe we can safely push this down. Jesus Camacho Rodriguez and Jinfeng Ni , can you check my reasoning?
        Hide
        minjikim MinJi Kim added a comment -

        Julian Hyde Thanks. Strong looks like what I was looking for! I will work on updating the patch.

        Show
        minjikim MinJi Kim added a comment - Julian Hyde Thanks. Strong looks like what I was looking for! I will work on updating the patch.
        Hide
        julianhyde Julian Hyde added a comment -

        Great. Please also extend RexProgramTest.testStrong to make sure Strong covers your use case.

        Show
        julianhyde Julian Hyde added a comment - Great. Please also extend RexProgramTest.testStrong to make sure Strong covers your use case.
        Hide
        jni Jinfeng Ni added a comment -

        Agreed with Julian Hyde that we only need "if" to push down, and we do not need "only if". We could push nullif(t3.a3, t3.a4. If any one of a3/a4 is null, nullif(t3.a3, t3.a4) --> t3.a3.

        In the example, case when t3.a3 is not null then t3.a3 else 100 end does not satisfy "if" requirement : when the input t3.a3 is null, it's output is not null.

        Another thing is that "Strong-ness" check seems to cover build-in SQL functions. In case people uses UDF (sqlkind = OTHER_FUNCTION), there is no way to specify which UDF has the "null-if-null" behavior. I assume we could cover those cases in future work.

        Show
        jni Jinfeng Ni added a comment - Agreed with Julian Hyde that we only need "if" to push down, and we do not need "only if". We could push nullif(t3.a3, t3.a4 . If any one of a3/a4 is null, nullif(t3.a3, t3.a4) --> t3.a3 . In the example, case when t3.a3 is not null then t3.a3 else 100 end does not satisfy "if" requirement : when the input t3.a3 is null, it's output is not null. Another thing is that "Strong-ness" check seems to cover build-in SQL functions. In case people uses UDF (sqlkind = OTHER_FUNCTION), there is no way to specify which UDF has the "null-if-null" behavior. I assume we could cover those cases in future work.
        Hide
        julianhyde Julian Hyde added a comment -

        We could add an annotation (or other metadata) to UDFs specifying how they treat null values.

        The majority of SQL functions return null if any of their arguments are null (and aggregate functions are not even invoked if argument is null) so I think that should be the default.

        There's an additional property which is whether they ONLY return null if their inputs are null. Nullif is a counter-example: for example Nullif(5, 5) evaluates to null.

        Show
        julianhyde Julian Hyde added a comment - We could add an annotation (or other metadata) to UDFs specifying how they treat null values. The majority of SQL functions return null if any of their arguments are null (and aggregate functions are not even invoked if argument is null) so I think that should be the default. There's an additional property which is whether they ONLY return null if their inputs are null. Nullif is a counter-example: for example Nullif(5, 5) evaluates to null.
        Hide
        minjikim MinJi Kim added a comment -

        I updated the PR with some changes. Please take a look at let me know if this is in the right direction! Thanks!

        https://github.com/apache/calcite/pull/425/

        Show
        minjikim MinJi Kim added a comment - I updated the PR with some changes. Please take a look at let me know if this is in the right direction! Thanks! https://github.com/apache/calcite/pull/425/
        Hide
        julianhyde Julian Hyde added a comment -

        Good catch on OR and AND not being strong, because OR(false, null) returns false.

        Do you think we could broaden ExprCondition.CASE to allow any expression? The rule will ensure that it is Strong. If so, we should rename CASE to TRUE.

        (I don't remember why we introduced ExprCondition, but in retrospect it was probably mostly that we didn't handle strong expressions properly. So, we could go even further and use ExprCondition.TRUE in the default rule.)

        In ProjectJoinTransposseRule#onMatch you can use call.builder() rather than relBuilderFactory.create(origProj.getCluster(), null).

        Show
        julianhyde Julian Hyde added a comment - Good catch on OR and AND not being strong, because OR(false, null) returns false . Do you think we could broaden ExprCondition.CASE to allow any expression? The rule will ensure that it is Strong. If so, we should rename CASE to TRUE . (I don't remember why we introduced ExprCondition , but in retrospect it was probably mostly that we didn't handle strong expressions properly. So, we could go even further and use ExprCondition.TRUE in the default rule.) In ProjectJoinTransposseRule#onMatch you can use call.builder() rather than relBuilderFactory.create(origProj.getCluster(), null) .
        Hide
        minjikim MinJi Kim added a comment - - edited

        Julian Hyde Sure thing. Will make the changes. I can test and see if we can change ExprCondition.TRUE as the default. I will add some more tests in that case as well.

        Show
        minjikim MinJi Kim added a comment - - edited Julian Hyde Sure thing. Will make the changes. I can test and see if we can change ExprCondition.TRUE as the default. I will add some more tests in that case as well.
        Hide
        minjikim MinJi Kim added a comment -

        Julian Hyde and Jinfeng Ni I updated the pull request. Please take a look.

        https://github.com/apache/calcite/pull/425/

        I noticed that ProjectFilterTransposeRule and ProjectSetOpTransposeRule also use PushProjector. I thought ProjectFilterTransposeRule seemed fine to change it to use ExprCondition.TRUE, but thought SetOp might need a little bit more thought. So, I left both of them as is. Let me know. While the patch is in review, I will continue work on making the other two rules to default to TRUE, and add more tests if I think that makes sense. Thanks!

        Show
        minjikim MinJi Kim added a comment - Julian Hyde and Jinfeng Ni I updated the pull request. Please take a look. https://github.com/apache/calcite/pull/425/ I noticed that ProjectFilterTransposeRule and ProjectSetOpTransposeRule also use PushProjector. I thought ProjectFilterTransposeRule seemed fine to change it to use ExprCondition.TRUE, but thought SetOp might need a little bit more thought. So, I left both of them as is. Let me know. While the patch is in review, I will continue work on making the other two rules to default to TRUE, and add more tests if I think that makes sense. Thanks!
        Hide
        minjikim MinJi Kim added a comment -

        It looks like I spoke too soon. I tried enabling ProjectFilterTransposeRule to use ExprCondition.TRUE, but that seem to cause a lot of issues. For one, pushing more general expressions down results in these weird plans where the filter isn't evaluating anything and pushes it to the project below to evaluate the filter condition and the filter just checks the condition to be true/false. I didn't think it was buying much, since pushing down the input reference reduces the number of columns selected and is useful, but pushing the filter condition evaluation didn't seem to be as useful in my mind. I think for now, I would be happy to keep it as it is. Do you agree?

              LogicalFilter(condition=[$3])
                LogicalProject(ENAME=[$1], SAL=[$5], DEPTNO=[$7], >=[>($5, 5000)])
        
        Show
        minjikim MinJi Kim added a comment - It looks like I spoke too soon. I tried enabling ProjectFilterTransposeRule to use ExprCondition.TRUE, but that seem to cause a lot of issues. For one, pushing more general expressions down results in these weird plans where the filter isn't evaluating anything and pushes it to the project below to evaluate the filter condition and the filter just checks the condition to be true/false. I didn't think it was buying much, since pushing down the input reference reduces the number of columns selected and is useful, but pushing the filter condition evaluation didn't seem to be as useful in my mind. I think for now, I would be happy to keep it as it is. Do you agree? LogicalFilter(condition=[$3]) LogicalProject(ENAME=[$1], SAL=[$5], DEPTNO=[$7], >=[>($5, 5000)])
        Hide
        julianhyde Julian Hyde added a comment -

        Yes, fine to keep it as is. Thanks for investigating.

        Show
        julianhyde Julian Hyde added a comment - Yes, fine to keep it as is. Thanks for investigating.
        Hide
        julianhyde Julian Hyde added a comment -

        MinJi Kim, Is your PR ready for review & commit?

        Show
        julianhyde Julian Hyde added a comment - MinJi Kim , Is your PR ready for review & commit?
        Hide
        minjikim MinJi Kim added a comment -

        Julian Hyde Yes it is! Sorry, if I wasn't clear earlier. Thanks for reviewing it!

        Show
        minjikim MinJi Kim added a comment - Julian Hyde Yes it is! Sorry, if I wasn't clear earlier. Thanks for reviewing it!
        Show
        julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/2ce41591 .
        Hide
        jcamachorodriguez Jesus Camacho Rodriguez added a comment -

        Resolved in release 1.13.0 (2017-06-26).

        Show
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - Resolved in release 1.13.0 (2017-06-26).

          People

          • Assignee:
            julianhyde Julian Hyde
            Reporter:
            minjikim MinJi Kim
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development