Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-850

Remove push down expressions from FilterJoinRule and create a new rule for it

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.0-incubating
    • Component/s: core
    • Labels:
      None

      Description

      CALCITE-457 added pushing expressions in join conditions into projects below the join in the FilterJoinRule, so the expression would be computed beforehand and not in the join predicate.

      While this can be an interesting feature for some projects using Calcite, it is a different functionality and it should be a standalone independent rule. For instance, in Hive we do not want to enable it at the moment, as it causes some performance regressions in many test cases.

        Issue Links

          Activity

          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          I have created a pull request in https://github.com/apache/incubator-calcite/pull/122 . Changes in plans are swapping of join inputs.

          Julian Hyde, Jinfeng Ni, Vladimir Sitnikov, you all seemed to be involved in CALCITE-457, any thoughts? Thanks

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - I have created a pull request in https://github.com/apache/incubator-calcite/pull/122 . Changes in plans are swapping of join inputs. Julian Hyde , Jinfeng Ni , Vladimir Sitnikov , you all seemed to be involved in CALCITE-457 , any thoughts? Thanks
          Hide
          jni Jinfeng Ni added a comment -

          I looked at the code change. I think it makes sense to move the logic of pushing join expression into a separate rule, and it's up to each system to decide whether turn on/off such rule in their planner. The code change looks fine to me ( one minor comment).

          I'm a bit surprised that it caused performance regression in Hive by pushing expression into project below the join, though. I guess under two scenarios such push down would cause performance overhead:
          1 ) The join condition itself does not have filtering, or very less filtering. As such, it does not matter much whether the filter is applied in join operator, or in the filter operator after join.
          2) the join condition evaluation applies short-circuit evaluation optimization. As such, it might be possible to skip some expensive expression. In contrast, if we push down the expression, we will end up with evaluating every expression always.

          I guess such scenarios probably be reflected in the costing; it's up to the costing to decide which way to go, while the rule's job is to enumerate the possible different choices.

          Also, if the query' join is ANSI-sql style; join condition is in "ON" clause, then Calcite will do such pushdown in SqlToRelConverter always, before the opt phases kicks in.

          Show
          jni Jinfeng Ni added a comment - I looked at the code change. I think it makes sense to move the logic of pushing join expression into a separate rule, and it's up to each system to decide whether turn on/off such rule in their planner. The code change looks fine to me ( one minor comment). I'm a bit surprised that it caused performance regression in Hive by pushing expression into project below the join, though. I guess under two scenarios such push down would cause performance overhead: 1 ) The join condition itself does not have filtering, or very less filtering. As such, it does not matter much whether the filter is applied in join operator, or in the filter operator after join. 2) the join condition evaluation applies short-circuit evaluation optimization. As such, it might be possible to skip some expensive expression. In contrast, if we push down the expression, we will end up with evaluating every expression always. I guess such scenarios probably be reflected in the costing; it's up to the costing to decide which way to go, while the rule's job is to enumerate the possible different choices. Also, if the query' join is ANSI-sql style; join condition is in "ON" clause, then Calcite will do such pushdown in SqlToRelConverter always, before the opt phases kicks in.
          Hide
          jni Jinfeng Ni added a comment -

          I applied your commit, and used the new rule in Drill and run couple of unit testcases. All run fine. I'll run more. But for now, the new rule seems to be good.

          Show
          jni Jinfeng Ni added a comment - I applied your commit, and used the new rule in Drill and run couple of unit testcases. All run fine. I'll run more. But for now, the new rule seems to be good.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Thanks for the feedback Jinfeng Ni. Please, let me know when you finish running those additional tests, so I can push the changes to master; I'd like this to be included in 1.4.

          Concerning your comments about the regressions: I agree that pushing expressions on both sides of a join shouldn't create performance regressions in most cases. I also agree, that ultimately it should be a cost-based decision.
          Hive is a special case... I'll try to keep it short.
          Currently Hive rewrites the plan returned by Calcite into a HiveQL query, that is in turn parsed again, physically optimized, and executed by the Hive engine.
          As we do not do e.g. algorithm selection in Calcite and reflect this directly by translating the Calcite operators into Hive operators, even minimal plan changes can have a huge impact for our logic. The reason is that some physical optimizations might not kick in if they don't recognize a given plan pattern (e.g. transformation of reduce side joins into map joins, etc.).
          We have been working on closing the gap between Calcite and Hive through operator-to-operator translation (umbrella JIRA is HIVE-9132), which would definitively solve this kind of issues.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Thanks for the feedback Jinfeng Ni . Please, let me know when you finish running those additional tests, so I can push the changes to master; I'd like this to be included in 1.4. Concerning your comments about the regressions: I agree that pushing expressions on both sides of a join shouldn't create performance regressions in most cases. I also agree, that ultimately it should be a cost-based decision. Hive is a special case... I'll try to keep it short. Currently Hive rewrites the plan returned by Calcite into a HiveQL query, that is in turn parsed again, physically optimized, and executed by the Hive engine. As we do not do e.g. algorithm selection in Calcite and reflect this directly by translating the Calcite operators into Hive operators, even minimal plan changes can have a huge impact for our logic. The reason is that some physical optimizations might not kick in if they don't recognize a given plan pattern (e.g. transformation of reduce side joins into map joins, etc.). We have been working on closing the gap between Calcite and Hive through operator-to-operator translation (umbrella JIRA is HIVE-9132 ), which would definitively solve this kind of issues.
          Hide
          jni Jinfeng Ni added a comment -

          +1.

          I run the regression suite in Drill, and did not see any issue.

          Show
          jni Jinfeng Ni added a comment - +1. I run the regression suite in Drill, and did not see any issue.
          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/incubator-calcite/commit/9f1f73d3 Thanks for the feedback Jinfeng Ni
          Hide
          jnadeau Jacques Nadeau added a comment -

          Resolved in release 1.4.0-incubating (2015-08-23)

          Show
          jnadeau Jacques Nadeau added a comment - Resolved in release 1.4.0-incubating (2015-08-23)

            People

            • Assignee:
              jcamachorodriguez Jesus Camacho Rodriguez
              Reporter:
              jcamachorodriguez Jesus Camacho Rodriguez
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development