Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1495

SemiJoinRule should not apply to RIGHT and FULL JOIN

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.11.0
    • Component/s: core
    • Labels:
      None

      Description

      SemiJoinRule should not apply to RIGHT and FULL JOIN; and it should strip LEFT JOIN once it has disovered that the rows from the right input are unique.

      Original description:

      For IN and EXISTS sub-queries Calcite currently generates plan consisting of GROUP BY on inner table followed by INNER JOIN with outer table. Such cases could be converted using this rule to use SEMI-JOIN to make it more efficient.

        Issue Links

          Activity

          Hide
          julianhyde Julian Hyde added a comment -

          We already have SemiJoinRule.

          I'm a bit skeptical about this one. For a few reasons:

          • To apply the rule, you need to know that none of the columns on the RHS are being used.
          • SemiJoin is a second-class citizen. There are many fewer rules for it than for Join. There are occasions when we want to treat SemiJoin as a kind of Join (e.g. heuristic join re-ordering)

          I suppose that if none of the columns on the RHS are used, we could push that Project through the join and join the LHS to a 0-column RHS (or a k-column RHS, where k is the cardinality of the join key, and decide not to project the join columns).

          We shouldn't look for an Aggregate specifically; we should look for a RHS that is unique (perhaps by means of Aggregate, perhaps by other means).

          Can you give an example of an optimization that this rule makes possible?

          Show
          julianhyde Julian Hyde added a comment - We already have SemiJoinRule. I'm a bit skeptical about this one. For a few reasons: To apply the rule, you need to know that none of the columns on the RHS are being used. SemiJoin is a second-class citizen. There are many fewer rules for it than for Join. There are occasions when we want to treat SemiJoin as a kind of Join (e.g. heuristic join re-ordering) I suppose that if none of the columns on the RHS are used, we could push that Project through the join and join the LHS to a 0-column RHS (or a k-column RHS, where k is the cardinality of the join key, and decide not to project the join columns). We shouldn't look for an Aggregate specifically; we should look for a RHS that is unique (perhaps by means of Aggregate, perhaps by other means). Can you give an example of an optimization that this rule makes possible?
          Hide
          vgarg Vineet Garg added a comment -

          Thanks Julian. I did not know that Calcite already has a SemiJoinRule. Does this Rule already takes your point #1 (none of the columns on the RHS are being used) into consideration or does the caller need to make sure of that ?

          Can you give an example of an optimization that this rule makes possible?

          The reason I suggested this rule is because HIVE supports LEFT SEMI JOIN natively. Doing LEFT SEMI JOIN only instead of doing GROUP BY + INNER JOIN should be more efficient in this case. I am not aware of any logical optimization this rule makes possible.

          Show
          vgarg Vineet Garg added a comment - Thanks Julian. I did not know that Calcite already has a SemiJoinRule. Does this Rule already takes your point #1 (none of the columns on the RHS are being used) into consideration or does the caller need to make sure of that ? Can you give an example of an optimization that this rule makes possible? The reason I suggested this rule is because HIVE supports LEFT SEMI JOIN natively. Doing LEFT SEMI JOIN only instead of doing GROUP BY + INNER JOIN should be more efficient in this case. I am not aware of any logical optimization this rule makes possible.
          Hide
          julianhyde Julian Hyde added a comment -

          You can have a physical semi-join operator without having a logical semi-join operator.

          Show
          julianhyde Julian Hyde added a comment - You can have a physical semi-join operator without having a logical semi-join operator.
          Hide
          julianhyde Julian Hyde added a comment -

          Yes, SemiJoinRule ensures that no columns from the RHS are used.

          One improvement: Currently it looks for an Aggregate as its input. It would be better if it looked for an input that is unique on the join key.

          Show
          julianhyde Julian Hyde added a comment - Yes, SemiJoinRule ensures that no columns from the RHS are used. One improvement: Currently it looks for an Aggregate as its input. It would be better if it looked for an input that is unique on the join key.
          Hide
          vgarg Vineet Garg added a comment -

          Thanks Julian. I think HIVE can use this existing rule. Should I convert this to an improvement JIRA to enhance SemiJoinRule ? ?

          Show
          vgarg Vineet Garg added a comment - Thanks Julian. I think HIVE can use this existing rule. Should I convert this to an improvement JIRA to enhance SemiJoinRule ? ?
          Hide
          julianhyde Julian Hyde added a comment -

          Maybe... but I think you should try the existing rule in Hive first. Maybe it does everything you need already. I'd start with some particular queries and set a goal of generating a particular plan. If there are deficiencies, sure, log them here.

          Show
          julianhyde Julian Hyde added a comment - Maybe... but I think you should try the existing rule in Hive first. Maybe it does everything you need already. I'd start with some particular queries and set a goal of generating a particular plan. If there are deficiencies, sure, log them here.
          Hide
          vgarg Vineet Garg added a comment -

          Sounds good. Thank for the help

          Show
          vgarg Vineet Garg added a comment - Sounds good. Thank for the help
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          This rule should also check for join type and should allow conversion only for inner join. Currently, it can convert an outer join which results in wrong result set.
          Also, rule needed to be updated such that it makes use of RelBuilder to allow generation of custom rel nodes.

          Show
          ashutoshc Ashutosh Chauhan added a comment - This rule should also check for join type and should allow conversion only for inner join. Currently, it can convert an outer join which results in wrong result set. Also, rule needed to be updated such that it makes use of RelBuilder to allow generation of custom rel nodes.
          Hide
          julianhyde Julian Hyde added a comment -

          This rule should also check for join type and should allow conversion only for inner join. Currently, it can convert an outer join which results in wrong result set.

          I agree. But suppose the rule matches a LEFT join whose right-hand columns are not used, and whose right input is unique (i.e. returns at most one row per key). That is a join you can remove, and replace by the left input. The rule might as well do that.

          Show
          julianhyde Julian Hyde added a comment - This rule should also check for join type and should allow conversion only for inner join. Currently, it can convert an outer join which results in wrong result set. I agree. But suppose the rule matches a LEFT join whose right-hand columns are not used, and whose right input is unique (i.e. returns at most one row per key). That is a join you can remove, and replace by the left input. The rule might as well do that.
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Julian Hyde Until enhancement you suggested is implemented, short term this should disallow all kind of outer joins, since rule in its current form is unsafe.

          Show
          ashutoshc Ashutosh Chauhan added a comment - Julian Hyde Until enhancement you suggested is implemented, short term this should disallow all kind of outer joins, since rule in its current form is unsafe.
          Hide
          julianhyde Julian Hyde added a comment -

          Changing the subject to refer to the incorrect behavior with LEFT, RIGHT, FULL outer. Creating SemiJoinRule is covered by CALCITE-330.

          Show
          julianhyde Julian Hyde added a comment - Changing the subject to refer to the incorrect behavior with LEFT, RIGHT, FULL outer. Creating SemiJoinRule is covered by CALCITE-330 .
          Show
          julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/565d6392 .
          Hide
          julianhyde Julian Hyde added a comment -

          Resolved in release 1.11.0 (2017-01-11).

          Show
          julianhyde Julian Hyde added a comment - Resolved in release 1.11.0 (2017-01-11).

            People

            • Assignee:
              julianhyde Julian Hyde
              Reporter:
              vgarg Vineet Garg
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development