[CALCITE-3890] Derive IS NOT NULL filter for the inputs of inner join - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.31.0
Component/s: core
Labels:
- pull-request-available

Description

We can infer IS NOT NULL predicate from join which implies some columns may not be null. For instance,

select * from a join b on a.id = b.id;

we can infer a.id is not null/b.id is not null and push down them into the child node of the join. Then it becomes

select * from (select* from a where id is null) t1 join (select * from b where id is not null) on t1.id = t2.id;

Attachments

Issue Links

relates to

CALCITE-6363 Introduce a rule to derive more filters from inner join condition

Open

HIVE-26427 Unify JoinDeriveIsNotNullFilterRule with HiveJoinAddNotNullRule

Open

links to

GitHub Pull Request #2800

Activity

Ascending order - Click to sort in descending order

Julian Hyde added a comment - 01/Apr/20 05:25

I agree. And we can also push down filters: ‘id is not null’ to both inputs, in this case.

Hopefully we can use existing logic, e.g. class Strong, for this deduction.

Julian Hyde added a comment - 01/Apr/20 05:25 I agree. And we can also push down filters: ‘id is not null’ to both inputs, in this case. Hopefully we can use existing logic, e.g. class Strong , for this deduction.

Zoltan Haindrich added a comment - 01/Apr/20 10:29

Chunwei Lei: in Hive we have a rule which somewhat does this
I was sure that a Calcite rule adds all the is not null conditions (and it is)...but apparently it was not contributed back; and it's still only available inside Hive.
it might worth taking a look at it: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java

Zoltan Haindrich added a comment - 01/Apr/20 10:29 Chunwei Lei : in Hive we have a rule which somewhat does this I was sure that a Calcite rule adds all the is not null conditions (and it is)...but apparently it was not contributed back; and it's still only available inside Hive. it might worth taking a look at it: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java

Vineet Garg added a comment - 02/Apr/20 00:14

As kgyrtkirk pointed out we have doing this in Hive for a while and it is very useful rule. I will be more than glad to work on contributing this code to calcite.

Vineet Garg added a comment - 02/Apr/20 00:14 As kgyrtkirk pointed out we have doing this in Hive for a while and it is very useful rule. I will be more than glad to work on contributing this code to calcite.

Chunwei Lei added a comment - 02/Apr/20 02:39 - edited

vgarg, welcome to contribute. I assigned the issue to you.

Chunwei Lei added a comment - 02/Apr/20 02:39 - edited vgarg , welcome to contribute. I assigned the issue to you.

Chunwei Lei added a comment - 11/Apr/20 02:33

vgarg, kgyrtkirk Instead of using a rule to add ISNOTNULL predicate which is the way in Hive, I am wondering if we can do it when creating the JOIN operator. Maybe we can add a configuration to indicate whether to add ISNOTNULL predicate when creating the JOIN operator using RelBuilder.

Chunwei Lei added a comment - 11/Apr/20 02:33 vgarg , kgyrtkirk Instead of using a rule to add ISNOTNULL predicate which is the way in Hive, I am wondering if we can do it when creating the JOIN operator. Maybe we can add a configuration to indicate whether to add ISNOTNULL predicate when creating the JOIN operator using RelBuilder.

Julian Hyde added a comment - 11/Apr/20 05:41

I'm worried about adding a configuration parameter. We don't want to end up with hundreds of them. If it makes sense, let's just do it.

Julian Hyde added a comment - 11/Apr/20 05:41 I'm worried about adding a configuration parameter. We don't want to end up with hundreds of them. If it makes sense, let's just do it.

Chunwei Lei added a comment - 24/Apr/22 02:38 - edited

vgarg any progress on this work? I would like to take over if you have no time.

Chunwei Lei added a comment - 24/Apr/22 02:38 - edited vgarg any progress on this work? I would like to take over if you have no time.

Chunwei Lei added a comment - 10/May/22 02:52

I opened a PR for this feature: https://github.com/apache/calcite/pull/2800. It would be great if someone can review it.

Chunwei Lei added a comment - 10/May/22 02:52 I opened a PR for this feature: https://github.com/apache/calcite/pull/2800. It would be great if someone can review it.

Julian Hyde added a comment - 10/May/22 17:33

The description does not say that the filter is a relational expression, nor does it say that the filter is before the join rather than after.

I understand why FULL join is not covered, but could you apply the rule to the non-NULL-generating sides of LEFT and RIGHT join? E.g.

Emp e LEFT JOIN Dept d USING (deptno)

becomes

(SELECT * FROM Emp WHERE deptno IS NOT NULL) LEFT JOIN Dept d USING (deptno)

Were you able to use Strong as I suggested? It should be easy to cover cases such as

Emp e JOIN Dept d ON e.deptno > d.deptno

Emp e JOIN Dept d ON e.deptno + d.deptno < 10

Julian Hyde added a comment - 10/May/22 17:33 The description does not say that the filter is a relational expression, nor does it say that the filter is before the join rather than after. I understand why FULL join is not covered, but could you apply the rule to the non-NULL-generating sides of LEFT and RIGHT join? E.g. Emp e LEFT JOIN Dept d USING (deptno) becomes (SELECT * FROM Emp WHERE deptno IS NOT NULL) LEFT JOIN Dept d USING (deptno) Were you able to use Strong as I suggested? It should be easy to cover cases such as Emp e JOIN Dept d ON e.deptno > d.deptno or Emp e JOIN Dept d ON e.deptno + d.deptno < 10

Chunwei Lei added a comment - 11/May/22 01:20

Emp e LEFT JOIN Dept d USING (deptno)

becomes

(SELECT * FROM Emp WHERE deptno IS NOT NULL) LEFT JOIN Dept d USING (deptno)

AFAIK, this transformation is wrong. For left/right join, the non-NULL-generating sides may have null values and they can not be filtered in advance.

Chunwei Lei added a comment - 11/May/22 01:20 Emp e LEFT JOIN Dept d USING (deptno) becomes (SELECT * FROM Emp WHERE deptno IS NOT NULL) LEFT JOIN Dept d USING (deptno) AFAIK, this transformation is wrong. For left/right join, the non-NULL-generating sides may have null values and they can not be filtered in advance.

Chunwei Lei added a comment - 11/May/22 01:45

The description does not say that the filter is a relational expression, nor does it say that the filter is before the join rather than after.

The description has been updated.

Chunwei Lei added a comment - 11/May/22 01:45 The description does not say that the filter is a relational expression, nor does it say that the filter is before the join rather than after. The description has been updated.

Xurenhe added a comment - 11/May/22 02:20

I left some doubts in the pr.
I am curious about how to analyze the join condition.

such as:

join on t1.id IS NOT DISTINCT FROM t2.id

Xurenhe added a comment - 11/May/22 02:20 I left some doubts in the pr. I am curious about how to analyze the join condition. such as: join on t1.id IS NOT DISTINCT FROM t2.id

Chunwei Lei added a comment - 11/May/22 02:51

I am trying to use Strong to analyze the join condition, as Julian said.

Chunwei Lei added a comment - 11/May/22 02:51 I am trying to use Strong to analyze the join condition, as Julian said.

Julian Hyde added a comment - 11/May/22 04:42

In the particular case of IS NOT DISTINCT FROM it is not safe to add IS NOT NULL filters. For example, Emp e JOIN Dept d ON e.deptno IS NOT DISTINCT FROM d.deptno will return a row where both e.deptno and d.deptno are both null. Hopefully Strong knows this.

Julian Hyde added a comment - 11/May/22 04:42 In the particular case of IS NOT DISTINCT FROM it is not safe to add IS NOT NULL filters. For example, Emp e JOIN Dept d ON e.deptno IS NOT DISTINCT FROM d.deptno will return a row where both e.deptno and d.deptno are both null. Hopefully Strong knows this.

Chunwei Lei added a comment - 12/May/22 02:29

Exactly. I think Strong can handle it well.

Chunwei Lei added a comment - 12/May/22 02:29 Exactly. I think Strong can handle it well.

Chunwei Lei added a comment - 20/May/22 11:52

Fixed in https://github.com/apache/calcite/commit/acf82f7784823c30fb7a64e905c3acacd0ed4f2b.

Chunwei Lei added a comment - 20/May/22 11:52 Fixed in https://github.com/apache/calcite/commit/acf82f7784823c30fb7a64e905c3acacd0ed4f2b .

Stamatis Zampetakis added a comment - 25/Jul/22 15:04

Hey Chunwei Lei , I didn't notice that this Jira was resolved, thats great! I see that there was a discussion before to take inspiration from HiveJoinAddNotNullRule . Can you clarify the similarities/differences (if any) between the new rule that you added and the one used in Hive. I am asking cause ideally I would like to avoid maintaining the same code in multiple places.

Stamatis Zampetakis added a comment - 25/Jul/22 15:04 Hey Chunwei Lei , I didn't notice that this Jira was resolved, thats great! I see that there was a discussion before to take inspiration from HiveJoinAddNotNullRule . Can you clarify the similarities/differences (if any) between the new rule that you added and the one used in Hive. I am asking cause ideally I would like to avoid maintaining the same code in multiple places.

Chunwei Lei added a comment - 26/Jul/22 06:01

Thank you for your attention, zabetak. IMHO, the most important point about how to implement this rule is how to avoid applying the rule infinitely, which is the big difference between the new rule I added and the one used in Hive. To achieve this goal, Hive uses some extra data structures to save the generated predicates, while the new rule I added just uses MetadataQuery and RexSimplify to see whether the ISNOTNULL predicate is redundant or not.

Chunwei Lei added a comment - 26/Jul/22 06:01 Thank you for your attention, zabetak . IMHO, the most important point about how to implement this rule is how to avoid applying the rule infinitely, which is the big difference between the new rule I added and the one used in Hive. To achieve this goal, Hive uses some extra data structures to save the generated predicates, while the new rule I added just uses MetadataQuery and RexSimplify to see whether the ISNOTNULL predicate is redundant or not.

Andrei Sereda added a comment - 03/Aug/22 16:21

Resolved in release 1.31.0

Andrei Sereda added a comment - 03/Aug/22 16:21 Resolved in release 1.31.0

People

Assignee:: Chunwei Lei

Reporter:: Chunwei Lei

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 01/Apr/20 04:09

Updated:: 15/Apr/24 03:00

Resolved:: 20/May/22 11:52

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

4h 50m