Description
It looks like GreedyHeuristicJoinOrderAlgorithm always assumes every joins are associative.
Following query returns in inaccurate result:
select * FROM
customer c
right outer join nation n on c.c_custkey = n.n_nationkey
join region r on c.c_custkey = r.r_regionkey;
because GreedyHeuristicJoinOrderAlgorithm changes join order as
select * FROM
customer c
join region r on c.c_custkey = r.r_regionkey
right outer join nation n on c.c_custkey = n.n_nationkey;
I think getBestPair() should be fixed to avoid wrong join ordering.
Attachments
Issue Links
- is related to
-
TAJO-1342 The complex join query runs forever
- Resolved
Activity
For the above case, the customer and nation tables must be joined first, and then the result is joined with the region table.
In addition, the join type should be the outer join for the first join.
I think that we should handle joins separately according to their join types when optimizing the join order.
Thanks for explanation. I have few knowledge of tajo core, and just thought about in a cost aspect.
GitHub user sirpkt opened a pull request:
https://github.com/apache/tajo/pull/379
TAJO-1277: GreedyHeuristicJoinOrderAlgorithm sometimes wrongly assumes associativity of joins
Basically, it limits the range of join ordering until it meets outer join operations.
For example, in the case of (((((a inner join b) inner join c) outer join d) inner join e) inner join f),
join ordering is partitioned as three parts as
1) (a inner join b) inner join c
2) (result of 1) outer join d
3) (((result of 2) inner join e) inner join f)
Following modifications are included:
- findBestOrder() is changed to partition join ordering
- getBestPair() and findJoin() are changed to return the corresponding JoinEdges of the selected join because those JoinEdges should be removed before next join ordering
It passes 'mvn clean install' and several join cases I tested,
however, I'm not sure this is good approach.
Please, leave me comments about the patch.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sirpkt/tajo TAJO-1277
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/tajo/pull/379.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #379
commit 74802e228dd3af8e885cc724c32d5746939c2c23
Author: Keuntae Park <sirpkt@apache.org>
Date: 2015-02-09T09:21:04Z
join optimizer is enhanced to distinguish non-associative join cases
Github user jihoonson commented on a diff in the pull request:
https://github.com/apache/tajo/pull/379#discussion_r24393011
— Diff: tajo-plan/src/main/java/org/apache/tajo/plan/joinorder/GreedyHeuristicJoinOrderAlgorithm.java —
@@ -57,17 +54,69 @@ public FoundJoinOrder findBestOrder(LogicalPlan plan, LogicalPlan.QueryBlock blo
JoinEdge bestPair;
while (remainRelations.size() > 1) {
+ Set<LogicalNode> checkingRelations = new LinkedHashSet<LogicalNode>();
+
+ for (LogicalNode relation : remainRelations) {
+ Collection <String> relationStrings = PlannerUtil.getRelationLineageWithinQueryBlock(plan, relation);
+ List<JoinEdge> joinEdges = new ArrayList<JoinEdge>();
+ String relationCollection = TUtil.collectionToString(relationStrings, ",");
+ List<JoinEdge> joinEdgesForGiven = joinGraph.getIncomingEdges(relationCollection);
+ if (joinEdgesForGiven != null)
+ for (String relationString: relationStrings) {
+ joinEdgesForGiven = joinGraph.getIncomingEdges(relationString);
+ if (joinEdgesForGiven != null)
+ }
+
+ // check if the relation is the last piece of outer join
+ boolean endInnerRelation = false;
+ for (JoinEdge joinEdge: joinEdges) {
+ switch(joinEdge.getJoinType()) {
+ case LEFT_OUTER:
+ case RIGHT_OUTER:
+ case FULL_OUTER:
— End diff –
We should consider other join types such as ```anti join``` or ```semi join``` as well as ```inner join``` and ```outer join```.
So, I think that the iteration must be stopped when finding ```anti join``` or ```semi join```.
Github user jihoonson commented on a diff in the pull request:
https://github.com/apache/tajo/pull/379#discussion_r24393231
— Diff: tajo-plan/src/main/java/org/apache/tajo/plan/joinorder/GreedyHeuristicJoinOrderAlgorithm.java —
@@ -57,17 +54,69 @@ public FoundJoinOrder findBestOrder(LogicalPlan plan, LogicalPlan.QueryBlock blo
JoinEdge bestPair;
while (remainRelations.size() > 1) {
+ Set<LogicalNode> checkingRelations = new LinkedHashSet<LogicalNode>();
+
+ for (LogicalNode relation : remainRelations) {
+ Collection <String> relationStrings = PlannerUtil.getRelationLineageWithinQueryBlock(plan, relation);
+ List<JoinEdge> joinEdges = new ArrayList<JoinEdge>();
+ String relationCollection = TUtil.collectionToString(relationStrings, ",");
+ List<JoinEdge> joinEdgesForGiven = joinGraph.getIncomingEdges(relationCollection);
+ if (joinEdgesForGiven != null)
+ for (String relationString: relationStrings) {
+ joinEdgesForGiven = joinGraph.getIncomingEdges(relationString);
+ if (joinEdgesForGiven != null)
+ }
+
+ // check if the relation is the last piece of outer join
+ boolean endInnerRelation = false;
+ for (JoinEdge joinEdge: joinEdges) {
— End diff –
Here, joinEdges seem to be sorted in the order of their occurrence in the user query.
So, I wonder how we can guarantee that joinEdges are always sorted in that order.
Github user jihoonson commented on the pull request:
https://github.com/apache/tajo/pull/379#issuecomment-73655137
@sirpkt, thanks for your nice work.
The basic approach looks good to me.
I left some comments.
Sometimes, outer joins can be associative under some conditions. (Please refer to http://stackoverflow.com/questions/20022196/are-left-outer-joins-associative, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins)
So, we should handle them, but I think that it should be done in another issue.
Github user sirpkt commented on a diff in the pull request:
https://github.com/apache/tajo/pull/379#discussion_r24468860
— Diff: tajo-plan/src/main/java/org/apache/tajo/plan/joinorder/GreedyHeuristicJoinOrderAlgorithm.java —
@@ -57,17 +54,69 @@ public FoundJoinOrder findBestOrder(LogicalPlan plan, LogicalPlan.QueryBlock blo
JoinEdge bestPair;
while (remainRelations.size() > 1) {
+ Set<LogicalNode> checkingRelations = new LinkedHashSet<LogicalNode>();
+
+ for (LogicalNode relation : remainRelations) {
+ Collection <String> relationStrings = PlannerUtil.getRelationLineageWithinQueryBlock(plan, relation);
+ List<JoinEdge> joinEdges = new ArrayList<JoinEdge>();
+ String relationCollection = TUtil.collectionToString(relationStrings, ",");
+ List<JoinEdge> joinEdgesForGiven = joinGraph.getIncomingEdges(relationCollection);
+ if (joinEdgesForGiven != null)
+ for (String relationString: relationStrings) {
+ joinEdgesForGiven = joinGraph.getIncomingEdges(relationString);
+ if (joinEdgesForGiven != null)
+ }
+
+ // check if the relation is the last piece of outer join
+ boolean endInnerRelation = false;
+ for (JoinEdge joinEdge: joinEdges) {
+ switch(joinEdge.getJoinType()) {
+ case LEFT_OUTER:
+ case RIGHT_OUTER:
+ case FULL_OUTER:
— End diff –
Thank you for the comment, @jihoonson .
I'll add other join types.
Github user sirpkt commented on a diff in the pull request:
https://github.com/apache/tajo/pull/379#discussion_r24469183
— Diff: tajo-plan/src/main/java/org/apache/tajo/plan/joinorder/GreedyHeuristicJoinOrderAlgorithm.java —
@@ -57,17 +54,69 @@ public FoundJoinOrder findBestOrder(LogicalPlan plan, LogicalPlan.QueryBlock blo
JoinEdge bestPair;
while (remainRelations.size() > 1) {
+ Set<LogicalNode> checkingRelations = new LinkedHashSet<LogicalNode>();
+
+ for (LogicalNode relation : remainRelations) {
+ Collection <String> relationStrings = PlannerUtil.getRelationLineageWithinQueryBlock(plan, relation);
+ List<JoinEdge> joinEdges = new ArrayList<JoinEdge>();
+ String relationCollection = TUtil.collectionToString(relationStrings, ",");
+ List<JoinEdge> joinEdgesForGiven = joinGraph.getIncomingEdges(relationCollection);
+ if (joinEdgesForGiven != null)
+ for (String relationString: relationStrings) {
+ joinEdgesForGiven = joinGraph.getIncomingEdges(relationString);
+ if (joinEdgesForGiven != null)
+ }
+
+ // check if the relation is the last piece of outer join
+ boolean endInnerRelation = false;
+ for (JoinEdge joinEdge: joinEdges) {
— End diff –
Actually, I thought that the order of JoinEdges in joinEdges list does not matter.
Since the given relation is the right relation for all the JoinEdges,
I just check the existence of JoinEdge that is outer join.
And, in my thought, the result does not vary regardless of the order of JoinEdge's occurrence.
Github user jihoonson commented on a diff in the pull request:
https://github.com/apache/tajo/pull/379#discussion_r24472011
— Diff: tajo-plan/src/main/java/org/apache/tajo/plan/joinorder/GreedyHeuristicJoinOrderAlgorithm.java —
@@ -57,17 +54,69 @@ public FoundJoinOrder findBestOrder(LogicalPlan plan, LogicalPlan.QueryBlock blo
JoinEdge bestPair;
while (remainRelations.size() > 1) {
+ Set<LogicalNode> checkingRelations = new LinkedHashSet<LogicalNode>();
+
+ for (LogicalNode relation : remainRelations) {
+ Collection <String> relationStrings = PlannerUtil.getRelationLineageWithinQueryBlock(plan, relation);
+ List<JoinEdge> joinEdges = new ArrayList<JoinEdge>();
+ String relationCollection = TUtil.collectionToString(relationStrings, ",");
+ List<JoinEdge> joinEdgesForGiven = joinGraph.getIncomingEdges(relationCollection);
+ if (joinEdgesForGiven != null)
+ for (String relationString: relationStrings) {
+ joinEdgesForGiven = joinGraph.getIncomingEdges(relationString);
+ if (joinEdgesForGiven != null)
+ }
+
+ // check if the relation is the last piece of outer join
+ boolean endInnerRelation = false;
+ for (JoinEdge joinEdge: joinEdges) {
— End diff –
You are right. It was my misunderstanding.
Github user jihoonson commented on a diff in the pull request:
https://github.com/apache/tajo/pull/379#discussion_r24472049
— Diff: tajo-plan/src/main/java/org/apache/tajo/plan/joinorder/GreedyHeuristicJoinOrderAlgorithm.java —
@@ -57,17 +54,69 @@ public FoundJoinOrder findBestOrder(LogicalPlan plan, LogicalPlan.QueryBlock blo
JoinEdge bestPair;
while (remainRelations.size() > 1) {
+ Set<LogicalNode> checkingRelations = new LinkedHashSet<LogicalNode>();
+
+ for (LogicalNode relation : remainRelations) {
+ Collection <String> relationStrings = PlannerUtil.getRelationLineageWithinQueryBlock(plan, relation);
+ List<JoinEdge> joinEdges = new ArrayList<JoinEdge>();
+ String relationCollection = TUtil.collectionToString(relationStrings, ",");
+ List<JoinEdge> joinEdgesForGiven = joinGraph.getIncomingEdges(relationCollection);
+ if (joinEdgesForGiven != null)
+ for (String relationString: relationStrings) {
— End diff –
Would you mind explaning these lines?
I found that the same join edge is added twice to the ```joinEdgesForGiven``` list.
Github user sirpkt commented on a diff in the pull request:
https://github.com/apache/tajo/pull/379#discussion_r24553012
— Diff: tajo-plan/src/main/java/org/apache/tajo/plan/joinorder/GreedyHeuristicJoinOrderAlgorithm.java —
@@ -57,17 +54,69 @@ public FoundJoinOrder findBestOrder(LogicalPlan plan, LogicalPlan.QueryBlock blo
JoinEdge bestPair;
while (remainRelations.size() > 1) {
+ Set<LogicalNode> checkingRelations = new LinkedHashSet<LogicalNode>();
+
+ for (LogicalNode relation : remainRelations) {
+ Collection <String> relationStrings = PlannerUtil.getRelationLineageWithinQueryBlock(plan, relation);
+ List<JoinEdge> joinEdges = new ArrayList<JoinEdge>();
+ String relationCollection = TUtil.collectionToString(relationStrings, ",");
+ List<JoinEdge> joinEdgesForGiven = joinGraph.getIncomingEdges(relationCollection);
+ if (joinEdgesForGiven != null)
+ for (String relationString: relationStrings) {
— End diff –
Oh, it's my mistake.
When relationStrings has only one entry, this code may adds that entry twice.
When a LogicalNode contains two relations, for example, A and B,
above code first finds joinEdges whose right relation is "A, B", which is obtained by TUtil.collectionToString().
Next, it finds joinEdges whose right relation is "A" or "B", which is obtained by 'for (String relationString: relationStrings)'.
So, if a LogicalNode contains just one relation, this code may adds that relation repeatedly.
Duplicated relation does not affect the result but I'll patch not to have duplicated relations.
Github user jihoonson commented on a diff in the pull request:
https://github.com/apache/tajo/pull/379#discussion_r24555499
— Diff: tajo-plan/src/main/java/org/apache/tajo/plan/joinorder/GreedyHeuristicJoinOrderAlgorithm.java —
@@ -57,17 +54,69 @@ public FoundJoinOrder findBestOrder(LogicalPlan plan, LogicalPlan.QueryBlock blo
JoinEdge bestPair;
while (remainRelations.size() > 1) {
+ Set<LogicalNode> checkingRelations = new LinkedHashSet<LogicalNode>();
+
+ for (LogicalNode relation : remainRelations) {
+ Collection <String> relationStrings = PlannerUtil.getRelationLineageWithinQueryBlock(plan, relation);
+ List<JoinEdge> joinEdges = new ArrayList<JoinEdge>();
+ String relationCollection = TUtil.collectionToString(relationStrings, ",");
+ List<JoinEdge> joinEdgesForGiven = joinGraph.getIncomingEdges(relationCollection);
+ if (joinEdgesForGiven != null)
+ for (String relationString: relationStrings) {
— End diff –
Thanks for your explanation.
Github user sirpkt commented on the pull request:
https://github.com/apache/tajo/pull/379#issuecomment-74016250
I updated the patch to reflect the comments from @jihoonson .
- add semi and anti join cases
- avoid duplicated JoinEdges
Github user jihoonson commented on the pull request:
https://github.com/apache/tajo/pull/379#issuecomment-74204669
+1
The latest patch looks good to me.
Thanks for your work!
Github user hyunsik commented on the pull request:
https://github.com/apache/tajo/pull/379#issuecomment-74479771
+1
The patch contains a workaround code which can fix the bug right now.
I have some comments. Your workaround code seems to try to firstly find the order of joins whose join types is same. In my opinion, this approach may miss some of join order combinations. It may cause sub-optimal join orders in some cases. I think that the right way is to improve the findBestPair() to find only associative join orders. It would be great if we do it in another jira.
Github user jihoonson commented on the pull request:
https://github.com/apache/tajo/pull/379#issuecomment-74480604
@hyunsik, thanks for your review.
I created an issue to handle the potentially missed cases.
https://issues.apache.org/jira/browse/TAJO-1352
SUCCESS: Integrated in Tajo-master-build #585 (See https://builds.apache.org/job/Tajo-master-build/585/)
TAJO-1277: GreedyHeuristicJoinOrderAlgorithm sometimes wrongly assumes associativity of joins. (Keuntae Park via jihoon) (jihoonson: rev 95d9cc455dd7d25da6dd0df709084d7346efb410)
- tajo-core/src/test/resources/queries/TestJoinQuery/testJoinWithMultipleJoinTypes.sql
- tajo-core/src/test/resources/results/TestJoinQuery/testJoinWithMultipleJoinTypes.result
- tajo-core/src/test/java/org/apache/tajo/engine/query/TestJoinQuery.java
- CHANGES
- tajo-plan/src/main/java/org/apache/tajo/plan/joinorder/GreedyHeuristicJoinOrderAlgorithm.java
FAILURE: Integrated in Tajo-master-CODEGEN-build #224 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/224/)
TAJO-1277: GreedyHeuristicJoinOrderAlgorithm sometimes wrongly assumes associativity of joins. (Keuntae Park via jihoon) (jihoonson: rev 95d9cc455dd7d25da6dd0df709084d7346efb410)
- tajo-plan/src/main/java/org/apache/tajo/plan/joinorder/GreedyHeuristicJoinOrderAlgorithm.java
- tajo-core/src/test/resources/results/TestJoinQuery/testJoinWithMultipleJoinTypes.result
- tajo-core/src/test/java/org/apache/tajo/engine/query/TestJoinQuery.java
- tajo-core/src/test/resources/queries/TestJoinQuery/testJoinWithMultipleJoinTypes.sql
- CHANGES
How about rewriting logical plan for this case? I feel that rewriting logical plan like this seems to be more optimal, and lower cost.