Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
Impala 1.2.1
-
None
Description
The amount of available memory changes the trade-off between partitioned and shuffle join strategies: if switching to shuffle join can avoid spilling to disk, it may be worth paying the cost of the additional network transfer.
There are two issues:
1. Join strategy decision only takes query mem-limit into account but ignore process mem-limit.
2. Join strategy decision does not take other joins of the same query into account. When multiple joins are present, memory consumption can be very high.
I (tarmstrong@cloudera.com) don't think we should attempt to fix #1 - there's a phase ordering problem here - we currently choose the best-performing plan then decide how much memory to allocate in admission control based on that plan. We can't preserve that while attempting to change the plan to fit the mem_limit. That said, I think the current heuristic is a little too aggressive about picking broadcast when the right side is very large - it should probably bias more towards shuffle as the right side gets larger.
Note that when IMPALA-3200 is completed, this shouldn't prevent the query running to completion, but still affects performance.