Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
-
None
-
ghx-label-3
Description
Consider the following select statement:
select tB.bField, count(tA.aField) ct from tableA tA join tableB tB using (id) where (...) group by tB.bField order by ct
if tableB has a large number of rows (but still less than tableA), performance can be orders of magnitude slower than the equivalent query:
select tB.bField, count(tA.aField) ct from tableA tA join (select distinct bField, id[, ...] from tableB) tB using (id) where (...) group by tB.bField order by ct
It appears to me that the slower query gets bogged down with shuttling unnecessary data between nodes.
Is it possible, and beneficial, to make such a query improvement implicit in Impala's query optimizer?
Attachments
Issue Links
- relates to
-
IMPALA-9875 Deduplicate build in joins with distinct semantics
- Open
- supercedes
-
IMPALA-10099 Push down DISTINCT aggregation for EXCEPT/INTERSECT
- Closed