[IMPALA-5260] Have query optimizer make joined tables distinct to improve performance - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
Fix Version/s: None
Component/s: Frontend
Labels:
- performance
- planner

Epic Color:
ghx-label-3

Description

Consider the following select statement:

select tB.bField, count(tA.aField) ct
from tableA tA
join tableB tB using (id)
where (...)
group by tB.bField
order by ct

if tableB has a large number of rows (but still less than tableA), performance can be orders of magnitude slower than the equivalent query:

select tB.bField, count(tA.aField) ct
from tableA tA
join (select distinct bField, id[, ...] from tableB) tB using (id)
where (...)
group by tB.bField
order by ct

It appears to me that the slower query gets bogged down with shuttling unnecessary data between nodes.

Is it possible, and beneficial, to make such a query improvement implicit in Impala's query optimizer?

Attachments

Issue Links

Add Link

relates to

IMPALA-9875 Deduplicate build in joins with distinct semantics

Open

Delete this link

supercedes

IMPALA-10099 Push down DISTINCT aggregation for EXCEPT/INTERSECT

Closed

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Michael Sokalski

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Apr/17 14:10

Updated:: 07/Feb/23 11:10

Agile

View on Board

Have query optimizer make joined tables distinct to improve performance

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment