[IMPALA-10347] Explore approaches to optimizing queries that will likely be short-circuited by limits - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Distributed Exec
Labels:
- performance

Target Version:

Product Backlog
Epic Color:
ghx-label-7

Description

Based on discussion with amansinha, there are opportunities beyond ~~IMPALA-10314~~ to optimize queries where there is a limit and the query is unlikely to scan many files.

The problem is that we do all the work to generate scan ranges and schedule them upfront, which adds a lot of overhead if only a small number of files actually need to be processed.

A couple of ideas we had:

Parallelize and/or otherwise optimize the scan range generation
Speculatively execute the query on a subset of files and then cancel and retry if we hit the limit
Incrementally generate scan ranges and assign them to executors so that scan range generation and execution can be overlapped. This is the most general solution but also has a lot of knock-on implications for other subsystems, like cardinality/memory estimation, scheduling, query execution, query coordination, etc.

Attachments

Issue Links

relates to

IMPALA-10314 Planning time for simple SELECT with LIMIT could be improved

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Tim Armstrong

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 20/Nov/20 20:18

Updated:: 20/Nov/20 20:18