Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
ghx-label-7
Description
Based on discussion with amansinha, there are opportunities beyond IMPALA-10314 to optimize queries where there is a limit and the query is unlikely to scan many files.
The problem is that we do all the work to generate scan ranges and schedule them upfront, which adds a lot of overhead if only a small number of files actually need to be processed.
A couple of ideas we had:
- Parallelize and/or otherwise optimize the scan range generation
- Speculatively execute the query on a subset of files and then cancel and retry if we hit the limit
- Incrementally generate scan ranges and assign them to executors so that scan range generation and execution can be overlapped. This is the most general solution but also has a lot of knock-on implications for other subsystems, like cardinality/memory estimation, scheduling, query execution, query coordination, etc.
Attachments
Issue Links
- relates to
-
IMPALA-10314 Planning time for simple SELECT with LIMIT could be improved
- Resolved