[IMPALA-10314] Planning time for simple SELECT with LIMIT could be improved - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 3.4.0
Fix Version/s: Impala 4.0.0
Component/s: Frontend
Labels:
None

Epic Color:
ghx-label-4

Description

Consider a table t1 with following characteristics:

HDFS, Parquet format, external table
number of partitions in t1 : 39000 (2 level partitioning)
number of column : 72
number of files : 350000

The planning time for the following query with LIMIT without order-by is fairly long:

select * from t1 limit 10;

Query Compilation: 4s411ms
   - Single node plan created: 3s812ms (3s259ms)

The bulk of the time is spent in HdfsScanNode.computeScanRangeLocations() which iterates over all the partitions and file descriptors within the partitions to assign scan ranges based on data affinity. For trivial LIMIT queries especially with small LIMIT values, we should look at ways to improve the planning time.

Attachments

Issue Links

is related to

IMPALA-10347 Explore approaches to optimizing queries that will likely be short-circuited by limits

Open

IMPALA-10985 always_true hint is not needed if all predicates are on partitioning columns

Open

relates to

IMPALA-10360 Allow a simple limit to be treated as a sampling hint where applicable

Resolved

Activity

People

Assignee:: Aman Sinha

Reporter:: Aman Sinha

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Nov/20 17:47

Updated:: 25/Oct/21 19:09

Resolved:: 30/Nov/20 06:09