Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 3.4.0
-
None
-
ghx-label-4
Description
Consider a table t1 with following characteristics:
HDFS, Parquet format, external table number of partitions in t1 : 39000 (2 level partitioning) number of column : 72 number of files : 350000
The planning time for the following query with LIMIT without order-by is fairly long:
select * from t1 limit 10; Query Compilation: 4s411ms - Single node plan created: 3s812ms (3s259ms)
The bulk of the time is spent in HdfsScanNode.computeScanRangeLocations() which iterates over all the partitions and file descriptors within the partitions to assign scan ranges based on data affinity. For trivial LIMIT queries especially with small LIMIT values, we should look at ways to improve the planning time.
Attachments
Issue Links
- is related to
-
IMPALA-10347 Explore approaches to optimizing queries that will likely be short-circuited by limits
- Open
-
IMPALA-10985 always_true hint is not needed if all predicates are on partitioning columns
- Open
- relates to
-
IMPALA-10360 Allow a simple limit to be treated as a sampling hint where applicable
- Resolved