Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 1.4, Impala 1.4.1, Impala 2.0, Impala 2.0.1, Impala 2.1, Impala 2.1.1, Impala 2.2
-
None
Description
The number of hosts a plan node is estimated to run on during plan generation and the actual number of hosts the query will run on could be different in the following scenarios:
1. Queries accessing tables on non-HDFS remote storage, e.g., S3 or Isilon
2. Queries with aggressive partition pruning
3. Large clusters relative to the size of the tables scanned
This discrepancy between the number of estimated hosts and the actual number of hosts can lead to suboptimal plan choices, in particular, a bad join strategy (broadcast vs. partitioned).
To determine whether a query is running slow due such a bad planning decision, you can examine the "hosts" field in the query plan from the profile, and contrast it with the actual hosts from the execution summary.