Details
-
New Feature
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
Impala 2.2
-
None
Description
The target use case is small queries on large clusters.
Today Impala schedules queries on all Impalad instances regardless of how much data each Impalad would read, this results in spreading the work too thin between nodes and exposes undesired scalability issues.
The proposal is to introduce a parameter that controls the Min/Max amount of data read by a single Impala instance.
The SimpleScheduler would combine several splits together in order to satisfy the Min size requirements for a single Impalad before moving on the to the next node.