Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Impala 2.5.0
-
None
Description
Impala and other SQL on Hadoop solutions use per partition statistics which creates a metadata scalability problem which I reckon outweighs benefits of having more accurate statistics.
This is the proposal is for a partitioned table :
- "Compute statistics" computes and stores per partition HLL same as before
- Catalog merges the HLL(s) for all partitions and stores/persists global statistics
- Impalad(s) never request per partition statics only global stats
- The only time the catalog needs to read the per partition HLL is when regenerating the global stats as part of adding/removing partitions
In other words during planning the partitioned table looks very similar to a non-partitioned table.
Attachments
Issue Links
- is blocked by
-
IMPALA-2649 improve incremental stats scalability
- Resolved