Impala and other SQL on Hadoop solutions use per partition statistics which creates a metadata scalability problem which I reckon outweighs benefits of having more accurate statistics.
This is the proposal is for a partitioned table :
- "Compute statistics" computes and stores per partition HLL same as before
- Catalog merges the HLL(s) for all partitions and stores/persists global statistics
- Impalad(s) never request per partition statics only global stats
- The only time the catalog needs to read the per partition HLL is when regenerating the global stats as part of adding/removing partitions
In other words during planning the partitioned table looks very similar to a non-partitioned table.