[IMPALA-3563] Evaluate using global opposed to per partition statistics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 2.5.0
Fix Version/s: None
Component/s: Catalog
Labels:

Target Version:

Product Backlog

Description

Impala and other SQL on Hadoop solutions use per partition statistics which creates a metadata scalability problem which I reckon outweighs benefits of having more accurate statistics.

This is the proposal is for a partitioned table :

"Compute statistics" computes and stores per partition HLL same as before
Catalog merges the HLL(s) for all partitions and stores/persists global statistics
Impalad(s) never request per partition statics only global stats
The only time the catalog needs to read the per partition HLL is when regenerating the global stats as part of adding/removing partitions

In other words during planning the partitioned table looks very similar to a non-partitioned table.

Attachments

Issue Links

is blocked by

IMPALA-2649 improve incremental stats scalability

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Mostafa Mokhtar

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 18/May/16 00:56

Updated:: 19/Jun/20 19:37