[SPARK-23445] ColumnStat refactoring - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 2.4.0
Component/s: SQL
Labels:
None

Description

Refactor ColumnStat to be more flexible.

Split ColumnStat and CatalogColumnStat just like CatalogStatistics is split from Statistics. This detaches how the statistics are stored from how they are processed in the query plan. CatalogColumnStat keeps min and max as String, making it not depend on dataType information.
For CatalogColumnStat, parse column names from property names in the metastore ({{KEY_VERSION }}property), not from metastore schema. This allows the catalog to read stats into {{CatalogColumnStat}}s even if the schema itself is not in the metastore.
Make all fields optional. min, max and histogram for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate.

The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans.

Attachments

Issue Links

links to

[Github] Pull Request #20624 (juliuszsompolski)

[Github] Pull Request #35363 (Stove-hust)

Activity

People

Assignee:: Juliusz Sompolski

Reporter:: Juliusz Sompolski

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Feb/18 02:03

Updated:: 30/Jan/22 02:55

Resolved:: 27/Feb/18 07:38