[SPARK-30394] Skip collecting stats in DetermineTableStats rule when hive table is convertible to datasource tables - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark will scan hdfs files to collect table stats in `DetermineTableStats` rule. But this can be expensive and not accurate(only file size on disk, not accounting compression factor), acutually we can skip this if this hive table can be converted to datasource table(parquet etc.), and do better estimation in `HadoopFsRelation`.

BeforeSPARK-28573, the implementation will update the CatalogTableStatistics, which will cause the improper stats(for parquet, this size is greatly smaller than real size in memory) be used in joinSelection when the hive table can be convert to datasource table.

In our production environment, user's highly compressed parquet table can cause OOMs when doing `broadcastHashJoin` due to this improper stats.

Attachments

Issue Links

links to

GitHub Pull Request #27055

Activity

People

Assignee:: Unassigned

Reporter:: liupengcheng

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 31/Dec/19 03:53

Updated:: 30/Apr/20 16:48