[SPARK-30712] Estimate sizeInBytes from file metadata for parquet files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Currently, Spark will use a compressionFactor when calculating `sizeInBytes` for `HadoopFsRelation`, but this is not accurate and it's hard to choose the best `compressionFactor`. Sometimes, this can causing OOMs due to improper BroadcastHashJoin.

So I propose to use the rowCount in the BlockMetadata to estimate the size in memory, which can be more accurate.

Attachments

Issue Links

relates to

SPARK-24914 totalSize is not a good estimate for broadcast joins

In Progress

Activity

People

Assignee:: Unassigned

Reporter:: liupengcheng

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Feb/20 12:37

Updated:: 12/Dec/22 18:10