Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.0
-
None
-
None
Description
Currently, Spark will use a compressionFactor when calculating `sizeInBytes` for `HadoopFsRelation`, but this is not accurate and it's hard to choose the best `compressionFactor`. Sometimes, this can causing OOMs due to improper BroadcastHashJoin.
So I propose to use the rowCount in the BlockMetadata to estimate the size in memory, which can be more accurate.
Attachments
Issue Links
- relates to
-
SPARK-24914 totalSize is not a good estimate for broadcast joins
- In Progress