Details
Description
Currently, InsertIntoHadoopFsRelation can run out of memory if the number of table partitions is large. The problem is that we open one output writer for each partition, and when data are randomized and when the number of partitions is large, we open a large number of output writers, leading to OOM.
The solution here is to inject a sorting operation once the number of active partitions is beyond a certain point (e.g. 50?)
Attachments
Issue Links
- blocks
-
SPARK-9707 Test sort-based fallback mode for dynamic partition insert
- Resolved
- is duplicated by
-
SPARK-8597 DataFrame partitionBy memory pressure scales extremely poorly
- Closed
-
SPARK-8968 dynamic partitioning in spark sql performance issue due to the high GC overhead
- Resolved
- links to