[SPARK-8890] Reduce memory consumption for dynamic partition insert - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
None

Target Version/s:

1.5.0
Sprint:
Spark 1.5 release

Description

Currently, InsertIntoHadoopFsRelation can run out of memory if the number of table partitions is large. The problem is that we open one output writer for each partition, and when data are randomized and when the number of partitions is large, we open a large number of output writers, leading to OOM.

The solution here is to inject a sorting operation once the number of active partitions is beyond a certain point (e.g. 50?)

Attachments

Issue Links

blocks

SPARK-9707 Test sort-based fallback mode for dynamic partition insert

Resolved

is duplicated by

SPARK-8597 DataFrame partitionBy memory pressure scales extremely poorly

Closed

SPARK-8968 dynamic partitioning in spark sql performance issue due to the high GC overhead

Resolved

links to

[Github] Pull Request #7514 (ilganeli)

[Github] Pull Request #8010 (marmbrus)

Activity

People

Assignee:: Michael Armbrust

Reporter:: Reynold Xin

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 08/Jul/15 06:49

Updated:: 26/Oct/15 01:02

Resolved:: 07/Aug/15 23:25