Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5180 Data source API improvement (Spark 1.5)
  3. SPARK-8890

Reduce memory consumption for dynamic partition insert

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.0
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:
    • Sprint:
      Spark 1.5 release

      Description

      Currently, InsertIntoHadoopFsRelation can run out of memory if the number of table partitions is large. The problem is that we open one output writer for each partition, and when data are randomized and when the number of partitions is large, we open a large number of output writers, leading to OOM.

      The solution here is to inject a sorting operation once the number of active partitions is beyond a certain point (e.g. 50?)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                marmbrus Michael Armbrust
                Reporter:
                rxin Reynold Xin
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: