Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5180 Data source API improvement (Spark 1.5)
  3. SPARK-8890

Reduce memory consumption for dynamic partition insert

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersConvert to IssueLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.5.0
    • SQL
    • None
    • Spark 1.5 release

    Description

      Currently, InsertIntoHadoopFsRelation can run out of memory if the number of table partitions is large. The problem is that we open one output writer for each partition, and when data are randomized and when the number of partitions is large, we open a large number of output writers, leading to OOM.

      The solution here is to inject a sorting operation once the number of active partitions is beyond a certain point (e.g. 50?)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            marmbrus Michael Armbrust Assign to me
            rxin Reynold Xin
            Votes:
            0 Vote for this issue
            Watchers:
            7 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Agile

                Completed Sprint:
                Spark 1.5 release ended 14/Aug/15
                View on Board

                Slack

                  Issue deployment