Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5180 Data source API improvement (Spark 1.5)
  3. SPARK-8597

DataFrame partitionBy memory pressure scales extremely poorly

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersConvert to IssueLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Blocker
    • Resolution: Duplicate
    • 1.4.0
    • None
    • SQL
    • None
    • Spark 1.5 release

    Description

      I'm running into a strange memory scaling issue when using the partitionBy feature of DataFrameWriter.

      I've generated a table (a CSV file) with 3 columns (A, B and C) and 32*32 different entries, with size on disk of about 20kb. There are 32 distinct values for column A and 32 distinct values for column B and all these are combined together (column C will contain a random number for each row - it doesn't matter) producing a 32*32 elements data set. I've imported this into Spark and I ran a partitionBy("A", "B") in order to test its performance. This should create a nested directory structure with 32 folders, each of them containing another 32 folders. It uses about 10Gb of RAM and it's running slow. If I increase the number of entries in the table from 32*32 to 128*128, I get Java Heap Space Out Of Memory no matter what value I use for Heap Space variable.

      Scala code:

      var df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("table.csv") 
      df.write.partitionBy("A", "B").mode("overwrite").parquet("table.parquet”)
      

      How I ran the Spark shell:

      bin/spark-shell --driver-memory 16g --master local[8] --packages com.databricks:spark-csv_2.10:1.0.3
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            mcheah Matt Cheah
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Agile

                Completed Sprint:
                Spark 1.5 release ended 14/Aug/15
                View on Board

                Slack

                  Issue deployment