Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5075

Memory Leak when repartitioning SchemaRDD or running queries in general

    XMLWordPrintableJSON

    Details

      Description

      I'm trying to repartition a json dataset for better cpu optimization and save in parquet format for better performance. The Json dataset is about 200gb

      from pyspark.sql import SQLContext
      sql_context = SQLContext(sc)

      rdd = sql_context.jsonFile('s3c://some_path')
      rdd = rdd.repartition(256)
      rdd.saveAsParquetFile('hdfs://some_path')

      In ganglia when the dataset first loads it's about 200G in memory which is expected. However once it attempts the repartition, it balloons over 2.5x in memory which is never returned making any subsequent operations fail from memory errors.

      https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png

      I'm also seeing a similar memory leak behavior when running repeated queries on a dataset.

      rdd = sql_context.parquetFile('hdfs://some_path')
      rdd.registerTempTable('events')

      sql_context.sql( anything )
      sql_context.sql( anything )
      sql_context.sql( anything )
      sql_context.sql( anything )

      will result in a memory usage pattern of.
      http://cl.ly/image/180y2D3d1A0X

      It seems like intermediate results are not being garbage collected or something. Eventually I have to kill my session to keep running queries.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              brdwrd Brad Willard
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: