Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46125

Memory leak when using createDataFrame with persist

    XMLWordPrintableJSON

Details

    Description

      When I create a dataset from pandas data frame and persisting it (DISK_ONLY), some "byte[]" objects (total size of imported data frame) will still remain in the driver's heap memory.

      This is the sample code for reproducing it:

      import pandas as pd
      import gc
      
      from pyspark.sql import SparkSession
      from pyspark.storagelevel import StorageLevel
      
      spark = SparkSession.builder \
          .config("spark.driver.memory", "4g") \
          .config("spark.executor.memory", "4g") \
          .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
          .getOrCreate()
      
      pdf = pd.read_pickle('tmp/input.pickle')
      df = spark.createDataFrame(pdf)
      df = df.persist(storageLevel=StorageLevel.DISK_ONLY)
      df.count()
      
      del pdf
      del df
      gc.collect()
      spark.sparkContext._jvm.System.gc()

      After running this code, I will perform a manual GC in VisualVM, but the driver memory usage will remain at 550 MBs (at start it was about 50 MBs).

      Then I tested with adding "df = df.unpersist()" after the "df.count()" line and everything was OK (Memory usage after performing manual GC was about 50 MBs).

      Also, I tried with reading from parquet file (without adding unpersist line) with this code:

      import gc
      
      from pyspark.sql import SparkSession
      from pyspark.storagelevel import StorageLevel
      
      spark = SparkSession.builder \
          .config("spark.driver.memory", "4g") \
          .config("spark.executor.memory", "4g") \
          .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
          .getOrCreate()
      
      df = spark.read.parquet('tmp/input.parquet')
      df = df.persist(storageLevel=StorageLevel.DISK_ONLY)
      df.count()
      
      del df
      gc.collect()
      spark.sparkContext._jvm.System.gc()

      Again everything was fine and memory usage was about 50 MBs after performing manual GC.

      Attachments

        1. CreateDataFrameWithoutUnpersist.png
          137 kB
          Arman Yazdani
        2. CreateDataFrameWithUnpersist.png
          387 kB
          Arman Yazdani
        3. image-2023-11-28-12-55-58-461.png
          298 kB
          Josh Rosen
        4. ReadParquetWithoutUnpersist.png
          416 kB
          Arman Yazdani

        Activity

          People

            Unassigned Unassigned
            arman1371 Arman Yazdani
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: