Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46125

Memory leak when using createDataFrame with persist

    XMLWordPrintableJSON

Details

    Description

      When I create a dataset from pandas data frame and persisting it (DISK_ONLY), some "byte[]" objects (total size of imported data frame) will still remain in the driver's heap memory.

      This is the sample code for reproducing it:

      import pandas as pd
      import gc
      
      from pyspark.sql import SparkSession
      from pyspark.storagelevel import StorageLevel
      
      spark = SparkSession.builder \
          .config("spark.driver.memory", "4g") \
          .config("spark.executor.memory", "4g") \
          .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
          .getOrCreate()
      
      pdf = pd.read_pickle('tmp/input.pickle')
      df = spark.createDataFrame(pdf)
      df = df.persist(storageLevel=StorageLevel.DISK_ONLY)
      df.count()
      
      del pdf
      del df
      gc.collect()
      spark.sparkContext._jvm.System.gc()

      After running this code, I will perform a manual GC in VisualVM, but the driver memory usage will remain at 550 MBs (at start it was about 50 MBs).

      Then I tested with adding "df = df.unpersist()" after the "df.count()" line and everything was OK (Memory usage after performing manual GC was about 50 MBs).

      Also, I tried with reading from parquet file (without adding unpersist line) with this code:

      import gc
      
      from pyspark.sql import SparkSession
      from pyspark.storagelevel import StorageLevel
      
      spark = SparkSession.builder \
          .config("spark.driver.memory", "4g") \
          .config("spark.executor.memory", "4g") \
          .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
          .getOrCreate()
      
      df = spark.read.parquet('tmp/input.parquet')
      df = df.persist(storageLevel=StorageLevel.DISK_ONLY)
      df.count()
      
      del df
      gc.collect()
      spark.sparkContext._jvm.System.gc()

      Again everything was fine and memory usage was about 50 MBs after performing manual GC.

      Attachments

        1. image-2023-11-28-12-55-58-461.png
          298 kB
          Josh Rosen
        2. ReadParquetWithoutUnpersist.png
          416 kB
          Arman Yazdani
        3. CreateDataFrameWithUnpersist.png
          387 kB
          Arman Yazdani
        4. CreateDataFrameWithoutUnpersist.png
          137 kB
          Arman Yazdani

        Activity

          People

            Unassigned Unassigned
            arman1371 Arman Yazdani
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: