Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30443

"Managed memory leak detected" even with no calls to take() or limit()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.3.2, 2.4.4, 3.0.0
    • None
    • Spark Core
    • None

    Description

      Our Spark code is causing a "Managed memory leak detected" warning to appear, even though we are not calling take() or limit().

      According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 managed memory leaks should only be caused by not reading an iterator to completion, i.e. take() or limit()

      Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed memory leak detected; size = 2097152 bytes, TID = 118"
      The size of the managed memory leak is always 2MB.

      I have created a minimal test program that reproduces the warning: 

      import pyspark.sql
      import pyspark.sql.functions as fx
      
      
      def main():
          builder = pyspark.sql.SparkSession.builder
          builder = builder.appName("spark-jira")
          spark = builder.getOrCreate()
      
          reader = spark.read
          reader = reader.format("csv")
          reader = reader.option("inferSchema", "true")
          reader = reader.option("header", "true")
      
          table_c = reader.load("c.csv")
          table_a = reader.load("a.csv")
          table_b = reader.load("b.csv")
      
          primary_filter = fx.col("some_code").isNull()
      
          new_primary_data = table_a.filter(primary_filter)
      
          new_ids = new_primary_data.select("some_id")
      
          new_data = table_b.join(new_ids, "some_id")
      
          new_data = new_data.select("some_id")
          result = table_c.join(new_data, "some_id", "left")
      
          result.repartition(1).write.json("results.json", mode="overwrite")
      
          spark.stop()
      
      
      if __name__ == "__main__":
          main()
      

      Our code isn't anything out of the ordinary, just some filters, selects and joins.

      The input data is made up of 3 CSV files. The input data files are quite large, roughly 2.6GB in total uncompressed. I attempted to reduce the number of rows in the CSV input files but this caused the warning to no longer appear. After compressing the files I was able to attach them below.

      Attachments

        1. a.csv.zip
          48.82 MB
          Luke Richter
        2. b.csv.zip
          18.87 MB
          Luke Richter
        3. c.csv.zip
          35.26 MB
          Luke Richter

        Activity

          People

            Unassigned Unassigned
            ltrichter Luke Richter
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: