Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30443

"Managed memory leak detected" even with no calls to take() or limit()



    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.3.2, 2.4.4, 3.0.0
    • None
    • Spark Core
    • None


      Our Spark code is causing a "Managed memory leak detected" warning to appear, even though we are not calling take() or limit().

      According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 managed memory leaks should only be caused by not reading an iterator to completion, i.e. take() or limit()

      Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed memory leak detected; size = 2097152 bytes, TID = 118"
      The size of the managed memory leak is always 2MB.

      I have created a minimal test program that reproduces the warning: 

      import pyspark.sql
      import pyspark.sql.functions as fx
      def main():
          builder = pyspark.sql.SparkSession.builder
          builder = builder.appName("spark-jira")
          spark = builder.getOrCreate()
          reader = spark.read
          reader = reader.format("csv")
          reader = reader.option("inferSchema", "true")
          reader = reader.option("header", "true")
          table_c = reader.load("c.csv")
          table_a = reader.load("a.csv")
          table_b = reader.load("b.csv")
          primary_filter = fx.col("some_code").isNull()
          new_primary_data = table_a.filter(primary_filter)
          new_ids = new_primary_data.select("some_id")
          new_data = table_b.join(new_ids, "some_id")
          new_data = new_data.select("some_id")
          result = table_c.join(new_data, "some_id", "left")
          result.repartition(1).write.json("results.json", mode="overwrite")
      if __name__ == "__main__":

      Our code isn't anything out of the ordinary, just some filters, selects and joins.

      The input data is made up of 3 CSV files. The input data files are quite large, roughly 2.6GB in total uncompressed. I attempted to reduce the number of rows in the CSV input files but this caused the warning to no longer appear. After compressing the files I was able to attach them below.


        1. c.csv.zip
          35.26 MB
          Luke Richter
        2. b.csv.zip
          18.87 MB
          Luke Richter
        3. a.csv.zip
          48.82 MB
          Luke Richter



            Unassigned Unassigned
            ltrichter Luke Richter
            0 Vote for this issue
            5 Start watching this issue

