Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.3.2, 2.4.4, 3.0.0
-
None
-
None
Description
Our Spark code is causing a "Managed memory leak detected" warning to appear, even though we are not calling take() or limit().
According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 managed memory leaks should only be caused by not reading an iterator to completion, i.e. take() or limit()
Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed memory leak detected; size = 2097152 bytes, TID = 118"
The size of the managed memory leak is always 2MB.
I have created a minimal test program that reproduces the warning:
import pyspark.sql import pyspark.sql.functions as fx def main(): builder = pyspark.sql.SparkSession.builder builder = builder.appName("spark-jira") spark = builder.getOrCreate() reader = spark.read reader = reader.format("csv") reader = reader.option("inferSchema", "true") reader = reader.option("header", "true") table_c = reader.load("c.csv") table_a = reader.load("a.csv") table_b = reader.load("b.csv") primary_filter = fx.col("some_code").isNull() new_primary_data = table_a.filter(primary_filter) new_ids = new_primary_data.select("some_id") new_data = table_b.join(new_ids, "some_id") new_data = new_data.select("some_id") result = table_c.join(new_data, "some_id", "left") result.repartition(1).write.json("results.json", mode="overwrite") spark.stop() if __name__ == "__main__": main()
Our code isn't anything out of the ordinary, just some filters, selects and joins.
The input data is made up of 3 CSV files. The input data files are quite large, roughly 2.6GB in total uncompressed. I attempted to reduce the number of rows in the CSV input files but this caused the warning to no longer appear. After compressing the files I was able to attach them below.