Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
2.4.3
-
None
-
None
-
Amazon EMR - Spark 2.4.3
Description
Calling unpersist(), even though the DataFrame is not used anymore removes all the InMemoryTableScan from the DAG.
Here's a simplified version of the code i'm using:
df = spark.read(...).where(...).cache() df_a = union(df.select(...), df.select(...), df.select(...)) df_b = df.select(...) df_c = df.select(...) df_d = df.select(...) df.unpersist() join(df_a, df_b, df_c, df_d).write()
I've created an album with the two DAGs, with and without the unpersist() call.
I call unpersist in order to prevent OOM during the join. From what I understand even though all the DataFrames come from df, unpersisting df after doing the selects shouldn't ignore the cache call, right?
Attachments
Issue Links
- links to