Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29035

unpersist() ignoring cache/persist()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.4.3
    • None
    • SQL
    • None
    • Amazon EMR - Spark 2.4.3

    Description

      Calling unpersist(), even though the DataFrame is not used anymore removes all the InMemoryTableScan from the DAG.

      Here's a simplified version of the code i'm using:

      df = spark.read(...).where(...).cache()
      df_a = union(df.select(...), df.select(...), df.select(...))
      df_b = df.select(...)
      df_c = df.select(...)
      df_d = df.select(...)
      df.unpersist()
      join(df_a, df_b, df_c, df_d).write()
      

      I've created an album with the two DAGs, with and without the unpersist() call.

      I call unpersist in order to prevent OOM during the join. From what I understand even though all the DataFrames come from df, unpersisting df after doing the selects shouldn't ignore the cache call, right?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jose.lima.silva Jose Silva
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 2h
                  2h
                  Remaining:
                  Remaining Estimate - 2h
                  2h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified