Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24613 Cache with UDF could not be matched with subsequent dependent caches
  3. SPARK-25985

Verify the SPARK-24613 Cache with UDF could not be matched with subsequent dependent caches

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • SQL, Tests

    Description

      Verify whether recacheByCondition works well when the cache data is with UDF. This is a follow-up of https://github.com/apache/spark/pull/21602

      Attachments

        Activity

          smilegator Can you please comment if this task is still relevant. If so I'd like to look into it if you could please elaborate on "works well" in the description.

          nafshartous Nick Afshartous added a comment - smilegator Can you please comment if this task is still relevant. If so I'd like to look into it if you could please elaborate on "works well" in the description.
          EveLiao Aoyuan Liao added a comment -

          smilegator I think recacheByCondition doesn't keep the cached plan. The following test would fail:

          //
          test("SPARK-24613 Cache with UDF could not be matched with subsequent dependent caches") {
              val udf1 = udf({x: Int => x + 1})    
              val df = spark.range(0, 10).toDF("a").withColumn("b", udf1($"a"))
              val df2 = df.agg(sum(df("b")))
              df.cache()
              df.count()
              df2.cache()
          
              df.unpersist() //recacheByCondition called within
          
              val plan = df2.queryExecution.withCachedData
              assert(plan.isInstanceOf[InMemoryRelation])
              
              val internalPlan = plan.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan
              assert(internalPlan.find(_.isInstanceOf[InMemoryTableScanExec]).isDefined)
          }
          

          The second assertion failed, which means that the data is cached while the plan not.

          EveLiao Aoyuan Liao added a comment - smilegator  I think recacheByCondition doesn't keep the cached plan. The following test would fail: // test( "SPARK-24613 Cache with UDF could not be matched with subsequent dependent caches" ) { val udf1 = udf({x: Int => x + 1}) val df = spark.range(0, 10).toDF( "a" ).withColumn( "b" , udf1($ "a" )) val df2 = df.agg(sum(df( "b" ))) df.cache() df.count() df2.cache() df.unpersist() //recacheByCondition called within val plan = df2.queryExecution.withCachedData assert (plan.isInstanceOf[InMemoryRelation]) val internalPlan = plan.asInstanceOf[InMemoryRelation].cacheBuilder.cachedPlan assert (internalPlan.find(_.isInstanceOf[InMemoryTableScanExec]).isDefined) } The second assertion failed, which means that the data is cached while the plan not.

          People

            Unassigned Unassigned
            smilegator Xiao Li
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: