Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24613

Cache with UDF could not be matched with subsequent dependent caches

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersCreate sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.3.0
    • 2.3.2, 2.4.0
    • SQL
    • None

    Description

      When caching a query, we generate its execution plan from the query's logical plan. However, the logical plan we get from the Dataset has already been analyzed, and when we try the get the execution plan, this already analyzed logical plan will be analyzed again in the new QueryExecution object, and unfortunately some rules have side effects if applied multiple times, which in this case, is the HandleNullInputsForUDF rule. The re-analyzed plan now has an extra null-check and can't be matched against the same plan. The following test would fail since df2's execution plan inside the CacheManager does not depend on df1.

      test("cache UDF result correctly 2") {
        val expensiveUDF = udf({x: Int => Thread.sleep(10000); x})
        val df = spark.range(0, 10).toDF("a").withColumn("b", expensiveUDF($"a"))
        val df2 = df.agg(sum(df("b")))
      
        df.cache()
        df.count()
        df2.cache()
      
        // udf has been evaluated during caching, and thus should not be re-evaluated here
        failAfter(5 seconds) {
          df2.collect()
        }
      }
      

      While it might be worth re-visiting such analysis rules, we can make also fix the CacheManager to avoid these potential problems.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            maryannxue Wei Xue Assign to me
            maryannxue Wei Xue
            Votes:
            0 Vote for this issue
            Watchers:
            2 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment