Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15915

CacheManager should use canonicalized plan for planToCache.

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • SQL
    • None

    Description

      DataFrame with plan overriding sameResult but not using canonicalized plan to compare can't cacheTable.

      The example is like:

          val localRelation = Seq(1, 2, 3).toDF()
          localRelation.createOrReplaceTempView("localRelation")
      
          spark.catalog.cacheTable("localRelation")
          assert(
            localRelation.queryExecution.withCachedData.collect {
              case i: InMemoryRelation => i
            }.size == 1)
      

      and this will fail as:

      ArrayBuffer() had size 0 instead of expected size 1
      

      The reason is that when do spark.catalog.cacheTable("localRelation"), CacheManager tries to cache for the plan wrapped by SubqueryAlias but when planning for the DataFrame localRelation, CacheManager tries to find cached table for the not-wrapped plan because the plan for DataFrame localRelation is not wrapped.
      Some plans like LocalRelation, LogicalRDD, etc. override sameResult method, but not use canonicalized plan to compare so the CacheManager can't detect the plans are the same.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ueshin Takuya Ueshin
            ueshin Takuya Ueshin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment