Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15915

CacheManager should use canonicalized plan for planToCache.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • SQL
    • None

    Description

      DataFrame with plan overriding sameResult but not using canonicalized plan to compare can't cacheTable.

      The example is like:

          val localRelation = Seq(1, 2, 3).toDF()
          localRelation.createOrReplaceTempView("localRelation")
      
          spark.catalog.cacheTable("localRelation")
          assert(
            localRelation.queryExecution.withCachedData.collect {
              case i: InMemoryRelation => i
            }.size == 1)
      

      and this will fail as:

      ArrayBuffer() had size 0 instead of expected size 1
      

      The reason is that when do spark.catalog.cacheTable("localRelation"), CacheManager tries to cache for the plan wrapped by SubqueryAlias but when planning for the DataFrame localRelation, CacheManager tries to find cached table for the not-wrapped plan because the plan for DataFrame localRelation is not wrapped.
      Some plans like LocalRelation, LogicalRDD, etc. override sameResult method, but not use canonicalized plan to compare so the CacheManager can't detect the plans are the same.

      Attachments

        Activity

          People

            ueshin Takuya Ueshin
            ueshin Takuya Ueshin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: