Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15915

CacheManager should use canonicalized plan for planToCache.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.0
    • Component/s: SQL
    • Labels:
      None

      Description

      DataFrame with plan overriding sameResult but not using canonicalized plan to compare can't cacheTable.

      The example is like:

          val localRelation = Seq(1, 2, 3).toDF()
          localRelation.createOrReplaceTempView("localRelation")
      
          spark.catalog.cacheTable("localRelation")
          assert(
            localRelation.queryExecution.withCachedData.collect {
              case i: InMemoryRelation => i
            }.size == 1)
      

      and this will fail as:

      ArrayBuffer() had size 0 instead of expected size 1
      

      The reason is that when do spark.catalog.cacheTable("localRelation"), CacheManager tries to cache for the plan wrapped by SubqueryAlias but when planning for the DataFrame localRelation, CacheManager tries to find cached table for the not-wrapped plan because the plan for DataFrame localRelation is not wrapped.
      Some plans like LocalRelation, LogicalRDD, etc. override sameResult method, but not use canonicalized plan to compare so the CacheManager can't detect the plans are the same.

        Attachments

          Activity

            People

            • Assignee:
              ueshin Takuya Ueshin
              Reporter:
              ueshin Takuya Ueshin
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: