Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33507 Improve and fix cache behavior in v1 and v2
  3. SPARK-33729

When refreshing cache, Spark should not use cached plan when recaching data

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • SQL
    • None

    Description

      Currently when cache is refreshed, e.g., via "REFRESH TABLE" command, Spark will call refreshTable method within CatalogImpl.

        override def refreshTable(tableName: String): Unit = {
          val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
          val tableMetadata = sessionCatalog.getTempViewOrPermanentTableMetadata(tableIdent)
          val table = sparkSession.table(tableIdent)
      
          if (tableMetadata.tableType == CatalogTableType.VIEW) {
            // Temp or persistent views: refresh (or invalidate) any metadata/data cached
            // in the plan recursively.
            table.queryExecution.analyzed.refresh()
          } else {
            // Non-temp tables: refresh the metadata cache.
            sessionCatalog.refreshTable(tableIdent)
          }
      
          // If this table is cached as an InMemoryRelation, drop the original
          // cached version and make the new version cached lazily.
          val cache = sparkSession.sharedState.cacheManager.lookupCachedData(table)
      
          // uncache the logical plan.
          // note this is a no-op for the table itself if it's not cached, but will invalidate all
          // caches referencing this table.
          sparkSession.sharedState.cacheManager.uncacheQuery(table, cascade = true)
      
          if (cache.nonEmpty) {
            // save the cache name and cache level for recreation
            val cacheName = cache.get.cachedRepresentation.cacheBuilder.tableName
            val cacheLevel = cache.get.cachedRepresentation.cacheBuilder.storageLevel
      
            // recache with the same name and cache level.
            sparkSession.sharedState.cacheManager.cacheQuery(table, cacheName, cacheLevel)
          }
        }
      

      Note that the table is created before the table relation cache is cleared, and used later in cacheQuery. This is incorrect since it still refers cached table relation which could be stale.

      Attachments

        Activity

          People

            csun Chao Sun
            csun Chao Sun
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: