Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44199

CacheManager refreshes the fileIndex unnecessarily

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.1
    • 3.5.0
    • Spark Core
    • None

    Description

      The CacheManager on this line https://github.com/apache/spark/blob/680ca2e56f2c8fc759743ad6755f6e3b1a19c629/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L372 uses a prefix based matching to decide which file index needs to be refreshed. However, that can be incorrect if the users have paths which are not subdirectories but share prefixes. For example, in the function below:

       

        private def refreshFileIndexIfNecessary(
            fileIndex: FileIndex,
            fs: FileSystem,
            qualifiedPath: Path): Boolean = {
          val prefixToInvalidate = qualifiedPath.toString
          val needToRefresh = fileIndex.rootPaths
            .map(_.makeQualified(fs.getUri, fs.getWorkingDirectory).toString)
            .exists(_.startsWith(prefixToInvalidate))
          if (needToRefresh) fileIndex.refresh()
          needToRefresh
        } 

      If the prefixToInvalidate is s3://bucket/mypath/table_dir and the file index has one of the root paths as s3://bucket/mypath/table_dir_2/part=1, then the needToRefresh will be true and the file index gets refreshed unnecessarily. This is not just wasted CPU cycles but can cause query failures as well, if there are access restrictions to the path being refreshed.

      Attachments

        Activity

          People

            vihangk1 Vihang Karajgaonkar
            vihangk1 Vihang Karajgaonkar
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: