[SPARK-44199] CacheManager refreshes the fileIndex unnecessarily - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.1
Fix Version/s: 3.5.0
Component/s: Spark Core
Labels:
None

Description

The CacheManager on this line https://github.com/apache/spark/blob/680ca2e56f2c8fc759743ad6755f6e3b1a19c629/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L372 uses a prefix based matching to decide which file index needs to be refreshed. However, that can be incorrect if the users have paths which are not subdirectories but share prefixes. For example, in the function below:

  private def refreshFileIndexIfNecessary(
      fileIndex: FileIndex,
      fs: FileSystem,
      qualifiedPath: Path): Boolean = {
    val prefixToInvalidate = qualifiedPath.toString
    val needToRefresh = fileIndex.rootPaths
      .map(_.makeQualified(fs.getUri, fs.getWorkingDirectory).toString)
      .exists(_.startsWith(prefixToInvalidate))
    if (needToRefresh) fileIndex.refresh()
    needToRefresh
  }

If the prefixToInvalidate is s3://bucket/mypath/table_dir and the file index has one of the root paths as s3://bucket/mypath/table_dir_2/part=1, then the needToRefresh will be true and the file index gets refreshed unnecessarily. This is not just wasted CPU cycles but can cause query failures as well, if there are access restrictions to the path being refreshed.

Attachments

Issue Links

links to

[Github] Pull Request #41749 (vihangk1)

Activity

People

Assignee:: Vihang Karajgaonkar

Reporter:: Vihang Karajgaonkar

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Jun/23 18:30

Updated:: 03/Jul/23 23:08

Resolved:: 03/Jul/23 23:08