Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.1
-
None
Description
The CacheManager on this line https://github.com/apache/spark/blob/680ca2e56f2c8fc759743ad6755f6e3b1a19c629/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L372 uses a prefix based matching to decide which file index needs to be refreshed. However, that can be incorrect if the users have paths which are not subdirectories but share prefixes. For example, in the function below:
private def refreshFileIndexIfNecessary( fileIndex: FileIndex, fs: FileSystem, qualifiedPath: Path): Boolean = { val prefixToInvalidate = qualifiedPath.toString val needToRefresh = fileIndex.rootPaths .map(_.makeQualified(fs.getUri, fs.getWorkingDirectory).toString) .exists(_.startsWith(prefixToInvalidate)) if (needToRefresh) fileIndex.refresh() needToRefresh }
If the prefixToInvalidate is s3://bucket/mypath/table_dir and the file index has one of the root paths as s3://bucket/mypath/table_dir_2/part=1, then the needToRefresh will be true and the file index gets refreshed unnecessarily. This is not just wasted CPU cycles but can cause query failures as well, if there are access restrictions to the path being refreshed.