Description
In ListingFileCatalog, the implementation of listLeafFiles is shown below. When the number of user-provided paths is less than the value of sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold, we will not use parallel listing, which is different from what 1.6 does (for 1.6, if the number of children of any inner dir is larger than the threshold, we will use the parallel listing).
protected def listLeafFiles(paths: Seq[Path]): mutable.LinkedHashSet[FileStatus] = { if (paths.length >= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sparkSession) } else { // Dummy jobconf to get to the pathFilter defined in configuration val jobConf = new JobConf(hadoopConf, this.getClass) val pathFilter = FileInputFormat.getInputPathFilter(jobConf) val statuses: Seq[FileStatus] = paths.flatMap { path => val fs = path.getFileSystem(hadoopConf) logInfo(s"Listing $path on driver") Try { HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), pathFilter) }.getOrElse(Array.empty[FileStatus]) } mutable.LinkedHashSet(statuses: _*) } }
Attachments
Issue Links
- is broken by
-
SPARK-14959 ​Problem Reading partitioned ORC or Parquet files
- Resolved
- is related to
-
SPARK-16737 ListingFileCatalog comments about RPC calls in object store isn't correct
- Resolved
- links to
User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/13830