Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16121

ListingFileCatalog does not list in parallel anymore

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 2.0.0
    • SQL
    • None

    Description

      In ListingFileCatalog, the implementation of listLeafFiles is shown below. When the number of user-provided paths is less than the value of sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold, we will not use parallel listing, which is different from what 1.6 does (for 1.6, if the number of children of any inner dir is larger than the threshold, we will use the parallel listing).

      protected def listLeafFiles(paths: Seq[Path]): mutable.LinkedHashSet[FileStatus] = {
          if (paths.length >= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
            HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sparkSession)
          } else {
            // Dummy jobconf to get to the pathFilter defined in configuration
            val jobConf = new JobConf(hadoopConf, this.getClass)
            val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
            val statuses: Seq[FileStatus] = paths.flatMap { path =>
              val fs = path.getFileSystem(hadoopConf)
              logInfo(s"Listing $path on driver")
              Try {
                HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), pathFilter)
              }.getOrElse(Array.empty[FileStatus])
            }
            mutable.LinkedHashSet(statuses: _*)
          }
        }
      

      Attachments

        Issue Links

          Activity

            People

              yhuai Yin Huai
              yhuai Yin Huai
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: