Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16121

ListingFileCatalog does not list in parallel anymore

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 2.0.0
    • SQL
    • None

    Description

      In ListingFileCatalog, the implementation of listLeafFiles is shown below. When the number of user-provided paths is less than the value of sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold, we will not use parallel listing, which is different from what 1.6 does (for 1.6, if the number of children of any inner dir is larger than the threshold, we will use the parallel listing).

      protected def listLeafFiles(paths: Seq[Path]): mutable.LinkedHashSet[FileStatus] = {
          if (paths.length >= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
            HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sparkSession)
          } else {
            // Dummy jobconf to get to the pathFilter defined in configuration
            val jobConf = new JobConf(hadoopConf, this.getClass)
            val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
            val statuses: Seq[FileStatus] = paths.flatMap { path =>
              val fs = path.getFileSystem(hadoopConf)
              logInfo(s"Listing $path on driver")
              Try {
                HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), pathFilter)
              }.getOrElse(Array.empty[FileStatus])
            }
            mutable.LinkedHashSet(statuses: _*)
          }
        }
      

      Attachments

        Issue Links

          Activity

            apachespark Apache Spark added a comment -

            User 'yhuai' has created a pull request for this issue:
            https://github.com/apache/spark/pull/13830

            apachespark Apache Spark added a comment - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/13830
            smilegator Xiao Li added a comment -

            I also saw this, but I thought this is by design. : )

            smilegator Xiao Li added a comment - I also saw this, but I thought this is by design. : )
            lian cheng Cheng Lian added a comment -

            Issue resolved by pull request 13830
            https://github.com/apache/spark/pull/13830

            lian cheng Cheng Lian added a comment - Issue resolved by pull request 13830 https://github.com/apache/spark/pull/13830
            mengxr Xiangrui Meng added a comment -

            Changed the fix versions to 2.0.1 and 2.1.0 since 2.0.0-RC1 is in vote.

            mengxr Xiangrui Meng added a comment - Changed the fix versions to 2.0.1 and 2.1.0 since 2.0.0-RC1 is in vote.
            gaurav24 Gaurav Shah added a comment -

            mengxr was this fixed in 2.0.0 or is it planned for 2.0.1, My partition discovery takes about 10 minutes and I guess this should fix it

            gaurav24 Gaurav Shah added a comment - mengxr was this fixed in 2.0.0 or is it planned for 2.0.1, My partition discovery takes about 10 minutes and I guess this should fix it
            srowen Sean R. Owen added a comment - You can look at the branch/tag yourself: https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala Yes the change seems to be applied.
            gaurav24 Gaurav Shah added a comment -

            Thanks srowen

            gaurav24 Gaurav Shah added a comment - Thanks srowen

            People

              yhuai Yin Huai
              yhuai Yin Huai
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: