Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6760

LocatedFileStatusFetcher to use listFiles(recursive)

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.8.0
    • Fix Version/s: None
    • Component/s: mrv2
    • Labels:
      None
    • Target Version/s:

      Description

      LocatedFileStatusFetcher does parallelized path listing, but it does make recursive calls on every subdir.

      If we could switch it to use FileSystem.listFiles(recursive), object stores that have high-performance implementations of that operation would see significant speedup.

      HADOOP-13208 implements that for S3A; Azure, swift &c can do the same.

        Issue Links

          Activity

          Hide
          stevel@apache.org Steve Loughran added a comment -

          Note that if we find that listFiles isn't completely what we want (it skips dirs), we may want to extend that FS method, as we can update the object stores and MR classes in sync. Things like Hive wouldn't need to change

          Show
          stevel@apache.org Steve Loughran added a comment - Note that if we find that listFiles isn't completely what we want (it skips dirs), we may want to extend that FS method, as we can update the object stores and MR classes in sync. Things like Hive wouldn't need to change
          Hide
          stevel@apache.org Steve Loughran added a comment -

          one problem here is that it's using globStatus(). The code should check to see if there's a wildcard in the scan, and if not, go to the optimal API call.

          Of course, if globStatus() did that check and action itself, everything which called that API would get a speedup. That could be be the better tactic

          Show
          stevel@apache.org Steve Loughran added a comment - one problem here is that it's using globStatus(). The code should check to see if there's a wildcard in the scan, and if not, go to the optimal API call. Of course, if globStatus() did that check and action itself, everything which called that API would get a speedup. That could be be the better tactic

            People

            • Assignee:
              Unassigned
              Reporter:
              stevel@apache.org Steve Loughran
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development