Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-202

Add a bulk FIleSystem.getFileBlockLocations


    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0
    • Component/s: hdfs-client, namenode
    • Labels:
    • Hadoop Flags:
      Incompatible change, Reviewed


      Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file.
      The downsides are multiple:

      1. Even with a few thousand files to process the number of RPCs quickly starts getting noticeable
      2. The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'.

      It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'.

      When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications...

      1. hdfsListFiles5.patch
        48 kB
        Hairong Kuang
      2. hdfsListFiles4.patch
        47 kB
        Hairong Kuang
      3. hdfsListFiles3.patch
        54 kB
        Hairong Kuang
      4. hdfsListFiles2.patch
        42 kB
        Hairong Kuang
      5. hdfsListFiles1.patch
        40 kB
        Hairong Kuang
      6. hdfsListFiles.patch
        43 kB
        Hairong Kuang

        Issue Links


          No work has yet been logged on this issue.


            • Assignee:
              Hairong Kuang
              Arun C Murthy
            • Votes:
              1 Vote for this issue
              14 Start watching this issue


              • Created: