Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-202

Add a bulk FIleSystem.getFileBlockLocations

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.22.0
    • hdfs-client, namenode
    • None
    • Incompatible change, Reviewed

    Description

      Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file.
      The downsides are multiple:

      1. Even with a few thousand files to process the number of RPCs quickly starts getting noticeable
      2. The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'.

      It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'.

      When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications...

      Attachments

        1. hdfsListFiles5.patch
          48 kB
          Hairong Kuang
        2. hdfsListFiles4.patch
          47 kB
          Hairong Kuang
        3. hdfsListFiles3.patch
          54 kB
          Hairong Kuang
        4. hdfsListFiles2.patch
          42 kB
          Hairong Kuang
        5. hdfsListFiles1.patch
          40 kB
          Hairong Kuang
        6. hdfsListFiles.patch
          43 kB
          Hairong Kuang

        Issue Links

          Activity

            People

              hairong Hairong Kuang
              acmurthy Arun Murthy
              Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: