Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2349

speed up list[located]status calls from input formats

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.4.0
    • task
    • None
    • Reviewed

    Description

      when a job has many input paths - listStatus - or the improved listLocatedStatus - calls (invoked from the getSplits() method) can take a long time. Most of the time is spent waiting for the previous call to complete and then dispatching the next call.

      This can be greatly speeded up by dispatching multiple calls at once (via executors). If the same filesystem client is used - then the calls are much better pipelined (since calls are serialized) and don't impose extra burden on the namenode while at the same time greatly reducing the latency to the client. In a simple test on non-peak hours, this resulted in the getSplits() time reducing from about 3s to about 0.5s.

      Attachments

        1. MAPREDUCE-2349.1.wip.txt
          22 kB
          Siddharth Seth
        2. MAPREDUCE-2349.2.txt
          39 kB
          Siddharth Seth
        3. MAPREDUCE-2349.3.txt
          39 kB
          Siddharth Seth
        4. MAPREDUCE-2349.4.txt
          41 kB
          Siddharth Seth
        5. MAPREDUCE-2349.5.txt
          43 kB
          Siddharth Seth

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            sseth Siddharth Seth
            jsensarma Joydeep Sen Sarma
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment