Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7470

multi-thread mapreduce v1 FileOutputcommitter

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • mrv2

    Description

      In cloud environment, such as aws, aliyun etc., the internet delay is non-trival when we commit thounds of files.

      In our situation, the ping delay is about 0.03ms in IDC, but when move to Coud, the ping delay is about 3ms, which is roughly 100x slower. We found that, committing tens thounds of files will cost a few tens of minutes. The more files there are, the logger it takes.

      So we propose a new committer algorithm, which is a variant of committer algorithm version 1, called 3. In this new algorithm 3, in order to decrease the committer time, we use a thread pool to commit job's final output.

      Our test result in Cloud production shows that, the new algorithm 3 has decrease the committer time by serveral tens of times.

      Attachments

        1. MAPREDUCE-7470.0.patch
          6 kB
          TianyiMa

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            tianyima TianyiMa
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment