Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1838

DistRaid map tasks have large variance in running times

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 0.20.1
    • 0.22.0
    • contrib/raid
    • None
    • Reviewed

    Description

      HDFS RAID uses map-reduce jobs to generate parity files for a set of source files. Each map task gets a subset of files to operate on. The current code assigns files by walking through the list of files given in the constructor of DistRaid

      The problem is that the list of files given to the constructor has the order of (pretty much) the directory listing. When a large number of files is added, files in that order tend to have the same size. Thus a map task can end up with large files where as another can end up with small files, increasing the variance in run times.

      We could do smarter assignment by using the file sizes.

      Attachments

        1. MAPREDUCE-1838.patch
          0.8 kB
          Ramkumar Vadali

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            rvadali Ramkumar Vadali
            rvadali Ramkumar Vadali
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment