Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4882

Error in estimating the length of the output file in Spill Phase

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 0.20.2, 1.0.3
    • Fix Version/s: 2.6.0
    • Component/s: None
    • Environment:

      Any Environment

    • Tags:
      Spill Size

      Description

      The sortAndSpill() method in MapTask.java has an error in estimating the length of the output file.
      The "long size" should be "(bufvoid - bufstart) + bufend" not "(bufvoid - bufend) + bufstart" when "bufend < bufstart".

      Here is the original code in MapTask.java.
      private void sortAndSpill() throws IOException, ClassNotFoundException,
      InterruptedException {
      //approximate the length of the output file to be the length of the
      //buffer + header lengths for the partitions
      long size = (bufend >= bufstart
      ? bufend - bufstart
      : (bufvoid - bufend) + bufstart) +
      partitions * APPROX_HEADER_LENGTH;
      FSDataOutputStream out = null;
      ------------------------------------------------------------------------------
      I had a test on "TeraSort". A snippet from mapper's log is as follows:

      MapTask: Spilling map output: record full = true
      MapTask: bufstart = 157286200; bufend = 10485460; bufvoid = 199229440
      MapTask: kvstart = 262142; kvend = 131069; length = 655360
      MapTask: Finished spill 3

      In this occasioin, Spill Bytes should be (199229440 - 157286200) + 10485460 = 52428700 (52 MB) because the number of spilled records is 524287 and each record costs 100B.

        Attachments

          Activity

            People

            • Assignee:
              jerrychenhf Haifeng Chen
              Reporter:
              jerrylead Lijie Xu
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified