Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4882

Error in estimating the length of the output file in Spill Phase

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 0.20.2, 1.0.3
    • 2.6.0
    • None
    • Any Environment

    • Spill Size

    Description

      The sortAndSpill() method in MapTask.java has an error in estimating the length of the output file.
      The "long size" should be "(bufvoid - bufstart) + bufend" not "(bufvoid - bufend) + bufstart" when "bufend < bufstart".

      Here is the original code in MapTask.java.
      private void sortAndSpill() throws IOException, ClassNotFoundException,
      InterruptedException {
      //approximate the length of the output file to be the length of the
      //buffer + header lengths for the partitions
      long size = (bufend >= bufstart
      ? bufend - bufstart
      : (bufvoid - bufend) + bufstart) +
      partitions * APPROX_HEADER_LENGTH;
      FSDataOutputStream out = null;
      ------------------------------------------------------------------------------
      I had a test on "TeraSort". A snippet from mapper's log is as follows:

      MapTask: Spilling map output: record full = true
      MapTask: bufstart = 157286200; bufend = 10485460; bufvoid = 199229440
      MapTask: kvstart = 262142; kvend = 131069; length = 655360
      MapTask: Finished spill 3

      In this occasioin, Spill Bytes should be (199229440 - 157286200) + 10485460 = 52428700 (52 MB) because the number of spilled records is 524287 and each record costs 100B.

      Attachments

        1. MAPREDUCE-4882.patch
          10 kB
          Haifeng Chen

        Activity

          People

            jerrychenhf Haifeng Chen
            jerrylead Lijie Xu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified