Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-2080

ChecksumFileSystem checksum file size incorrect.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.14.0, 0.14.1, 0.14.2
    • Fix Version/s: 0.15.0
    • Component/s: fs
    • Labels:
      None
    • Environment:

      Sun jdk1.6.0_02 running on Linux CentOS 5

      Description

      Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:

      2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
      at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
      at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
      at java.io.DataOutputStream.write(DataOutputStream.java:90)
      at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
      at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
      at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
      at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
      at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
      at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
      at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
      at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
      at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
      at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)

      The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.

      The method used for calculating checksum file size is the following (ChecksumFileSystem:318):

      ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;

      The issue here is the cast to float. Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000. The fix is to replace this calculation with something that doesn't cast to float.

      (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

        Attachments

        1. TestInternalFilesystem.java
          2 kB
          Richard Lee
        2. ChecksumFileSystem.java.patch
          0.7 kB
          Richard Lee
        3. hadoop-2080.patch
          5 kB
          Owen O'Malley

          Activity

            People

            • Assignee:
              omalley Owen O'Malley
              Reporter:
              rl337 Richard Lee
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: