[HADOOP-2080] ChecksumFileSystem checksum file size incorrect. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.14.0, 0.14.1, 0.14.2
Fix Version/s: 0.15.0
Component/s: fs
Labels:
None
Environment:

Sun jdk1.6.0_02 running on Linux CentOS 5

Description

Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this:

2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space
at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)

The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task.

The method used for calculating checksum file size is the following (ChecksumFileSystem:318):

((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;

The issue here is the cast to float. Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000. The fix is to replace this calculation with something that doesn't cast to float.

(((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ChecksumFileSystem.java.patch
20/Oct/07 01:13
0.7 kB
Richard Lee
hadoop-2080.patch
22/Oct/07 22:51
5 kB
Owen O'Malley
TestInternalFilesystem.java
19/Oct/07 18:19
2 kB
Richard Lee

Activity

People

Assignee:: Owen O'Malley

Reporter:: Richard Lee

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 19/Oct/07 18:16

Updated:: 05/Nov/07 18:13

Resolved:: 22/Oct/07 23:21