Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
0.20.2, 1.0.3
-
None
-
Any Environment
-
Spill Size
Description
The sortAndSpill() method in MapTask.java has an error in estimating the length of the output file.
The "long size" should be "(bufvoid - bufstart) + bufend" not "(bufvoid - bufend) + bufstart" when "bufend < bufstart".
Here is the original code in MapTask.java.
private void sortAndSpill() throws IOException, ClassNotFoundException,
InterruptedException {
//approximate the length of the output file to be the length of the
//buffer + header lengths for the partitions
long size = (bufend >= bufstart
? bufend - bufstart
: (bufvoid - bufend) + bufstart) +
partitions * APPROX_HEADER_LENGTH;
FSDataOutputStream out = null;
------------------------------------------------------------------------------
I had a test on "TeraSort". A snippet from mapper's log is as follows:
MapTask: Spilling map output: record full = true
MapTask: bufstart = 157286200; bufend = 10485460; bufvoid = 199229440
MapTask: kvstart = 262142; kvend = 131069; length = 655360
MapTask: Finished spill 3
In this occasioin, Spill Bytes should be (199229440 - 157286200) + 10485460 = 52428700 (52 MB) because the number of spilled records is 524287 and each record costs 100B.