
|
If you were logged in you would be able to see more operations.
|
|
|
|
Issue Links:
|
Blocker
|
|
|
|
This issue is blocked by:
|
|
HADOOP-2053
OutOfMemoryError : Java heap space errors in hadoop 0.14
|
|
|
|
|
Incorporates
|
|
|
|
This issue is part of:
|
|
HADOOP-2919
Create fewer copies of buffer data during sort/spill
|
|
|
|
|
|
|
| Resolution Date: |
31/Mar/08 10:54 PM
|
MapTask#MapOutputBuffer uses a plain-jane DataOutputBuffer which defaults to a buffer of size 32-bytes, and the DataOutputBuffer#write call doubles the underlying byte-array when it needs more space.
However for maps which output any decent amount of data (e.g. 128MB in examples/Sort.java) this means the buffer grows painfully slowly from 2^6 to 2^28, and each time this results in a new array being created, followed by an array-copy:
I reckon we could do much better in the MapTask, specifically...
For e.g. we start with a buffer of size 1/4KB and quadruple, rather than double, upto, say 4/8/16MB. Then we resume doubling (or less).
This means that it quickly ramps up to minimize no. of System.arrayCopy calls and small-sized buffers to GC; and later start doubling to ensure we don't ramp-up too quickly to minimize memory wastage due to fragmentation.
Of course, this issue is about benchmarking and figuring if all this is worth it, and, if so, what are the right set of trade-offs to make.
Thoughts?
|
|
Description
|
MapTask#MapOutputBuffer uses a plain-jane DataOutputBuffer which defaults to a buffer of size 32-bytes, and the DataOutputBuffer#write call doubles the underlying byte-array when it needs more space.
However for maps which output any decent amount of data (e.g. 128MB in examples/Sort.java) this means the buffer grows painfully slowly from 2^6 to 2^28, and each time this results in a new array being created, followed by an array-copy:
I reckon we could do much better in the MapTask, specifically...
For e.g. we start with a buffer of size 1/4KB and quadruple, rather than double, upto, say 4/8/16MB. Then we resume doubling (or less).
This means that it quickly ramps up to minimize no. of System.arrayCopy calls and small-sized buffers to GC; and later start doubling to ensure we don't ramp-up too quickly to minimize memory wastage due to fragmentation.
Of course, this issue is about benchmarking and figuring if all this is worth it, and, if so, what are the right set of trade-offs to make.
Thoughts? |
Show » |
made changes - 15/Oct/07 10:48 AM
| Field |
Original Value |
New Value |
|
Link
|
|
This issue is blocked by HADOOP-2053
[ HADOOP-2053
]
|
made changes - 04/Jan/08 10:20 AM
|
Fix Version/s
|
0.16.0
[ 12312740
]
|
|
made changes - 31/Mar/08 10:54 PM
|
Resolution
|
|
Duplicate
[ 3
]
|
|
Fix Version/s
|
|
0.17.0
[ 12312913
]
|
|
Assignee
|
Arun C Murthy
[ acmurthy
]
|
Chris Douglas
[ chris.douglas
]
|
|
Status
|
Open
[ 1
]
|
Resolved
[ 5
]
|
made changes - 17/Apr/08 05:21 AM
|
Fix Version/s
|
0.17.0
[ 12312913
]
|
|
made changes - 17/Apr/08 05:26 AM
|
Status
|
Resolved
[ 5
]
|
Closed
[ 6
]
|
made changes - 08/Jul/09 04:52 PM
|
Component/s
|
mapred
[ 12310690
]
|
|
|
From
HADOOP-2053stack trace,<stack>
task_200710112103_0001_m_000015_1: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.io.Text.write(Text.java:243)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:340)
</stack>
Text.write is directly calling DataOutputStream.write in which calls ByteArrayOutputStream.write.
What I expected was DataOutputBuffer.write --> DataOutputBufffer.Buffer.write.