Description
We were seeing OutOfMemoryErrors with stack traces like the following (Hadoop 0.17.0):
java.lang.OutOfMemoryError at java.util.zip.Deflater.init(Native Method) at java.util.zip.Deflater.<init>(Deflater.java:123) at java.util.zip.Deflater.<init>(Deflater.java:132) at org.apache.hadoop.io.CompressedWritable.write(CompressedWritable.java:71) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1016) [...]
A Google search found the following long-standing issue in Java in which use of java.util.zip.Deflater causes an OutOfMemoryError:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4797189
CompressedWritable instantiates a Deflater, but does not call deflater.end(). It should do that in order to release the Deflater's resources immediately, instead of waiting for the object to be finalized.
We applied this change locally and saw much improvement in the stability of memory usage of our app.
This may also affect the SequenceFile compression types, because org.apache.hadoop.io.compress.zlib.BuiltInZlib
{Deflater,Inflater} extend java.util.zip.{Deflater,Inflater}. org.apache.hadoop.io.compress.Compressor defines an end() method, but I do not see that this method is ever called.