Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
Description
I recently analyzed JVM heap dumps from Hive running a big workload. Two excerpts from the analysis done with jxray (www.jxray.com) are given below. It turns out that nearly a half of live memory is taken by objects awaiting finalization, and the biggest offender among them is class OpensslAesCtrCryptoCodec:
401,189K (39.7%) (1 of sun.misc.Cleaner) <-- Java Static: sun.misc.Cleaner.first 400,572K (39.6%) (14001 of org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec, org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager, java.util.jar.JarFile etc.) <-- j.l.r.Finalizer.referent <-- j.l.r.Finalizer.{next} <-- sun.misc.Cleaner.next <-- sun.misc.Cleaner.{next} <-- Java Static: sun.misc.Cleaner.first 270,673K (26.8%) (2138 of org.apache.hadoop.mapred.JobConf) <-- org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec.conf <-- j.l.r.Finalizer.referent <-- j.l.r.Finalizer.{next} <-- sun.misc.Cleaner.next <-- sun.misc.Cleaner.{next} <-- Java Static: sun.misc.Cleaner.first --------------------- 102,232K (10.1%) (1 of j.l.r.Finalizer) <-- Java Static: java.lang.ref.Finalizer.unfinalized 101,676K (10.1%) (8613 of org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec, java.util.zip.ZipFile$ZipFileInflaterInputStream, org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager etc.) <-- j.l.r.Finalizer.referent <-- j.l.r.Finalizer.{next} <-- Java Static: java.lang.ref.Finalizer.unfinalized
This heap dump was taken using 'jmap -dump:live', which forces the JVM to run full GC before dumping the heap. So we are already looking at the heap right after GC, and yet all these unfinalized objects are there. I think this happens because the JVM always runs only one finalization thread, and thus the queue of objects that need finalization may get processed too slowly. My understanding is that finalization works as follows:
1. When GC runs, it discovers that object x that overrides finalize() is unreachable.
2. x is added to the finalization queue. So technically x is still reachable, it occupies memory, and all the objects that it references stay in memory as well.
3. The finalization thread processes objects from the finalization queue serially, thus x may stay in memory for long time.
4. x.finalize() is invoked, then x is made unreachable. If x stayed in memory for long time, it's now in Old Gen of the heap, so only full GC can clean it up.
5. When full GC finally occurs, x gets cleaned up.
So finalization is formally reliable, but in practice it's quite possible that a lot of unreachable, but unfinalized objects flood the memory. I guess we are seeing all these OpensslAesCtrCryptoCodec objects when they are in phase 3 above. And the really bad thing is that these objects in turn keep in memory a whole lot of other stuff, in particular JobConf objects. Such a JobConf has nothing to do with finalization, yet the GC cannot release it until the corresponding OpensslAesCtrCryptoCodec's is gone.
Here is OpensslAesCtrCryptoCodec.finalize() method with my comments:
protected void finalize() throws Throwable { try { Closeable r = (Closeable) this.random; r.close(); // Relevant only when (random instanceof OsSecureRandom == true) } catch (ClassCastException e) { } super.finalize(); // Not needed, no finalize() in superclasses }
So, finalize() in this class, that may keep in memory a whole tree of objects, is relevant only when this codec is configured to use OsSecureRandom class. The latter reads random bytes from the configured file, and needs finalization to close the input stream associated with that file.
The suggested fix is to remove finalize() from OpensslAesCtrCryptoCodec and add it to the only class from this "family" that really needs it, OsSecureRandom. That will ensure that only OsSecureRandom objects (if/when they are used) stay in memory awaiting finalization, and no other, irrelevant objects.
Note that this solution means that streams are still closed lazily. This, in principle, may cause its own problems. So the most reliable fix would be to call OsSecureRandom.close() explicitly when it's not needed anymore. But the above fix is a necessary first step anyway, it will remove the most acute problem with memory and will not make any other things worse than they currently are.
Attachments
Attachments
Issue Links
- is related to
-
HADOOP-14524 Make CryptoCodec Closeable so it can be cleaned up proactively
- Resolved