Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.6.0
-
None
-
None
Description
While running Shark with an insufficient number of reduce tasks, an overloaded worker machine raised java.lang.OutOfMemoryError : GC overhead limit exceeded. This caused that Java process to hang at 100% CPU, spending all of its time in the garbage collector. This failure wasn't detected by the master, causing the entire job to hang.
Handling and reporting failures due to OutOfMemoryError can be complicated because the OutOfMemoryError exception can be raised at many different locations, depending on which allocation caused the error.
I'm not sure that it's safe to recover from OutOfMemoryError, so worker processes should probably die once they raise that error. We might be able to do this in an uncaught exception handler.