Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-599

OutOfMemoryErrors can cause workers to hang indefinitely

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.6.0
    • 0.7.0
    • None
    • None

    Description

      While running Shark with an insufficient number of reduce tasks, an overloaded worker machine raised java.lang.OutOfMemoryError : GC overhead limit exceeded. This caused that Java process to hang at 100% CPU, spending all of its time in the garbage collector. This failure wasn't detected by the master, causing the entire job to hang.

      Handling and reporting failures due to OutOfMemoryError can be complicated because the OutOfMemoryError exception can be raised at many different locations, depending on which allocation caused the error.

      I'm not sure that it's safe to recover from OutOfMemoryError, so worker processes should probably die once they raise that error. We might be able to do this in an uncaught exception handler.

      Attachments

        Activity

          People

            Unassigned Unassigned
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: