Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.23.0, 1.0.0
    • Fix Version/s: None
    • Component/s: datanode, namenode
    • Labels:
      None

      Description

      We should gracefully handle j.l.OutOfMemoryError exceptions in the NN or DN. We should catch them in a high-level handler, cleanly fail the RPC (vs sending back the OOM stackrace) or background thread, and shutdown the NN or DN. Currently the process is left in a not well-test tested state (continuously fails RPCs and internal threads, may or may not recover and doesn't shutdown gracefully).

        Activity

        Hide
        Todd Lipcon added a comment -

        One option we use in HBase is to set the JVM to kill -9 itself immediately on OOME. Probably a little too much for the NN but reasonable default for the DN/TT.

        On the NN I'd rather see the critical threads all get uncaughtExceptionHandlers attached which abort the NN if they fail. So if an individual rpc handler OOMEs (eg by an invalid request making it try to allocate a 4G array or something) it won't take down the NN, whereas if the LeaseManager OOMEs it should.

        Show
        Todd Lipcon added a comment - One option we use in HBase is to set the JVM to kill -9 itself immediately on OOME. Probably a little too much for the NN but reasonable default for the DN/TT. On the NN I'd rather see the critical threads all get uncaughtExceptionHandlers attached which abort the NN if they fail. So if an individual rpc handler OOMEs (eg by an invalid request making it try to allocate a 4G array or something) it won't take down the NN, whereas if the LeaseManager OOMEs it should.
        Hide
        Shaneal Manek added a comment -

        Incidentally, I worked with a jvmti agent a while ago that did a thread/heap dump on OOM. It was really useful for debugging.

        The license is compatible, so it may be worth scavenging some of that code/functionality - check it out if curious: https://github.com/Greplin/polarbear

        Show
        Shaneal Manek added a comment - Incidentally, I worked with a jvmti agent a while ago that did a thread/heap dump on OOM. It was really useful for debugging. The license is compatible, so it may be worth scavenging some of that code/functionality - check it out if curious: https://github.com/Greplin/polarbear
        Hide
        Tsz Wo Nicholas Sze added a comment -

        OutOfMemoryError is a subclass of Error which indicates serious problems that a reasonable application should not try to catch according to the javadoc.

        It is hard to handle OutOfMemoryError. One problem is that there could be more OutOfMemoryErrors being thrown when handling the first OutOfMemoryError.

        Show
        Tsz Wo Nicholas Sze added a comment - OutOfMemoryError is a subclass of Error which indicates serious problems that a reasonable application should not try to catch according to the javadoc . It is hard to handle OutOfMemoryError. One problem is that there could be more OutOfMemoryErrors being thrown when handling the first OutOfMemoryError.
        Hide
        Eli Collins added a comment -

        HDFS isn't really an application. If we labor on subsequent failures can result in data loss. IMO it's better to failfast.

        Show
        Eli Collins added a comment - HDFS isn't really an application. If we labor on subsequent failures can result in data loss. IMO it's better to failfast.
        Hide
        Uma Maheswara Rao G added a comment -

        I too agree. Recently i have debugged many issues due to OOME in my clusters. for example: HADOOP-7916, HDFS-2850

        Show
        Uma Maheswara Rao G added a comment - I too agree. Recently i have debugged many issues due to OOME in my clusters. for example: HADOOP-7916 , HDFS-2850
        Hide
        Suresh Srinivas added a comment -

        that a reasonable application should not try to catch

        Nicholas, I think what this means is, an application should not try to catch it for recovery purpose. I think failing fast instead of trying to recover seems like a reasonable choice.

        @Uma

        I too agree.

        You are agreeing with Eli?

        Show
        Suresh Srinivas added a comment - that a reasonable application should not try to catch Nicholas, I think what this means is, an application should not try to catch it for recovery purpose. I think failing fast instead of trying to recover seems like a reasonable choice. @Uma I too agree. You are agreeing with Eli?
        Hide
        Uma Maheswara Rao G added a comment -

        Suresh, yes you are right.
        Thinking again, how can we do this(fast fail) in client code? That will run along with the several kind of applications right. And that will be again upto user interest to fastfail on OOME or not. We will have ipc threads and streamer threads running at clinet side. am i missing?

        Show
        Uma Maheswara Rao G added a comment - Suresh, yes you are right. Thinking again, how can we do this(fast fail) in client code? That will run along with the several kind of applications right. And that will be again upto user interest to fastfail on OOME or not. We will have ipc threads and streamer threads running at clinet side. am i missing?
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Sure, logging a message and then calling System.exit(..) probably are doable. I doubt if we can have anything more than that.

        Show
        Tsz Wo Nicholas Sze added a comment - Sure, logging a message and then calling System.exit(..) probably are doable. I doubt if we can have anything more than that.
        Hide
        Eli Collins added a comment -

        @Uma, this is for server-side OOMs, ie the NN/DN should failfast when they get an unrecoverable OOM (as Todd points out not all OOMs are unrecoverable). The client will still either get a response indicating failure as it does today or it's connection will die when the NN shutsdown.

        Show
        Eli Collins added a comment - @Uma, this is for server-side OOMs, ie the NN/DN should failfast when they get an unrecoverable OOM (as Todd points out not all OOMs are unrecoverable). The client will still either get a response indicating failure as it does today or it's connection will die when the NN shutsdown.
        Hide
        Todd Lipcon added a comment -

        We could simply recommend the following for the java options in order to make OOME fail fast:
        -XX:OnOutOfMemoryError="kill -9 %p"

        Show
        Todd Lipcon added a comment - We could simply recommend the following for the java options in order to make OOME fail fast: -XX:OnOutOfMemoryError="kill -9 %p"
        Hide
        Suresh Srinivas added a comment -

        @Eli ... as Todd points out not all OOMs are unrecoverable ...

        On the NN I'd rather see the critical threads all get uncaughtExceptionHandlers attached which abort the NN if they fail. So if an individual rpc handler OOMEs (eg by an invalid request making it try to allocate a 4G array or something) it won't take down the NN, whereas if the LeaseManager OOMEs it should.

        I think this may not be a good idea. Infact I would say, it is more important to shutdown NN when RPC handler gets an OOME. Lets say an RPC handler updated in memory namespace and was about add it to editlog. The system was indeed running out of memory and before editlog could be written the handler got OOME. If we do not shutdown at this time, we could end up in interesting data corruption issues.

        Instead of trying to categorize which one is safe and not safe, we should use kill -9 option. In cases where OOME is caused by the system trying to create a large object, we could add appropriate size/limit checks.

        Show
        Suresh Srinivas added a comment - @Eli ... as Todd points out not all OOMs are unrecoverable ... On the NN I'd rather see the critical threads all get uncaughtExceptionHandlers attached which abort the NN if they fail. So if an individual rpc handler OOMEs (eg by an invalid request making it try to allocate a 4G array or something) it won't take down the NN, whereas if the LeaseManager OOMEs it should. I think this may not be a good idea. Infact I would say, it is more important to shutdown NN when RPC handler gets an OOME. Lets say an RPC handler updated in memory namespace and was about add it to editlog. The system was indeed running out of memory and before editlog could be written the handler got OOME. If we do not shutdown at this time, we could end up in interesting data corruption issues. Instead of trying to categorize which one is safe and not safe, we should use kill -9 option. In cases where OOME is caused by the system trying to create a large object, we could add appropriate size/limit checks.
        Hide
        Todd Lipcon added a comment -

        OK, you've convinced me

        Show
        Todd Lipcon added a comment - OK, you've convinced me
        Hide
        Suresh Srinivas added a comment -

        I am going to mark this as won't fix. If anyone disagrees, then reopen with. Reason.

        Show
        Suresh Srinivas added a comment - I am going to mark this as won't fix. If anyone disagrees, then reopen with. Reason.
        Hide
        Eli Collins added a comment -

        No longer think we should do the kill -9 option?

        Show
        Eli Collins added a comment - No longer think we should do the kill -9 option?
        Hide
        Suresh Srinivas added a comment -

        I actually thought about it. But given title "gracefully handle" and killing is not graceful, decided to close the bug

        Feel free to change the title and reopen. Or perhaps a new Jira.

        Show
        Suresh Srinivas added a comment - I actually thought about it. But given title "gracefully handle" and killing is not graceful, decided to close the bug Feel free to change the title and reopen. Or perhaps a new Jira.
        Hide
        Eli Collins added a comment -

        Makes sense =)

        Show
        Eli Collins added a comment - Makes sense =)

          People

          • Assignee:
            Unassigned
            Reporter:
            Eli Collins
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development