Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3347

NullPointerException When trying to log to editstreams

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None
    • Environment:

      HDFS

      Description

      When i try to create a file i got a exception

      2012-05-02 17:42:55,768 DEBUG hdfs.StateChange (NameNodeRpcServer.java:create(402)) - *DIR* NameNode.create: file /a._COPYING_ for DFSClient_NONMAPREDUCE_1515782500_1 at 10.18.40.95
      2012-05-02 17:42:55,770 DEBUG hdfs.StateChange (FSNamesystem.java:startFileInternal(1547)) - DIR* NameSystem.startFile: src=/a._COPYING_, holder=DFSClient_NONMAPREDUCE_1515782500_1, clientMachine=10.18.40.95, createParent=true, replication=1, createFlag=[CREATE, OVERWRITE]
      2012-05-02 17:42:55,778 WARN  ipc.Server (Server.java:run(1701)) - IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 10.18.40.95:37973: error: java.lang.NullPointerException
      java.lang.NullPointerException
      	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logEdit(FSEditLog.java:348)
      	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logGenerationStamp(FSEditLog.java:755)
      	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStamp(FSNamesystem.java:4357)
      	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1621)
      	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1509)
      	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:409)
      	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:200)
      	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:42590)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:428)
      	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:905)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
      	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1684)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:396)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1205)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1682)
      

      analysing the same, will provide the details soon.

        Issue Links

          Activity

          Hide
          amith added a comment -

          All the operation are failing after this

          Show
          amith added a comment - All the operation are failing after this
          Hide
          Robert Joseph Evans added a comment -

          What version of Hadoop are you using?

          Show
          Robert Joseph Evans added a comment - What version of Hadoop are you using?
          Hide
          amith added a comment -

          I got in 3.0 trunk

          Show
          amith added a comment - I got in 3.0 trunk
          Hide
          Kihwal Lee added a comment -

          What is dfs.namenode.name.dir or dfs.namenode.edits.dir set to? Did it exist before running namenode?

          Show
          Kihwal Lee added a comment - What is dfs.namenode.name.dir or dfs.namenode.edits.dir set to? Did it exist before running namenode?
          Hide
          Kihwal Lee added a comment -

          More questions.

          Did it work for a while then exhibit this symptom?
          Do you see "Rolling edit logs" log message around the time the issue occurs?
          Any other exceptions or errors occurring?

          Show
          Kihwal Lee added a comment - More questions. Did it work for a while then exhibit this symptom? Do you see "Rolling edit logs" log message around the time the issue occurs? Any other exceptions or errors occurring?
          Hide
          amith added a comment -

          Hi

          Problem is with transitionToActive for namenode.

          Analysis:

          here

          try {
                  editLogStream.write(op);
                } catch (IOException ex) {
                  // All journals failed, it is handled in logSync.
                }
          

          editLogStream is found to be null !!!

          I got the problem is in transitionToActive of NN

          in

          void startActiveServices() throws IOException {
              LOG.info("Starting services required for active state");
              writeLock();
              try {
                FSEditLog editLog = dir.fsImage.getEditLog();
          
                if (!editLog.isOpenForWrite()) {
                  // During startup, we're already open for write during initialization.
                  editLog.initJournalsForWrite();<-- Here IAE thrown caused the transition failure
                  .
          	.
          	.
                  dir.fsImage.editLog.openForWrite();<-- Here editLogStream var is initialised this is not done due to IAE 
                }
                .
                .
                .
            }
          

          FSEditLog#startLogSegment() will populate the 'editLogStream' variable which will be done when NN is starting as active itself.

          Here editLog.initJournalsForWrite() can fail throwing

          java.lang.IllegalArgumentException: No class configured for bkpr
          	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getJournalClass(FSEditLog.java:1204)
          	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.createJournal(FSEditLog.java:1218)
          	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournals(FSEditLog.java:242)
          	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournalsForWrite(FSEditLog.java:210)
          	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:596)
          	at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1246)
          	at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
          	at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
          	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
          	at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1178)
          	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:981)
          	at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:80)
          	at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:2827)
          	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:428)
          	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:905)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1684)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:396)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1205)
          	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1682)
          

          which caused the init to fail !

          If I check the status of NN using the command
          ./hdfs haadmin -getServiceState nn1
          active

          I got the reply as active !!!

          Now perform any write operation which will log to edits cause the NPE.

          I feel that if any transition fails due to exception it should remain in its old state (here standby)

          Please correct me if I am wrong

          Show
          amith added a comment - Hi Problem is with transitionToActive for namenode. Analysis: here try { editLogStream.write(op); } catch (IOException ex) { // All journals failed, it is handled in logSync. } editLogStream is found to be null !!! I got the problem is in transitionToActive of NN in void startActiveServices() throws IOException { LOG.info( "Starting services required for active state" ); writeLock(); try { FSEditLog editLog = dir.fsImage.getEditLog(); if (!editLog.isOpenForWrite()) { // During startup, we're already open for write during initialization. editLog.initJournalsForWrite();<-- Here IAE thrown caused the transition failure . . . dir.fsImage.editLog.openForWrite();<-- Here editLogStream var is initialised this is not done due to IAE } . . . } FSEditLog#startLogSegment() will populate the 'editLogStream' variable which will be done when NN is starting as active itself. Here editLog.initJournalsForWrite() can fail throwing java.lang.IllegalArgumentException: No class configured for bkpr at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getJournalClass(FSEditLog.java:1204) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.createJournal(FSEditLog.java:1218) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournals(FSEditLog.java:242) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournalsForWrite(FSEditLog.java:210) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:596) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1246) at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61) at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49) at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1178) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:981) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:80) at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:2827) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:428) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:905) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1684) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1205) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1682) which caused the init to fail ! If I check the status of NN using the command ./hdfs haadmin -getServiceState nn1 active I got the reply as active !!! Now perform any write operation which will log to edits cause the NPE. I feel that if any transition fails due to exception it should remain in its old state (here standby) Please correct me if I am wrong
          Hide
          amith added a comment -

          Or is it good to remove stream when createJournal() fails for any non 'file' scheme.

          Show
          amith added a comment - Or is it good to remove stream when createJournal() fails for any non 'file' scheme.
          Hide
          amith added a comment -

          Can we silently ignore from adding this stream into journal set by catching the exception so that NN can become fully active so that fix can be simple instead of reverting the state of NN back to standby.

          Show
          amith added a comment - Can we silently ignore from adding this stream into journal set by catching the exception so that NN can become fully active so that fix can be simple instead of reverting the state of NN back to standby.
          Hide
          Kihwal Lee added a comment -

          The simplest thing you can do is to change HAState.setStateInternal() so that it catches Exception and revert the state before re-throwing. Runtime exceptions such as IAE will be logged on the server-side at WARN level by RPC Server.

          If we start thinking about handling exceptions and dropping streams in FSEditLog.initJournals(), the problem becomes more complex. In many places, it is assumed that FSEditLog.journalSet contains all configured edit streams, working or non-working. This can change in the future if the journal stream recovery becomes more dynamic based on something like zk-backed active journal set state storage.

          Show
          Kihwal Lee added a comment - The simplest thing you can do is to change HAState.setStateInternal() so that it catches Exception and revert the state before re-throwing. Runtime exceptions such as IAE will be logged on the server-side at WARN level by RPC Server. If we start thinking about handling exceptions and dropping streams in FSEditLog.initJournals(), the problem becomes more complex. In many places, it is assumed that FSEditLog.journalSet contains all configured edit streams, working or non-working. This can change in the future if the journal stream recovery becomes more dynamic based on something like zk-backed active journal set state storage.
          Hide
          Todd Lipcon added a comment -

          I agree with Kihwal. There's already another JIRA open about handling errors during state transition. IMO we should completely bail out, rather than trying to revert state – it's too likely we're in a bizarre corner case that we haven't tested for, if we fail in the midst of a state transition.

          Show
          Todd Lipcon added a comment - I agree with Kihwal. There's already another JIRA open about handling errors during state transition. IMO we should completely bail out, rather than trying to revert state – it's too likely we're in a bizarre corner case that we haven't tested for, if we fail in the midst of a state transition.
          Hide
          Kihwal Lee added a comment -

          IMO we should completely bail out, rather than trying to revert state

          Do you think following is reasonable?

          In HAState.setStateInternal(),

          • ServiceFailedException : revert the state while still holding the context.writelock
          • All other Throwables : bail out.
          • When bailing out, do something similar to what's done in JournalSet.mapJournalsAndReportErrors().
          Show
          Kihwal Lee added a comment - IMO we should completely bail out, rather than trying to revert state Do you think following is reasonable? In HAState.setStateInternal(), ServiceFailedException : revert the state while still holding the context.writelock All other Throwables : bail out. When bailing out, do something similar to what's done in JournalSet.mapJournalsAndReportErrors().
          Hide
          Todd Lipcon added a comment -

          ServiceFailedException : revert the state while still holding the context.writelock

          I'm worried this is easier said than done But in principle I like the idea.

          Show
          Todd Lipcon added a comment - ServiceFailedException : revert the state while still holding the context.writelock I'm worried this is easier said than done But in principle I like the idea.
          Hide
          amith added a comment -

          Hi,
          I agree with kihwal lee and todd.

          For any ServiceFailedException we can revert the state, and for other exception(include throwable) we can revert the state.

          Some doubts,

          1. Is there any possibility of adding this logic to failover controllers rather than including the logic in NN and increase the complexity.

          like

          try {
                HAServiceProtocolHelper.transitionToActive(
                    toSvc.getProxy(conf, rpcTimeoutToNewActive));
              } catch (ServiceFailedException sfe) {
                LOG.error("Unable to make " + toSvc + " active (" +
                    sfe.getMessage() + "). Failing back.");
                failed = true;
                cause = sfe;
              } catch (IOException ioe) {
                LOG.error("Unable to make " + toSvc +
                    " active (unable to connect). Failing back.", ioe);
                failed = true;
                cause = ioe;
          +    } catch (Throwable th) {
          +      LOG.error("Unexpected error" + toSvc +
          +          " Failing back.", th);
          +      failed = true;
          +      cause = ioe;
          +    }
          
          

          2. Can we mark the NN state as Active only when it successfully do a transition to Active similarly for Standby.

          This can solve any issues since for any operation we have a state check ?

          Thanks in advance

          Show
          amith added a comment - Hi, I agree with kihwal lee and todd. For any ServiceFailedException we can revert the state, and for other exception(include throwable) we can revert the state. Some doubts, 1. Is there any possibility of adding this logic to failover controllers rather than including the logic in NN and increase the complexity. like try { HAServiceProtocolHelper.transitionToActive( toSvc.getProxy(conf, rpcTimeoutToNewActive)); } catch (ServiceFailedException sfe) { LOG.error( "Unable to make " + toSvc + " active (" + sfe.getMessage() + "). Failing back." ); failed = true ; cause = sfe; } catch (IOException ioe) { LOG.error( "Unable to make " + toSvc + " active (unable to connect). Failing back." , ioe); failed = true ; cause = ioe; + } catch (Throwable th) { + LOG.error( "Unexpected error" + toSvc + + " Failing back." , th); + failed = true ; + cause = ioe; + } 2. Can we mark the NN state as Active only when it successfully do a transition to Active similarly for Standby. This can solve any issues since for any operation we have a state check ? Thanks in advance
          Hide
          Vinayakumar B added a comment -

          IMO, I have following options with less code changes.
          1. Shutdown the NameNode for error during state transition.
          2. Validate all mandatory configurations before startup of the process.

          Show
          Vinayakumar B added a comment - IMO, I have following options with less code changes. 1. Shutdown the NameNode for error during state transition. 2. Validate all mandatory configurations before startup of the process.
          Hide
          Vinayakumar B added a comment -

          Oops, Just now I saw the fix done by Aaron in HDFS-3026, So this issue will not occur again.
          Closing this issue.

          Show
          Vinayakumar B added a comment - Oops, Just now I saw the fix done by Aaron in HDFS-3026 , So this issue will not occur again. Closing this issue.
          Hide
          Vinayakumar B added a comment -

          Already fixed in HDFS-3026,

          Show
          Vinayakumar B added a comment - Already fixed in HDFS-3026 ,

            People

            • Assignee:
              amith
              Reporter:
              amith
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development