Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-5291

Clients need to retry when Active NN is in SafeMode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 2.1.1-beta
    • 2.3.0
    • ha
    • None
    • Reviewed

    Description

      In our test, we saw NN immediately went into safemode after transitioning to active state. This can cause HBase region server to timeout and kill itself. We should allow clients to retry when HA is enabled and ANN is in SafeMode.

      ============================================
      Some log snippets:

      standby state to active transition

      2013-10-02 00:13:49,482 INFO  ipc.Server (Server.java:run(2068)) - IPC Server handler 69 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from IP:33911 Call#1483 Retry#1: error: org.apache.hadoop.ipc.StandbyException: Operation category WRITE is not supported in state standby
      2013-10-02 00:13:49,689 INFO  ipc.Server (Server.java:saslProcess(1342)) - Auth successful for nn/hostname@EXAMPLE.COM (auth:SIMPLE)
      2013-10-02 00:13:49,696 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(111)) - Authorization successful for nn/hostname@EXAMPLE.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
      2013-10-02 00:13:49,700 INFO  namenode.FSNamesystem (FSNamesystem.java:stopStandbyServices(1013)) - Stopping services started for standby state
      2013-10-02 00:13:49,701 WARN  ha.EditLogTailer (EditLogTailer.java:doWork(336)) - Edit log tailer interrupted
      java.lang.InterruptedException: sleep interrupted
              at java.lang.Thread.sleep(Native Method)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:334)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:356)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1463)
              at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:454)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTail
      2013-10-02 00:13:49,704 INFO  namenode.FSNamesystem (FSNamesystem.java:startActiveServices(885)) - Starting services required for active state
      2013-10-02 00:13:49,719 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recoverUnfinalizedSegments(419)) - Starting recovery process for unclosed journal segments...
      2013-10-02 00:13:49,755 INFO  ipc.Server (Server.java:saslProcess(1342)) - Auth successful for hbase/hostname@EXAMPLE.COM (auth:SIMPLE)
      2013-10-02 00:13:49,761 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(111)) - Authorization successful for hbase/hostname@EXAMPLE.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol
      2013-10-02 00:13:49,839 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recoverUnfinalizedSegments(421)) - Successfully started new epoch 85
      2013-10-02 00:13:49,839 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recoverUnclosedSegment(249)) - Beginning recovery of unclosed segment starting at txid 887112
      2013-10-02 00:13:49,874 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recoverUnclosedSegment(258)) - Recovery prepare phase complete. Responses:
      IP:8485: segmentState { startTxId: 887112 endTxId: 887531 isInProgress: true } lastWriterEpoch: 84 lastCommittedTxId: 887530
      172.18.145.97:8485: segmentState { startTxId: 887112 endTxId: 887531 isInProgress: true } lastWriterEpoch: 84 lastCommittedTxId: 887530
      2013-10-02 00:13:49,875 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recover
      

      And then we get into safemode

      Construction[IP:1019|RBW]]} size 0
      2013-10-02 00:13:50,277 INFO  BlockStateChange (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap updated: IP:1019 is added to blk_IP157{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[IP:1019|RBW], ReplicaUnderConstruction[172.18.145.96:1019|RBW], ReplicaUnde
      rConstruction[IP:1019|RBW]]} size 0
      2013-10-02 00:13:50,279 INFO  hdfs.StateChange (FSNamesystem.java:reportStatus(4703)) - STATE* Safe mode ON.
      The reported blocks 1071 needs additional 5 blocks to reach the threshold 1.0000 of total blocks 1075.
      Safe mode will be turned off automatically
      2013-10-02 00:13:50,279 INFO  BlockStateChange (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap updated: IP:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], ReplicaUnderConstruction[172.18.145.97:1019|RBW], ReplicaUnderConstruction[IP:1019|RBW]]} size 0
      2013-10-02 00:13:50,280 INFO  BlockStateChange (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap updated: 172.18.145.99:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], ReplicaUnderConstruction[172.18.145.97:1019|RBW], ReplicaUnderConstruction[IP:1019|RBW]]} size 0
      2013-10-02 00:13:50,281 INFO  BlockStateChange (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap updated: 172.18.145.97:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], ReplicaUnderConstruction[172.18.145.97:1019|RBW], ReplicaUnderConstruction[IP:1019|RBW]]} size 0
      

      Attachments

        1. nn.log
          807 kB
          Arpit Gupta
        2. HDFS-5291.000.patch
          27 kB
          Jing Zhao
        3. HDFS-5291.001.patch
          27 kB
          Jing Zhao
        4. HDFS-5291.002.patch
          29 kB
          Jing Zhao
        5. HDFS-5291.003.patch
          29 kB
          Jing Zhao
        6. HDFS-5291.003.patch
          30 kB
          Jing Zhao

        Activity

          People

            jingzhao Jing Zhao
            arpitgupta Arpit Gupta
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: