[HDFS-5291] Clients need to retry when Active NN is in SafeMode - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.1.1-beta
Fix Version/s: 2.3.0
Component/s: ha
Labels:
None

Hadoop Flags:

Reviewed

Description

In our test, we saw NN immediately went into safemode after transitioning to active state. This can cause HBase region server to timeout and kill itself. We should allow clients to retry when HA is enabled and ANN is in SafeMode.

============================================
Some log snippets:

standby state to active transition

2013-10-02 00:13:49,482 INFO  ipc.Server (Server.java:run(2068)) - IPC Server handler 69 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from IP:33911 Call#1483 Retry#1: error: org.apache.hadoop.ipc.StandbyException: Operation category WRITE is not supported in state standby
2013-10-02 00:13:49,689 INFO  ipc.Server (Server.java:saslProcess(1342)) - Auth successful for nn/hostname@EXAMPLE.COM (auth:SIMPLE)
2013-10-02 00:13:49,696 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(111)) - Authorization successful for nn/hostname@EXAMPLE.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
2013-10-02 00:13:49,700 INFO  namenode.FSNamesystem (FSNamesystem.java:stopStandbyServices(1013)) - Stopping services started for standby state
2013-10-02 00:13:49,701 WARN  ha.EditLogTailer (EditLogTailer.java:doWork(336)) - Edit log tailer interrupted
java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:334)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:356)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1463)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:454)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTail
2013-10-02 00:13:49,704 INFO  namenode.FSNamesystem (FSNamesystem.java:startActiveServices(885)) - Starting services required for active state
2013-10-02 00:13:49,719 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recoverUnfinalizedSegments(419)) - Starting recovery process for unclosed journal segments...
2013-10-02 00:13:49,755 INFO  ipc.Server (Server.java:saslProcess(1342)) - Auth successful for hbase/hostname@EXAMPLE.COM (auth:SIMPLE)
2013-10-02 00:13:49,761 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(111)) - Authorization successful for hbase/hostname@EXAMPLE.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol
2013-10-02 00:13:49,839 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recoverUnfinalizedSegments(421)) - Successfully started new epoch 85
2013-10-02 00:13:49,839 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recoverUnclosedSegment(249)) - Beginning recovery of unclosed segment starting at txid 887112
2013-10-02 00:13:49,874 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recoverUnclosedSegment(258)) - Recovery prepare phase complete. Responses:
IP:8485: segmentState { startTxId: 887112 endTxId: 887531 isInProgress: true } lastWriterEpoch: 84 lastCommittedTxId: 887530
172.18.145.97:8485: segmentState { startTxId: 887112 endTxId: 887531 isInProgress: true } lastWriterEpoch: 84 lastCommittedTxId: 887530
2013-10-02 00:13:49,875 INFO  client.QuorumJournalManager (QuorumJournalManager.java:recover

And then we get into safemode

Construction[IP:1019|RBW]]} size 0
2013-10-02 00:13:50,277 INFO  BlockStateChange (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap updated: IP:1019 is added to blk_IP157{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[IP:1019|RBW], ReplicaUnderConstruction[172.18.145.96:1019|RBW], ReplicaUnde
rConstruction[IP:1019|RBW]]} size 0
2013-10-02 00:13:50,279 INFO  hdfs.StateChange (FSNamesystem.java:reportStatus(4703)) - STATE* Safe mode ON.
The reported blocks 1071 needs additional 5 blocks to reach the threshold 1.0000 of total blocks 1075.
Safe mode will be turned off automatically
2013-10-02 00:13:50,279 INFO  BlockStateChange (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap updated: IP:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], ReplicaUnderConstruction[172.18.145.97:1019|RBW], ReplicaUnderConstruction[IP:1019|RBW]]} size 0
2013-10-02 00:13:50,280 INFO  BlockStateChange (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap updated: 172.18.145.99:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], ReplicaUnderConstruction[172.18.145.97:1019|RBW], ReplicaUnderConstruction[IP:1019|RBW]]} size 0
2013-10-02 00:13:50,281 INFO  BlockStateChange (BlockManager.java:logAddStoredBlock(2237)) - BLOCK* addStoredBlock: blockMap updated: 172.18.145.97:1019 is added to blk_IP158{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[172.18.145.99:1019|RBW], ReplicaUnderConstruction[172.18.145.97:1019|RBW], ReplicaUnderConstruction[IP:1019|RBW]]} size 0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

nn.log
02/Oct/13 23:52
807 kB
Arpit Gupta
HDFS-5291.000.patch
07/Oct/13 02:10
27 kB
Jing Zhao
HDFS-5291.001.patch
07/Oct/13 05:22
27 kB
Jing Zhao
HDFS-5291.002.patch
07/Oct/13 07:19
29 kB
Jing Zhao
HDFS-5291.003.patch
07/Oct/13 17:50
29 kB
Jing Zhao
HDFS-5291.003.patch
07/Oct/13 20:50
30 kB
Jing Zhao

Clients need to retry when Active NN is in SafeMode

Details

Description

Attachments

Attachments

Activity

People

Dates