Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
3.1.1
-
None
-
None
Description
When I upgrade hadoop to new version (using for ex. https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#namenode_-rollingUpgrade as instruction) I've got a situation:
I'm upgrading JN's one by one.
- Upgrade and restart JN1
- NN see JN offline:
WARN client.QuorumJournalManager: Remote journal 10.73.67.132:8485 failed to write txns 1205396-1205399. Will try to write to this JN again after the next log roll. - No log roll for some time (at least 1min)
- Upgrade and restart JN2
- NN see it again:
WARN client.QuorumJournalManager: Remote journal 10.73.67.68:8485 failed to write txns 1205799-1205800. Will try to write to this JN again after the next log roll. - BUT! At this time we have no JN quorum:
FATAL namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.73.67.212:8485, 10.73.67.132:8485, 10.73.67.68:8485], stream=QuorumOutputStream starting at txid 1205246))
10.73.67.212:8485: null [success]
2 exceptions thrown:
10.73.67.132:8485: Journal disabled until next roll
10.73.67.68:8485: End of File Exception between local host is: "srv05.lt01.gismt.crpt.tech/10.73.67.132"; destination host is: "srv07.lt01.gismt.crpt.tech":8485; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException
although JN1 is online already
It looks like NN should retry JN's marked as offline before giving up.