Details
-
Sub-task
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
QuorumJournalManager (HDFS-3077)
-
None
-
None
-
Reviewed
Description
In doing some stress tests, I ran into an issue with failover if the current edit log segment written by the old active is large. With a 327MB log segment containing 6.4M transactions, the JN took ~11 seconds to read and validate it during the recovery step. This was longer than the 10 second timeout for createNewEpoch, which caused the recovery to fail.