Details
Description
The standby uses the following method to check if it is time to trigger edit log rolling on active.
/** * @return true if the configured log roll period has elapsed. */ private boolean tooLongSinceLastLoad() { return logRollPeriodMs >= 0 && (monotonicNow() - lastLoadTimeMs) > logRollPeriodMs ; }
In doTailEdits(), lastLoadTimeMs is updated when standby is able to successfully tail any edits
if (editsLoaded > 0) {
lastLoadTimeMs = monotonicNow();
}
The default configuration for dfs.ha.log-roll.period is 120 seconds and dfs.ha.tail-edits.period is 60 seconds. With in-progress edit log tailing enabled, tooLongSinceLastLoad() will almost never return true resulting in edit logs not rolled for a long time until this configuration dfs.namenode.edit.log.autoroll.multiplier.threshold takes effect.
[In our deployment, this resulted in in-progress edit logs getting deleted. The sequence of events is that standby was able to checkpoint twice while the in-progress edit log was growing on active. When the NNStorageRetentionManager decided to cleanup old checkpoints and edit logs, it cleaned up the in-progress edit log from active and QJM (as the txnid on in-progress edit log was older than the 2 most recent checkpoints) resulting in irrecoverably losing a few minutes worth of metadata].
Attachments
Attachments
Issue Links
- breaks
-
HDFS-14349 Edit log may be rolled more frequently than necessary with multiple Standby nodes
- Open
- relates to
-
HDFS-10519 Add a configuration option to enable in-progress edit log tailing
- Resolved