Affects Version/s: None
Fix Version/s: None
In one of our cluster faced an issue, where NameNode restart failed due to a stale/failed txn available in one JN but not others.
1. Full cluster restart
2. startLogSegment Txn(195222) synced in Only one JN but failed to others, because they were shutting down. Only editlog file was created but txn was not synced in others, so after restart they were marked as empty.
3. Cluster restarted. During failover, this new logSegment missed the recovery because this JN was slow in responding to this call.
4. Other JNs recover was successfull, as there was no in-progress files.
5. editlog.openForWrite() detected that (195222) was already available, and failed the failover.
Same steps repeated until that stale editlog in JN was manually deleted.
Since QJM is a quorum of JNs, txn is considered successfull, if its written min quorum. Otherwise it will be failed.
So, same case should be applied while selecting streams for reading also.
Stale/failed txns available in only less JNs should not be considered for reading.
HDFS-10519, does similar work to consider 'durable' txns based on 'committedTxnId'. But updating 'committedTxnId' for every flush with one more RPC seems tobe problematic to performance.