What is the behavior by standby, with this patch, if it has completely read the last segment and is waiting for the new segment to be completed? I believe in that case it would anyway return zero.
Not quite. With this patch, if the standby NN has never been in the active state, the metric will always output 18179, probably because of some oddity with the way metrics output negative values (since curSegmentTxId is initially set to HdfsConstants.INVALID_TXID, which is -12345.) This is obviously incorrect. If the standby NN has previously been in the active state, this metric will always output 2, which is also incorrect.
We will end up reading from in_progress log for automatic failover to reduce the failover times.
Maybe. I strongly suspect that the time for automatic failover will be greatly dominated by the time to detect failure of the active and fence it, not the time it takes to read the most recent edit log segment once we've decided to fail over, in which case this optimization of reading in-progress edit logs will provide little benefit.
Regardless, this isn't how it's implemented now.
This would be one less place to change when standby starts reading from in_progress.
Except that we should write a test that this metric outputs the correct values, in which case this code might change anyway. We don't yet know how reading in-progress edit logs will be implemented.
Regarding testing, any HA test will run into it. I have a 100% hit rate on the actual cluster
Sure, but none of the tests will fail because of this error, will they? You'll see an error in the NN log if you look, but only if. And even if tests were failing without this patch, there's still no test asserting that the metric outputs the correct value in the case of the standby NN.