Hi Andrew, thanks for the explanation. I guess I understand your concern now: only rolling on ANN based on edits # may cause issue in some scenario. This is because if we don't have further operations it is possible that SBN will wait a long time to tail that part of edits which is in an in-progress segment.
Checkpointing combines the edit log with the fsimage, and we purge unnecessary log segments afterwards.
But I'm still a little confused about this part. I fail to see the difference of the based-on-time rolling from SBN and ANN. In the current code, SBN triggers rolling still through RPC to ANN. Also this does not affect checkpointing and purging: when SBN does a checkpoint, both SBN and ANN will purge old edits in their own storage (SBN does the purging before uploading the checkpoint, and ANN does it after getting the new fsimage).
So I guess a possible solution may be: just letting ANN does rolling every 2min. I think this can achieve almost the same effect as the current mechanism, without delaying the failover. Or you see some counter examples with this change?
Back to the changing the rpc timeout solution. Looks like we have not set timeout for this NN-->NN rpc right now (correct me if I'm wrong). Setting a timeout (e.g., 20s just like the default timeout from client to NN) of course can improve the failover time in our test case, but I still prefer the above solution because it makes the rolling behavior simpler and more predictable (especially it removes the rpc call from SBN to ANN).