I thought a bit about that, but it would require another communication channel between the active and SB and has implications for decomissioning standbys as well.
For example, one solution I considered was to have the SBN write a file into the shared edits dir marking the latest txnid for which it had a checkpoint. The ANN could then use that to determine what point the edit logs could be purged to. However, this was problematic for several reasons:
1) Decommissioning an SBN becomes more complicated than just turning it off – if you just turn it off, then the active will never again purge edit logs, which seems problematic.
2) Dropping a file in the shared edits dir breaks the "journal" abstraction - we'd need to implement a different back-channel for BK-based logging, for example.
3) Extra code complexity, especially if in the future we want to support multiple SBNs.
I also considered the operator perspective of consistency with other similar systems. In configuring MySQL replication, for example, the operator configures a "binary log retention period" as a number of days for which to retain older binlogs. If the slave is down for longer than this period, then it has to be re-bootstrapped with an rsync from the master.
Given that we intend to later implement automatic bootstrapping if an SBN is started with a too-old image (
HDFS-2731) that seems like a much simpler solution to the problem.
The other advantage of the method implemented here is that other systems which want to consume edit logs probably will want higher retention as well, without the complexity of implementing a back-channel "purge" command to the NN.