Combination of this and
HDFS-3077 sound very much like ZAB with one difference - in the use of 2-phase commit for the write broadcast.
Lets say there is 1 active NN writing to a quorum set of journal daemons. This is the same as ZAB. Active NN writes edits and ZAB leader writes new states.
ZAB uses 2 phase commits (without abort) for each write while our design is getting away without it. I am wondering why we can get away with it.
My guess is that each follower in ZAB can also serve reads from clients. Hence, it cannot serve an update until it is guaranteed that a quorum of followers has agreed on that update. That is what 2 phase commit gives.
In our case, the active NN is the only server for client reads. Hence, updates are not served to clients until a quorum acks back.
However, the above would break for us if the standby NN is using any journal daemon to refresh its state. Because, ideally, a journal node should not inform the standby about an update until the it knows that the update has been accepted by a quorum of journal daemons. That would require a 2 phase commit.
E.g. Standby NN3 reads the last edit written to JN1 by old active NN1, before NN1 realized that it has lost quorum to NN2 (by failing to write to JN2 and JN3).
Perhaps we can get away with this by using some assumptions on timeouts, or by additional constraints on the standby. Eg. that it only syncs with finalized edit segments.
If we say that the standby sync with only the finalized log segments in order to be safe from the above, then IMO, the tailing of the edits by the standby should not be done by the standby directly but via a journal daemon API for the standby. This JD API would ensure that only valid edits are being sent to the standby (edits from finalized segments or edits known to be safely committed to a quorum of journal daemons). This way the correctness of the journal protocol would remain inside it. Instead of leaking it into the standby by having the standby code remember rules for tailing edits.