I thought we are not going to have external stonith using special devices and that is mainly the reason why we are going through hoops to implement fencing in journal daemons.
In the current design, which uses a filer, we require external stonith devices. There is no correct way of doing it without either stonith or storage fencing.
The proposal with the journal-daemon based fencing is essentailly the same as storage fencing - just that we do it with our own software storage instead of a NAS/SAN.
Why is the behaviour different from what happens when zkfc loses the ephemeral node? Currently zkfc when it loses the ephemeral node will shutdown the active NN
No, it doesn't - it will transition it to standby. But, as I commented elsewhere, this is redundant, because the new active is actually going to fence it anyway before taking over.
Similarly if active NN does not hear from zkfc, it implies that zkfc is dead, going through gc pause essentially resulting in loss of ephemeral node.
But this can reduce uptime. For example, imagine an administrator accidentally changes the ACL on zookeeper. This causes both ZKFCs to get an authentication error and crash at the same time. With your design, both NNs will then commit suicide. With the existing implementation, the system will continue to run in its existing state – i.e no new failovers will occur, but whoever is active will remain active.
If active NN loses quorum, it has to shutdown
Yes, it has to shut down before it does any edits, or it has to be fenced by the next active. Notification of session loss is asynchronous. The same is true of your proposal. In either case it can take arbitrarily long before it "notices" that it should not be active. So we still require that the new active fence it before it becomes active. So, this proposal doesn't solve any problems.
In fact, one of the most of the difficult APIs to implement correctly would be transitionToStandby() from active state.
We already have that implemented. It syncs any existing edits, and then stops allowing new ones. We allow failover from one node to another without aborting, so long as it's graceful. This is perfectly correct. If we need to do a non-graceful failover, we fence the node by STONITH or by disallowing further access to the edit logs (which indirectly causes the node to abort, since logSync() fails).
It seems you're trying to solve problems we've already solved.