1. I discussed with Arpit Agarwal offline and he suggested us use the same logic in updateActorStatesFromHeartbeat to update the active NN bpServiceToActive, which has dealt with several cases carefully. Moreover, if we are updating bpServiceToActive we should likely also update lastActiveClaimTxId. To achieve this, I think we can pass NNHAStatusHeartbeatProto instead of HAServiceStateProto in NamespaceInfoProto.
Mingliang Liu, I actually did it this way in the patch on purpose. The entire logic of updating bpServiceToActive will occur before any heartbeats start, since we are doing this during the handshake between the DN and the NN. If we send in an NNHAStatusHeartbeatProto instead of a HAServiceStateProto then we will have to deal with the lastActiveClaimTxId as you have mentioned. However, this would require more serious changes to the code, since we would have to either set and send along a TxId on the NN side (extra code change for what I see is negligible benefit) or we would need to arbitrarily create one on the DN side (would need to set it to be below the first heartbeat TxId, so it would have to be a negative number or would have to make extra changes).
At this point, we want the DN to have an active before it starts trying to do anything with it (the whole point of this fix). If, for whatever reason, both NNs declare themselves as active, then it will choose the first one and ignore the second. If the wrong assertion is made, then it will talk to the standby and we will get a simple standby exception and then once the next heartbeat comes we will update the correct active. So worst case scenario we get a standby exception and retry, which is still loads better than the NPE that we were getting before. I think that since this is such a small window that it is unnecessary to make more changes with the TxId.
Daryn Sharp may have more thoughts on this.
2. For the unit test, can we set a very large heartbeat interval in configuration, and check the active NN is not null after cluster.waitForActive()? Mocked tests are useful as well and can be kept. Another idea is to drop heartbeat request against a spied HeartbeatManager.
This should be fairly easy to do. I'll put up a patch shortly with this added test.