Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
Description
We have the following within our docs (point 4 here):
In the first phase, the membership coordinator sends out a view preparation message to all members and waits 12 seconds for a view preparation ack return message from each member. If the coordinator does not receive an ack message from a member within 12 seconds, the coordinator attempts to connect to the member’s failure-detection socket. If the coordinator cannot connect to the member’s failure-detection socket, the coordinator declares the member dead and starts the membership view protocol again from the beginning.
These 12 seconds refer to viewAckTimeout property within the GMSJoinLeave class, and it’s calculated as follows:
long ackCollectionTimeout = config.getMemberTimeout() * 2 * 12437 / 10000; if (ackCollectionTimeout < 1500) { ackCollectionTimeout = 1500; } else if (ackCollectionTimeout > 12437) { ackCollectionTimeout = 12437; } ackCollectionTimeout = Long .getLong(GeodeGlossary.GEMFIRE_PREFIX + "VIEW_ACK_TIMEOUT", ackCollectionTimeout) .longValue(); this.viewAckTimeout = ackCollectionTimeout;
So, the actual value for the viewAckTimeout is member-timeout * 2 seconds, but it can’t be lower than 1.5, neither higher than 12, unless the user configures the undocumented VIEW_ACK_TIMEOUT system property (for which I haven't found any tests nor anything related, meaning that it shouldn't be used at all as we don't know what the negative implications - if any - might be).
We should either remove the internal check and allow the user to fully configure this property (member-timeout * 2 by default) or add better documentation about this internal timeout and why it shouldn't be changed outside of the fixed interval.