Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
In a test that kills and restarts locators one of the restarting locators times out trying to join the distributed system. Logs show that another locator was becoming the membership coordinator and was delayed in sending out a membership view when it processed a different join request for a member that was already in the distributed system.
locator A gets join request from node 1 and sends a PREPARE
node 1 sets its identity's view ID using the PREPAREd view
locator A is killed
node 1 sends a join request to locator B. Its identity has a view ID set.
node 2 sends a join request to locator B and gets a PREPARE
locator B processes node 1's join request and assigns a new view ID to it
locator B processes node 2's join request and assigns a new view ID to it
locator B sends the PREPARE with these two new nodes. It also has node 1's original ID
locator B times out waiting for a response from node 1 with the new view ID and declares it crashed. It sends out a new PREPARE w/o that address.
node 2 gives up waiting
locator B gets no response from node 2 and declares it crashed, sends out a new PREPARE without node 2 and succeeds.
Here are log snippets showing the problem. Process 616 has a JoinRequest queued when this locator becomes coordinator. The JoinRequest ID has v46 already in it, showing that a PREPARE has already been sent with this member in it.
The locator then creates a new View that has process 616's ID in it twice - once with v46 and once with v60
locatorgemfire_2_2_29835/system.log: [fine 2019/03/27 22:22:22.817 PDT locatorgemfire_2_2_host2_29835 <Geode Membership View Creator> tid=0xba] processing request JoinRequestMessage(rs-GEM-2463-1622a0i32xlarge-hydra-client-17(peergemfire_2_1_host2_616:616)<ec><v46>:41004) failureDetectionPort:43747 locatorgemfire_2_2_29835/system.log: [fine 2019/03/27 22:22:22.817 PDT locatorgemfire_2_2_host2_29835 <Geode Membership View Creator> tid=0xba] processing request JoinRequestMessage(rs-GEM-2463-1622a0i32xlarge-hydra-client-17(locatorgemfire_2_3_host2_746:746:locator)<ec>:41002) failureDetectionPort:52188 locatorgemfire_2_2_29835/system.log: [info 2019/03/27 22:22:22.818 PDT locatorgemfire_2_2_host2_29835 <Geode Membership View Creator> tid=0xba] preparing new view View[rs-GEM-2463-1622a0i32xlarge-hydra-client-17(locatorgemfire_2_2_host2_29835:29835:locator)<ec><v24>:41001|60] members: [rs-GEM-2463-1622a0i32xlarge-hydra-client-17(locatorgemfire_2_2_host2_29835:29835:locator)<ec><v24>:41001, rs-GEM-2463-1622a0i32xlarge-hydra-client-17(peergemfire_2_2_host2_30052:30052)<ec><v25>:41007{lead}, rs-GEM-2463-1622a0i32xlarge-hydra-client-17(locatorgemfire_2_4_host2_31300:31300:locator)<ec><v29>:41003, rs-GEM-2463-1622a0i32xlarge-hydra-client-17(locatorgemfire_2_1_host2_31671:31671:locator)<ec><v41>:41000, rs-GEM-2463-1622a0i32xlarge-hydra-client-17(peergemfire_2_2_host2_31856:31856)<ec><v42>:41006, rs-GEM-2463-1622a0i32xlarge-hydra-client-17(peergemfire_2_1_host2_32560:32560)<ec><v44>:41005, rs-GEM-2463-1622a0i32xlarge-hydra-client-17(peergemfire_2_1_host2_616:616)<ec><v46>:41004, rs-GEM-2463-1622a0i32xlarge-hydra-client-17(peergemfire_2_1_host2_616:616)<ec><v60>:41004, rs-GEM-2463-1622a0i32xlarge-hydra-client-17(locatorgemfire_2_3_host2_746:746:locator)<ec><v60>:41002]