Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-6977

WAN Auto Discovery Failure

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10.0
    • Component/s: membership, wan
    • Labels:

      Description

      The topology for the test is 4 clusters with 4 locators and 4 servers each, the sites are connected using a ring topology (cluster1 knows about cluster2, cluster2 knows about cluster3, cluster3 knows about cluster4 and cluster4 knows about cluster1).
      After the startup finished, it was found out that one locator (out of sixteen) didn't know about another locator in other site.

      According to the docs:

      Discovery for Multi-Site Systems
      Each Geode cluster in a WAN configuration uses locators to discover remote clusters as well as local members.
      Each locator in a WAN configuration defines a unique distributed-system-id property that identifies the local cluster to which it belongs. A locator uses the remote-locators property to define the addresses of one or more locators in remote clusters to use for WAN distribution.
      When a locator starts up, it contacts each locator that is configured in the remote-locators property to exchange information about the available locators and gateway receivers in the cluster. The locator also shares information about locators and gateway receivers in any other Geode clusters that have connected to the cluster. Connected clusters can then use the shared gateway receiver information to distribute region events according to their configured gateway senders.
      Each time a new locator starts up or an existing locator shuts down, the changed information is broadcast to other connected Geode clusters.
      

      I've added some extra logging to the LocatorMembershipListenerImpl class and reproduced the issue from scratch. In this particular run locator_1_1 (rs-GEM-2603-1105a2i3large-hydra-client-7[23955]) was not found in locator_3_3 (rs-GEM-2603-1105a2i3large-hydra-client-7[22655]). The member locator 1_1 is the last one to startup:

      [info 2019/07/12 12:10:47.737 PDT <vm_34_thr_66_locator_1_3_host1_6234> tid=0x15] Locator started on 10.32.111.135[22488]
      [info 2019/07/12 12:10:47.756 PDT <vm_36_thr_68_locator_2_1_host2_8759> tid=0x17] Locator started on 10.32.109.20[23639]
      [info 2019/07/12 12:10:47.809 PDT <vm_44_thr_76_locator_4_1_host2_8840> tid=0x17] Locator started on 10.32.109.20[24645]
      [info 2019/07/12 12:10:48.800 PDT <vm_41_thr_73_locator_3_2_host1_6306> tid=0x15] Locator started on 10.32.111.135[23280]
      [info 2019/07/12 12:10:50.042 PDT <vm_47_thr_79_locator_4_4_host2_8863> tid=0x15] Locator started on 10.32.109.20[29961]
      [info 2019/07/12 12:10:50.162 PDT <vm_38_thr_70_locator_2_3_host2_8780> tid=0x15] Locator started on 10.32.109.20[27708]
      [info 2019/07/12 12:10:50.172 PDT <vm_46_thr_78_locator_4_3_host2_8860> tid=0x15] Locator started on 10.32.109.20[22615]
      [info 2019/07/12 12:10:50.493 PDT <vm_39_thr_71_locator_2_4_host2_8797> tid=0x17] Locator started on 10.32.109.20[25209]
      [info 2019/07/12 12:10:51.052 PDT <vm_37_thr_69_locator_2_2_host2_8766> tid=0x15] Locator started on 10.32.109.20[24073]
      [info 2019/07/12 12:10:51.067 PDT <vm_35_thr_67_locator_1_4_host1_6246> tid=0x15] Locator started on 10.32.111.135[28694]
      [info 2019/07/12 12:10:51.082 PDT <vm_45_thr_77_locator_4_2_host2_8849> tid=0x15] Locator started on 10.32.109.20[20182]
      [info 2019/07/12 12:10:51.112 PDT <vm_42_thr_74_locator_3_3_host1_6314> tid=0x15] Locator started on 10.32.111.135[22655]
      [info 2019/07/12 12:10:51.709 PDT <vm_33_thr_65_locator_1_2_host1_6229> tid=0x15] Locator started on 10.32.111.135[28293]
      [info 2019/07/12 12:10:52.007 PDT <vm_40_thr_72_locator_3_1_host1_6297> tid=0x15] Locator started on 10.32.111.135[22923]
      [info 2019/07/12 12:10:52.625 PDT <vm_43_thr_75_locator_3_4_host1_6326> tid=0x15] Locator started on 10.32.111.135[26144]
      [info 2019/07/12 12:10:56.183 PDT <vm_32_thr_64_locator_1_1_host1_6225> tid=0x15] Locator started on 10.32.111.135[23955]
      

      After locator_1_1 joins the cluster, every locator informs the other locators about this new addition (and also about the other locators known so far), as expected.
      The logs below show that all locators tried to share the addition of locator locator_1_1 rs-GEM-2603-1105a2i3large-hydra-client-7[23955] with locator_3_3 rs-GEM-2603-1105a2i3large-hydra-client-7[22655]:

      /locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:57.153 PDT <Thread-26> tid=0x7c] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-7[28293]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:57.192 PDT <Thread-24> tid=0x6c] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-7[22488]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:58.011 PDT <Thread-44> tid=0xa7] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-7[28694]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:57.796 PDT <Thread-25> tid=0x6d] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[23639]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:58.334 PDT <Thread-64> tid=0xa2] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[24073]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:57.368 PDT <Thread-29> tid=0x78] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[27708]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:57.086 PDT <Thread-42> tid=0x93] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[25209]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:10:59.059 PDT <Thread-23> tid=0x76] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-7[22923]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:58.027 PDT <Thread-25> tid=0x6c] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-7[23280]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:58.890 PDT <Thread-30> tid=0x7f] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-7[26144]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:57.506 PDT <Thread-23> tid=0x74] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[24645]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:57.179 PDT <Thread-25> tid=0x77] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[20182]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:10:59.006 PDT <Thread-53> tid=0xb1] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[22615]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      ./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:57.507 PDT <Thread-28> tid=0xa1] TcpClient sending LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : rs-GEM-2603-1105a2i3large-hydra-client-41[29961]} to rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
      

      One second afterwards, every single locator also logs that it failed to share the information about locator_1_1 rs-GEM-2603-1105a2i3large-hydra-client-7[23955] with locator_3_3 rs-GEM-2603-1105a2i3large-hydra-client-7[22655]:

      ./locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:58.155 PDT <Thread-26> tid=0x7c] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:58.196 PDT <Thread-24> tid=0x6c] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:59.012 PDT <Thread-44> tid=0xa7] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:58.798 PDT <Thread-25> tid=0x6d] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:59.336 PDT <Thread-64> tid=0xa2] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:58.371 PDT <Thread-29> tid=0x78] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:58.093 PDT <Thread-42> tid=0x93] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:11:00.061 PDT <Thread-23> tid=0x76] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:59.029 PDT <Thread-25> tid=0x6c] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:59.903 PDT <Thread-30> tid=0x7f] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:58.514 PDT <Thread-23> tid=0x74] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:58.368 PDT <Thread-25> tid=0x77] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:11:00.007 PDT <Thread-53> tid=0xb1] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      ./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:58.515 PDT <Thread-28> tid=0xa1] Locator Membership listener could not exchange locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655
      

      This is also backed up by the fact that locator_3_3 rs-GEM-2603-1105a2i3large-hydra-client-7[22655] is the only one that never logs the reception of the LocatorJoinMessage for locator_1_1 rs-GEM-2603-1105a2i3large-hydra-client-7[23955]:

      ##### grep -lR "LocatorJoinMessage{distributedSystemId=1 locators=rs-GEM-2603-1105a2i3large-hydra-client-7\[23955\]" . | sort
      ./locatorgemfire_1_1_6225/system.log
      ./locatorgemfire_1_2_6229/system.log
      ./locatorgemfire_1_3_6234/system.log
      ./locatorgemfire_1_4_6246/system.log
      ./locatorgemfire_2_1_8759/system.log
      ./locatorgemfire_2_2_8766/system.log
      ./locatorgemfire_2_3_8780/system.log
      ./locatorgemfire_2_4_8797/system.log
      ./locatorgemfire_3_1_6297/system.log
      ./locatorgemfire_3_2_6306/system.log
      ./locatorgemfire_3_4_6326/system.log
      ./locatorgemfire_4_1_8840/system.log
      ./locatorgemfire_4_2_8849/system.log
      ./locatorgemfire_4_3_8860/system.log
      ./locatorgemfire_4_4_8863/system.log
      

      The exception when trying to exchange the locator information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with rs-GEM-2603-1105a2i3large-hydra-client-7:22655 is always the same on every single locator:

      java.net.SocketTimeoutException: connect timed out
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      	at java.net.Socket.connect(Socket.java:589)
      	at org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:898)
      	at org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:839)
      	at org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:828)
      	at org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:205)
      	at org.apache.geode.cache.client.internal.locator.wan.LocatorMembershipListenerImpl$1.run(LocatorMembershipListenerImpl.java:120)
      

      Even though there are no errors on the logs for locator_3_3 within the relevant time frame, the statistics show that the cpu is at 100% and that there are no samples taken for around 5 seconds, probably due to a short JVM pause or resource contention, that's probably the reason why the connection from other locators failed. That said, there were no more additions/deletions of extra locators within any of the clusters either (which would have fired another round of messages from every locator to every other known locator about the addition/deletion), so in the end "it is expected" that locator_3_3 didn't have the information about locator_1_1 in its internal map of known locators.
      There's no retry logic within LocatorJoinMessage.locatorJoined and the timeout of 1 second seems to be too short: any locator might be busy for a short period of time and, as such, the information about the new locator will be missed until there's another locator membership change; this ticket is to change that: increase the timeout and retry a couple of times by default.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jujoramos Juan José Ramos Cassella
                Reporter:
                jujoramos Juan José Ramos Cassella
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h