Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-8901

Surviving side server forcefully disconnected after network drop

    XMLWordPrintableJSON

Details

    Description

      During a network partition, locator-0 and server-0 were partitioned from the other members of the DS (locator-1, server-1, server-2 (leadMember), server-3). We see the expected "Operation not permitted" Exceptions (in locator-0) for the 4 surviving side members:

       

      [warn 2020/12/16 23:14:02.827 GMT <Geode Failure Detection thread 2> tid=0x78] Unable to send message to 10.108.1.130(gemfire-cluster-server-2:1)<v2>:41000
      java.io.IOException: Operation not permitted
      [warn 2020/12/16 23:14:02.938 GMT <Geode Heartbeat Sender> tid=0x22] Unable to send message to 10.108.3.134(gemfire-cluster-locator-1:1:locator)<ec><v0>:41000
      java.io.IOException: Operation not permitted
      [warn 2020/12/16 23:14:06.701 GMT <Geode Membership View Creator> tid=0x79] Unable to send message to 10.108.3.135(gemfire-cluster-server-1:1)<v4>:41000
      java.io.IOException: Operation not permitted
      [warn 2020/12/16 23:14:10.322 GMT <Geode Failure Detection thread 3> tid=0x7a] Unable to send message to 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000
      java.io.IOException: Operation not permitted
      

      As expected, we see the loss of quorum:

      [warn 2020/12/16 23:14:11.718 GMT <Geode Membership View Creator> tid=0x79] total weight lost in this view change is 28 of 51.  Quorum has been lost!

      However, we expected to see a lost weight of 38 (10 + 15 + 10 + 3) for server-1, server-2, server-3 and locator-1, respectively. What we do see is that server-3 gets forcefully disconnected as well – that might occur because after the "Operation not permitted" Exception above, we pass an availability check.

      [info 2020/12/16 23:14:10.323 GMT <Geode Failure Detection thread 3> tid=0x7a] Performing availability check for suspect member 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000 reason=Unable to send messages to this member via JGroups
      ...
      [warn 2020/12/16 23:14:11.711 GMT <Geode Membership View Creator> tid=0x79] these members failed to respond to the view change: [10.108.3.134(gemfire-cluster-locator-1:1:locator)<ec><v0>:41000, 10.108.3.135(gemfire-cluster-server-1:1)<v4>:41000, 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000, 10.108.1.130(gemfire-cluster-server-2:1)<v2>:41000]
      [info 2020/12/16 23:14:11.714 GMT <Geode View Creator verification thread 1> tid=0x7c] checking state of member 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000
      [info 2020/12/16 23:14:11.714 GMT <Geode View Creator verification thread 1> tid=0x7c] member 10.108.0.192(gemfire-cluster-server-3:1)<v3>:41000 passed availability check

      This issue looks similar to GEODE-8721 which has been fixed in b7afc604b9c2fafe4388dcdcf05fc7ec49c0ce86, but the failure logs don't contain the logging relevant to GEODE-8721:

      Availability check detected recent message traffic for suspect member

      This has a time stamp showing the time of contact. In GEODE-8721 we see the timestamp being continually updated.

       

      Attachments

        Issue Links

          Activity

            People

              kaslami Kamilla Aslami
              kaslami Kamilla Aslami
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: