[GEODE-7750] If a member is force disconnected from the distributed system repeatedly, it reconnects each time, but after the second force disconnect, it shuts down after reconnecting - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.11.0
Fix Version/s: 1.12.0
Component/s: membership
Labels:
None

Description

If I start a server using gfsh and repeatedly invoke the MembershipManagerHelper crashDistributedSystem method to force it to disconnect from the distributed system, it reconnects each time, but after the second force disconnect, the server shuts down after reconnecting.

Here is truncated logging for the first disconnect:

[info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44] crashing distributed system: Connected "server1" (id=76d0ecd7) to distributed system using locators "localhost[23456]" logging to cacheserver.log  started at Wed Jan 29 11:32:37 PST 2020

[info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44] GroupMembershipService.beSick invoked for 192.168.99.1(server1:51825)<ec><v3>:41001 - simulating sickness

[info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44] GroupMembershipService.playDead invoked for 192.168.99.1(server1:51825)<ec><v3>:41001

[fatal 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44] Membership service failure: for testing
org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException: for testing
	at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:2004)
	at org.apache.geode.distributed.internal.membership.api.MembershipManagerHelper.crashDistributedSystem(MembershipManagerHelper.java:66)
	at MembershipCycleHealthFunction.execute(MembershipCycleHealthFunction.java:20)
	at org.apache.geode.internal.cache.execute.AbstractExecution.executeFunctionLocally(AbstractExecution.java:328)
	at org.apache.geode.internal.cache.execute.AbstractExecution.lambda$executeFunctionOnLocalNode$1(AbstractExecution.java:299)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at org.apache.geode.distributed.internal.ClusterOperationExecutors.runUntilShutdown(ClusterOperationExecutors.java:449)
	at org.apache.geode.distributed.internal.ClusterOperationExecutors.doFunctionExecutionThread(ClusterOperationExecutors.java:379)
	at org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:119)
	at java.lang.Thread.run(Thread.java:745)

[info 2020/01/29 11:41:07.616 PST <Function Execution Processor3> tid=0x44] CacheServer configuration saved

[info 2020/01/29 11:41:07.626 PST <Geode Failure Detection Server thread 1> tid=0x1f] GMSHealthMonitor server thread exiting

[info 2020/01/29 11:41:07.626 PST <DisconnectThread> tid=0x47] Stopping membership services

[info 2020/01/29 11:41:07.630 PST <ReconnectThread> tid=0x47] Disconnecting old DistributedSystem to prepare for a reconnect attempt

[info 2020/01/29 11:41:07.632 PST <ReconnectThread> tid=0x47] GemFireCache[id = 101500074; isClosing = true; isShutDownAll = false; created = Wed Jan 29 11:32:38 PST 2020; server = true; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.

[info 2020/01/29 11:41:07.633 PST <ReconnectThread> tid=0x47] Cache server on port 40401 is shutting down.

[info 2020/01/29 11:41:07.657 PST <ReconnectThread> tid=0x47] Shutting down DistributionManager 192.168.99.1(server1:51825)<ec><v3>:41001. At least one Exception occurred.

[info 2020/01/29 11:41:07.759 PST <ReconnectThread> tid=0x47] Now closing distribution for 192.168.99.1(server1:51825)<ec><v3>:41001

[info 2020/01/29 11:41:07.760 PST <ReconnectThread> tid=0x47] DistributionManager stopped in 103ms.

[info 2020/01/29 11:41:07.760 PST <ReconnectThread> tid=0x47] Marking DistributionManager 192.168.99.1(server1:51825)<ec><v3>:41001 as closed.

Here is truncated logging for the first reconnect:

[info 2020/01/29 11:42:07.755 PST <ReconnectThread> tid=0x47] Attempting to reconnect to the distributed system.  This is attempt #1.

[info 2020/01/29 11:42:07.775 PST <ReconnectThread> tid=0x47] Starting membership services

[info 2020/01/29 11:42:07.778 PST <ReconnectThread> tid=0x47] Established local address 192.168.99.1(server1:51825):41001

[info 2020/01/29 11:42:07.779 PST <ReconnectThread> tid=0x47] GemFire P2P Listener started on /192.168.99.1:50431

[info 2020/01/29 11:42:07.793 PST <ReconnectThread> tid=0x47] Attempting to join the distributed system through coordinator 192.168.99.1(locator:51077:locator)<ec><v0>:41000 using address 192.168.99.1(server1:51825)<ec>:41001

[info 2020/01/29 11:42:08.104 PST <unicast receiver,boglesbymac-63966> tid=0x1b] received new view: View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|5] members: [192.168.99.1(locator:51077:locator)<ec><v0>:41000, 192.168.99.1(server2:51826)<ec><v3>:41002{lead}, 192.168.99.1(server1:51825)<ec><v5>:41001]
old view is: null

[info 2020/01/29 11:42:08.105 PST <ReconnectThread> tid=0x47] Starting DistributionManager 192.168.99.1(server1:51825)<ec><v5>:41001.  (took 331 ms)

[info 2020/01/29 11:42:08.105 PST <ReconnectThread> tid=0x47] Initial (distribution manager) view, View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|5] members: [192.168.99.1(locator:51077:locator)<ec><v0>:41000, 192.168.99.1(server2:51826)<ec><v3>:41002{lead}, 192.168.99.1(server1:51825)<ec><v5>:41001]

[info 2020/01/29 11:42:08.111 PST <P2P message reader for 192.168.99.1(locator:51077:locator)<ec><v0>:41000 shared unordered uid=4 port=64693> tid=0x5c] Member 192.168.99.1(locator:51077:locator)<ec><v0>:41000 is  equivalent or in the same redundancy zone.

[info 2020/01/29 11:42:08.113 PST <P2P message reader for 192.168.99.1(server2:51826)<ec><v3>:41002 shared unordered uid=5 port=64696> tid=0x61] Member 192.168.99.1(server2:51826)<ec><v3>:41002 is  equivalent or in the same redundancy zone.

[info 2020/01/29 11:42:08.113 PST <ReconnectThread> tid=0x47] DistributionManager 192.168.99.1(server1:51825)<ec><v5>:41001 started on localhost[23456]. There were 2 other DMs. others: [192.168.99.1(locator:51077:locator)<ec><v0>:41000, 192.168.99.1(server2:51826)<ec><v3>:41002]  (took 343 ms) 

[info 2020/01/29 11:42:08.392 PST <ReconnectThread> tid=0x47] Cache server connection listener bound to address 192.168.99.1-0.0.0.0/0.0.0.0:40401 with backlog 1280.

[info 2020/01/29 11:42:08.395 PST <ReconnectThread> tid=0x47] CacheServer Configuration:  port=40401 max-connections=800 max-threads=0 notify-by-subscription=true socket-buffer-size=32768 maximum-time-between-pings=60000 maximum-message-count=230000 message-time-to-live=180 eviction-policy=none capacity=1 overflow directory=. groups=[] loadProbe=ConnectionCountProbe loadPollInterval=5000 tcpNoDelay=true

[info 2020/01/29 11:42:23.399 PST <ReconnectThread> tid=0x47] Reconnect completed.
New DistributedSystem is Connected "server1" (id=17253324) to distributed system using locators "localhost[23456]" logging to cacheserver.log  started at Wed Jan 29 11:42:07 PST 2020
New Cache is GemFireCache[id = 1930372133; isClosing = false; isShutDownAll = false; created = Wed Jan 29 11:42:08 PST 2020; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]

The second disconnect looks similar to the first.

The second reconnect looks similar to the first too except for key difference.

[info 2020/01/29 11:44:54.243 PST <ReconnectThread> tid=0x78] Attempting to reconnect to the distributed system.  This is attempt #1.

[info 2020/01/29 11:44:54.259 PST <ReconnectThread> tid=0x78] Starting membership services

[info 2020/01/29 11:44:54.260 PST <ReconnectThread> tid=0x78] Established local address 192.168.99.1(server1:51825):41001

[info 2020/01/29 11:44:54.261 PST <ReconnectThread> tid=0x78] GemFire P2P Listener started on /192.168.99.1:57338

[info 2020/01/29 11:44:54.273 PST <ReconnectThread> tid=0x78] Attempting to join the distributed system through coordinator 192.168.99.1(locator:51077:locator)<ec><v0>:41000 using address 192.168.99.1(server1:51825)<ec>:41001

[info 2020/01/29 11:44:54.580 PST <unicast receiver,boglesbymac-63966> tid=0x1b] received new view: View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|7] members: [192.168.99.1(locator:51077:locator)<ec><v0>:41000, 192.168.99.1(server2:51826)<ec><v3>:41002{lead}, 192.168.99.1(server1:51825)<ec><v7>:41001]
old view is: null

[info 2020/01/29 11:44:54.581 PST <ReconnectThread> tid=0x78] Starting DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001.  (took 324 ms)

[info 2020/01/29 11:44:54.581 PST <ReconnectThread> tid=0x78] Initial (distribution manager) view, View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|7] members: [192.168.99.1(locator:51077:locator)<ec><v0>:41000, 192.168.99.1(server2:51826)<ec><v3>:41002{lead}, 192.168.99.1(server1:51825)<ec><v7>:41001]

[info 2020/01/29 11:44:54.690 PST <P2P message reader for 192.168.99.1(locator:51077:locator)<ec><v0>:41000 shared unordered uid=5 port=65360> tid=0x8d] Member 192.168.99.1(locator:51077:locator)<ec><v0>:41000 is  equivalent or in the same redundancy zone.

[info 2020/01/29 11:44:54.691 PST <P2P message reader for 192.168.99.1(server2:51826)<ec><v3>:41002 shared unordered uid=6 port=65363> tid=0x92] Member 192.168.99.1(server2:51826)<ec><v3>:41002 is  equivalent or in the same redundancy zone.

[info 2020/01/29 11:44:54.692 PST <ReconnectThread> tid=0x78] DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001 started on localhost[23456]. There were 2 other DMs. others: [192.168.99.1(locator:51077:locator)<ec><v0>:41000, 192.168.99.1(server2:51826)<ec><v3>:41002]  (took 439 ms) 

[info 2020/01/29 11:45:09.931 PST <ReconnectThread> tid=0x78] Reconnect completed.
New DistributedSystem is Connected "server1" (id=ed24822) to distributed system using locators "localhost[23456]" logging to cacheserver.log  started at Wed Jan 29 11:44:54 PST 2020
New Cache is GemFireCache[id = 198638695; isClosing = false; isShutDownAll = false; created = Wed Jan 29 11:44:54 PST 2020; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]

As soon as the ReconnectThread, the Distributed system shutdown hook is invoked, and the member shuts down:

[info 2020/01/29 11:45:09.932 PST <Distributed system shutdown hook> tid=0xd] VM is exiting - shutting down distributed system

[info 2020/01/29 11:45:09.940 PST <Distributed system shutdown hook> tid=0xd] GemFireCache[id = 198638695; isClosing = true; isShutDownAll = false; created = Wed Jan 29 11:44:54 PST 2020; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.

[info 2020/01/29 11:45:09.948 PST <Distributed system shutdown hook> tid=0xd] Shutting down DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001. 

[info 2020/01/29 11:45:10.054 PST <Distributed system shutdown hook> tid=0xd] Now closing distribution for 192.168.99.1(server1:51825)<ec><v7>:41001

[info 2020/01/29 11:45:10.054 PST <Distributed system shutdown hook> tid=0xd] Stopping membership services

[info 2020/01/29 11:45:10.056 PST <Geode Failure Detection Server thread 1> tid=0x8a] GMSHealthMonitor server thread exiting

[info 2020/01/29 11:45:10.061 PST <Distributed system shutdown hook> tid=0xd] DistributionManager stopped in 113ms.

[info 2020/01/29 11:45:10.061 PST <Distributed system shutdown hook> tid=0xd] Marking DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001 as closed.

That is because the Cache Server Acceptor thread is not restarted during the second reconnect. Logging with 'Cache server connection listener bound to address ...' doesn't occur in the second reconnect.

A thread like this exists in the member after the first reconnect, but not the second:

"Cache Server Acceptor 0.0.0.0/0.0.0.0:40402 local port: 40402" #63 prio=5 os_prio=31 tid=0x00007f9142bb3800 nid=0xac03 runnable [0x000070000a7cf000]
   java.lang.Thread.State: RUNNABLE
	at java.net.PlainSocketImpl.socketAccept(Native Method)
	at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
	at java.net.ServerSocket.implAccept(ServerSocket.java:545)
	at java.net.ServerSocket.accept(ServerSocket.java:513)
	at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.accept(AcceptorImpl.java:1263)
	at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.run(AcceptorImpl.java:1202)
	at java.lang.Thread.run(Thread.java:745)

Here is a stack trace showing the code path in the first reconnect. This doesn't happen in the second reconnect.

[info 2020/01/29 12:34:01.956 PST <ReconnectThread> tid=0x45] Cache server connection listener bound to address 192.168.99.1-0.0.0.0/0.0.0.0:40401 with backlog 1280.
java.lang.Exception
	at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:605)
	at org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:303)
	at org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:429)
	at org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:386)
	at org.apache.geode.distributed.internal.InternalDistributedSystem.createAndStartCacheServers(InternalDistributedSystem.java:2750)
	at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2634)
	at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2390)
	at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1242)
	at org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2311)
	at org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1283)
	at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:2027)
	at java.lang.Thread.run(Thread.java:745)

Attachments

Activity

People

Assignee:: Ernest Burghardt

Reporter:: Barrett Oglesby

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Jan/20 22:39

Updated:: 30/Mar/20 23:34

Resolved:: 07/Feb/20 18:22