Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.15.0
Description
I've hit this issue while executing some chaos testing over a GemFire cluster using 2 locators and 3 servers; SSL is enabled, and a dummy SecurityManager is configured which authenticates any user and always returns true within the authorize method. There are 3 PARTITION_REDUNDANT regions configured, one per client, each with 1 redundant copy. Once the cluster is up and running, 3 clients continuously execute Region.put operations on a well known set of keys for its own Region (created with PROXY type), and another process executes the following logic in parallel (pseudocode):
for server in ${servers} do # Pause the JVM for 30 seconds to simulate a stop the world GC kill -STOP server sleep 30 # Unpause the JVM, wait for member to reconnect and regions to recover redundancy configured kill -CONT "${SERVER_PID}" waitForReconnectcompletedInServerLog waitForNumBucketsWithoutRedundancyToBeZeroInGfshShowRegionMetrics done
The test works fine most of the time, but randomly fails due to an unexpected exception logged within the logs for at least one locator. The exception is always reported from a P2P message reader thread on the locator for a server member that has just returned to life, as an example:
#### LOCATOR-0 [warn 2021/08/17 05:20:45.166 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-2(system-test-gemfire-server-2:1)<v4>:41000 unshared ordered sender uid=61 dom #1 local port=48141 remote port=46174> tid=0x6f] P2P message reader@354fac47 timed out during a membership check. [fatal 2021/08/17 05:20:45.166 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-2(system-test-gemfire-server-2:1)<v4>:41000 unshared ordered sender uid=61 dom #1 local port=48141 remote port=46174> tid=0x6f] Error deserializing P2P handshake message java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire or non-GemFire during handshake due to initial byte being 1 at org.apache.geode.internal.tcp.Connection.readHandshakeForReceiver(Connection.java:2875) at org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2825) at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1649) at org.apache.geode.internal.tcp.Connection.run(Connection.java:1489) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) #### SERVER-2 [warn 2021/08/17 05:20:44.012 GMT system-test-gemfire-server-2 <StatSampler> tid=0x35] Statistics sampling thread detected a wakeup delay of 29070 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics. [warn 2021/08/17 05:20:44.015 GMT system-test-gemfire-server-2 <Geode Heartbeat Sender> tid=0x23] Failure detection heartbeat-generation thread overslept by more than a full period. Asleep time: 31,175,291,931 nanoseconds. Period: 2,500,000,000 nanoseconds. [info 2021/08/17 05:20:44.143 GMT system-test-gemfire-server-2 <unicast receiver,system-test-gemfire-server-2-25596> tid=0x1e] saving cache server configuration for use with the cluster-configuration service on reconnect [info 2021/08/17 05:20:44.143 GMT system-test-gemfire-server-2 <unicast receiver,system-test-gemfire-server-2-25596> tid=0x1e] cache server configuration saved [info 2021/08/17 05:20:44.233 GMT system-test-gemfire-server-2 <DisconnectThread> tid=0xe5] Stopping membership services [info 2021/08/17 05:20:44.455 GMT system-test-gemfire-server-2 <ReconnectThread> tid=0xe5] Disconnecting old DistributedSystem to prepare for a reconnect attempt [info 2021/08/17 05:20:44.463 GMT system-test-gemfire-server-2 <ReconnectThread> tid=0xe5] GemFireCache[id = 252990056; isClosing = true; isShutDownAll = false; created = Tue Aug 17 05:11:50 GMT 2021; server = true; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing. [info 2021/08/17 05:20:44.544 GMT system-test-gemfire-server-2 <ReconnectThread> tid=0xe5] Cache server on port 40404 is shutting down. [info 2021/08/17 05:20:44.565 GMT system-test-gemfire-server-2 <Queue Removal Thread> tid=0x5e] The QueueRemovalThread is done.
The full set of logs and statistics from all member of the cluster can be found here.
Contrary to what the exception message states, it's worth noticing that all members within the cluster (including clients) are using the same Geode version.
—
Below are some extra logs from when I was able to reproduce the issue with log-level=fine configured on all members:
#### LOCATOR-0 [debug 2021/08/16 20:40:22.858 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] P2P handshake remoteAddr is system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 (GEODE 1.15.0) [warn 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] P2P message reader@6673b475 timed out during a membership check. [fatal 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] Error deserializing P2P handshake message java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire or non-GemFire during handshake due to initial byte being 1 at org.apache.geode.internal.tcp.Connection.readHandshakeForReceiver(Connection.java:2875) at org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2825) at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1649) at org.apache.geode.internal.tcp.Connection.run(Connection.java:1489) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) [debug 2021/08/16 20:40:23.863 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] readMessages terminated id=/10.12.0.9:58373 from system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 isHandshakeReader=false [debug 2021/08/16 20:40:23.863 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] Stopping P2P message reader@6673b475 for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 #### SERVER-1 [warn 2021/08/16 20:40:22.698 GMT system-test-gemfire-server-1 <StatSampler> tid=0x35] Statistics sampling thread detected a wakeup delay of 29219 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics. ... [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <ClientMembership Event Invoker1> tid=0x58] Starting thread for P2P handshake reader@6161c139-47 [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P handshake reader@6161c139-47> tid=0x33] Starting P2P handshake reader@6161c139-47 on Socket[addr=/10.12.0.9,port=58373,localport=57818] [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P handshake reader@6161c139-47> tid=0x33] readMessages terminated id=/10.12.1.6:44219 from system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000 isHandshakeReader=false [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P handshake reader@6161c139-47> tid=0x33] Stopping P2P handshake reader@6161c139-47 for system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000 [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <ClientMembership Event Invoker1> tid=0x58] ConnectionTable: created an ordered connection: system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000(uid=47) [debug 2021/08/16 20:40:23.059 GMT system-test-gemfire-server-1 <ClientMembership Event Invoker1> tid=0x58] Sending on these 1 connections: [system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000(uid=47)] [debug 2021/08/16 20:40:23.065 GMT system-test-gemfire-server-1 <ClientMembership Event Invoker1> tid=0x58] distribution of message aborted by shutdown: UpdateOperation(EntryEventImpl[op=CREATE;region=/_notificationRegion_10.12.1.6<v3>41000;key=org.apache.geode.management.internal.NotificationKey@9b5d432b;callbackArg=null;originRemote=false;originMember=system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000;callbacksInvoked;id=EventID[system-test-gemfire-server-1(system-test-gemfire-server-1)<v3>:41000;threadID=2;sequenceID=5]])
The full logs and statistics for all member of the cluster can be found here.
I'm not familiar with this part of the code but, from the above logs, it looks like the same reader (reader@6161c139-47) is being used by two different threads at the same time?, is this expected?, might that be the cause of the exception we're seeing?.
Attachments
Issue Links
- links to