Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
None
Description
Sometimes server node will fail with the following trace:
SEVERE: TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability. java.lang.NullPointerException at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.cacheMetrics(GridDiscoveryManager.java:1149) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5022) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2690) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2491) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6675) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2574) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Two problems here:
- Uncaught exception in cacheMetrics() leads to unconditional failure of node, because it happens to be in discovery thread. Should probably wrap all non-trivial code include try-catch.
- Lack of proper locking when destroying cache (see also IGNITE-6580,
IGNITE-7278andIGNITE-7165)
Attachments
Issue Links
- blocks
-
IGNITE-7540 Sequential checkpoints cause overwrite of already cleaned & freed offheap page
- Resolved
- links to