[IGNITE-7476] Server node will join with failure gathering metrics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.5
Component/s: None
Labels:
None

Description

Sometimes server node will fail with the following trace:

SEVERE: TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability.
java.lang.NullPointerException
    at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.cacheMetrics(GridDiscoveryManager.java:1149)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5022)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2690)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2491)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6675)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2574)
    at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

Two problems here:

Uncaught exception in cacheMetrics() leads to unconditional failure of node, because it happens to be in discovery thread. Should probably wrap all non-trivial code include try-catch.
Lack of proper locking when destroying cache (see also IGNITE-6580, ~~IGNITE-7278~~ and ~~IGNITE-7165~~)

Attachments

Issue Links

blocks

IGNITE-7540 Sequential checkpoints cause overwrite of already cleaned & freed offheap page

Resolved

links to

GitHub Pull Request #3448

Activity

People

Assignee:: Ilya Kasnacheev

Reporter:: Ilya Kasnacheev

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Jan/18 11:58

Updated:: 12/Feb/18 12:18

Resolved:: 12/Feb/18 12:18