Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.5.0.final
-
None
-
None
-
None
-
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
Description
During normal cluster operation we got the following error, that completely killed all communication with this node:
Runtime error caught during grid runnable execution: GridWorker [name=grid-nio-worker-1, gridName=null, finished=false, isCancelled=false, hashCode=558690914, interrupted=false, runner=grid-nio-worker-1-#69%null%] java.lang.AssertionError: null at org.apache.ignite.internal.direct.stream.v2.DirectByteBufferStreamImplV2$1.create(DirectByteBufferStreamImplV2.java:100) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.direct.stream.v2.DirectByteBufferStreamImplV2$1.create(DirectByteBufferStreamImplV2.java:98) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.direct.stream.v2.DirectByteBufferStreamImplV2.readArray(DirectByteBufferStreamImplV2.java:1337) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.direct.stream.v2.DirectByteBufferStreamImplV2.readByteArray(DirectByteBufferStreamImplV2.java:948) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.direct.DirectMessageReader.readByteArray(DirectMessageReader.java:173) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.managers.communication.GridIoMessage.readFrom(GridIoMessage.java:289) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.util.nio.GridDirectParser.decode(GridDirectParser.java:76) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:104) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.util.nio.GridConnectionBytesVerifyFilter.onMessageReceived(GridConnectionBytesVerifyFilter.java:123) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:107) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onMessageReceived(GridNioServer.java:2149) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.util.nio.GridNioFilterChain.onMessageReceived(GridNioFilterChain.java:173) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:903) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeys(GridNioServer.java:1463) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1398) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1280) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) ~[ignite-core-1.5.0.final.jar:1.5.0.final] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
Update 10 May 2017
I have got the same error and I have the following suggestions:
- We cannot assert data we read from network. All assertions should be removed and replaced with some runtime exception. When exception is thrown it should be logged and connection should be closed and then reopened (Ignite resends unacked messages automatically).
- In my case all NIO threads have died, but cluster has not kicked the node out (we need to add test for this and fix it).
- We need to write some CRC after each message, e.g. hash of all written field hashes and types and arrays' lengths. If CRC validation fails on receiver then exception should be thrown (see pt. 1) and connection should be restored.