[IGNITE-4111] Communication fails to send message if target node did not finish join process - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8
Component/s: general
Labels:
None

Description

Currently this scenario is possible:

joining node sent join request and waits for TcpDiscoveryNodeAddFinishedMessage inside ServerImpl.joinTopology
others nodes already see this node and can send messages to it (for example try to run compute job on this node)
joining node can not receive message: TcpCommunicationSpi will hang inside 'onFirstMessage' on 'getSpiContext' call, so sending node will get error trying to establish connection

Possible fix: if in onFirstMessage() spi context is not available, then TcpCommunicationSpi should send special response which indicates that this node is not ready yet, and sender should retry after some time.

Also need check internal code for places where message can be unnecessarily sent to node: one such place is GridCachePartitionExchangeManager.refreshPartitions - message is sent to all known nodes, but here we can filter by node order / finished exchage version.