Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-4111

Communication fails to send message if target node did not finish join process

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.8
    • general
    • None

    Description

      Currently this scenario is possible:

      • joining node sent join request and waits for TcpDiscoveryNodeAddFinishedMessage inside ServerImpl.joinTopology
      • others nodes already see this node and can send messages to it (for example try to run compute job on this node)
      • joining node can not receive message: TcpCommunicationSpi will hang inside 'onFirstMessage' on 'getSpiContext' call, so sending node will get error trying to establish connection

      Possible fix: if in onFirstMessage() spi context is not available, then TcpCommunicationSpi should send special response which indicates that this node is not ready yet, and sender should retry after some time.

      Also need check internal code for places where message can be unnecessarily sent to node: one such place is GridCachePartitionExchangeManager.refreshPartitions - message is sent to all known nodes, but here we can filter by node order / finished exchage version.

      Attachments

        1. test onFirstMessage hang.log
          60 kB
          Alexandr Kuramshin

        Activity

          People

            NSAmelchev Nikita Amelchev
            sboikov Semen Boikov
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 20m
                20m