Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-10933

Node may hang on join to topology and not move forward

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.8
    • None
    • None

    Description

      Several nodes join to topology simultaneously and hang on a long time.

      That can be on first start all cluster nodes or join nodes to completed topology.

      In the logs of problem nodes can see messages:

      2019-01-11 18:37:39.296 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require sig
      nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
      
       2019-01-11 18:43:09.374 [WARN ][Thread-56][o.a.i.s.d.tcp.TcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require sig
      nificant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
      
      ...
      
      

      and so for a long time without others.

      UPDATE: such behavior is caused by transferring TcpDiscoveryClientReconnectMessage stored in pending objects collection to joining node causing socket connection invalidation to joining node and marking it as failed.

      Reproduced by the following scenario:

      1. Create topology in specific order: srv1 srv2 client srv3 srv4
      2. Delay client reconnect.
      3. Trigger topology change by restarting srv2 (will trigger reconnect to next node), srv3, srv4
      4. Resume reconnect to node with empty EnsuredMessageHistory (triggering discovery message of type TcpDiscoveryClientReconnectMessage) and wait for completion.
      5. Add new node to topology.

      New node will fail with assertion or forever will stuck on join depending on timings.

      Same scenario could be probably triggered by temporary connection loss to joining node.

      v.pyatkov, thanks for help with the investigation.

       

       

      Attachments

        Issue Links

          Activity

            People

              ascherbakov Alexey Scherbakov
              v.pyatkov Vladislav Pyatkov
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m