Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-13012

Fix failure detection timeout. Simplify node ping routine.

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.8.1
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
    • Release Note:
      Fixed processing of failure detection timeout in TcpDiscoverySpi. If a node fails to send a message or ping, now it drops current connection strictly within this timeout and begins establishing new connection much faster.
    • Ignite Flags:
      Release Notes Required

      Description

      Connection failure may not be detected within IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. Node ping routine is duplicated.

      We should fix:

      1. Failure detection timeout should take in account last sent message. Current ping is bound to own time:

      ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent

      This is weird because any discovery message check connection.

      2. Make connection check interval depend on failure detection timeout (FTD). Current value is a constant:

      static int ServerImpls.CON_CHECK_INTERVAL = 500

      3. Remove additional, quickened connection checking. Once we do fix 1, this will become even more useless.
      Despite TCP discovery has a period of connection checking, it may send ping before this period exhausts. This premature ping relies also on the time of any received message for some reason.

      4. Do not worry user with “Node seems disconnected” when everything is OK. Once we do fix 1 and 3, this will become even more useless.
      Node may log on INFO: “Local node seems to be disconnected from topology …” whereas it is not actually disconnected at all.

        Attachments

        1. IGNITE-13012-patch.patch
          18 kB
          Vladimir Steshin

          Issue Links

            Activity

              People

              • Assignee:
                vladsz83 Vladimir Steshin
                Reporter:
                vladsz83 Vladimir Steshin
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 40m
                  3h 40m