Uploaded image for project: 'Apache IoTDB'
  1. Apache IoTDB
  2. IOTDB-953

[Distributed] Improve handling when a node cannot be reached

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • None
    • Core/Cluster
    • None

    Description

      When a node fails to send a request to another node, it will record one failure in its ClientPool, and when the count of failures reaches 3, it will reject to give clients of that node for 60s.

      This implementation has three main drawbacks:
      1. It does not distinguish network connection errors from others. Once `onError()` of a client is called, the count of failures increases, even if it is not called due to a network failure.
      2. Heartbeats should not be affected by this mechanism. As one functionality of heartbeats is to detect if one node is still alive, and they also need clients to do so, if they are blocked by the mechanism, we will lose the chance to resume connection with another node earlier, and the result would be we must wait for 60s even if the node has already resumed.
      3. Heartbeat successes will not unblock other requests. Because we are using a separate pool for heartbeats when a heartbeat to a node succeeds, it only unblocks other heartbeats to this node, and other requests are still blocked for 60s because they are using another pool for clients.

      Attachments

        Issue Links

          Activity

            People

              houliang Houliang Qi
              jt2594838 Tian Jiang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: