Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-14356

PeerSync should not fail with SocketTimeoutException from hanging nodes

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master (9.0), 8.6
    • Component/s: None
    • Labels:
      None

      Description

      Right now in PeerSync (during leader election), in case of exception on requesting versions to a node, we will skip that node if exception is one the following type

      • ConnectTimeoutException
      • NoHttpResponseException
      • SocketException
        Sometime the other node basically hang but still accept connection. In that case SocketTimeoutException is thrown and we consider the PeerSync process as failed and the whole shard just basically leaderless forever (as long as the hang node still there).

      We can't just blindly adding SocketTimeoutException to above list, since Shalin Shekhar Mangar mentioned that sometimes timeout can happen because of genuine reasons too e.g. temporary GC pause.
      I think the general idea here is we obey leaderVoteWait restriction and retry doing sync with others in case of connection/timeout exception happen.

        Attachments

        1. SOLR-14356.patch
          1 kB
          Cao Manh Dat
        2. SOLR-14356.patch
          1 kB
          Cao Manh Dat

          Activity

            People

            • Assignee:
              caomanhdat Cao Manh Dat
              Reporter:
              caomanhdat Cao Manh Dat
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: