Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9716

RecoveryStrategy send prep recovery cmd without setting request time out

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      Currently, RecoveryStrategy sends prep recovery cmd without setting request time out. But this can be long running request, so if we have network partition in the middle of the request. Recovering core will stay down forever.

      1. SOLR-9716.patch
        8 kB
        Shalin Shekhar Mangar
      2. SOLR-9716.patch
        8 kB
        Cao Manh Dat
      3. SOLR-9716.patch
        7 kB
        Cao Manh Dat
      4. SOLR-9716.patch
        9 kB
        Cao Manh Dat
      5. SOLR-9716.patch
        3 kB
        Cao Manh Dat

        Issue Links

          Activity

          Hide
          caomanhdat Cao Manh Dat added a comment -

          Initial solution for this issue without tests. This patch is tested on solr-jepsen (https://github.com/LucidWorks/jepsen/tree/solr-jepsen) and passed.

          This is a critical issue, it can cause a replica being down forever so I think we can commit this patch first and create another issue for creating an unit test for this ticket.

          Show
          caomanhdat Cao Manh Dat added a comment - Initial solution for this issue without tests. This patch is tested on solr-jepsen ( https://github.com/LucidWorks/jepsen/tree/solr-jepsen ) and passed. This is a critical issue, it can cause a replica being down forever so I think we can commit this patch first and create another issue for creating an unit test for this ticket.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Updated patch included a test.

          In this patch I introduced a new class called ChaosHttpSolrClient, in which randomly wait forever ( in case of PREPRECOVERY request ) if socketTimeOut is not set.

          Show
          caomanhdat Cao Manh Dat added a comment - Updated patch included a test. In this patch I introduced a new class called ChaosHttpSolrClient, in which randomly wait forever ( in case of PREPRECOVERY request ) if socketTimeOut is not set.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Cleaner test based on TestInjection.

          Show
          caomanhdat Cao Manh Dat added a comment - Cleaner test based on TestInjection.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Updated patch, modified TestInjection.injectPrepRecoveryOpPauseForever() to make sure that it won't continuous pause all the times. If not, the test will be failed some time because of timeout waiting for the collection to be active.

          Show
          caomanhdat Cao Manh Dat added a comment - Updated patch, modified TestInjection.injectPrepRecoveryOpPauseForever() to make sure that it won't continuous pause all the times. If not, the test will be failed some time because of timeout waiting for the collection to be active.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks Dat. This looks good. Earlier the read timeout was effectively infinite so I think we should probably wait for more than a minute. I'll bump this up to 5 minutes and commit this patch.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks Dat. This looks good. Earlier the read timeout was effectively infinite so I think we should probably wait for more than a minute. I'll bump this up to 5 minutes and commit this patch.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          That would be great!

          Show
          caomanhdat Cao Manh Dat added a comment - That would be great!
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          The earlier patches did not break the loop after a successful call. This patch fixes it and increases the max wait to 300 seconds. I'll commit after running tests and precommit.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - The earlier patches did not break the loop after a successful call. This patch fixes it and increases the max wait to 300 seconds. I'll commit after running tests and precommit.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1f1990d8be9fbbe0d95a10f3be1dffccec969a32 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1f1990d ]

          SOLR-9716: RecoveryStrategy sends prep recovery command without setting read time out which can cause replica recovery to hang indefinitely on network partitions

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1f1990d8be9fbbe0d95a10f3be1dffccec969a32 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1f1990d ] SOLR-9716 : RecoveryStrategy sends prep recovery command without setting read time out which can cause replica recovery to hang indefinitely on network partitions
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 9a8030171cfaf529e5de0edae0a5ceddb871d3ff in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9a80301 ]

          SOLR-9716: RecoveryStrategy sends prep recovery command without setting read time out which can cause replica recovery to hang indefinitely on network partitions

          (cherry picked from commit 1f1990d)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 9a8030171cfaf529e5de0edae0a5ceddb871d3ff in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9a80301 ] SOLR-9716 : RecoveryStrategy sends prep recovery command without setting read time out which can cause replica recovery to hang indefinitely on network partitions (cherry picked from commit 1f1990d)
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks Dat!

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks Dat!
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1f1990d8be9fbbe0d95a10f3be1dffccec969a32 in lucene-solr's branch refs/heads/apiv2 from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1f1990d ]

          SOLR-9716: RecoveryStrategy sends prep recovery command without setting read time out which can cause replica recovery to hang indefinitely on network partitions

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1f1990d8be9fbbe0d95a10f3be1dffccec969a32 in lucene-solr's branch refs/heads/apiv2 from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1f1990d ] SOLR-9716 : RecoveryStrategy sends prep recovery command without setting read time out which can cause replica recovery to hang indefinitely on network partitions

            People

            • Assignee:
              shalinmangar Shalin Shekhar Mangar
              Reporter:
              caomanhdat Cao Manh Dat
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development