Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9446

Leader failure after creating a freshly replicated index can send nodes into recovery even if index was not changed

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.3, master (7.0)
    • Component/s: replication (java)
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      We noticed this issue while migrating solr index from machines A1, A2 and A3 to B1, B2, B3. We followed following steps (and there were no updates during the migration process).

      • Index had replicas on machines A1, A2, A3. Let's say A1 was the leader at the time
      • We added 3 more replicas B1, B2 and B3. These nodes synced with the by replication. These fresh nodes do not have tlogs.
      • We shut down one of the old nodes (A3).
      • We then shut down the leader (A1)
      • New leader got elected (let's say A2) became the new leader
      • Leader asked all the replicas to sync with it
      • Fresh nodes (ones without tlogs), first tried PeerSync but since there was no frame of reference, PeerSync failed and fresh nodes fail back on to try replication

      Although replication would not copy all the segments again, it seems like we can short circuit sync to put nodes back in active state as soon as possible.

      If in case freshly replicated index becomes leader for some reason, it can still send nodes (both other freshly replicated indexes and old replicas) into recovery. Here is the scenario

      • Freshly replicated becomes the leader.
      • New leader however asks all the replicas to sync with it.
      • Replicas (including old one) ask for versions from the leader, but the leader has no update logs, hence replicas can not compute missing versions and falls back to replication
      1. SOLR-9446.patch
        14 kB
        Pushkar Raste

        Issue Links

          Activity

          Hide
          praste Pushkar Raste added a comment -

          I can think of couple of ways two solve it using fingerprint comparsion

          1. Add a fingerprint check in SyncStratergy.syncToMe() and request replica to sync only if fingperint does not match
          2. Add a fingerprint check in RecoveryStratergy.doRecovery() and initiate recovery only if fingerprint check does not match
          3. Add a fingerprint check in PeerSync.sync() to check if we are already in sync

          I think we almost always try PeerSync before trying replication so #3, should work.

          Show
          praste Pushkar Raste added a comment - I can think of couple of ways two solve it using fingerprint comparsion Add a fingerprint check in SyncStratergy.syncToMe() and request replica to sync only if fingperint does not match Add a fingerprint check in RecoveryStratergy.doRecovery() and initiate recovery only if fingerprint check does not match Add a fingerprint check in PeerSync.sync() to check if we are already in sync I think we almost always try PeerSync before trying replication so #3 , should work.
          Hide
          praste Pushkar Raste added a comment -

          It also seems like if I take either of approach #1 or approach #2 I will have add check at more than one place to cover multiple scenario (e.g. LIR, node coming out of long GC etc, getVersions call to RealTImeGetComponent with sync)

          As I mentioned in the last comment, since we always try PeerSync first, adding a a check in PeerSync.sync(), seems easiest/cleanest way to fix it.

          Show
          praste Pushkar Raste added a comment - It also seems like if I take either of approach #1 or approach #2 I will have add check at more than one place to cover multiple scenario (e.g. LIR, node coming out of long GC etc, getVersions call to RealTImeGetComponent with sync) As I mentioned in the last comment, since we always try PeerSync first, adding a a check in PeerSync.sync(), seems easiest/cleanest way to fix it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user praste opened a pull request:

          https://github.com/apache/lucene-solr/pull/73

          SOLR-9446 Do a fingerprint check before starting PeerSync

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/praste/lucene-solr SOLR-9446

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/lucene-solr/pull/73.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #73


          commit 82e2fb5914a202f7577b92b999370cfb6fcc605b
          Author: Pushkar Raste <praste@bloomberg.net>
          Date: 2016-08-26T17:50:40Z

          SOLR-9446 Do a fingerprint check before starting PeerSync


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user praste opened a pull request: https://github.com/apache/lucene-solr/pull/73 SOLR-9446 Do a fingerprint check before starting PeerSync You can merge this pull request into a Git repository by running: $ git pull https://github.com/praste/lucene-solr SOLR-9446 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/73.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #73 commit 82e2fb5914a202f7577b92b999370cfb6fcc605b Author: Pushkar Raste <praste@bloomberg.net> Date: 2016-08-26T17:50:40Z SOLR-9446 Do a fingerprint check before starting PeerSync
          Hide
          noble.paul Noble Paul added a comment -

          I find the following assertion commented out in the testcase

          // assertEquals("FreshNode went into recovery", numRequestsBefore, numRequestsAfter);
          

          I tested by uncommenting it. it is passing anyway.

          Show
          noble.paul Noble Paul added a comment - I find the following assertion commented out in the testcase // assertEquals( "FreshNode went into recovery" , numRequestsBefore, numRequestsAfter); I tested by uncommenting it. it is passing anyway.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 15cee3141c160c38756ceed73bd1cd88002c01c9 in lucene-solr's branch refs/heads/master from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=15cee31 ]

          SOLR-9446: Leader failure after creating a freshly replicated index can send nodes into recovery even if index was not changed

          Show
          jira-bot ASF subversion and git services added a comment - Commit 15cee3141c160c38756ceed73bd1cd88002c01c9 in lucene-solr's branch refs/heads/master from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=15cee31 ] SOLR-9446 : Leader failure after creating a freshly replicated index can send nodes into recovery even if index was not changed
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8502995e3b1ce66db49be26b23a3fa3c169345a8 in lucene-solr's branch refs/heads/branch_6x from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8502995 ]

          SOLR-9446: Leader failure after creating a freshly replicated index can send nodes into recovery even if index was not changed

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8502995e3b1ce66db49be26b23a3fa3c169345a8 in lucene-solr's branch refs/heads/branch_6x from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8502995 ] SOLR-9446 : Leader failure after creating a freshly replicated index can send nodes into recovery even if index was not changed
          Hide
          jimtronic Jim Musil added a comment -

          FWIW, This was a particularly bad problem for us. In the scenario outlined in the description, our old nodes were going down at different times but generally while the new nodes were in recovery. This produced a situation where all the live nodes were in recovery, but could never recover. The new nodes did not serve requests and the collection was dead in the water.

          Show
          jimtronic Jim Musil added a comment - FWIW, This was a particularly bad problem for us. In the scenario outlined in the description, our old nodes were going down at different times but generally while the new nodes were in recovery. This produced a situation where all the live nodes were in recovery, but could never recover. The new nodes did not serve requests and the collection was dead in the water.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.3.0 release.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

            People

            • Assignee:
              noble.paul Noble Paul
              Reporter:
              praste Pushkar Raste
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development