Solr
  1. Solr
  2. SOLR-8367

The new LIR 'all replicas participate' failsafe code needs to be improved.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      For one, it currently only kicks in the first attempted leader. If it's another replica that is stuck in LIR, it won't help.

      Second, when we attempt to be leader, knowing we might fail due to LIR, we should not put other replicas into recovery if they fail to sync with us - not until we know we will actually be leader.

      1. SOLR-8367.patch
        17 kB
        Mark Miller
      2. SOLR-8367.patch
        16 kB
        Mark Miller
      3. SOLR-8367.patch
        6 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Patch with initial logic change.

          We need to see if all the replicas are involved even if we are not the first replica in line for the leader election.

          We also wait to request recoveries until we actually see we will be the leader.

          Show
          Mark Miller added a comment - Patch with initial logic change. We need to see if all the replicas are involved even if we are not the first replica in line for the leader election. We also wait to request recoveries until we actually see we will be the leader.
          Hide
          Mark Miller added a comment -

          Working on a test, just hitting some issues. I think someone broke it so that when you restart a Jetty instance in a test, it's not getting the same data dir or something.

          Show
          Mark Miller added a comment - Working on a test, just hitting some issues. I think someone broke it so that when you restart a Jetty instance in a test, it's not getting the same data dir or something.
          Hide
          Mark Miller added a comment -

          Or it's my head is dopey and I need to be hard coding a filesystem based directory.

          Just surprised that when you don't, tlogs don't survive restart either. Perhaps because of the old index dir cleanup code.

          Show
          Mark Miller added a comment - Or it's my head is dopey and I need to be hard coding a filesystem based directory. Just surprised that when you don't, tlogs don't survive restart either. Perhaps because of the old index dir cleanup code.
          Hide
          Mark Miller added a comment -

          Okay, here is basically what we need with additional testing. Still need to give it another once over.

          Show
          Mark Miller added a comment - Okay, here is basically what we need with additional testing. Still need to give it another once over.
          Hide
          Mike Drob added a comment -

          SyncStrategy, the try-catch at line ~218 can go away.
          Do we need to check isClosed again during requestRecoveries? It's possible that it's false when the recoveries are being set up, but changes to true later by the time we actually make the RPC.
          An optimization might be to break out of the whole loop when closed, since it looks like not much will be happening anyway.

          In ZkController, can we log which retry we are on (in both places)? That will make parsing logs easier when this failure happens.

          You have a couple System.out.println in SolrCore that could probably be log.debug or even trace.

          In the test, you could store (HttpSolrClient) clients.get(0)).getBaseURL() as a local variable instead of loading it twice.

          If you're fixing shard count to 2 in the test, do we still want to createCollection(testCollectionName, 1, 3, 1); for three shards?

          Architecture question – What happens when you send an update FROMLEADER to the one that happens to be the leader? Also, why are we using decreasing versions?

          Show
          Mike Drob added a comment - SyncStrategy, the try-catch at line ~218 can go away. Do we need to check isClosed again during requestRecoveries ? It's possible that it's false when the recoveries are being set up, but changes to true later by the time we actually make the RPC. An optimization might be to break out of the whole loop when closed, since it looks like not much will be happening anyway. In ZkController, can we log which retry we are on (in both places)? That will make parsing logs easier when this failure happens. You have a couple System.out.println in SolrCore that could probably be log.debug or even trace . In the test, you could store (HttpSolrClient) clients.get(0)).getBaseURL() as a local variable instead of loading it twice. If you're fixing shard count to 2 in the test, do we still want to createCollection(testCollectionName, 1, 3, 1); for three shards? Architecture question – What happens when you send an update FROMLEADER to the one that happens to be the leader? Also, why are we using decreasing versions?
          Hide
          Mark Miller added a comment - - edited

          in SolrCore that could probably be log.debug or even trace.

          System.out is not allowed in SolrCore, I just have not cleaned up yet.

          If you're fixing shard count to 2 in the test, do we still want to createCollection(testCollectionName, 1, 3, 1); for three shards?

          Yes, create api uses 3 shards (there is a control jetty).

          What happens when you send an update FROMLEADER to the one that happens to be the leader?

          See the conflict exception we check for.

          Also, why are we using decreasing versions?

          Because it doesn't matter for this test.

          Show
          Mark Miller added a comment - - edited in SolrCore that could probably be log.debug or even trace. System.out is not allowed in SolrCore, I just have not cleaned up yet. If you're fixing shard count to 2 in the test, do we still want to createCollection(testCollectionName, 1, 3, 1); for three shards? Yes, create api uses 3 shards (there is a control jetty). What happens when you send an update FROMLEADER to the one that happens to be the leader? See the conflict exception we check for. Also, why are we using decreasing versions? Because it doesn't matter for this test.
          Hide
          Mark Miller added a comment -

          I've clean up the patch and beasted the test. Looking good.

          Show
          Mark Miller added a comment - I've clean up the patch and beasted the test. Looking good.
          Hide
          Mark Miller added a comment -

          An optimization might be to break out of the whole loop when closed, since it looks like not much will be happening anyway.

          Since they are all fired off async, I don't know that it is really worth it. All the isClosed stuff is really just best effort to bail early, but not really critical it's at every point.

          Show
          Mark Miller added a comment - An optimization might be to break out of the whole loop when closed, since it looks like not much will be happening anyway. Since they are all fired off async, I don't know that it is really worth it. All the isClosed stuff is really just best effort to bail early, but not really critical it's at every point.
          Hide
          Mike Drob added a comment -

          Since they are all fired off async, I don't know that it is really worth it. All the isClosed stuff is really just best effort to bail early, but not really critical it's at every point.

          I see what you're saying about it being async, so it was still possible for a close to sneak in before this patch as well. If we're closed, but still request a replica to recover, then I see that it has it's own checks for shutting down and closed as well so things will be fine there.

          Unrelated: while trying to trace the execution path here, I noticed that CoreAdminHandler::handleRequestRecoveryAction creates a thread and starts it without either giving it a name or submitting it to an executor. Should I file a separate JIRA for that? Looks like that thread was added by you in SOLR-4254, maybe that was before we had the executors everywhere.

          Show
          Mike Drob added a comment - Since they are all fired off async, I don't know that it is really worth it. All the isClosed stuff is really just best effort to bail early, but not really critical it's at every point. I see what you're saying about it being async, so it was still possible for a close to sneak in before this patch as well. If we're closed, but still request a replica to recover, then I see that it has it's own checks for shutting down and closed as well so things will be fine there. Unrelated: while trying to trace the execution path here, I noticed that CoreAdminHandler::handleRequestRecoveryAction creates a thread and starts it without either giving it a name or submitting it to an executor. Should I file a separate JIRA for that? Looks like that thread was added by you in SOLR-4254 , maybe that was before we had the executors everywhere.
          Hide
          Mark Miller added a comment -

          Should I file a separate JIRA for that?

          Yeah, probably a good idea.

          maybe that was before we had the executors everywhere.

          Yeah, most of the first pass cloud code was not using executors - recovery stuff, syncstrategy stuff, etc. Most of it has been converted over time.

          Show
          Mark Miller added a comment - Should I file a separate JIRA for that? Yeah, probably a good idea. maybe that was before we had the executors everywhere. Yeah, most of the first pass cloud code was not using executors - recovery stuff, syncstrategy stuff, etc. Most of it has been converted over time.
          Hide
          ASF subversion and git services added a comment -

          Commit 1718987 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1718987 ]

          SOLR-8367: Fix the LeaderInitiatedRecovery 'all replicas participate' fail-safe.

          Show
          ASF subversion and git services added a comment - Commit 1718987 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1718987 ] SOLR-8367 : Fix the LeaderInitiatedRecovery 'all replicas participate' fail-safe.
          Hide
          ASF subversion and git services added a comment -

          Commit 1719005 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1719005 ]

          SOLR-8367: Fix the LeaderInitiatedRecovery 'all replicas participate' fail-safe.

          Show
          ASF subversion and git services added a comment - Commit 1719005 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1719005 ] SOLR-8367 : Fix the LeaderInitiatedRecovery 'all replicas participate' fail-safe.
          Hide
          Mark Miller added a comment -

          Thanks for the review Mike! Let me know if anything else comes up.

          Show
          Mark Miller added a comment - Thanks for the review Mike! Let me know if anything else comes up.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development