Solr
  1. Solr
  2. SOLR-5240

SolrCloud node doesn't (quickly) come all the way back

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.5
    • Fix Version/s: 4.5, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Killing a single node and bringing it back up can result in "waiting until we see more replicas up..."

      1. SOLR-5240.patch
        1 kB
        Yonik Seeley

        Activity

        Hide
        Yonik Seeley added a comment -

        I was doing some ad-hoc testing of the current state of cloud when I ran across this bug.
        Steps to reproduce:

        # our standard cloud bootstrap
        java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=c1 -DzkRun -DnumShards=2 -jar start.jar
        # create a new collection
        curl "http://localhost:8983/solr/admin/collections?action=CREATE&name=c2&replicationFactor=4&router=implicit&shards=s1&maxShardsPerNode=100"
        # now kill the server (CTRL-C on the console or whatever)
        
        # now restart the server
        java -DzkRun -jar start.jar
        
        #and the admin console isn't responsive, and you see this in the logs...
        12628 [coreLoadExecutor-4-thread-2] INFO  org.apache.solr.cloud.ShardLeaderElectionContext  – Waiting until we see more replicas up for shard s1: total=4 found=3 timeoutin=179999
        1
        32701 [coreLoadExecutor-4-thread-2] INFO  org.apache.solr.cloud.ShardLeaderElectionContext  – Waiting until we see more replicas up for shard s1: total=4 found=3 timeoutin=159926
        [...]
        
        Show
        Yonik Seeley added a comment - I was doing some ad-hoc testing of the current state of cloud when I ran across this bug. Steps to reproduce: # our standard cloud bootstrap java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=c1 -DzkRun -DnumShards=2 -jar start.jar # create a new collection curl "http: //localhost:8983/solr/admin/collections?action=CREATE&name=c2&replicationFactor=4&router=implicit&shards=s1&maxShardsPerNode=100" # now kill the server (CTRL-C on the console or whatever) # now restart the server java -DzkRun -jar start.jar #and the admin console isn't responsive, and you see this in the logs... 12628 [coreLoadExecutor-4-thread-2] INFO org.apache.solr.cloud.ShardLeaderElectionContext – Waiting until we see more replicas up for shard s1: total=4 found=3 timeoutin=179999 1 32701 [coreLoadExecutor-4-thread-2] INFO org.apache.solr.cloud.ShardLeaderElectionContext – Waiting until we see more replicas up for shard s1: total=4 found=3 timeoutin=159926 [...]
        Hide
        Yonik Seeley added a comment -

        Since 3 out of the 4 replicas are up... my guess is that this has to do with parallel core loading (which defaults to loading 3 cores at a time). And I guess the "wait for more replicas" is part of the core "loading"... hence a temporary deadlock until things time out.

        Show
        Yonik Seeley added a comment - Since 3 out of the 4 replicas are up... my guess is that this has to do with parallel core loading (which defaults to loading 3 cores at a time). And I guess the "wait for more replicas" is part of the core "loading"... hence a temporary deadlock until things time out.
        Hide
        Yonik Seeley added a comment -

        Here's the simplest patch that fixes it - removing any executor thread limit when in ZK mode. Note that this deadlock-until-timeout situation can also easily happen even when replicas of a particular shard aren't on the same node. All that is required is to have more than 3 cores per node.

        Show
        Yonik Seeley added a comment - Here's the simplest patch that fixes it - removing any executor thread limit when in ZK mode. Note that this deadlock-until-timeout situation can also easily happen even when replicas of a particular shard aren't on the same node. All that is required is to have more than 3 cores per node.
        Hide
        Mark Miller added a comment -

        +1 - any other fix seems somewhat complicated.

        Show
        Mark Miller added a comment - +1 - any other fix seems somewhat complicated.
        Hide
        ASF subversion and git services added a comment -

        Commit 1523871 from Yonik Seeley in branch 'dev/trunk'
        [ https://svn.apache.org/r1523871 ]

        SOLR-5240: unlimited core loading threads to fix waiting-for-other-replicas deadlock

        Show
        ASF subversion and git services added a comment - Commit 1523871 from Yonik Seeley in branch 'dev/trunk' [ https://svn.apache.org/r1523871 ] SOLR-5240 : unlimited core loading threads to fix waiting-for-other-replicas deadlock
        Hide
        ASF subversion and git services added a comment -

        Commit 1523872 from Yonik Seeley in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1523872 ]

        SOLR-5240: unlimited core loading threads to fix waiting-for-other-replicas deadlock

        Show
        ASF subversion and git services added a comment - Commit 1523872 from Yonik Seeley in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1523872 ] SOLR-5240 : unlimited core loading threads to fix waiting-for-other-replicas deadlock
        Hide
        ASF subversion and git services added a comment -

        Commit 1523873 from Yonik Seeley in branch 'dev/branches/lucene_solr_4_5'
        [ https://svn.apache.org/r1523873 ]

        SOLR-5240: unlimited core loading threads to fix waiting-for-other-replicas deadlock

        Show
        ASF subversion and git services added a comment - Commit 1523873 from Yonik Seeley in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1523873 ] SOLR-5240 : unlimited core loading threads to fix waiting-for-other-replicas deadlock
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Unassigned
            Reporter:
            Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development