Solr
  1. Solr
  2. SOLR-5243

killing a shard in one collection can result in leader election in a different collection

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Discovered while doing some more ad-hoc testing... if I create two collections with the same shard name and then kill the leader in one, it can sometimes cause a leader election in the other (leaving the first leaderless).

      1. SOLR-5243.patch
        3 kB
        Mark Miller
      2. SOLR-5243.patch
        3 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          Steps to reproduce:

          #Bring up 2 nodes....
          cp -rp example example2
          cd example
          java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myConf -DzkRun -DnumShards=2 -jar start.jar
          
          cd example2
          java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar
          
          #if both leaders aren't on port 8983, kill example2 and then bring it back up.
          
          #look up the core name for the c2/s1 leader and unload it
          curl "http://localhost:8983/solr/admin/cores?action=UNLOAD&core=c2_s1_replica2"
          
          # now see two things:
          # 1) c2/s1 is now leaderless
          # 2) The leader of c3/s1 has switched to port 7574
          # from the logs on port 7574 we can see that leader election was kicked off for the wrong collection...
          
          102432 [main-EventThread] INFO  org.apache.solr.cloud.ShardLeaderElectionContext  – Running the leader process for shard s1
          102484 [main-EventThread] INFO  org.apache.solr.cloud.ShardLeaderElectionContext  – Checking if I should try and be the leader.
          102484 [main-EventThread] INFO  org.apache.solr.cloud.ShardLeaderElectionContext  – My last published State was Active, it's okay to be the leader.
          102484 [main-EventThread] INFO  org.apache.solr.cloud.ShardLeaderElectionContext  – I may be the new leader - try and sync
          102485 [main-EventThread] INFO  org.apache.solr.cloud.SyncStrategy  – Sync replicas to http://192.168.1.104:7574/solr/c3_s1_replica2/
          
          
          Show
          Yonik Seeley added a comment - Steps to reproduce: #Bring up 2 nodes.... cp -rp example example2 cd example java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myConf -DzkRun -DnumShards=2 -jar start.jar cd example2 java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar # if both leaders aren't on port 8983, kill example2 and then bring it back up. #look up the core name for the c2/s1 leader and unload it curl "http: //localhost:8983/solr/admin/cores?action=UNLOAD&core=c2_s1_replica2" # now see two things: # 1) c2/s1 is now leaderless # 2) The leader of c3/s1 has switched to port 7574 # from the logs on port 7574 we can see that leader election was kicked off for the wrong collection... 102432 [main-EventThread] INFO org.apache.solr.cloud.ShardLeaderElectionContext – Running the leader process for shard s1 102484 [main-EventThread] INFO org.apache.solr.cloud.ShardLeaderElectionContext – Checking if I should try and be the leader. 102484 [main-EventThread] INFO org.apache.solr.cloud.ShardLeaderElectionContext – My last published State was Active, it's okay to be the leader. 102484 [main-EventThread] INFO org.apache.solr.cloud.ShardLeaderElectionContext – I may be the new leader - try and sync 102485 [main-EventThread] INFO org.apache.solr.cloud.SyncStrategy – Sync replicas to http: //192.168.1.104:7574/solr/c3_s1_replica2/
          Hide
          Yonik Seeley added a comment -

          It appears like the election process is OK... it's the unload that results in the wrong ephemeral node being removed.

          Show
          Yonik Seeley added a comment - It appears like the election process is OK... it's the unload that results in the wrong ephemeral node being removed.
          Hide
          Mark Miller added a comment -

          It looks like Sami was storing the electionContexts by coreNodeName.

          Show
          Mark Miller added a comment - It looks like Sami was storing the electionContexts by coreNodeName.
          Hide
          Mark Miller added a comment -

          It would seem this only happens if you have the same core node name in different collections.

          Show
          Mark Miller added a comment - It would seem this only happens if you have the same core node name in different collections.
          Hide
          Mark Miller added a comment -

          Hmm..not all test passing with that patch yet.

          Show
          Mark Miller added a comment - Hmm..not all test passing with that patch yet.
          Hide
          Mark Miller added a comment -

          Another patch.

          Had to remove an assert in unregister, because it actually did not make sense on a fail core start - but this bug was hiding that.

          Show
          Mark Miller added a comment - Another patch. Had to remove an assert in unregister, because it actually did not make sense on a fail core start - but this bug was hiding that.
          Hide
          ASF subversion and git services added a comment -

          Commit 1524286 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1524286 ]

          SOLR-5243: Killing a shard in one collection can result in leader election in a different collection if they share the same coreNodeName.

          Show
          ASF subversion and git services added a comment - Commit 1524286 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1524286 ] SOLR-5243 : Killing a shard in one collection can result in leader election in a different collection if they share the same coreNodeName.
          Hide
          ASF subversion and git services added a comment -

          Commit 1524287 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1524287 ]

          SOLR-5243: Killing a shard in one collection can result in leader election in a different collection if they share the same coreNodeName.

          Show
          ASF subversion and git services added a comment - Commit 1524287 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1524287 ] SOLR-5243 : Killing a shard in one collection can result in leader election in a different collection if they share the same coreNodeName.
          Hide
          ASF subversion and git services added a comment -

          Commit 1524288 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1524288 ]

          SOLR-5243: CHANGES entry.

          Show
          ASF subversion and git services added a comment - Commit 1524288 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1524288 ] SOLR-5243 : CHANGES entry.
          Hide
          ASF subversion and git services added a comment -

          Commit 1524289 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1524289 ]

          SOLR-5243: CHANGES entry.

          Show
          ASF subversion and git services added a comment - Commit 1524289 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1524289 ] SOLR-5243 : CHANGES entry.
          Hide
          ASF subversion and git services added a comment -

          Commit 1524290 from Mark Miller in branch 'dev/branches/lucene_solr_4_5'
          [ https://svn.apache.org/r1524290 ]

          SOLR-5243: Killing a shard in one collection can result in leader election in a different collection if they share the same coreNodeName.

          Show
          ASF subversion and git services added a comment - Commit 1524290 from Mark Miller in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1524290 ] SOLR-5243 : Killing a shard in one collection can result in leader election in a different collection if they share the same coreNodeName.
          Hide
          ASF subversion and git services added a comment -

          Commit 1524291 from Mark Miller in branch 'dev/branches/lucene_solr_4_5'
          [ https://svn.apache.org/r1524291 ]

          SOLR-5243: CHANGES entry.

          Show
          ASF subversion and git services added a comment - Commit 1524291 from Mark Miller in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1524291 ] SOLR-5243 : CHANGES entry.
          Hide
          Shalin Shekhar Mangar added a comment -

          I think this fix either caused a bug or uncovered a bug in shard splitting. ShardSplitTest has been failing sporadically since this was committed.

          Mark/Yonik, just off the top of your head, any idea why that would happen?

          http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-MacOSX/828/
          http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/7589/

          Show
          Shalin Shekhar Mangar added a comment - I think this fix either caused a bug or uncovered a bug in shard splitting. ShardSplitTest has been failing sporadically since this was committed. Mark/Yonik, just off the top of your head, any idea why that would happen? http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-MacOSX/828/ http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/7589/
          Hide
          Yonik Seeley added a comment -

          Do the shard split tests start up more than 3 cores per CoreContainer? If not, there should be no impact. If so, then the change in timing may have uncovered a different issue.

          Show
          Yonik Seeley added a comment - Do the shard split tests start up more than 3 cores per CoreContainer? If not, there should be no impact. If so, then the change in timing may have uncovered a different issue.
          Hide
          Adrien Grand added a comment -

          4.5 release -> bulk close

          Show
          Adrien Grand added a comment - 4.5 release -> bulk close

            People

            • Assignee:
              Mark Miller
              Reporter:
              Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development