Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-12066

Cleanup deleted core when node start

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.3.1
    • Component/s: AutoScaling, SolrCloud
    • Labels:
      None

      Description

      Initially when SOLR-12047 was created it looked like waiting for a state in ZK for only 3 seconds was the culprit for cores not loading up

       

      But it turns out to be something else. Here are the steps to reproduce this problem

       

      • create a 3 node cluster
      • create a 1 shard X 2 replica collection to use node1 and node2 ( http://localhost:8983/solr/admin/collections?action=create&name=test_node_lost&numShards=1&nrtReplicas=2&autoAddReplicas=true )
      • stop node 2 : ./bin/solr stop -p 7574
      • Solr will create a new replica on node3 after 30 seconds because of the ".auto_add_replicas" trigger
      • At this point state.json has info about replicas being on node1 and node3
      • Start node2. Bam!
        java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: Unable to create core [test_node_lost_shard1_replica_n2]
        ...
        Caused by: org.apache.solr.common.SolrException: Unable to create core [test_node_lost_shard1_replica_n2]
        at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
        ...
        Caused by: org.apache.solr.common.SolrException: 
        at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
        at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
        ...
        Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does not exist in shard shard1: DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
        ...

       

      The practical effects of this is not big since the move replica has already put the replica on another JVM . But to the user it's super confusing on what's happening. He can never get rid of this error unless he manually cleans up the data directory on node2 and restart

       

      Please note: I chose autoAddReplicas=true to reproduce this. but a user could be using a node lost trigger and and run into the same issue

        Attachments

        1. SOLR-12066.patch
          6 kB
          Cao Manh Dat
        2. SOLR-12066.patch
          7 kB
          Cao Manh Dat

          Issue Links

            Activity

              People

              • Assignee:
                caomanhdat Cao Manh Dat
                Reporter:
                varun Varun Thacker
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: