[SOLR-12066] Cleanup deleted core when node start - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 7.3.1
Component/s: AutoScaling, SolrCloud
Labels:
None

Description

Initially when ~~SOLR-12047~~ was created it looked like waiting for a state in ZK for only 3 seconds was the culprit for cores not loading up

But it turns out to be something else. Here are the steps to reproduce this problem

create a 3 node cluster
create a 1 shard X 2 replica collection to use node1 and node2 ( http://localhost:8983/solr/admin/collections?action=create&name=test_node_lost&numShards=1&nrtReplicas=2&autoAddReplicas=true )
stop node 2 : ./bin/solr stop -p 7574
Solr will create a new replica on node3 after 30 seconds because of the ".auto_add_replicas" trigger
At this point state.json has info about replicas being on node1 and node3

Start node2. Bam!

java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: Unable to create core [test_node_lost_shard1_replica_n2]
...
Caused by: org.apache.solr.common.SolrException: Unable to create core [test_node_lost_shard1_replica_n2]
at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
...
Caused by: org.apache.solr.common.SolrException: 
at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
...
Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does not exist in shard shard1: DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
...

The practical effects of this is not big since the move replica has already put the replica on another JVM . But to the user it's super confusing on what's happening. He can never get rid of this error unless he manually cleans up the data directory on node2 and restart

Please note: I chose autoAddReplicas=true to reproduce this. but a user could be using a node lost trigger and and run into the same issue

Attachments

SOLR-12066.patch
28/Mar/18 04:23
6 kB
Cao Manh Dat
SOLR-12066.patch
28/Mar/18 07:31
7 kB
Cao Manh Dat

Issue Links

Add Link

causes

SOLR-13396 SolrCloud will delete the core data for any core that is not referenced in the clusterstate

Closed

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Cao Manh Dat

Reporter:: Varun Thacker

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 08/Mar/18 07:27

Updated:: 24/Jul/19 18:18

Resolved:: 04/Apr/18 02:48

Agile

View on Board

Cleanup deleted core when node start

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment