[SOLR-6133] More robust collection-delete - ASF JIRA

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.7.2, 4.8.1
Fix Version/s: None
Component/s: SolrCloud
Labels:
- difficulty-medium
- impact-medium

Description

If Solrs are not "stable" (completely up and running etc) a collection-delete request might result in partly deleted collections. You might say that it is fair that you are not able to have a collection deleted if all of its shards are not actively running - even though I would like a mechanism that just deleted them when/if they ever come up again. But even though all shards claim to be actively running you can still end up with partly deleted collections - that is not acceptable IMHO. At least clusterstate should always reflect the state, so that you are able to detect that your collection-delete request was only partly carried out - which parts were successfully deleted and which were not (including information about data-folder-deletion)

The text above sounds like an epic-sized task, with potentially numerous problems to fix, so in order not to make this ticket "open forever" I will point out a particular scenario where I see problems. Then this problem is corrected we can close this ticket. Other tickets will have to deal with other collection-delete issues.

Here is what I did and saw

Logged into one of my Linux machines with IP 192.168.78.239
Prepared for Solr install
```
mkdir -p /xXX/solr
cd /xXX/solr
```
downloaded solr-4.7.2.tgz

Installed Solr 4.7.2 and prepared for three "nodes"

tar zxvf solr-4.7.2.tgz
cd solr-4.7.2/
cp -r example node1
cp -r example node2
cp -r example node3

Initialized Solr config into Solr

cd node1
java -DzkRun -Dhost=192.168.78.239 -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -jar start.jar
CTRL-C to stop solr (node1) again after it started completely

Started all three Solr nodes

nohup java -Djetty.port=8983 -Dhost=192.168.78.239 -DzkRun -jar start.jar &>> node1_stdouterr.log &
cd ../node2
nohup java -Djetty.port=8984 -Dhost=192.168.78.239 -DzkHost=localhost:9983 -jar start.jar &>> node2_stdouterr.log &
cd ../node3
nohup java -Djetty.port=8985 -Dhost=192.168.78.239 -DzkHost=localhost:9983 -jar start.jar &>> node3_stdouterr.log &

Created a collection "mycoll"

curl 'http://192.168.78.239:8983/solr/admin/collections?action=CREATE&name=mycoll&numShards=6&replicationFactor=1&maxShardsPerNode=2&collection.configName=myconf'

Collected "Cloud Graph" image, clusterstate.json and info about data folders (see attached coll_delete_problem.zip | after_create_all_solrs_still_running). You will see that everything is as it is supposed to be. Two shards per node, six all in all, it is all reflected in clusterstate and there is a data-folder for each shard

Stopped all three Solr nodes

kill $(ps -ef | grep 8985 | grep -v grep | awk '{print $2}')
kill $(ps -ef | grep 8984 | grep -v grep | awk '{print $2}')
kill $(ps -ef | grep 8983 | grep -v grep | awk '{print $2}')

Started Solr node1 only (wait for it to start completely)

cd ../node1
nohup java -Djetty.port=8983 -Dhost=192.168.78.239 -DzkRun -jar start.jar &>> node1_stdouterr.log &
Wait for it to start fully - might take a minute or so

Collected "Cloud Graph" image, clusterstate.json and info about data folders (see attached coll_delete_problem.zip | after_create_solr1_restarted_solr2_and_3_not_started_yet). You will see that everything is as it is supposed to be. Two shards per node, six all in all, the four on node2 and node3 are down, it is all reflected in clusterstate and there is still a data-folder for each shard
Started CollDelete.java (see attached coll_delete_problem.zip) - will delete collection "mycoll" when all three Solrs are live and all shards are "active"

Started the remaining two Solr nodes

cd ../node2
nohup java -Djetty.port=8984 -Dhost=192.168.78.239 -DzkHost=localhost:9983 -jar start.jar &>> node2_stdouterr.log &
cd ../node3
nohup java -Djetty.port=8985 -Dhost=192.168.78.239 -DzkHost=localhost:9983 -jar start.jar &>> node3_stdouterr.log &

After CollDelete.java finished, collected "Cloud Graph" image, clusterstate.json, info about data folders and output from CollDelete.java (see attached coll_delete_problem.zip | after_create_all_solrs_restarted_delete_coll_while_solr2_and_3_was_starting_up). You will see that not everything is as it is supposed to be. Al info about "mycoll" deleted from clusterstate - ok. But data-folders remain for node2 and node3 - not ok.

CollDelete output

All 3 solrs live
All (6) shards active. Now deleting
{responseHeader={status=0,QTime=1823},failure={192.168.78.239:8985_solr=org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://192.168.78.239:8985/solr,192.168.78.239:8984_solr=org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://192.168.78.239:8984/solr,192.168.78.239:8984_solr=org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://192.168.78.239:8984/solr,192.168.78.239:8985_solr=org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://192.168.78.239:8985/solr},success={192.168.78.239:8983_solr={responseHeader={status=0,QTime=305}},192.168.78.239:8983_solr={responseHeader={status=0,QTime=260}}}}

Please note, that consecutive attempts to collection-delete "mycoll" will fail, because Solr claims that "mycoll" does not exist.
I stopped the Solrs again
Collected stdouterr files (see attached coll_delete_problem.zip)

In this scenario you see that because you send the delete-collection request while some Solrs have not completely started yet, you will end up in a situation where it seems like the collection has been deleted, but data-folders are still left on disk taking up disk-space. The most significant thing is that this happens even though then client (sending the delete-request) waits until all Solrs are live and all shards of the collection to be deleted claim to be active. What more can a careful client do?

In this particular case, where you specifically wait for solrs to be live and shards active, I think we should make sure that everything is deleted (including folders) correctly

I am not looking for a bullet-proof solution. I believe we can always some up with crazy scenarios where you end up with a half deleted collection. But this particular scenario should work, I believe.

Please note, that I have seen other scenarios where only parts of the stuff in clusterstate is deleted (try removing the parts about waiting for active shards in CollDelete.java - so that you are only waiting for live Solrs), they just seem to be harder to reproduce consistently. But the fact that you can have such situations also, might help when designing the robust solution.

Please also note, that I tested this on 4.7.2 because it is the latest java6 enabled release and I only had java6 on my machine. One of my colleagues have tested 4.8.1 on a machine with java7 - no difference.

More robust collection-delete

Details

Description

Attachments

Attachments

Activity

People

Dates