Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
4.7.2, 4.8.1
-
None
Description
If Solrs are not "stable" (completely up and running etc) a collection-delete request might result in partly deleted collections. You might say that it is fair that you are not able to have a collection deleted if all of its shards are not actively running - even though I would like a mechanism that just deleted them when/if they ever come up again. But even though all shards claim to be actively running you can still end up with partly deleted collections - that is not acceptable IMHO. At least clusterstate should always reflect the state, so that you are able to detect that your collection-delete request was only partly carried out - which parts were successfully deleted and which were not (including information about data-folder-deletion)
The text above sounds like an epic-sized task, with potentially numerous problems to fix, so in order not to make this ticket "open forever" I will point out a particular scenario where I see problems. Then this problem is corrected we can close this ticket. Other tickets will have to deal with other collection-delete issues.
Here is what I did and saw
- Logged into one of my Linux machines with IP 192.168.78.239
- Prepared for Solr install
mkdir -p /xXX/solr cd /xXX/solr
- downloaded solr-4.7.2.tgz
- Installed Solr 4.7.2 and prepared for three "nodes"
tar zxvf solr-4.7.2.tgz cd solr-4.7.2/ cp -r example node1 cp -r example node2 cp -r example node3
- Initialized Solr config into Solr
cd node1 java -DzkRun -Dhost=192.168.78.239 -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -jar start.jar CTRL-C to stop solr (node1) again after it started completely
- Started all three Solr nodes
nohup java -Djetty.port=8983 -Dhost=192.168.78.239 -DzkRun -jar start.jar &>> node1_stdouterr.log & cd ../node2 nohup java -Djetty.port=8984 -Dhost=192.168.78.239 -DzkHost=localhost:9983 -jar start.jar &>> node2_stdouterr.log & cd ../node3 nohup java -Djetty.port=8985 -Dhost=192.168.78.239 -DzkHost=localhost:9983 -jar start.jar &>> node3_stdouterr.log &
- Created a collection "mycoll"
curl 'http://192.168.78.239:8983/solr/admin/collections?action=CREATE&name=mycoll&numShards=6&replicationFactor=1&maxShardsPerNode=2&collection.configName=myconf'
- Collected "Cloud Graph" image, clusterstate.json and info about data folders (see attached coll_delete_problem.zip | after_create_all_solrs_still_running). You will see that everything is as it is supposed to be. Two shards per node, six all in all, it is all reflected in clusterstate and there is a data-folder for each shard
- Stopped all three Solr nodes
kill $(ps -ef | grep 8985 | grep -v grep | awk '{print $2}') kill $(ps -ef | grep 8984 | grep -v grep | awk '{print $2}') kill $(ps -ef | grep 8983 | grep -v grep | awk '{print $2}')
- Started Solr node1 only (wait for it to start completely)
cd ../node1 nohup java -Djetty.port=8983 -Dhost=192.168.78.239 -DzkRun -jar start.jar &>> node1_stdouterr.log & Wait for it to start fully - might take a minute or so
- Collected "Cloud Graph" image, clusterstate.json and info about data folders (see attached coll_delete_problem.zip | after_create_solr1_restarted_solr2_and_3_not_started_yet). You will see that everything is as it is supposed to be. Two shards per node, six all in all, the four on node2 and node3 are down, it is all reflected in clusterstate and there is still a data-folder for each shard
- Started CollDelete.java (see attached coll_delete_problem.zip) - will delete collection "mycoll" when all three Solrs are live and all shards are "active"
- Started the remaining two Solr nodes
cd ../node2 nohup java -Djetty.port=8984 -Dhost=192.168.78.239 -DzkHost=localhost:9983 -jar start.jar &>> node2_stdouterr.log & cd ../node3 nohup java -Djetty.port=8985 -Dhost=192.168.78.239 -DzkHost=localhost:9983 -jar start.jar &>> node3_stdouterr.log &
- After CollDelete.java finished, collected "Cloud Graph" image, clusterstate.json, info about data folders and output from CollDelete.java (see attached coll_delete_problem.zip | after_create_all_solrs_restarted_delete_coll_while_solr2_and_3_was_starting_up). You will see that not everything is as it is supposed to be. Al info about "mycoll" deleted from clusterstate - ok. But data-folders remain for node2 and node3 - not ok.
- CollDelete output
All 3 solrs live All (6) shards active. Now deleting {responseHeader={status=0,QTime=1823},failure={192.168.78.239:8985_solr=org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://192.168.78.239:8985/solr,192.168.78.239:8984_solr=org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://192.168.78.239:8984/solr,192.168.78.239:8984_solr=org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://192.168.78.239:8984/solr,192.168.78.239:8985_solr=org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://192.168.78.239:8985/solr},success={192.168.78.239:8983_solr={responseHeader={status=0,QTime=305}},192.168.78.239:8983_solr={responseHeader={status=0,QTime=260}}}}
- CollDelete output
- Please note, that consecutive attempts to collection-delete "mycoll" will fail, because Solr claims that "mycoll" does not exist.
- I stopped the Solrs again
- Collected stdouterr files (see attached coll_delete_problem.zip)
In this scenario you see that because you send the delete-collection request while some Solrs have not completely started yet, you will end up in a situation where it seems like the collection has been deleted, but data-folders are still left on disk taking up disk-space. The most significant thing is that this happens even though then client (sending the delete-request) waits until all Solrs are live and all shards of the collection to be deleted claim to be active. What more can a careful client do?
In this particular case, where you specifically wait for solrs to be live and shards active, I think we should make sure that everything is deleted (including folders) correctly
I am not looking for a bullet-proof solution. I believe we can always some up with crazy scenarios where you end up with a half deleted collection. But this particular scenario should work, I believe.
Please note, that I have seen other scenarios where only parts of the stuff in clusterstate is deleted (try removing the parts about waiting for active shards in CollDelete.java - so that you are only waiting for live Solrs), they just seem to be harder to reproduce consistently. But the fact that you can have such situations also, might help when designing the robust solution.
Please also note, that I tested this on 4.7.2 because it is the latest java6 enabled release and I only had java6 on my machine. One of my colleagues have tested 4.8.1 on a machine with java7 - no difference.