I had a discussion with Varun about this issue. We have two problems here
- Solr corrupts the index during replication recovery
- Such a corrupt index puts Solr into an infinite recovery loop
For #1 the problem is clear – we have open searchers on uncommitted/flushed files which are mixed with files from the leader causing corruption.
Possible solutions for #1 are either a) switch to a different index dir and move/copy files from committed segments and use the index.properties approach to open a searcher on the new index dir or b) close the searcher then rollback the writer and then download the necessary files.
Closing the searcher.... is not as simple as it sounds because the searcher is ref counted and close() doesn't really close immediately. Also, at any time, a request might open a new searcher so it is a very involved change.
For #2, every where we open a reader/searcher or writer, we should be ready to handle the corrupt index exceptions.
I think we should first try to first solve the problem of corrupting the index. So let's try the deletion approach that Varun outlined. If that fails then we should switch to a new index dir, move/copy over files from commit points, fetch the missing segments from the leader and use the index.properties approach to completely move to a new index directory.
The second problem that we need to solve is that a corrupted index trashes the server. We should be able to recover from such a scenario instead of going into an infinite recovery loop.
Let's fix these two problems (in that order) and then figure out ways to optimize recovery.
Longer term we need to change our code such that we can close the searchers, rollback the writer and delete uncommitted files and then attempt replication recovery.
Also my earlier comment on non-cloud Solr was wrong:
In SolrCloud we could just close the searcher before rollback because a replica in recovery won't get any search requests but that's not practical in standalone Solr because it'd cause downtime.
In stand alone Solr this is not a problem because indexing and soft-commits do not happen on slaves. But anyway changing to close the searcher etc is a big change.