Description
While testing SOLR-11458, ab ran into an interesting failure which resulted in different document counts between leader and replica. The test is MoveReplicaHDFSTest on jira/solr-11458-2 branch.
The failure is rare but reproducible on beasting:
reproduce with: ant test -Dtestcase=MoveReplicaHDFSTest -Dtests.method=testNormalFailedMove -Dtests.seed=161856CB543CD71C -Dtests.slow=true -Dtests.locale=ar-SA -Dtests.timezone=US/Michigan -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] FAILURE 14.2s | MoveReplicaHDFSTest.testNormalFailedMove <<< [junit4] > Throwable #1: java.lang.AssertionError: expected:<100> but was:<56> [junit4] > at __randomizedtesting.SeedInfo.seed([161856CB543CD71C:31134983787E4905]:0) [junit4] > at org.apache.solr.cloud.MoveReplicaTest.testFailedMove(MoveReplicaTest.java:305) [junit4] > at org.apache.solr.cloud.MoveReplicaHDFSTest.testNormalFailedMove(MoveReplicaHDFSTest.java:69)
The root problem here is when the old replica is not live during deletion of a collection, the correspond HDFS data of that replica is not removed therefore when a new collection with the same name as the deleted collection is created, new replicas will reuse the old HDFS data. This leads to many problems in leader election and recovery
Attachments
Attachments
Issue Links
- blocks
-
SOLR-11458 Bugs in MoveReplicaCmd handling of failures
- Closed
- is related to
-
SOLR-9566 Can we avoid doing recovery when collections are first created?
- Closed