Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
7.4.1
-
None
-
None
Description
After a replica does a full recovery, the old index directory is deleted, however the snapshot metadata is not updated in Zookeeper. This means that the affected core will have snapshots which point to a non-existing index directory.
Steps to reproduce (I used Solr 7.4.0 to test this but it likely affects newer versions too):
(1) Create any collection in SolrCloud with more than 1 replica per shard.
The state.json of the testcollection2 I used:
state.json
(2) Start adding documents to the collection. After some documents are added, creating a snapshot. In this example I called it snap2.
INFO (qtp577405636-12140)-c:testcollection2----o.a.s.s.HttpSolrCall: [admin] webapp=null path=/admin/collections params={name=testcollection2&action=CREATESNAPSHOT&collection=testcollection2&commitName=snap2&wt=javabin&version=2} status=0 QTime=280
The snapshot is created successfully for both cores, the metadata in ZooKeeper looks like this:
snap2-before.txt
For core_node4 the index directory is /solr/testcollection2/core_node4/data/index.
(3) Shut down one of the Solr servers. Here I use the server hosting core_node4.
(4) Continue adding documents to the collection (add at least 100 documents to ensure that the replica which is shut down will go to a full replication recovery on the next start)
Start the server:
INFO (recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.u.PeerSyncWithLeader: PeerSync: core=testcollection2_shard1_replica_n2 url=https://snapshot-test-2.example.com:8985/solr Received 99 versions from https://snapshot-test-3.example.com:8985/solr/testcollection2_shard1_replica_n1/ INFO (recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.u.PeerSync: PeerSync: core=testcollection2_shard1_replica_n2 url=https://snapshot-test-2.example.com:8985/solr Our versions are too old. ourHighThreshold=1683104801277083650 otherLowThreshold=1683139494002294784 ourHighest=1683104801293860865 otherHighest=1683139494085132289 INFO (recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.u.PeerSyncWithLeader: PeerSync: core=testcollection2_shard1_replica_n2 url=https://snapshot-test-2.example.com:8985/solr DONE. sync failed INFO (recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.c.RecoveryStrategy: PeerSync Recovery was not successful - trying replication. INFO (recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.c.RecoveryStrategy: Starting Replication Recovery.
(5) After the replication is finished, the index.properties points to the new directory index.20201112075340480
hdfs dfs -cat /solr/testcollection2/core_node4/data/index.properties #index.properties #Thu Nov 12 07:58:52 GMT+00:00 2020 index=index.20201112075340480
And the original index directory for core_node4 has been deleted:
hdfs dfs -du -h /solr/testcollection2/core_node4/data 0 0 /solr/testcollection2/core_node4/data/index 4.5 G 8.9 G /solr/testcollection2/core_node4/data/index.20201112075340480 84 168 /solr/testcollection2/core_node4/data/index.properties 215 430 /solr/testcollection2/core_node4/data/replication.properties 401 802 /solr/testcollection2/core_node4/data/snapshot_metadata 9.2 M 18.5 M /solr/testcollection2/core_node4/data/tlog
The snapshot metadata in Zookeeper is exactly the same as in step (2), so the snap2 still points to the index directory /solr/testcollection2/core_node4/data/index which is empty by this time.
(6) Try deleting the snapshot snap2.
In Zookeeper, the snapshot metadata will change:
- the snapshot status goes to failed
- the snapshot metadata for the core_node3 (the legit core) is deleted
- the snapshot metadata for the core_node4 (which is in the stale status) will remain there
The solr server log will contain the following error message:
INFO (qtp577405636-1029)----x:testcollection2_shard1_replica_n2-o.a.s.s.HttpSolrCall: [admin] webapp=null path=/admin/cores params={core=testcollection2_shard1_replica_n2&qt=/admin/cores&name=shard1&action=DELETESNAPSHOT&comm itName=snap2&wt=javabin&version=2} status=500 QTime=60 ERROR (qtp577405636-1029)----x:testcollection2_shard1_replica_n2-o.a.s.s.HttpSolrCall: null:org.apache.solr.common.SolrException: Error handling 'DELETESNAPSHOT' action at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:365) at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:396) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199) at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:799) at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:780) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:504) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:376) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:322) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:513) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:539) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:259) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(NRTCachingDirectory(BlockDirectory(HdfsDirectory@hdfs://snapshot-test-1.example.com:8020/solr/testcollection2/core_node4/data/index lockFactory=org.apache.lucene.store.NoLockFactory@b06d116); maxCacheMB=192.0 maxMergeSizeMB=16.0)): files: [] at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:833) at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:110) at org.apache.solr.core.snapshots.SolrSnapshotManager.deleteSnapshotIndexFiles(SolrSnapshotManager.java:287) at org.apache.solr.core.snapshots.SolrSnapshotManager.deleteSnapshotIndexFiles(SolrSnapshotManager.java:225) at org.apache.solr.core.SolrCore.deleteNamedSnapshot(SolrCore.java:542) at org.apache.solr.handler.admin.DeleteSnapshotOp.execute(DeleteSnapshotOp.java:42) at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360) ... 38 more
(7) After trying the DELETESNAPSHOT command again, it succeeds without issues, the metadata for snap2 is entirely removed from ZK.