Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-15003

SolrCloud Snapshot metadata inconsistent after core replication

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 7.4.1
    • None
    • SolrCloud
    • None

    Description

      After a replica does a full recovery, the old index directory is deleted, however the snapshot metadata is not updated in Zookeeper. This means that the affected core will have snapshots which point to a non-existing index directory.

      Steps to reproduce (I used Solr 7.4.0 to test this but it likely affects newer versions too):

      (1) Create any collection in SolrCloud with more than 1 replica per shard.
      The state.json of the testcollection2 I used:
      state.json

      (2) Start adding documents to the collection. After some documents are added, creating a snapshot. In this example I called it snap2.

      INFO (qtp577405636-12140)-c:testcollection2----o.a.s.s.HttpSolrCall: [admin] webapp=null path=/admin/collections params={name=testcollection2&action=CREATESNAPSHOT&collection=testcollection2&commitName=snap2&wt=javabin&version=2} status=0 QTime=280 
      

      The snapshot is created successfully for both cores, the metadata in ZooKeeper looks like this:
      snap2-before.txt

      For core_node4 the index directory is /solr/testcollection2/core_node4/data/index.

      (3) Shut down one of the Solr servers. Here I use the server hosting core_node4.

      (4) Continue adding documents to the collection (add at least 100 documents to ensure that the replica which is shut down will go to a full replication recovery on the next start)

      Start the server:

      INFO (recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod
      e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.u.PeerSyncWithLeader: PeerSync: core=testcollection2_shard1_replica_n2 url=https://snapshot-test-2.example.com:8985/solr  Received 99 versions from https://snapshot-test-3.example.com:8985/solr/testcollection2_shard1_replica_n1/
      INFO (recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod
      e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.u.PeerSync: PeerSync: core=testcollection2_shard1_replica_n2 url=https://snapshot-test-2.example.com:8985/solr  Our versions are too old. ourHighThreshold=1683104801277083650 otherLowThreshold=1683139494002294784 ourHighest=1683104801293860865 otherHighest=1683139494085132289
      INFO (recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod
      e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.u.PeerSyncWithLeader: PeerSync: core=testcollection2_shard1_replica_n2 url=https://snapshot-test-2.example.com:8985/solr DONE. sync failed
      INFO (recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod
      e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.c.RecoveryStrategy: PeerSync Recovery was not successful - trying replication.
      INFO (recoveryExecutor-4-thread-1-processing-n:snapshot-test-2.example.com:8985_solr x:testcollection2_shard1_replica_n2 s:shard1 c:testcollection2 r:core_nod
      e4)-c:testcollection2-s:shard1-r:core_node4-x:testcollection2_shard1_replica_n2-o.a.s.c.RecoveryStrategy: Starting Replication Recovery.
      

      (5) After the replication is finished, the index.properties points to the new directory index.20201112075340480

      hdfs dfs -cat  /solr/testcollection2/core_node4/data/index.properties
      #index.properties
      #Thu Nov 12 07:58:52 GMT+00:00 2020
      index=index.20201112075340480
      

      And the original index directory for core_node4 has been deleted:

      hdfs dfs -du -h  /solr/testcollection2/core_node4/data
      0      0       /solr/testcollection2/core_node4/data/index
      4.5 G  8.9 G   /solr/testcollection2/core_node4/data/index.20201112075340480
      84     168     /solr/testcollection2/core_node4/data/index.properties
      215    430     /solr/testcollection2/core_node4/data/replication.properties
      401    802     /solr/testcollection2/core_node4/data/snapshot_metadata
      9.2 M  18.5 M  /solr/testcollection2/core_node4/data/tlog
      

      The snapshot metadata in Zookeeper is exactly the same as in step (2), so the snap2 still points to the index directory /solr/testcollection2/core_node4/data/index which is empty by this time.

      (6) Try deleting the snapshot snap2.

      In Zookeeper, the snapshot metadata will change:

      • the snapshot status goes to failed
      • the snapshot metadata for the core_node3 (the legit core) is deleted
      • the snapshot metadata for the core_node4 (which is in the stale status) will remain there

      The solr server log will contain the following error message:

      INFO (qtp577405636-1029)----x:testcollection2_shard1_replica_n2-o.a.s.s.HttpSolrCall: [admin] webapp=null path=/admin/cores params={core=testcollection2_shard1_replica_n2&qt=/admin/cores&name=shard1&action=DELETESNAPSHOT&comm
      itName=snap2&wt=javabin&version=2} status=500 QTime=60
      ERROR (qtp577405636-1029)----x:testcollection2_shard1_replica_n2-o.a.s.s.HttpSolrCall: null:org.apache.solr.common.SolrException: Error handling 'DELETESNAPSHOT' action
              at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:365)
              at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:396)
              at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180)
              at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
              at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:799)
              at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:780)
              at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:504)
              at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:376)
              at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:322)
              at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
              at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)
              at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
              at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
              at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
              at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
              at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:513)
              at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
              at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
              at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
              at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
              at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
              at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
              at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
              at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
              at org.eclipse.jetty.server.Server.handle(Server.java:539)
              at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
              at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
              at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
              at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
              at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:259)
              at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
              at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
              at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
              at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
              at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
              at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
              at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
              at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(NRTCachingDirectory(BlockDirectory(HdfsDirectory@hdfs://snapshot-test-1.example.com:8020/solr/testcollection2/core_node4/data/index lockFactory=org.apache.lucene.store.NoLockFactory@b06d116); maxCacheMB=192.0 maxMergeSizeMB=16.0)): files: []
              at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:833)
              at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:110)
              at org.apache.solr.core.snapshots.SolrSnapshotManager.deleteSnapshotIndexFiles(SolrSnapshotManager.java:287)
              at org.apache.solr.core.snapshots.SolrSnapshotManager.deleteSnapshotIndexFiles(SolrSnapshotManager.java:225)
              at org.apache.solr.core.SolrCore.deleteNamedSnapshot(SolrCore.java:542)
              at org.apache.solr.handler.admin.DeleteSnapshotOp.execute(DeleteSnapshotOp.java:42)
              at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360)
              ... 38 more
      

      (7) After trying the DELETESNAPSHOT command again, it succeeds without issues, the metadata for snap2 is entirely removed from ZK.

      Attachments

        1. snap2-failed.png
          49 kB
          Istvan Farkas
        2. snap2-before.txt
          16 kB
          Istvan Farkas
        3. state.json
          1 kB
          Istvan Farkas

        Activity

          People

            Unassigned Unassigned
            warper Istvan Farkas
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: