Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10398

Multiple LIR requests can fail PeerSync even if it succeeds

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 7.3
    • None
    • None

    Description

      I've seen a scenario where multiple LIRs happen around the same time.
      In this case even if PeerSync succeeded we ended up failing causing a full index fetch.

      Sequence of events:
      T1: Leader puts replica in LIR and replica's LIRState as DOWN
      T2: Replica begins PeerSync and LIRState changes
      T3: Leader puts replica in LIR again and replica's LIRState is set to DOWN
      T4: PeerSync from T1 succeeds and examines it's own LIRState which is now DOWN and fails triggering a full replication

      Log snippet

      T1 from the Leader Logs

      solr.log.2:12779:2017-03-23 03:03:18.706 INFO  (qtp1076677520-9812) [c:test s:shard73 r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put replica core=test_shard73_replica2 coreNodeName=core_node247 on server:8993_solr into leader-initiated recovery.
      

      T2 from the replica logs:

      solr.log.1:2017-03-23 03:03:26.724 INFO  (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy Attempting to PeerSync from http://server:8983/solr/test_shard73_replica1/ - recoveringAfterStartup=false
      

      T3 from the Leader Logs

      solr.log.2:2017-03-23 03:03:43.268 INFO  (qtp1076677520-9796) [c:test s:shard73 r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put replica core=test_shard73_replica2 coreNodeName=core_node247 on server:8993_solr into leader-initiated recovery.
      

      T4 from the replica logs:

      2017-03-23 03:05:38.009 INFO  (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy PeerSync Recovery was successful - registering as Active.
      2017-03-23 03:05:38.012 ERROR (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy Error while trying to recover.:org.apache.solr.common.SolrException: Cannot publish state of core 'test_shard73_replica2' as active without recovering first!
       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1179)
       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1135)
       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1131)
       at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:415)
       at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
      
       2017-03-23 03:05:47.014 INFO  (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.h.IndexFetcher Starting download to NRTCachingDirectory(MMapDirectory@/data4/test_shard73_replica2/data/index.20170323030546697 lockFactory=org.apache.lucene.store.NativeFSLockFactory@4aa1e5c0; maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true
      

      I don't know whats the best approach to tackle the problem is but I'll post suggestions after doing some research. I wanted to create the Jira to track the issue

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            varun Varun Thacker
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment