Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10398

Multiple LIR requests can fail PeerSync even if it succeeds

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.3
    • Component/s: None
    • Labels:
      None

      Description

      I've seen a scenario where multiple LIRs happen around the same time.
      In this case even if PeerSync succeeded we ended up failing causing a full index fetch.

      Sequence of events:
      T1: Leader puts replica in LIR and replica's LIRState as DOWN
      T2: Replica begins PeerSync and LIRState changes
      T3: Leader puts replica in LIR again and replica's LIRState is set to DOWN
      T4: PeerSync from T1 succeeds and examines it's own LIRState which is now DOWN and fails triggering a full replication

      Log snippet

      T1 from the Leader Logs

      solr.log.2:12779:2017-03-23 03:03:18.706 INFO  (qtp1076677520-9812) [c:test s:shard73 r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put replica core=test_shard73_replica2 coreNodeName=core_node247 on server:8993_solr into leader-initiated recovery.
      

      T2 from the replica logs:

      solr.log.1:2017-03-23 03:03:26.724 INFO  (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy Attempting to PeerSync from http://server:8983/solr/test_shard73_replica1/ - recoveringAfterStartup=false
      

      T3 from the Leader Logs

      solr.log.2:2017-03-23 03:03:43.268 INFO  (qtp1076677520-9796) [c:test s:shard73 r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put replica core=test_shard73_replica2 coreNodeName=core_node247 on server:8993_solr into leader-initiated recovery.
      

      T4 from the replica logs:

      2017-03-23 03:05:38.009 INFO  (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy PeerSync Recovery was successful - registering as Active.
      2017-03-23 03:05:38.012 ERROR (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy Error while trying to recover.:org.apache.solr.common.SolrException: Cannot publish state of core 'test_shard73_replica2' as active without recovering first!
       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1179)
       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1135)
       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1131)
       at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:415)
       at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
      
       2017-03-23 03:05:47.014 INFO  (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.h.IndexFetcher Starting download to NRTCachingDirectory(MMapDirectory@/data4/test_shard73_replica2/data/index.20170323030546697 lockFactory=org.apache.lucene.store.NativeFSLockFactory@4aa1e5c0; maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true
      

      I don't know whats the best approach to tackle the problem is but I'll post suggestions after doing some research. I wanted to create the Jira to track the issue

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                varun Varun Thacker
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: