[SOLR-10398] Multiple LIR requests can fail PeerSync even if it succeeds - ASF JIRA

Agile Board

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 7.3
Component/s: None
Labels:
None

Description

I've seen a scenario where multiple LIRs happen around the same time.
In this case even if PeerSync succeeded we ended up failing causing a full index fetch.

Sequence of events:
T1: Leader puts replica in LIR and replica's LIRState as DOWN
T2: Replica begins PeerSync and LIRState changes
T3: Leader puts replica in LIR again and replica's LIRState is set to DOWN
T4: PeerSync from T1 succeeds and examines it's own LIRState which is now DOWN and fails triggering a full replication

Log snippet

T1 from the Leader Logs

solr.log.2:12779:2017-03-23 03:03:18.706 INFO  (qtp1076677520-9812) [c:test s:shard73 r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put replica core=test_shard73_replica2 coreNodeName=core_node247 on server:8993_solr into leader-initiated recovery.

T2 from the replica logs:

solr.log.1:2017-03-23 03:03:26.724 INFO  (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy Attempting to PeerSync from http://server:8983/solr/test_shard73_replica1/ - recoveringAfterStartup=false

T3 from the Leader Logs

solr.log.2:2017-03-23 03:03:43.268 INFO  (qtp1076677520-9796) [c:test s:shard73 r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put replica core=test_shard73_replica2 coreNodeName=core_node247 on server:8993_solr into leader-initiated recovery.

T4 from the replica logs:

2017-03-23 03:05:38.009 INFO  (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy PeerSync Recovery was successful - registering as Active.
2017-03-23 03:05:38.012 ERROR (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy Error while trying to recover.:org.apache.solr.common.SolrException: Cannot publish state of core 'test_shard73_replica2' as active without recovering first!
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1179)
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1135)
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1131)
 at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:415)
 at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)

 2017-03-23 03:05:47.014 INFO  (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.h.IndexFetcher Starting download to NRTCachingDirectory(MMapDirectory@/data4/test_shard73_replica2/data/index.20170323030546697 lockFactory=org.apache.lucene.store.NativeFSLockFactory@4aa1e5c0; maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true

I don't know whats the best approach to tackle the problem is but I'll post suggestions after doing some research. I wanted to create the Jira to track the issue