Solr
  1. Solr
  2. SOLR-8085

Fix a variety of issues that can result in replicas getting out of sync.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      I've been discussing this fail I found with Yonik.

      The problem seems to be that a replica tries to recover and publishes recovering - the attempt then fails, but docs are now coming in from the leader. The replica tries to recover again and has gotten enough docs to pass peery sync.

      I'm trying a possible solution now where we won't allow peer sync after a recovery that is not successful.

      1. fail.150922_125320
        14.79 MB
        Yonik Seeley
      2. fail.150922_130608
        8.01 MB
        Yonik Seeley
      3. SOLR-8085.patch
        6 kB
        Mark Miller
      4. SOLR-8085.patch
        3 kB
        Mark Miller
      5. SOLR-8085.patch
        2 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          Here's a fail from HdfsChaosMonkeySafeLeaderTest on trunk (with some minor patches provided by Mark).

          Throwable #1: java.lang.AssertionError: shard3 is not consistent. Got 658 from http://127.0.0.1:44649/collection1lastClient and got 330 from http://127.0.0.1:38249/collection1

          Show
          Yonik Seeley added a comment - Here's a fail from HdfsChaosMonkeySafeLeaderTest on trunk (with some minor patches provided by Mark). Throwable #1: java.lang.AssertionError: shard3 is not consistent. Got 658 from http://127.0.0.1:44649/collection1lastClient and got 330 from http://127.0.0.1:38249/collection1
          Hide
          Yonik Seeley added a comment -

          Another one that has a smaller log file.

          Throwable #1: java.lang.AssertionError: shard3 is not consistent. Got 375 from http://127.0.0.1:40940/collection1lastClient and got 230 from http://127.0.0.1:47239/collection1

          Show
          Yonik Seeley added a comment - Another one that has a smaller log file. Throwable #1: java.lang.AssertionError: shard3 is not consistent. Got 375 from http://127.0.0.1:40940/collection1lastClient and got 230 from http://127.0.0.1:47239/collection1
          Hide
          Mark Miller added a comment -

          I'm trying a possible solution now where we won't allow peer sync after a recovery that is not successful.

          This seems to work out well.

          Most of these issues show up easier when running the more heavyweight hdfs tests though, so I am running those - because we truncate support in hdfs was not added until 2.7, when a replication fails, we still replay all the buffered docs. A badly timed server restart can then pass incorrectly on peer sync if this happens, regardless of my attempted fix that won't cross jetty restarts.

          Show
          Mark Miller added a comment - I'm trying a possible solution now where we won't allow peer sync after a recovery that is not successful. This seems to work out well. Most of these issues show up easier when running the more heavyweight hdfs tests though, so I am running those - because we truncate support in hdfs was not added until 2.7, when a replication fails, we still replay all the buffered docs. A badly timed server restart can then pass incorrectly on peer sync if this happens, regardless of my attempted fix that won't cross jetty restarts.
          Hide
          Yonik Seeley added a comment -

          OK, here's an analyisis of fail.150922_130608

          it looks like LIR happens during normal recovery after startup, and we finally end up doing recovery with recoverAfterStartup=false, which uses recent versions in peersync (and include docs that have been buffered while we were recovering) rather than the true startup versions. This causes peersync to pass when it should not have.

          
          (add first appears, node 47239 appears to be coming up at the time)
            2> 74317 INFO  (qtp324612161-378) [n:127.0.0.1:40940_ c:collection1 s:shard3 r:core_node6 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:38911/collection1/&wt=javabin&version=2} {add=[0-333 (1513033816377131008)]} 0 19
            2> 74317 INFO  (qtp324612161-378) [n:127.0.0.1:40940_ c:collection1 s:shard3 r:core_node6 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:38911/collection1/&wt=javabin&version=2} {add=[0-333 (1513033816377131008)]} 0 19
            2> 74317 INFO  (qtp661063741-234) [n:127.0.0.1:38911_ c:collection1 s:shard3 r:core_node2 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={wt=javabin&version=2} {add=[0-333 (1513033816377131008)]} 0 31
            
          (node coming up)
            2> 75064 INFO  (coreLoadExecutor-196-thread-1-processing-n:127.0.0.1:47239_) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.VersionInfo Refreshing highest value of _version_ for 256 version buckets from index
            2> 75065 INFO  (coreLoadExecutor-196-thread-1-processing-n:127.0.0.1:47239_) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.UpdateLog Took 31.0ms to seed version buckets with highest version 1513033812159758336
            2> 75120 INFO  (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.ZkController Replaying tlog for http://127.0.0.1:47239/collection1/ during startup... NOTE: This can take a while.
            2> 75162 DEBUG (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.UpdateLog add add{flags=a,_version_=1513033807303802880,id=1-3}
            2> 75162 DEBUG (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{flags=a,_version_=1513033807303802880,id=1-3} LocalSolrQueryRequest{update.distrib=FROMLEADER&log_replay=true}
          
          (replay finished)
            2> 75280 DEBUG (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{flags=a,_version_=1513033812159758336,id=1-132} LocalSolrQueryRequest{update.distrib=FROMLEADER&log_replay=true}
           
          (meanwhile, the leader is asking us to recover?)
            2> 75458 WARN  (updateExecutor-14-thread-5-processing-x:collection1 r:core_node2 http:////127.0.0.1:47239//collection1// n:127.0.0.1:38911_ s:shard3 c:collection1) [n:127.0.0.1:38911_ c:collection1 s:shard3 r:core_node2 x:collection1] o.a.s.c.LeaderInitiatedRecoveryThread Asking core=collection1 coreNodeName=core_node10 on http://127.0.0.1:47239 to recover; unsuccessful after 2 of 120 attempts so far ...
            
          (and we see the request to recover)
            2> 75475 INFO  (qtp2087242119-1282) [n:127.0.0.1:47239_    ] o.a.s.h.a.CoreAdminHandler It has been requested that we recover: core=collection1
          
          (so we cancel the existing recovery)
            2> 75478 INFO  (Thread-1246) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.DefaultSolrCoreState Running recovery - first canceling any ongoing recovery
          
            2> 75552 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Starting recovery process. recoveringAfterStartup=true
          
            2> 75610 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy ###### startupVersions=[1513033812159758336, [...]
            2> 75611 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Publishing state of core collection1 as recovering, leader is http://127.0.0.1:38911/collection1/ and I am http://127.0.0.1:47239/collection1/
            2> 75611 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.ZkController publishing state=recovering
           
            2> 75642 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Sending prep recovery command to http://127.0.0.1:38911; WaitForState: action=PREPRECOVERY&core=collection1&nodeName=127.0.0.1%3A47239_&coreNodeName=core_node10&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true
          
          (In the meantime, we still receive updates.  1-414 is one of the updates that we are missing!!!  I assume we are buffering these?)
            2> 75782 DEBUG (qtp2087242119-1281) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=1-414} {update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:38911/collection1/&wt=javabin&version=2}
            2> 75828 INFO  (qtp2087242119-1281) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:38911/collection1/&wt=javabin&version=2} {add=[1-414 (1513033817943703552)]} 0 45
          
          (what's this about?)
            2> 75923 INFO  (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.s.SolrIndexSearcher Opening Searcher@2ba72586[collection1] main
          
          (wait... replay? is this from when we started up and we never stopped it?  should we have?)
            2> 76135 DEBUG (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE FINISH LocalSolrQueryRequest{update.distrib=FROMLEADER&log_replay=true}
          
          
          (log replay finishes... the dbq *:* shows that this is the tlog from the beginning of this test.. i.e. we never got to do a commit at first)
            2> 76181 INFO  (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] {deleteByQuery=*:* (-1513033805984694272),add=[1-3 (1513033807303802880), 0-5 (1513033807434874880), 1-6 (1513033807628861440), 1-7 (1513033807699116032), 1-9 (1513033807797682176), 0-14 (1513033807941337088), 0-15 (1513033807976988672), 1-14 (1513033807991668736), 1-15 (1513033808035708928), 0-17 (1513033808075554816), ... (61 adds)]} 0 1060
            2> 76182 WARN  (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=61 deletes=0 deleteByQuery=1 errors=0 positionOfStart=0}
          
          (we're canceling recoveryh again?)
            2> 76182 INFO  (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.DefaultSolrCoreState Running recovery - first canceling any ongoing recovery
            2> 76186 WARN  (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Stopping recovery for core=collection1 coreNodeName=core_node10
            
          (did this fail because we canceled the recovery?)
            2> 76194 ERROR (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Error while trying to recover.:java.util.concurrent.ExecutionException: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://127.0.0.1:38911
            2> 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
            2> 	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
            2> 	at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:598)
            2> 	at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:361)
            2> 	at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
            2> Caused by: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://127.0.0.1:38911
            2> 	at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:590)
            2> 	at org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:285)
            2> 	at org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:281)
            2> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            2> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
            2> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            2> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            2> 	at java.lang.Thread.run(Thread.java:745)
            2> Caused by: java.net.SocketException: Socket closed
            2> 76202 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Finished recovery process.
          
          (note: this is the last we hear of this thread... not clear if that's OK or not)
            2> 76202 INFO  (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.ActionThrottle The last recovery attempt started 711ms ago.
            2> 76202 INFO  (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.ActionThrottle Throttling recovery attempts - waiting for 9288ms
            
           (still receiving updates, while we are waiting to do the next recovery attempt...  are these still buffering? how do we tell?)
            2> 76252 DEBUG (qtp2087242119-1283) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=0-432} {update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:38911/collection1/&wt=javabin&version=2}
          
          
          (OK, looks like we are finally trying the recovery process again... but note that "recoveringAfterStartup" is now false!!!!)
            2> 85491 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Starting recovery process. recoveringAfterStartup=false
            2> 85531 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Publishing state of core collection1 as recovering, leader is http://127.0.0.1:38911/collection1/ and I am http://127.0.0.1:47239/collection1/
          
          (Peersync.  Because recoveringAfterStartup is false, it will use most recent versions to compare, rather than startup versions)
          (As an example, version 1513033825996767232 is very recent, at timestamp 83465, while we were still not recovered)
            2> 92554 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Attempting to PeerSync from http://127.0.0.1:38911/collection1/ - recoveringAfterStartup=false
            2> 92554 DEBUG (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:47239 startingVersions=100 [1513033825996767232, [...]
            2> 92646 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:47239  Received 100 versions from http://127.0.0.1:38911/collection1/
            2> 92647 DEBUG (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:47239  sorted versions from http://127.0.0.1:38911/collection1/ = [1513033825996767232, ...
            2> 92647 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:47239  Our versions are newer. ourLowThreshold=1513033822028955648 otherHigh=1513033825172586496
            2> 92647 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:47239 DONE. sync succeeded
          
          (And we pass peersync when we should not)
            2> 92796 INFO  (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy PeerSync Recovery was successful - registering as Active.
          
            2> ######shard3 is not consistent.  Got 375 from http://127.0.0.1:40940/collection1lastClient and got 230 from http://127.0.0.1:47239/collection1
            2> ###### sizes=375,230
            2> ###### Only in http://127.0.0.1:40940/collection1: [{_version_=1513033815669342208, id=0-295}, [...], {_version_=1513033816377131008, id=0-333}]
           
          
          Show
          Yonik Seeley added a comment - OK, here's an analyisis of fail.150922_130608 it looks like LIR happens during normal recovery after startup, and we finally end up doing recovery with recoverAfterStartup=false, which uses recent versions in peersync (and include docs that have been buffered while we were recovering) rather than the true startup versions. This causes peersync to pass when it should not have. (add first appears, node 47239 appears to be coming up at the time) 2> 74317 INFO (qtp324612161-378) [n:127.0.0.1:40940_ c:collection1 s:shard3 r:core_node6 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={update.distrib=FROMLEADER&distrib.from=http: //127.0.0.1:38911/collection1/&wt=javabin&version=2} {add=[0-333 (1513033816377131008)]} 0 19 2> 74317 INFO (qtp324612161-378) [n:127.0.0.1:40940_ c:collection1 s:shard3 r:core_node6 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={update.distrib=FROMLEADER&distrib.from=http: //127.0.0.1:38911/collection1/&wt=javabin&version=2} {add=[0-333 (1513033816377131008)]} 0 19 2> 74317 INFO (qtp661063741-234) [n:127.0.0.1:38911_ c:collection1 s:shard3 r:core_node2 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={wt=javabin&version=2} {add=[0-333 (1513033816377131008)]} 0 31 (node coming up) 2> 75064 INFO (coreLoadExecutor-196-thread-1-processing-n:127.0.0.1:47239_) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.VersionInfo Refreshing highest value of _version_ for 256 version buckets from index 2> 75065 INFO (coreLoadExecutor-196-thread-1-processing-n:127.0.0.1:47239_) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.UpdateLog Took 31.0ms to seed version buckets with highest version 1513033812159758336 2> 75120 INFO (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.ZkController Replaying tlog for http: //127.0.0.1:47239/collection1/ during startup... NOTE: This can take a while . 2> 75162 DEBUG (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.UpdateLog add add{flags=a,_version_=1513033807303802880,id=1-3} 2> 75162 DEBUG (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{flags=a,_version_=1513033807303802880,id=1-3} LocalSolrQueryRequest{update.distrib=FROMLEADER&log_replay= true } (replay finished) 2> 75280 DEBUG (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{flags=a,_version_=1513033812159758336,id=1-132} LocalSolrQueryRequest{update.distrib=FROMLEADER&log_replay= true } (meanwhile, the leader is asking us to recover?) 2> 75458 WARN (updateExecutor-14-thread-5-processing-x:collection1 r:core_node2 http: ////127.0.0.1:47239//collection1// n:127.0.0.1:38911_ s:shard3 c:collection1) [n:127.0.0.1:38911_ c:collection1 s:shard3 r:core_node2 x:collection1] o.a.s.c.LeaderInitiatedRecoveryThread Asking core=collection1 coreNodeName=core_node10 on http://127.0.0.1:47239 to recover; unsuccessful after 2 of 120 attempts so far ... (and we see the request to recover) 2> 75475 INFO (qtp2087242119-1282) [n:127.0.0.1:47239_ ] o.a.s.h.a.CoreAdminHandler It has been requested that we recover: core=collection1 (so we cancel the existing recovery) 2> 75478 INFO ( Thread -1246) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.DefaultSolrCoreState Running recovery - first canceling any ongoing recovery 2> 75552 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Starting recovery process. recoveringAfterStartup= true 2> 75610 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy ###### startupVersions=[1513033812159758336, [...] 2> 75611 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Publishing state of core collection1 as recovering, leader is http: //127.0.0.1:38911/collection1/ and I am http://127.0.0.1:47239/collection1/ 2> 75611 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.ZkController publishing state=recovering 2> 75642 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Sending prep recovery command to http: //127.0.0.1:38911; WaitForState: action=PREPRECOVERY&core=collection1&nodeName=127.0.0.1%3A47239_&coreNodeName=core_node10&state=recovering&checkLive= true &onlyIfLeader= true &onlyIfLeaderActive= true (In the meantime, we still receive updates. 1-414 is one of the updates that we are missing!!! I assume we are buffering these?) 2> 75782 DEBUG (qtp2087242119-1281) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=1-414} {update.distrib=FROMLEADER&distrib.from=http: //127.0.0.1:38911/collection1/&wt=javabin&version=2} 2> 75828 INFO (qtp2087242119-1281) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={update.distrib=FROMLEADER&distrib.from=http: //127.0.0.1:38911/collection1/&wt=javabin&version=2} {add=[1-414 (1513033817943703552)]} 0 45 (what's this about?) 2> 75923 INFO (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.s.SolrIndexSearcher Opening Searcher@2ba72586[collection1] main (wait... replay? is this from when we started up and we never stopped it? should we have?) 2> 76135 DEBUG (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE FINISH LocalSolrQueryRequest{update.distrib=FROMLEADER&log_replay= true } (log replay finishes... the dbq *:* shows that this is the tlog from the beginning of this test.. i.e. we never got to do a commit at first) 2> 76181 INFO (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] {deleteByQuery=*:* (-1513033805984694272),add=[1-3 (1513033807303802880), 0-5 (1513033807434874880), 1-6 (1513033807628861440), 1-7 (1513033807699116032), 1-9 (1513033807797682176), 0-14 (1513033807941337088), 0-15 (1513033807976988672), 1-14 (1513033807991668736), 1-15 (1513033808035708928), 0-17 (1513033808075554816), ... (61 adds)]} 0 1060 2> 76182 WARN (recoveryExecutor-199-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=61 deletes=0 deleteByQuery=1 errors=0 positionOfStart=0} (we're canceling recoveryh again?) 2> 76182 INFO (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.DefaultSolrCoreState Running recovery - first canceling any ongoing recovery 2> 76186 WARN (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Stopping recovery for core=collection1 coreNodeName=core_node10 (did this fail because we canceled the recovery?) 2> 76194 ERROR (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Error while trying to recover.:java.util.concurrent.ExecutionException: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http: //127.0.0.1:38911 2> at java.util.concurrent.FutureTask.report(FutureTask.java:122) 2> at java.util.concurrent.FutureTask.get(FutureTask.java:192) 2> at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:598) 2> at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:361) 2> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227) 2> Caused by: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http: //127.0.0.1:38911 2> at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:590) 2> at org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:285) 2> at org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:281) 2> at java.util.concurrent.FutureTask.run(FutureTask.java:266) 2> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) 2> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 2> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 2> at java.lang. Thread .run( Thread .java:745) 2> Caused by: java.net.SocketException: Socket closed 2> 76202 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Finished recovery process. (note: this is the last we hear of this thread... not clear if that's OK or not) 2> 76202 INFO (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.ActionThrottle The last recovery attempt started 711ms ago. 2> 76202 INFO (coreZkRegister-190-thread-1-processing-n:127.0.0.1:47239_ x:collection1 s:shard3 c:collection1 r:core_node10) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.ActionThrottle Throttling recovery attempts - waiting for 9288ms (still receiving updates, while we are waiting to do the next recovery attempt... are these still buffering? how do we tell?) 2> 76252 DEBUG (qtp2087242119-1283) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=0-432} {update.distrib=FROMLEADER&distrib.from=http: //127.0.0.1:38911/collection1/&wt=javabin&version=2} (OK, looks like we are finally trying the recovery process again... but note that "recoveringAfterStartup" is now false !!!!) 2> 85491 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Starting recovery process. recoveringAfterStartup= false 2> 85531 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Publishing state of core collection1 as recovering, leader is http: //127.0.0.1:38911/collection1/ and I am http://127.0.0.1:47239/collection1/ (Peersync. Because recoveringAfterStartup is false , it will use most recent versions to compare, rather than startup versions) (As an example, version 1513033825996767232 is very recent, at timestamp 83465, while we were still not recovered) 2> 92554 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy Attempting to PeerSync from http: //127.0.0.1:38911/collection1/ - recoveringAfterStartup= false 2> 92554 DEBUG (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http: //127.0.0.1:47239 startingVersions=100 [1513033825996767232, [...] 2> 92646 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http: //127.0.0.1:47239 Received 100 versions from http://127.0.0.1:38911/collection1/ 2> 92647 DEBUG (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http: //127.0.0.1:47239 sorted versions from http://127.0.0.1:38911/collection1/ = [1513033825996767232, ... 2> 92647 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http: //127.0.0.1:47239 Our versions are newer. ourLowThreshold=1513033822028955648 otherHigh=1513033825172586496 2> 92647 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http: //127.0.0.1:47239 DONE. sync succeeded (And we pass peersync when we should not) 2> 92796 INFO (RecoveryThread-collection1) [n:127.0.0.1:47239_ c:collection1 s:shard3 r:core_node10 x:collection1] o.a.s.c.RecoveryStrategy PeerSync Recovery was successful - registering as Active. 2> ######shard3 is not consistent. Got 375 from http: //127.0.0.1:40940/collection1lastClient and got 230 from http://127.0.0.1:47239/collection1 2> ###### sizes=375,230 2> ###### Only in http: //127.0.0.1:40940/collection1: [{_version_=1513033815669342208, id=0-295}, [...], {_version_=1513033816377131008, id=0-333}]
          Hide
          Yonik Seeley added a comment -

          OK, here's one possible patch I think.

          Show
          Yonik Seeley added a comment - OK, here's one possible patch I think.
          Hide
          Mark Miller added a comment -

          but note that "recoveringAfterStartup" is now false!!!!)

          Yeah, lot's of ways to lose the class and start over - so if you really want a field to persist it has to be static.

          Show
          Mark Miller added a comment - but note that "recoveringAfterStartup" is now false!!!!) Yeah, lot's of ways to lose the class and start over - so if you really want a field to persist it has to be static.
          Hide
          Mark Miller added a comment -

          Looked at the patch - yeah, or put it on the core state

          Show
          Mark Miller added a comment - Looked at the patch - yeah, or put it on the core state
          Hide
          Yonik Seeley added a comment -

          Yeah, It can't be static because each core needs it's own state.
          We could also maintain it as a normal variable in RecoveryStrategy and either reuse RecoveryStrategy objects, or initialize future objects from past objects. Thoughts on best approach?

          Show
          Yonik Seeley added a comment - Yeah, It can't be static because each core needs it's own state. We could also maintain it as a normal variable in RecoveryStrategy and either reuse RecoveryStrategy objects, or initialize future objects from past objects. Thoughts on best approach?
          Hide
          Mark Miller added a comment -

          Could be a static map or something in RecoveryStrategy too - but seeing as we already store another variable like this in the default state, made a lot of sense to me.

          With your patch and running on a patched version of 4.10.3, I was still only seeing one other type of fail.

          Docs that came in during recovery after publishing recovering and before buffering would end up interfering with and causing a false peer sync pass if enough of them came in.

          I seemed to have worked around this issue by buffering docs before peer sync and before publishing as RECOVERING (the signal for the leader to start sending updates).

          With my current runs using no deletes, I have not yet found a fail after this on this version of the code.

          Show
          Mark Miller added a comment - Could be a static map or something in RecoveryStrategy too - but seeing as we already store another variable like this in the default state, made a lot of sense to me. With your patch and running on a patched version of 4.10.3, I was still only seeing one other type of fail. Docs that came in during recovery after publishing recovering and before buffering would end up interfering with and causing a false peer sync pass if enough of them came in. I seemed to have worked around this issue by buffering docs before peer sync and before publishing as RECOVERING (the signal for the leader to start sending updates). With my current runs using no deletes, I have not yet found a fail after this on this version of the code.
          Hide
          Mark Miller added a comment -

          Should mention one other change as I am mostly testing Hdfs version of this chaosmonkey test - after talking to Yonik I also fixed an issue where because we don't have truncate support we were replaying buffered docs on fail to get past them - really we should not do that as it can lead to bad peer sync passes and I have a fix for that as well. I'll file a separate JIRA issue for that one.

          Show
          Mark Miller added a comment - Should mention one other change as I am mostly testing Hdfs version of this chaosmonkey test - after talking to Yonik I also fixed an issue where because we don't have truncate support we were replaying buffered docs on fail to get past them - really we should not do that as it can lead to bad peer sync passes and I have a fix for that as well. I'll file a separate JIRA issue for that one.
          Hide
          Mark Miller added a comment -

          For the last comment I filed SOLR-8094.

          Show
          Mark Miller added a comment - For the last comment I filed SOLR-8094 .
          Hide
          Mark Miller added a comment -

          Here is a patch I just ported to trunk that starts buffering before peer sync and before we publish as RECOVERING.

          With Yonik Seeley's fix and the couple I have, I've been able to hit 200 HdfsChaosMonkey test runs without replica inconsistency on my 4.10.3 + backports based code for the first time.

          Show
          Mark Miller added a comment - Here is a patch I just ported to trunk that starts buffering before peer sync and before we publish as RECOVERING. With Yonik Seeley 's fix and the couple I have, I've been able to hit 200 HdfsChaosMonkey test runs without replica inconsistency on my 4.10.3 + backports based code for the first time.
          Hide
          Mark Miller added a comment -

          Here is a patch I just ported to trunk that starts buffering before peer sync and before we publish as RECOVERING.

          It looks like this may pose a problem for shard splitting.

          Show
          Mark Miller added a comment - Here is a patch I just ported to trunk that starts buffering before peer sync and before we publish as RECOVERING. It looks like this may pose a problem for shard splitting.
          Hide
          Mark Miller added a comment -

          NM - the patch was missing a change. Here is a new patch that combines with Yonik's.

          Show
          Mark Miller added a comment - NM - the patch was missing a change. Here is a new patch that combines with Yonik's.
          Hide
          ASF subversion and git services added a comment -

          Commit 1706423 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1706423 ]

          SOLR-8085: Fix a variety of issues that can result in replicas getting out of sync.

          Show
          ASF subversion and git services added a comment - Commit 1706423 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1706423 ] SOLR-8085 : Fix a variety of issues that can result in replicas getting out of sync.
          Hide
          ASF subversion and git services added a comment -

          Commit 1706424 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1706424 ]

          SOLR-8085: Fix a variety of issues that can result in replicas getting out of sync.

          Show
          ASF subversion and git services added a comment - Commit 1706424 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1706424 ] SOLR-8085 : Fix a variety of issues that can result in replicas getting out of sync.
          Hide
          Mark Miller added a comment -

          I also have a test change we may need to prevent false fails - I'll spin that off into it's own issue.

          Show
          Mark Miller added a comment - I also have a test change we may need to prevent false fails - I'll spin that off into it's own issue.
          Hide
          Joel Bernstein added a comment -

          Mark Miller, I'm getting errors in the TestSQLHandler test cases, which extends AbstractFullDistribZkTestBase, following this commit. In the test case it looks like one of the shards is not having documents added to it. Still investigating...

          Show
          Joel Bernstein added a comment - Mark Miller , I'm getting errors in the TestSQLHandler test cases, which extends AbstractFullDistribZkTestBase, following this commit. In the test case it looks like one of the shards is not having documents added to it. Still investigating...
          Hide
          Joel Bernstein added a comment -

          I think I'm using the wrong method to index documents and this ticket just exposed that.

          Show
          Joel Bernstein added a comment - I think I'm using the wrong method to index documents and this ticket just exposed that.
          Hide
          Joel Bernstein added a comment -

          Ok, the issue was that I wasn't calling: waitForRecoveriesToFinish(false);

          Once I added this the test passes.

          Show
          Joel Bernstein added a comment - Ok, the issue was that I wasn't calling: waitForRecoveriesToFinish(false); Once I added this the test passes.
          Hide
          Shalin Shekhar Mangar added a comment -

          Good catch guys! Thanks for fixing this. Special thanks to Yonik for analysing the failure logs.

          Show
          Shalin Shekhar Mangar added a comment - Good catch guys! Thanks for fixing this. Special thanks to Yonik for analysing the failure logs.
          Hide
          Mark Miller added a comment -

          I also have a test change we may need to prevent false fails - I'll spin that off into it's own issue.

          SOLR-8121

          Show
          Mark Miller added a comment - I also have a test change we may need to prevent false fails - I'll spin that off into it's own issue. SOLR-8121

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development