Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7673

Race condition in shard splitting can cause operation to hang indefinitely or sub-shards to never become active

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.10.4, 5.2
    • Fix Version/s: 5.3, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      I found this while digging into the failure on https://builds.apache.org/job/Lucene-Solr-Tests-trunk-Java8/69/

         [junit4]   2> 390032 INFO  (OverseerStateUpdate-93987739315601411-127.0.0.1:57926_-n_0000000000) [n:127.0.0.1:57926_    ] o.a.s.c.o.ReplicaMutator Update state numShards=1 message={
         [junit4]   2>   "core":"testasynccollectioncreation_shard1_0_replica2",
         [junit4]   2>   "core_node_name":"core_node5",
         [junit4]   2>   "roles":null,
         [junit4]   2>   "base_url":"http://127.0.0.1:38702",
         [junit4]   2>   "node_name":"127.0.0.1:38702_",
         [junit4]   2>   "numShards":"1",
         [junit4]   2>   "state":"active",
         [junit4]   2>   "shard":"shard1_0",
         [junit4]   2>   "collection":"testasynccollectioncreation",
         [junit4]   2>   "operation":"state"}
         [junit4]   2> 390033 INFO  (OverseerStateUpdate-93987739315601411-127.0.0.1:57926_-n_0000000000) [n:127.0.0.1:57926_    ] o.a.s.c.o.ZkStateWriter going to update_collection /collections/testasynccollectioncreation/state.json version: 18
         [junit4]   2> 390033 INFO  (RecoveryThread-testasynccollectioncreation_shard1_0_replica2) [n:127.0.0.1:38702_ c:testasynccollectioncreation s:shard1_0 r:core_node5 x:testasynccollectioncreation_shard1_0_replica2] o.a.s.u.UpdateLog Took 2 ms to seed version buckets with highest version 1503803851028824064
         [junit4]   2> 390033 INFO  (zkCallback-167-thread-1-processing-n:127.0.0.1:57926_) [n:127.0.0.1:57926_    ] o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/testasynccollectioncreation/state.json for collection testasynccollectioncreation has occurred - updating... (live nodes size: 2)
         [junit4]   2> 390033 INFO  (RecoveryThread-testasynccollectioncreation_shard1_0_replica2) [n:127.0.0.1:38702_ c:testasynccollectioncreation s:shard1_0 r:core_node5 x:testasynccollectioncreation_shard1_0_replica2] o.a.s.c.RecoveryStrategy Finished recovery process.
         [junit4]   2> 390033 INFO  (zkCallback-173-thread-1-processing-n:127.0.0.1:38702_) [n:127.0.0.1:38702_    ] o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/testasynccollectioncreation/state.json for collection testasynccollectioncreation has occurred - updating... (live nodes size: 2)
         [junit4]   2> 390034 INFO  (zkCallback-167-thread-1-processing-n:127.0.0.1:57926_) [n:127.0.0.1:57926_    ] o.a.s.c.c.ZkStateReader Updating data for testasynccollectioncreation to ver 19 
         [junit4]   2> 390034 INFO  (zkCallback-173-thread-1-processing-n:127.0.0.1:38702_) [n:127.0.0.1:38702_    ] o.a.s.c.c.ZkStateReader Updating data for testasynccollectioncreation to ver 19 
         [junit4]   2> 390298 INFO  (qtp1859109869-1664) [n:127.0.0.1:38702_    ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :requeststatus with paramsrequestid=1004&action=REQUESTSTATUS&wt=javabin&version=2 
         [junit4]   2> 390299 INFO  (qtp1859109869-1664) [n:127.0.0.1:38702_    ] o.a.s.s.SolrDispatchFilter [admin] webapp=null path=/admin/collections params={requestid=1004&action=REQUESTSTATUS&wt=javabin&version=2} status=0 QTime=1 
         [junit4]   2> 390348 INFO  (qtp1859109869-1665) [n:127.0.0.1:38702_    ] o.a.s.h.a.CoreAdminHandler Checking request status for : 10042599422118342586
         [junit4]   2> 390348 INFO  (qtp1859109869-1665) [n:127.0.0.1:38702_    ] o.a.s.s.SolrDispatchFilter [admin] webapp=null path=/admin/cores params={qt=/admin/cores&requestid=10042599422118342586&action=REQUESTSTATUS&wt=javabin&version=2} status=0 QTime=1 
         [junit4]   2> 390348 INFO  (OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_) [n:127.0.0.1:57926_    ] o.a.s.c.OverseerCollectionProcessor Asking sub shard leader to wait for: testasynccollectioncreation_shard1_0_replica2 to be alive on: 127.0.0.1:38702_
         [junit4]   2> 390349 INFO  (OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_) [n:127.0.0.1:57926_    ] o.a.s.c.OverseerCollectionProcessor Creating replica shard testasynccollectioncreation_shard1_1_replica2 as part of slice shard1_1 of collection testasynccollectioncreation on 127.0.0.1:38702_
         [junit4]   2> 390349 INFO  (OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_) [n:127.0.0.1:57926_    ] o.a.s.c.c.ZkStateReader Load collection config from:/collections/testasynccollectioncreation
         [junit4]   2> 390350 INFO  (OverseerThreadFactory-891-thread-4-processing-n:127.0.0.1:57926_) [n:127.0.0.1:57926_    ] o.a.s.c.c.ZkStateReader path=/collections/testasynccollectioncreation configName=conf1 specified config exists in ZooKeeper
         [junit4]   2> 390352 INFO  (qtp381614561-1631) [n:127.0.0.1:57926_    ] o.a.s.s.SolrDispatchFilter [admin] webapp=null path=/admin/cores params={nodeName=127.0.0.1:38702_&core=testasynccollectioncreation_shard1_0_replica1&async=10042599424132346542&qt=/admin/cores&coreNodeName=core_node5&action=PREPRECOVERY&checkLive=true&state=recovering&onlyIfLeader=true&wt=javabin&version=2} status=0 QTime=2 
      

      The following sequence of events lead to deadlock:

      1. testasynccollectioncreation_shard1_0_replica2 (core_node5) becomes active
      2. OCP asks sub-shard leader testasynccollectioncreation_shard1_0_replica1 to wait until replica2 is in recovery

      At this point, the test just keeps pinging for status until timeout and fails.

        Attachments

        1. Lucene-Solr-Tests-trunk-Java8-69.log
          1.31 MB
          Shalin Shekhar Mangar
        2. SOLR-7673.patch
          13 kB
          Shalin Shekhar Mangar

          Activity

            People

            • Assignee:
              shalinmangar Shalin Shekhar Mangar
              Reporter:
              shalinmangar Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: