Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5615

Deadlock while trying to recover after a ZK session expiry

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.4, 4.5, 4.6
    • Fix Version/s: 4.6.1, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      The sequence of events which might trigger this is as follows:

      • Leader of a shard, say OL, has a ZK expiry
      • The new leader, NL, starts the election process
      • NL, through Overseer, clears the current leader (OL) for the shard from the cluster state
      • OL reconnects to ZK, calls onReconnect from event thread (main-EventThread)
      • OL marks itself down
      • OL sets up watches for cluster state, and then retrieves it (with no leader for this shard)
      • NL, through Overseer, updates cluster state to mark itself leader for the shard
      • OL tries to register itself as a replica, and waits till the cluster state is updated
        with the new leader from event thread
      • ZK sends a watch update to OL, but it is blocked on the event thread waiting for it.

      Oops. This finally breaks out after trying to register itself as replica times out after 20 mins.

        Attachments

        1. SOLR-5615.patch
          4 kB
          Mark Miller
        2. SOLR-5615.patch
          4 kB
          Mark Miller
        3. SOLR-5615.patch
          10 kB
          Mark Miller

          Activity

            People

            • Assignee:
              markrmiller@gmail.com Mark Miller
              Reporter:
              andyetitmoves Ramkumar Aiyengar
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: