Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11590

Synchronize ZK connect/disconnect handling

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 7.2, 8.0
    • None
    • None

    Description

      Here is a sequence of 2 disconnects and re-connects

      1. 2017-10-31T08:34:23.106-0700 Watcher org.apache.solr.common.cloud.ConnectionManager@1579ca20 name:ZooKeeperConnection Watcher:host:port got event WatchedEvent state:Disconnected type:None path:null path:null type:None
      2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected
      3. 2017-10-31T08:34:23.107-0700 Watcher org.apache.solr.common.cloud.ConnectionManager@1579ca20 name:ZooKeeperConnection Watcher:host:port got event WatchedEvent state:SyncConnected type:None path:null path:null type:None
      
      1. 2017-10-31T08:36:46.541-0700 Watcher org.apache.solr.common.cloud.ConnectionManager@1579ca20 name:ZooKeeperConnection Watcher:host:port got event WatchedEvent state:Disconnected type:None path:null path:null type:None
      2. 2017-10-31T08:36:46.549-0700 Watcher org.apache.solr.common.cloud.ConnectionManager@1579ca20 name:ZooKeeperConnection Watcher:host:port got event WatchedEvent state:SyncConnected type:None path:null path:null type:None
      2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected
      

      In the first disconnect the sequence is - get disconnect watcher, execute disconnect code, execute connect code
      In the second disconnect the sequence is - get disconnect watcher, execute connect code, execute disconnect code

      In the second sequence of events, if the JVM has leader replicas then all updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." . This starts happening exactly after 27 seconds ( zk client timeout is 30s , 90% of 30 = 27 - when the code thinks the session is likely expired). No leadership changes since there was no session expiry. Unless you restart the node all updates to the system continue to fail.

      These log lines correspond are from Solr 5.3 hence where the WatchedEvent was still being logged as INFO

      We process the connect code and then process the disconnect code out of order based on the log ordering. The connection is active but the flag is not set and hence after 27 seconds zkCheck starts complaining that the connection is likely expired

      A related Jira is SOLR-5721

      ZK gives us ordered watch events ( https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees ) but from what I understand Solr can still process them out of order. We could take a lock and synchronize ConnectionManager#connected and ConnectionManager#disconnected .

      Would that be the right approach to take?

      Attachments

        1. SOLR-11590.patch
          2 kB
          Noble Paul
        2. SOLR-11590.patch
          3 kB
          Varun Thacker

        Activity

          People

            varun Varun Thacker
            varun Varun Thacker
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: