Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-4099

Suspect zookeeper client thread doesn't call back the watcher, that occur the overseer collection can't work normal.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.0-ALPHA, 4.0-BETA, 4.0
    • 4.1, 6.0
    • SolrCloud
    • None
    • Zookeeper version: 3.2

    Description

      In test environment, our zookeeper version is old that our requirement version. Not use solr default 3.3.6 version.

      The overseer collection processor stop work. Trace the dump, the overseer wait for LatchChildWatcher.await.
      Check the zookeeper /overseer/collection-queue-work, block a lot of operation for collection.

      Check the logic, suspect the zookeeper client doesn't call back the watchevent that register the path "/overseer/collection-queue-work", unlucky the log level is debug.

      This case doesn't happen often, very little. But if it happen, it is fatal, we have to stop the leader server.

      Suggest the compensate solution, that doesn't await until notify. Only wait some time that maybe it is ten minutes or a half of hour or other value to recheck the queue again. Of cause if get the notify, that can direct work normal.

      Attachments

        1. patch-4099.txt
          2 kB
          Raintung Li

        Activity

          People

            markrmiller@gmail.com Mark Miller
            raintung.li Raintung Li
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 12h
                12h
                Remaining:
                Remaining Estimate - 12h
                12h
                Logged:
                Time Spent - Not Specified
                Not Specified