Solr
  1. Solr
  2. SOLR-6336

DistributedQueue (and it's use in OCP) leaks ZK Watches

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      The current DistributedQueue implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting).

      This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event.

        Issue Links

          Activity

          Hide
          ASF GitHub Bot added a comment -

          GitHub user andyetitmoves opened a pull request:

          https://github.com/apache/lucene-solr/pull/80

          Reuse watcher in DistributedQueue across peek/take

          Initial patch for SOLR-6336. Some more work can probably be done (other functions in the queue still do this, and probably tests would be good to check this in general), but here's an initial fix which passes tests and fixes Jenkins failures currently happening..

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/bloomberg/lucene-solr fix-watch-leak

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/lucene-solr/pull/80.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #80


          commit d61f5de2056a4ea841c2d637b9def1c2cb8597b0
          Author: Ramkumar Aiyengar <raiyengar@bloomberg.net>
          Date: 2014-08-07T12:30:22Z

          Reuse watcher in DistributedQueue across peek/take


          Show
          ASF GitHub Bot added a comment - GitHub user andyetitmoves opened a pull request: https://github.com/apache/lucene-solr/pull/80 Reuse watcher in DistributedQueue across peek/take Initial patch for SOLR-6336 . Some more work can probably be done (other functions in the queue still do this, and probably tests would be good to check this in general), but here's an initial fix which passes tests and fixes Jenkins failures currently happening.. You can merge this pull request into a Git repository by running: $ git pull https://github.com/bloomberg/lucene-solr fix-watch-leak Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/80.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #80 commit d61f5de2056a4ea841c2d637b9def1c2cb8597b0 Author: Ramkumar Aiyengar <raiyengar@bloomberg.net> Date: 2014-08-07T12:30:22Z Reuse watcher in DistributedQueue across peek/take
          Hide
          Ramkumar Aiyengar added a comment -

          other functions in the queue still do this

          That's only offer, which is best managed by the caller as the watcher depends on the arguments passed in. In any case, that's handled reasonably well by CollectionsHandler which uses it.

          Show
          Ramkumar Aiyengar added a comment - other functions in the queue still do this That's only offer , which is best managed by the caller as the watcher depends on the arguments passed in. In any case, that's handled reasonably well by CollectionsHandler which uses it.
          Hide
          ASF subversion and git services added a comment -

          Commit 1616654 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1616654 ]

          SOLR-6336: DistributedQueue can easily create too many ZooKeeper Watches. (closes #80)

          Show
          ASF subversion and git services added a comment - Commit 1616654 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1616654 ] SOLR-6336 : DistributedQueue can easily create too many ZooKeeper Watches. (closes #80)
          Hide
          ASF subversion and git services added a comment -

          Commit 1616655 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1616655 ]

          SOLR-6336: DistributedQueue can easily create too many ZooKeeper Watches. (closes #80)

          Show
          ASF subversion and git services added a comment - Commit 1616655 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1616655 ] SOLR-6336 : DistributedQueue can easily create too many ZooKeeper Watches. (closes #80)
          Hide
          ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/lucene-solr/pull/80

          Show
          ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/lucene-solr/pull/80
          Hide
          ASF GitHub Bot added a comment -

          GitHub user andyetitmoves opened a pull request:

          https://github.com/apache/lucene-solr/pull/81

          Cache children as well so that they can be returned when the watcher is reused

          Fixes an issue with apache/lucene-solr#80 pulled to SOLR-6336. If the first `getChildren` actually returns nodes, and the second request happens before the watch is fired, currently it will return no children.

          The rest of the patch is just minor code cleanup.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/bloomberg/lucene-solr trunk-reuse-latch-fix

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/lucene-solr/pull/81.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #81


          commit f603e7506d1fe5d956a75cdb13897b1b7af7ac70
          Author: Ramkumar Aiyengar <andyetitmoves@gmail.com>
          Date: 2014-08-08T07:44:10Z

          Cache children as well so that they can be returned when the watcher is reused


          Show
          ASF GitHub Bot added a comment - GitHub user andyetitmoves opened a pull request: https://github.com/apache/lucene-solr/pull/81 Cache children as well so that they can be returned when the watcher is reused Fixes an issue with apache/lucene-solr#80 pulled to SOLR-6336 . If the first `getChildren` actually returns nodes, and the second request happens before the watch is fired, currently it will return no children. The rest of the patch is just minor code cleanup. You can merge this pull request into a Git repository by running: $ git pull https://github.com/bloomberg/lucene-solr trunk-reuse-latch-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/81.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #81 commit f603e7506d1fe5d956a75cdb13897b1b7af7ac70 Author: Ramkumar Aiyengar <andyetitmoves@gmail.com> Date: 2014-08-08T07:44:10Z Cache children as well so that they can be returned when the watcher is reused
          Hide
          Ramkumar Aiyengar added a comment -

          Fixed a bug with children not being returned when the watcher is reused.

          Show
          Ramkumar Aiyengar added a comment - Fixed a bug with children not being returned when the watcher is reused.
          Hide
          ASF subversion and git services added a comment -

          Commit 1616771 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1616771 ]

          SOLR-6336: Cache children as well so that they can be returned when the watcher is reused.

          Show
          ASF subversion and git services added a comment - Commit 1616771 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1616771 ] SOLR-6336 : Cache children as well so that they can be returned when the watcher is reused.
          Hide
          ASF subversion and git services added a comment -

          Commit 1616772 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1616772 ]

          SOLR-6336: Cache children as well so that they can be returned when the watcher is reused.

          Show
          ASF subversion and git services added a comment - Commit 1616772 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1616772 ] SOLR-6336 : Cache children as well so that they can be returned when the watcher is reused.
          Hide
          Mark Miller added a comment -

          Thanks Ramkumar!

          Show
          Mark Miller added a comment - Thanks Ramkumar!
          Hide
          Ramkumar Aiyengar added a comment -

          I raised SOLR-6370 to catch such issues proactively.

          Show
          Ramkumar Aiyengar added a comment - I raised SOLR-6370 to catch such issues proactively.
          Hide
          Jessica Cheng Mallet added a comment -

          Please let me know if I'm supposed to open a new issue (not sure what the policy is).

          I'm encountering a bug from this patch where now I'm stuck in a loop making getChildren() request to zookeeper with this thread dump:

          Thread-51 [WAITING] CPU time: 1d 15h 0m 57s
          java.lang.Object.wait()
          org.apache.zookeeper.ClientCnxn.submitRequest(RequestHeader, Record, Record, ZooKeeper$WatchRegistration)
          org.apache.zookeeper.ZooKeeper.getChildren(String, Watcher)
          org.apache.solr.common.cloud.SolrZkClient$6.execute()<2 recursive calls>
          org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkOperation)
          org.apache.solr.common.cloud.SolrZkClient.getChildren(String, Watcher, boolean)
          org.apache.solr.cloud.DistributedQueue.orderedChildren(Watcher)
          org.apache.solr.cloud.DistributedQueue.getChildren(long)
          org.apache.solr.cloud.DistributedQueue.peek(long)
          org.apache.solr.cloud.DistributedQueue.peek(boolean)
          org.apache.solr.cloud.Overseer$ClusterStateUpdater.run()
          java.lang.Thread.run()

          Looking at the code, I think the issue is that LatchChildWatcher#process always sets the event to its member, regardless of its type, but the problem is that once an event is set, the await no longer waits. In this state, the while loop in getChildren(long), when called with wait being Integer.MAX_VALUE will come back, NOT wait at await because event != null, but then it still will not get any children.

          while (true) {
          if (!children.isEmpty()) break;
          watcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait);
          if (watcher.getWatchedEvent() != null)

          Unknown macro: { children = orderedChildren(null); }

          if (wait != Long.MAX_VALUE) break;
          }

          I think the fix would be to only set the event in the watcher if the type is a NodeChildrenChanged.

          Show
          Jessica Cheng Mallet added a comment - Please let me know if I'm supposed to open a new issue (not sure what the policy is). I'm encountering a bug from this patch where now I'm stuck in a loop making getChildren() request to zookeeper with this thread dump: Thread-51 [WAITING] CPU time: 1d 15h 0m 57s java.lang.Object.wait() org.apache.zookeeper.ClientCnxn.submitRequest(RequestHeader, Record, Record, ZooKeeper$WatchRegistration) org.apache.zookeeper.ZooKeeper.getChildren(String, Watcher) org.apache.solr.common.cloud.SolrZkClient$6.execute()<2 recursive calls> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkOperation) org.apache.solr.common.cloud.SolrZkClient.getChildren(String, Watcher, boolean) org.apache.solr.cloud.DistributedQueue.orderedChildren(Watcher) org.apache.solr.cloud.DistributedQueue.getChildren(long) org.apache.solr.cloud.DistributedQueue.peek(long) org.apache.solr.cloud.DistributedQueue.peek(boolean) org.apache.solr.cloud.Overseer$ClusterStateUpdater.run() java.lang.Thread.run() Looking at the code, I think the issue is that LatchChildWatcher#process always sets the event to its member, regardless of its type, but the problem is that once an event is set, the await no longer waits. In this state, the while loop in getChildren(long), when called with wait being Integer.MAX_VALUE will come back, NOT wait at await because event != null, but then it still will not get any children. while (true) { if (!children.isEmpty()) break; watcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait); if (watcher.getWatchedEvent() != null) Unknown macro: { children = orderedChildren(null); } if (wait != Long.MAX_VALUE) break; } I think the fix would be to only set the event in the watcher if the type is a NodeChildrenChanged.
          Hide
          Shawn Heisey added a comment -

          Jessica Cheng Mallet, because this issue is resolved and you're having what looks to you like a related bug, it's standard practice to open a new issue. Sometimes discussion may continue on an issue after it's resolved, typically that would be for clarification purposes, to decide whether a new issue should be filed. You can link the new issue to this as a related issue after you create it.

          I do not understand the cloud/zookeeper internals well enough to know what you were saying above. One day I hope to.

          Show
          Shawn Heisey added a comment - Jessica Cheng Mallet , because this issue is resolved and you're having what looks to you like a related bug, it's standard practice to open a new issue. Sometimes discussion may continue on an issue after it's resolved, typically that would be for clarification purposes, to decide whether a new issue should be filed. You can link the new issue to this as a related issue after you create it. I do not understand the cloud/zookeeper internals well enough to know what you were saying above. One day I hope to.
          Hide
          Jessica Cheng Mallet added a comment -

          Thanks for the clarification Shawn Heisey! I'll open a new issue. Thanks!

          Show
          Jessica Cheng Mallet added a comment - Thanks for the clarification Shawn Heisey ! I'll open a new issue. Thanks!
          Hide
          Mark Miller added a comment -

          Best rule of thumb: if the issue is released, new issue and link it to the related one. If it's not released, reopen the issue.

          Show
          Mark Miller added a comment - Best rule of thumb: if the issue is released, new issue and link it to the related one. If it's not released, reopen the issue.
          Hide
          Jessica Cheng Mallet added a comment -

          Got it! Thanks!

          Show
          Jessica Cheng Mallet added a comment - Got it! Thanks!
          Hide
          ASF GitHub Bot added a comment -

          Github user andyetitmoves closed the pull request at:

          https://github.com/apache/lucene-solr/pull/81

          Show
          ASF GitHub Bot added a comment - Github user andyetitmoves closed the pull request at: https://github.com/apache/lucene-solr/pull/81

            People

            • Assignee:
              Mark Miller
              Reporter:
              Ramkumar Aiyengar
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development