Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6631

DistributedQueue spinning on calling zookeeper getChildren()

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10.4, 5.0
    • Component/s: SolrCloud
    • Labels:

      Description

      The change from SOLR-6336 introduced a bug where now I'm stuck in a loop making getChildren() request to zookeeper with this thread dump:

      Thread-51 [WAITING] CPU time: 1d 15h 0m 57s
      java.lang.Object.wait()
      org.apache.zookeeper.ClientCnxn.submitRequest(RequestHeader, Record, Record, ZooKeeper$WatchRegistration)
      org.apache.zookeeper.ZooKeeper.getChildren(String, Watcher)
      org.apache.solr.common.cloud.SolrZkClient$6.execute()<2 recursive calls>
      org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkOperation)
      org.apache.solr.common.cloud.SolrZkClient.getChildren(String, Watcher, boolean)
      org.apache.solr.cloud.DistributedQueue.orderedChildren(Watcher)
      org.apache.solr.cloud.DistributedQueue.getChildren(long)
      org.apache.solr.cloud.DistributedQueue.peek(long)
      org.apache.solr.cloud.DistributedQueue.peek(boolean)
      org.apache.solr.cloud.Overseer$ClusterStateUpdater.run()
      java.lang.Thread.run()

      Looking at the code, I think the issue is that LatchChildWatcher#process always sets the event to its member variable event, regardless of its type, but the problem is that once the member event is set, the await no longer waits. In this state, the while loop in getChildren(long), when called with wait being Integer.MAX_VALUE will loop back, NOT wait at await because event != null, but then it still will not get any children.

      while (true) {
      if (!children.isEmpty()) break;
      watcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait);
      if (watcher.getWatchedEvent() != null)
      { children = orderedChildren(null); }
      if (wait != Long.MAX_VALUE) break;
      }

      I think the fix would be to only set the event in the watcher if the type is not None.

        Attachments

        1. SOLR-6631.patch
          9 kB
          Timothy Potter
        2. SOLR-6631.patch
          5 kB
          Timothy Potter

          Issue Links

            Activity

              People

              • Assignee:
                thelabdude Timothy Potter
                Reporter:
                mewmewball Jessica Cheng Mallet
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: