Solr
  1. Solr
  2. SOLR-6631

DistributedQueue spinning on calling zookeeper getChildren()

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10.4, 5.0
    • Component/s: SolrCloud
    • Labels:

      Description

      The change from SOLR-6336 introduced a bug where now I'm stuck in a loop making getChildren() request to zookeeper with this thread dump:

      Thread-51 [WAITING] CPU time: 1d 15h 0m 57s
      java.lang.Object.wait()
      org.apache.zookeeper.ClientCnxn.submitRequest(RequestHeader, Record, Record, ZooKeeper$WatchRegistration)
      org.apache.zookeeper.ZooKeeper.getChildren(String, Watcher)
      org.apache.solr.common.cloud.SolrZkClient$6.execute()<2 recursive calls>
      org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkOperation)
      org.apache.solr.common.cloud.SolrZkClient.getChildren(String, Watcher, boolean)
      org.apache.solr.cloud.DistributedQueue.orderedChildren(Watcher)
      org.apache.solr.cloud.DistributedQueue.getChildren(long)
      org.apache.solr.cloud.DistributedQueue.peek(long)
      org.apache.solr.cloud.DistributedQueue.peek(boolean)
      org.apache.solr.cloud.Overseer$ClusterStateUpdater.run()
      java.lang.Thread.run()

      Looking at the code, I think the issue is that LatchChildWatcher#process always sets the event to its member variable event, regardless of its type, but the problem is that once the member event is set, the await no longer waits. In this state, the while loop in getChildren(long), when called with wait being Integer.MAX_VALUE will loop back, NOT wait at await because event != null, but then it still will not get any children.

      while (true) {
      if (!children.isEmpty()) break;
      watcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait);
      if (watcher.getWatchedEvent() != null)
      { children = orderedChildren(null); }
      if (wait != Long.MAX_VALUE) break;
      }

      I think the fix would be to only set the event in the watcher if the type is not None.

      1. SOLR-6631.patch
        9 kB
        Timothy Potter
      2. SOLR-6631.patch
        5 kB
        Timothy Potter

        Issue Links

          Activity

          Hide
          Timothy Potter added a comment -

          Here's a cut of a patch with the start of a unit test for DistributedQueue. I'm curious how we might be able to trigger a "None" event to get raised so I can add that to the unit test? Mainly I want to reproduce the scenario described in the bug report in the unit test if possible, but wasn't able to get ZK to raise a "None" event.

          Show
          Timothy Potter added a comment - Here's a cut of a patch with the start of a unit test for DistributedQueue. I'm curious how we might be able to trigger a "None" event to get raised so I can add that to the unit test? Mainly I want to reproduce the scenario described in the bug report in the unit test if possible, but wasn't able to get ZK to raise a "None" event.
          Hide
          Hoss Man added a comment -

          I'm curious how we might be able to trigger a "None" event to get raised so I can add that to the unit test?

          IIUC: "EventType.None" only happens when there are "session level events" – ie: KeeperState in (Disconnected, SyncConnected (reconnected), Expired ).

          I think the fix would be to only set the event in the watcher if the type is not None.

          I'm not very familiar with this code, perhaps a better approach would be to only proceed with this code if the EventType recived is in an explicit set of expected types?

          whichever way makes sense, one trick i find very helpful in situations like this (a: dealing with enums from third party packages; b: wanting to behave according to partitions of the enum space) is to not just "do X if state in (A) " but "do X if state in (A) else no-op if state in (B) else ERROR" so that if someone upgrades zookeeper and there are suddenly all new EventTypes we don't expect, they aren't silently ignored.

          the EnumSet.allOf() and EnumSet.complimentsOf() methods can also help write very targetted unit tests to alert you to unexpected values as soon as you upgrade.

          So for example...

          public class DistributedQueue {
            public static final EnumSet<EventType> EXPECTED_EVENTS = EnumSet.of(...);
            public static final EnumSet<EventType> IGNORED_EVENTS = EnumSet.of(...);
            ...
              if (EXPECTED_EVENTS.contains(event.getType()) {
                // do stuff
                ...
              } else if (IGNORED_EVENTS.contains(event.getType()) {
                // NO-OP
              } else {
                log.error("WTF EVENT IS THIS? " + ...)
              }
            ...
          }
          public class TestDistributedQueue {
            ...
            /**
             * if this test fails, don't change it - go audit these EnumSets and all their usages
             */
            public void testSanityOfEventTypes() {
              EnumSet<EventType> known = EnumSet.copyOf(DistributedQueue.EXPECTED_EVENTS);
              known.addAll(DistributedQueue.IGNORED_EVENTS);
          
              EnumSet<EventType> unknown = EnumSet.complementOf(known);
              assertEquals("un-known EventTypes found, zk upgrade?", EnumSet.noneOf(EventType.class), unknown)
            }
          
          Show
          Hoss Man added a comment - I'm curious how we might be able to trigger a "None" event to get raised so I can add that to the unit test? IIUC: "EventType.None" only happens when there are "session level events" – ie: KeeperState in (Disconnected, SyncConnected (reconnected), Expired ). I think the fix would be to only set the event in the watcher if the type is not None. I'm not very familiar with this code, perhaps a better approach would be to only proceed with this code if the EventType recived is in an explicit set of expected types? whichever way makes sense, one trick i find very helpful in situations like this (a: dealing with enums from third party packages; b: wanting to behave according to partitions of the enum space) is to not just "do X if state in (A) " but "do X if state in (A) else no-op if state in (B) else ERROR" so that if someone upgrades zookeeper and there are suddenly all new EventTypes we don't expect, they aren't silently ignored. the EnumSet.allOf() and EnumSet.complimentsOf() methods can also help write very targetted unit tests to alert you to unexpected values as soon as you upgrade. So for example... public class DistributedQueue { public static final EnumSet<EventType> EXPECTED_EVENTS = EnumSet.of(...); public static final EnumSet<EventType> IGNORED_EVENTS = EnumSet.of(...); ... if (EXPECTED_EVENTS.contains(event.getType()) { // do stuff ... } else if (IGNORED_EVENTS.contains(event.getType()) { // NO-OP } else { log.error( "WTF EVENT IS THIS? " + ...) } ... } public class TestDistributedQueue { ... /** * if this test fails, don't change it - go audit these EnumSets and all their usages */ public void testSanityOfEventTypes() { EnumSet<EventType> known = EnumSet.copyOf(DistributedQueue.EXPECTED_EVENTS); known.addAll(DistributedQueue.IGNORED_EVENTS); EnumSet<EventType> unknown = EnumSet.complementOf(known); assertEquals( "un-known EventTypes found, zk upgrade?" , EnumSet.noneOf(EventType.class), unknown) }
          Hide
          Timothy Potter added a comment -

          Thanks for the feedback Hoss. I was actually wondering if it would suffice to handle NodeChildrenChanged EventTypes in the LatchChildWatcher process method, i.e. change the code to:

          if (eventType == Event.EventType.NodeChildrenChanged)

          { ... }

          Mark Miller or Ramkumar Aiyengar do either of you have any insight you can share on this? Specifically, I'd like to change the LatchChildWatcher.process to set the event member and notifyAll only if the EventType is NodeChildrenChanged, i.e.

              @Override
              public void process(WatchedEvent event) {
                Event.EventType eventType = event.getType();
                LOG.info("LatchChildWatcher fired on path: " + event.getPath() + " state: "
                    + event.getState() + " type " + eventType);
                if (eventType == Event.EventType.NodeChildrenChanged) {
                  synchronized (lock) {
                    this.event = event;
                    lock.notifyAll();
                  }
                }
              }
          

          Or do we need to handle the other event types and just not affect the event if the type is None as originally suggested by Jessica Cheng Mallet?

          Need to get this one committed soon

          Show
          Timothy Potter added a comment - Thanks for the feedback Hoss. I was actually wondering if it would suffice to handle NodeChildrenChanged EventTypes in the LatchChildWatcher process method, i.e. change the code to: if (eventType == Event.EventType.NodeChildrenChanged) { ... } Mark Miller or Ramkumar Aiyengar do either of you have any insight you can share on this? Specifically, I'd like to change the LatchChildWatcher.process to set the event member and notifyAll only if the EventType is NodeChildrenChanged, i.e. @Override public void process(WatchedEvent event) { Event.EventType eventType = event.getType(); LOG.info( "LatchChildWatcher fired on path: " + event.getPath() + " state: " + event.getState() + " type " + eventType); if (eventType == Event.EventType.NodeChildrenChanged) { synchronized (lock) { this .event = event; lock.notifyAll(); } } } Or do we need to handle the other event types and just not affect the event if the type is None as originally suggested by Jessica Cheng Mallet ? Need to get this one committed soon
          Hide
          Mark Miller added a comment -

          if (eventType == Event.EventType.NodeChildrenChanged) {

          +1 - we are only interested in waiting around to see a child added - this watcher should not need to consider other events.

          Show
          Mark Miller added a comment - if (eventType == Event.EventType.NodeChildrenChanged) { +1 - we are only interested in waiting around to see a child added - this watcher should not need to consider other events.
          Hide
          ASF subversion and git services added a comment -

          Commit 1635131 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1635131 ]

          SOLR-6631: DistributedQueue spinning on calling zookeeper getChildren()

          Show
          ASF subversion and git services added a comment - Commit 1635131 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1635131 ] SOLR-6631 : DistributedQueue spinning on calling zookeeper getChildren()
          Hide
          ASF subversion and git services added a comment -

          Commit 1635142 from Timothy Potter in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1635142 ]

          SOLR-6631: DistributedQueue spinning on calling zookeeper getChildren()

          Show
          ASF subversion and git services added a comment - Commit 1635142 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1635142 ] SOLR-6631 : DistributedQueue spinning on calling zookeeper getChildren()
          Hide
          ASF subversion and git services added a comment -

          Commit 1635155 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1635155 ]

          Backout fix for SOLR-6631 as things like create collection are hanging now

          Show
          ASF subversion and git services added a comment - Commit 1635155 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1635155 ] Backout fix for SOLR-6631 as things like create collection are hanging now
          Hide
          ASF subversion and git services added a comment -

          Commit 1635157 from Timothy Potter in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1635157 ]

          Backout fix for SOLR-6631 as things like create collection are hanging now

          Show
          ASF subversion and git services added a comment - Commit 1635157 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1635157 ] Backout fix for SOLR-6631 as things like create collection are hanging now
          Hide
          Jessica Cheng Mallet added a comment -

          I originally thought NodeChildrenChanged would be enough too, but it made the tests hang forever. That's when I realized that the zk.exist() call in offer() also uses this watcher, so it's not enough to just watch for NodeChildrenChanged.

          We can either make the watcher set all not None events (None events don't remove watches, so they need to be excluded), or use a different kind of watch in the zk.exist() call.

          Show
          Jessica Cheng Mallet added a comment - I originally thought NodeChildrenChanged would be enough too, but it made the tests hang forever. That's when I realized that the zk.exist() call in offer() also uses this watcher, so it's not enough to just watch for NodeChildrenChanged. We can either make the watcher set all not None events (None events don't remove watches, so they need to be excluded), or use a different kind of watch in the zk.exist() call.
          Hide
          Mark Miller added a comment -

          Whoops - glossed over that in my IDE call hierarchy.

          a different kind of watch in the zk.exist() call

          I lean towards that - subclass or something - I think it's better for the current name of the watcher and think it's better the watch only processes the events it is actually interested in.

          Not a real big deal either way, but in the other case, let's change the name of the Watcher.

          Show
          Mark Miller added a comment - Whoops - glossed over that in my IDE call hierarchy. a different kind of watch in the zk.exist() call I lean towards that - subclass or something - I think it's better for the current name of the watcher and think it's better the watch only processes the events it is actually interested in. Not a real big deal either way, but in the other case, let's change the name of the Watcher.
          Hide
          Timothy Potter added a comment -

          Thanks for the help Mark Miller and Jessica Cheng Mallet! I think this patch is ready to go (and no hangs this time)! Please give a quick review and I'll get it committed.

          Show
          Timothy Potter added a comment - Thanks for the help Mark Miller and Jessica Cheng Mallet ! I think this patch is ready to go (and no hangs this time)! Please give a quick review and I'll get it committed.
          Hide
          Mark Miller added a comment -

          +1, looks great to me.

          Show
          Mark Miller added a comment - +1, looks great to me.
          Hide
          Jessica Cheng Mallet added a comment -

          +1, thanks Tim!

          Show
          Jessica Cheng Mallet added a comment - +1, thanks Tim!
          Hide
          ASF subversion and git services added a comment -

          Commit 1635573 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1635573 ]

          SOLR-6631: Part deux - refactor LatchChildWatcher to LatchWatcher that takes an optional EventType during construction to only release the latch when a specific event type is received.

          Show
          ASF subversion and git services added a comment - Commit 1635573 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1635573 ] SOLR-6631 : Part deux - refactor LatchChildWatcher to LatchWatcher that takes an optional EventType during construction to only release the latch when a specific event type is received.
          Hide
          ASF subversion and git services added a comment -

          Commit 1635576 from Timothy Potter in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1635576 ]

          SOLR-6631: Part deux - refactor LatchChildWatcher to LatchWatcher that takes an optional EventType during construction to only release the latch when a specific event type is received.

          Show
          ASF subversion and git services added a comment - Commit 1635576 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1635576 ] SOLR-6631 : Part deux - refactor LatchChildWatcher to LatchWatcher that takes an optional EventType during construction to only release the latch when a specific event type is received.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.
          Hide
          Shalin Shekhar Mangar added a comment -

          Reopening to backport to 4.10.4

          Show
          Shalin Shekhar Mangar added a comment - Reopening to backport to 4.10.4
          Hide
          ASF subversion and git services added a comment -

          Commit 1662429 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1662429 ]

          SOLR-6631: DistributedQueue spinning on calling zookeeper getChildren()

          Show
          ASF subversion and git services added a comment - Commit 1662429 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1662429 ] SOLR-6631 : DistributedQueue spinning on calling zookeeper getChildren()
          Hide
          ASF subversion and git services added a comment -

          Commit 1662435 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1662435 ]

          SOLR-6631: Fix bad merge

          Show
          ASF subversion and git services added a comment - Commit 1662435 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1662435 ] SOLR-6631 : Fix bad merge
          Hide
          Michael McCandless added a comment -

          Bulk close for 4.10.4 release

          Show
          Michael McCandless added a comment - Bulk close for 4.10.4 release

            People

            • Assignee:
              Timothy Potter
              Reporter:
              Jessica Cheng Mallet
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development