Solr
  1. Solr
  2. SOLR-5436

Eliminate the 1500ms wait in overseer loop

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, Trunk
    • Component/s: None
    • Labels:
      None

      Description

      The Overseer thread waits 1500 ms before it polls for new events. The wait should be eliminated and it should just wait for new events till they come the way it is done in OverseerCollectionProcessor

      1. SOLR-5436.patch
        7 kB
        Noble Paul
      2. SOLR-5436.patch
        10 kB
        Noble Paul
      3. SOLR-5436.patch
        5 kB
        Noble Paul
      4. SOLR-5436.patch
        7 kB
        Noble Paul
      5. SOLR-5436.patch
        9 kB
        Mark Miller
      6. SOLR-5436.patch
        12 kB
        Mark Miller
      7. SOLR-5436.patch
        13 kB
        Mark Miller
      8. SOLR-5436.patch
        13 kB
        Mark Miller
      9. SOLR-5436.patch
        22 kB
        Mark Miller
      10. SOLR-5436.patch
        22 kB
        Mark Miller

        Activity

        Hide
        Shalin Shekhar Mangar added a comment -

        As I understand it, the reason behind the wait are to group the processing of multiple events together so that the cluster state change is not broadcasted to all watching nodes too many times. But this can work along with splitting the cluster state in SOLR-5381.

        Show
        Shalin Shekhar Mangar added a comment - As I understand it, the reason behind the wait are to group the processing of multiple events together so that the cluster state change is not broadcasted to all watching nodes too many times. But this can work along with splitting the cluster state in SOLR-5381 .
        Hide
        Noble Paul added a comment -

        Shalin Shekhar Mangar Originally we did not have a blocking lookup of the queue. that was also one of the reasons.

        Show
        Noble Paul added a comment - Shalin Shekhar Mangar Originally we did not have a blocking lookup of the queue. that was also one of the reasons.
        Hide
        Mark Miller added a comment -

        Originally we did not have a blocking lookup of the queue

        Right - the history is that Sami wrote the Overseer loop first, later I wrote the OverseerCollectionProcessor loop - I didn't like the polling and used the blocking lookup of the queue and just never got to catching the Overseer loop impl up.

        Show
        Mark Miller added a comment - Originally we did not have a blocking lookup of the queue Right - the history is that Sami wrote the Overseer loop first, later I wrote the OverseerCollectionProcessor loop - I didn't like the polling and used the blocking lookup of the queue and just never got to catching the Overseer loop impl up.
        Hide
        Noble Paul added a comment -

        What if there are state change events happening in quick succession (like a rolling restart) will it not lead to too many updates getting propagated to all the nodes? Is it a real problem?

        Show
        Noble Paul added a comment - What if there are state change events happening in quick succession (like a rolling restart) will it not lead to too many updates getting propagated to all the nodes? Is it a real problem?
        Hide
        Mark Miller added a comment -

        Shalin's original comment? Yeah, I suppose it does make sense. How much of a problem it is probably depends on the scale.

        Show
        Mark Miller added a comment - Shalin's original comment? Yeah, I suppose it does make sense. How much of a problem it is probably depends on the scale.
        Hide
        Noble Paul added a comment -

        In my patch it would ensure that any event would get processed within the STATE_UPDATE_DELAY or all normal events would get processed within 100ms

        Show
        Noble Paul added a comment - In my patch it would ensure that any event would get processed within the STATE_UPDATE_DELAY or all normal events would get processed within 100ms
        Hide
        Noble Paul added a comment -

        added a new method to peek(waitTime) to DistributedQueue

        Show
        Noble Paul added a comment - added a new method to peek(waitTime) to DistributedQueue
        Hide
        Noble Paul added a comment - - edited

        This time w/ all the tests passing

        Mark Miller please do a review . If it is fine I can check it in soon

        Show
        Noble Paul added a comment - - edited This time w/ all the tests passing Mark Miller please do a review . If it is fine I can check it in soon
        Hide
        Mark Miller added a comment -

        Yeah, I'll take a look today.

        Show
        Mark Miller added a comment - Yeah, I'll take a look today.
        Hide
        Mark Miller added a comment -

        I've looked it over, but I have to look closer at a couple things and I got tied up in other issues. Let me take a closer look tomorrow.

        Show
        Mark Miller added a comment - I've looked it over, but I have to look closer at a couple things and I got tied up in other issues. Let me take a closer look tomorrow.
        Hide
        Noble Paul added a comment -

        optimized the DistributedQueue code

        Show
        Noble Paul added a comment - optimized the DistributedQueue code
        Hide
        Mark Miller added a comment -

        I'm reviewing the latest patch.

        Show
        Mark Miller added a comment - I'm reviewing the latest patch.
        Hide
        Mark Miller added a comment -

        Looks good.

        Here is the same patch but with some minor cleanup and that eliminates the 5 second poll of the main queue.

        Show
        Mark Miller added a comment - Looks good. Here is the same patch but with some minor cleanup and that eliminates the 5 second poll of the main queue.
        Hide
        Noble Paul added a comment - - edited

        I kept the 5sec peek() on purpose. Because if the OverSeer changes and no state event happens there is no way for the node to know it. So let's keep it

        Show
        Noble Paul added a comment - - edited I kept the 5sec peek() on purpose. Because if the OverSeer changes and no state event happens there is no way for the node to know it. So let's keep it
        Hide
        Mark Miller added a comment -

        Because if the OverSeer changes and no state event happens there is no way for the node to know it.

        There is no reason the Overseer needs to pull every 5 seconds to avoid this. We know when a new Overseer is elected.

        If we keep the 5 second poll of the queue, this issue gains very little IMO.

        Show
        Mark Miller added a comment - Because if the OverSeer changes and no state event happens there is no way for the node to know it. There is no reason the Overseer needs to pull every 5 seconds to avoid this. We know when a new Overseer is elected. If we keep the 5 second poll of the queue, this issue gains very little IMO.
        Hide
        Mark Miller added a comment -

        If for some odd reason the Overseer stops being the Overseer without losing it's connection to ZooKeeper, this patch will stop the current Overseer from running when it's alerted the Overseer has changed.

        Show
        Mark Miller added a comment - If for some odd reason the Overseer stops being the Overseer without losing it's connection to ZooKeeper, this patch will stop the current Overseer from running when it's alerted the Overseer has changed.
        Hide
        Mark Miller added a comment -

        Keep in mind, that latch watcher will fire on zk connection events as well as when items are added to the queue.

        Show
        Mark Miller added a comment - Keep in mind, that latch watcher will fire on zk connection events as well as when items are added to the queue.
        Hide
        Noble Paul added a comment -

        Yeah the Latch Watcher gets a callback , but it the peek(true) would not return unless there is some entry is added to the queue. What happens when the OverSeer changed and no entry is added to the queue ?

        Show
        Noble Paul added a comment - Yeah the Latch Watcher gets a callback , but it the peek(true) would not return unless there is some entry is added to the queue. What happens when the OverSeer changed and no entry is added to the queue ?
        Hide
        Mark Miller added a comment -

        Ah, right - good point. We don't catch the watcher connection event and have peek fail.

        The latest change should handle this case though. Either the connection to ZK is gone and the peek call will fail when it retries after getting the connection watcher event. Or the connection might have failed and come back before the retry tries to talk to zk - in this case, the code change I just made should kill the current Overseer threads - either when we are alerted the Overseer has changed or when we start a new overseer because the zk connection was expired.

        Show
        Mark Miller added a comment - Ah, right - good point. We don't catch the watcher connection event and have peek fail. The latest change should handle this case though. Either the connection to ZK is gone and the peek call will fail when it retries after getting the connection watcher event. Or the connection might have failed and come back before the retry tries to talk to zk - in this case, the code change I just made should kill the current Overseer threads - either when we are alerted the Overseer has changed or when we start a new overseer because the zk connection was expired.
        Hide
        Mark Miller added a comment -

        New patch - I was a little off there.

        For the case where the queue loop does not fail on a down zookeeper because it bounced quickly or something, I said that when we start a new overseer, that will close the old threads - but that only happens if we are elected the overseer. This patch stops any local overseer threads on joining the overseer election.

        Show
        Mark Miller added a comment - New patch - I was a little off there. For the case where the queue loop does not fail on a down zookeeper because it bounced quickly or something, I said that when we start a new overseer, that will close the old threads - but that only happens if we are elected the overseer. This patch stops any local overseer threads on joining the overseer election.
        Hide
        Noble Paul added a comment - - edited

        does this patch take care of this scenario?

        The Overseer went into a GC pause or something and others nodes assume it is down and re-elected another OverSeer. And there is no event in the queue and this OverSeer is still waiting on the queue

        I don't see the exit condition from the loop in the peek() method

        Show
        Noble Paul added a comment - - edited does this patch take care of this scenario? The Overseer went into a GC pause or something and others nodes assume it is down and re-elected another OverSeer. And there is no event in the queue and this OverSeer is still waiting on the queue I don't see the exit condition from the loop in the peek() method
        Hide
        Mark Miller added a comment -

        The Overseer went into a GC pause or something and others nodes assume it is down and re-elected another OverSeer.

        This will only happen if our ephemeral overseer leader node goes down - which means we lost our connection to zookeeper. If we lose our connection to zookeeper, the queue thread will exit trying to talk to zk. If we reconnect to zookeeper before the queue loop has a chance to fail (some kind of wicked quick flap), we will stop the overseer threads on getting in line to be the Overseer again. Also, if zk tells us the overseer leader went down, we stop any overseer threads we might have.

        I think its about as strong as the poll.

        Show
        Mark Miller added a comment - The Overseer went into a GC pause or something and others nodes assume it is down and re-elected another OverSeer. This will only happen if our ephemeral overseer leader node goes down - which means we lost our connection to zookeeper. If we lose our connection to zookeeper, the queue thread will exit trying to talk to zk. If we reconnect to zookeeper before the queue loop has a chance to fail (some kind of wicked quick flap), we will stop the overseer threads on getting in line to be the Overseer again. Also, if zk tells us the overseer leader went down, we stop any overseer threads we might have. I think its about as strong as the poll.
        Hide
        Mark Miller added a comment -

        I don't see the exit condition from the loop in the peek() method

        You shouldn't need one.

        Show
        Mark Miller added a comment - I don't see the exit condition from the loop in the peek() method You shouldn't need one.
        Hide
        Mark Miller added a comment -

        You could of course explicitly add one - catch the connection change notification from the watcher and then cause the loop to fail on the next run. I don't think it's strictly necessary, but it wouldn't hurt.

        Show
        Mark Miller added a comment - You could of course explicitly add one - catch the connection change notification from the watcher and then cause the loop to fail on the next run. I don't think it's strictly necessary, but it wouldn't hurt.
        Hide
        Mark Miller added a comment -

        Updated patch to trunk changes.

        Show
        Mark Miller added a comment - Updated patch to trunk changes.
        Hide
        Noble Paul added a comment - - edited

        I think that exception handling is better in peek() method better than expecting the consumer of the API to take care of that

        either way it works

        Show
        Noble Paul added a comment - - edited I think that exception handling is better in peek() method better than expecting the consumer of the API to take care of that either way it works
        Hide
        Mark Miller added a comment -

        better than expecting the consumer of the API to take care of that

        It's really an Overseer specific thing though - I don't think it's a problem for the consumer of the DistributedQueue API. We are dealing with special Overseer considerations, and I don't really see the strategy for that as a consumer.

        A standard consumer of the queue API does not necessary need to detect this kind of zk connection flap - It's kind of Overseer specific I think.

        Adding may actually hurt the distributedqueue consumer experience - you expect to keep working the queue in the face of zk connection/disconnection without any special handling.

        Show
        Mark Miller added a comment - better than expecting the consumer of the API to take care of that It's really an Overseer specific thing though - I don't think it's a problem for the consumer of the DistributedQueue API. We are dealing with special Overseer considerations, and I don't really see the strategy for that as a consumer. A standard consumer of the queue API does not necessary need to detect this kind of zk connection flap - It's kind of Overseer specific I think. Adding may actually hurt the distributedqueue consumer experience - you expect to keep working the queue in the face of zk connection/disconnection without any special handling.
        Hide
        Mark Miller added a comment -

        This patch is even better. It ensures we stop any Overseer threads before we reconnect to ZooKeeper on an expiration. This should be even stronger than the 5 second poll.

        Show
        Mark Miller added a comment - This patch is even better. It ensures we stop any Overseer threads before we reconnect to ZooKeeper on an expiration. This should be even stronger than the 5 second poll.
        Hide
        Mark Miller added a comment -

        One more patch attached - fixes silly NPE bug when BeforeReconnect was null.

        Show
        Mark Miller added a comment - One more patch attached - fixes silly NPE bug when BeforeReconnect was null.
        Hide
        Noble Paul added a comment -

        Is there anything more we need to do before committing this?

        Show
        Noble Paul added a comment - Is there anything more we need to do before committing this?
        Hide
        ASF subversion and git services added a comment -

        Commit 1544255 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1544255 ]

        SOLR-5436: Eliminate the 1500ms wait in overseer loop as well as polling the ZK distributed queue.

        Show
        ASF subversion and git services added a comment - Commit 1544255 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1544255 ] SOLR-5436 : Eliminate the 1500ms wait in overseer loop as well as polling the ZK distributed queue.
        Hide
        ASF subversion and git services added a comment -

        Commit 1544257 from Mark Miller in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1544257 ]

        SOLR-5436: Eliminate the 1500ms wait in overseer loop as well as polling the ZK distributed queue.

        Show
        ASF subversion and git services added a comment - Commit 1544257 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1544257 ] SOLR-5436 : Eliminate the 1500ms wait in overseer loop as well as polling the ZK distributed queue.
        Hide
        Noble Paul added a comment -

        Why is it not yet resolved?

        Show
        Noble Paul added a comment - Why is it not yet resolved?

          People

          • Assignee:
            Mark Miller
            Reporter:
            Noble Paul
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development