Solr
  1. Solr
  2. SOLR-7869

Overseer does not handle BadVersionException correctly

    Details

      Description

      If the /clusterstate.json is modified externally then the Overseer can go into an infinite loop upon a BadVersionException alternately trying to execute main queue and then the work queue:

      ERROR - 2015-08-04 18:49:56.224; [   ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer work queue loop
      org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
              at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
              at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
              at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
              at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
              at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
              at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
              at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
              at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
              at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:168)
              at java.lang.Thread.run(Thread.java:745)
      INFO  - 2015-08-04 18:49:56.224; [   ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = {
        "operation":"state",
        "state":"down",
        "base_url":"http://127.0.1.1:7574/solr",
        "core":"test_shard1_replica1",
        "roles":null,
        "node_name":"127.0.1.1:7574_solr",
        "shard":null,
        "collection":"test",
        "core_node_name":"core_node1"} current state version: 9
      INFO  - 2015-08-04 18:49:56.224; [   ] org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=null message={
        "operation":"state",
        "state":"down",
        "base_url":"http://127.0.1.1:7574/solr",
        "core":"test_shard1_replica1",
        "roles":null,
        "node_name":"127.0.1.1:7574_solr",
        "shard":null,
        "collection":"test",
        "core_node_name":"core_node1"}
      INFO  - 2015-08-04 18:49:56.224; [   ] org.apache.solr.cloud.overseer.ReplicaMutator; shard=shard1 is already registered
      ERROR - 2015-08-04 18:49:56.225; [   ] org.apache.solr.cloud.Overseer$ClusterStateUpdater; Exception in Overseer main queue loop
      org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /clusterstate.json
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
              at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
              at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
              at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
              at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
              at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)
              at org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:180)
              at org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:67)
              at org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:286)
              at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:213)
              at java.lang.Thread.run(Thread.java:745)
      INFO  - 2015-08-04 18:49:56.225; [   ] org.apache.solr.common.cloud.ZkStateReader; Updating data for gettingstarted to ver 8
      
      1. SOLR-7869.patch
        23 kB
        Shalin Shekhar Mangar
      2. SOLR-7869.patch
        23 kB
        Shalin Shekhar Mangar
      3. SOLR-7869.patch
        8 kB
        Shalin Shekhar Mangar
      4. SOLR-7869.patch
        3 kB
        Scott Blum

        Activity

        Hide
        Scott Blum added a comment -

        But what's the right fix? Having looked through the code a bit now, OverSeer.ClusterUpdater has a very baked-in assumption that no one else is updating cluster state. Copies of ClusterState float around and get updated over and over during processing, with the assumption that the local node is performing an atomic sequence of operations to get to a desired end state. How can external changes be merged in? My impulse was to catch BadVersionException, refresh ClusterState from ZK, then re-apply all the queued updates against the refreshed state. However, I'm afraid that approach violates all of ClusterUpdater's assumptions. I think the only thing to do is just clobber whatever is in ZK with what Overseer wants to write, even though that seems less than ideal.

        Show
        Scott Blum added a comment - But what's the right fix? Having looked through the code a bit now, OverSeer.ClusterUpdater has a very baked-in assumption that no one else is updating cluster state. Copies of ClusterState float around and get updated over and over during processing, with the assumption that the local node is performing an atomic sequence of operations to get to a desired end state. How can external changes be merged in? My impulse was to catch BadVersionException, refresh ClusterState from ZK, then re-apply all the queued updates against the refreshed state. However, I'm afraid that approach violates all of ClusterUpdater's assumptions. I think the only thing to do is just clobber whatever is in ZK with what Overseer wants to write, even though that seems less than ideal.
        Hide
        Scott Blum added a comment -

        Attached a TEST ONLY that repros the failure. This is not a fix.

        Show
        Scott Blum added a comment - Attached a TEST ONLY that repros the failure. This is not a fix.
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks Scott!

        My impulse was to catch BadVersionException, refresh ClusterState from ZK, then re-apply all the queued updates against the refreshed state.

        That is the right fix. That is how I intended it to work but I obviously didn't write enough tests.

        However, I'm afraid that approach violates all of ClusterUpdater's assumptions.

        Originally the overseer would force update the cluster state at the beginning of the loop, apply the updates and write to ZK. This was wasteful because most of the time, the Overseer is the only guy writing to ZK state. This is why I introduced a local cluster state which is written to ZK using CAS removing the need for refreshing the cluster state. If that CAS fails then that means that someone has changed state externally or due to a bug multiple overseers have started processing. At this point, we go back to the beginning of the loop, check if we are still leader, force refresh the cluster state, process work queue and the continue on to the main queue.

        Show
        Shalin Shekhar Mangar added a comment - Thanks Scott! My impulse was to catch BadVersionException, refresh ClusterState from ZK, then re-apply all the queued updates against the refreshed state. That is the right fix. That is how I intended it to work but I obviously didn't write enough tests. However, I'm afraid that approach violates all of ClusterUpdater's assumptions. Originally the overseer would force update the cluster state at the beginning of the loop, apply the updates and write to ZK. This was wasteful because most of the time, the Overseer is the only guy writing to ZK state. This is why I introduced a local cluster state which is written to ZK using CAS removing the need for refreshing the cluster state. If that CAS fails then that means that someone has changed state externally or due to a bug multiple overseers have started processing. At this point, we go back to the beginning of the loop, check if we are still leader, force refresh the cluster state, process work queue and the continue on to the main queue.
        Hide
        Shalin Shekhar Mangar added a comment -

        We can fix ZkStateWriter easily with the following:

        if (clusterState != null && !prevState.getZkClusterStateVersion().equals(clusterState.getZkClusterStateVersion()))  {
              // our local cluster state is out of sync with what is in ZK, re-initialize
              // this is safe because either 1) any pending updates are in #updates and will be applied again, or
              // 2) they have already been written to ZK
              // i.e. it is guaranteed that overwriting cluster state with prevState will not discard any updates
              // that Overseer had performed unless such an act was done externally by the user
              clusterState = prevState;
              isClusterStateModified = true;
            }
        

        However, the test isn't quite right. We can't handle BadVersionException inside ZkStateWriter itself. It has to be propagated to Overseer which should do the logic I outlined in the previous comment. So any test should test at the Overseer level and not just the ZkStateWriter.

        Show
        Shalin Shekhar Mangar added a comment - We can fix ZkStateWriter easily with the following: if (clusterState != null && !prevState.getZkClusterStateVersion().equals(clusterState.getZkClusterStateVersion())) { // our local cluster state is out of sync with what is in ZK, re-initialize // this is safe because either 1) any pending updates are in #updates and will be applied again, or // 2) they have already been written to ZK // i.e. it is guaranteed that overwriting cluster state with prevState will not discard any updates // that Overseer had performed unless such an act was done externally by the user clusterState = prevState; isClusterStateModified = true ; } However, the test isn't quite right. We can't handle BadVersionException inside ZkStateWriter itself. It has to be propagated to Overseer which should do the logic I outlined in the previous comment. So any test should test at the Overseer level and not just the ZkStateWriter.
        Hide
        Shalin Shekhar Mangar added a comment -

        Test + fix. I tried to reproduce this in the OverseerTest but it was proving to be too difficult. The randomized test I had would maybe reproduce once in 5 times so I went back to the test Scott had written and augmented it.

        1. I wonder if it is better to assert that given cluster state version is greater than ZkStateWriter's internal cluster state instead of blindly using given cluster state when version is not equal.
        2. I also wonder if a better fix is to re-create ZkStateWriter object entirely if refreshClusterState is true in the Overseer? The reason is what if a user modifies a collection's state.json directly but doesn't modify the /clusterstate.json. In that case, our current fix won't work.
        Show
        Shalin Shekhar Mangar added a comment - Test + fix. I tried to reproduce this in the OverseerTest but it was proving to be too difficult. The randomized test I had would maybe reproduce once in 5 times so I went back to the test Scott had written and augmented it. I wonder if it is better to assert that given cluster state version is greater than ZkStateWriter's internal cluster state instead of blindly using given cluster state when version is not equal. I also wonder if a better fix is to re-create ZkStateWriter object entirely if refreshClusterState is true in the Overseer? The reason is what if a user modifies a collection's state.json directly but doesn't modify the /clusterstate.json. In that case, our current fix won't work.
        Hide
        Scott Blum added a comment -

        1) So this doesn't actually fix anything yet, because there are no changes to Overseer itself? Presumably you'd need to catch BVE in overseer and force-refresh reader clusterState?

        2) Just noting that this seems the opposite of what we discussed earlier. I interpreted your earlier comments to mean that we should blow away the ZK data in favor of the overseer data, since overseer is authoritative. This patch seems do the opposite, preferring external user changes. To wit "it is guaranteed that overwriting cluster state with prevState will not discard any updates that Overseer had performed unless such an act was done externally by the user".

        3) In ZkStateWriterTest, I note that ZkStateWriter isn't super amenable to testing, it's kind of subtle that enqueuing an update sometimes causes a flush, and sometimes does. Dunno if it's better or worse to have test-visible methods for doing a queue-without-flush and then explicit flush.

        4) In ZkStateWriterTest.testExternalModificationToSharedClusterState(), first try block, you're missing a fail() after the enqueueUpdate to test that the exception really did occur. In the first catch block, I'm not sure it's good to log the expected exception, I always find it confusing when tests log exceptions that don't actually cause the test to fail. I would remove the second catch block; if you get any other exception than the one you expect, best to just let it escape and let the test framework get it.

        5) In a similar fashion, I would remove the second try/catch block entirely, just keeping the body of the try. You expect that none of it will throw an exception, so just leave it unadorned and the test framework will handle if there is.

        Show
        Scott Blum added a comment - 1) So this doesn't actually fix anything yet, because there are no changes to Overseer itself? Presumably you'd need to catch BVE in overseer and force-refresh reader clusterState? 2) Just noting that this seems the opposite of what we discussed earlier. I interpreted your earlier comments to mean that we should blow away the ZK data in favor of the overseer data, since overseer is authoritative. This patch seems do the opposite, preferring external user changes. To wit "it is guaranteed that overwriting cluster state with prevState will not discard any updates that Overseer had performed unless such an act was done externally by the user". 3) In ZkStateWriterTest, I note that ZkStateWriter isn't super amenable to testing, it's kind of subtle that enqueuing an update sometimes causes a flush, and sometimes does. Dunno if it's better or worse to have test-visible methods for doing a queue-without-flush and then explicit flush. 4) In ZkStateWriterTest.testExternalModificationToSharedClusterState(), first try block, you're missing a fail() after the enqueueUpdate to test that the exception really did occur. In the first catch block, I'm not sure it's good to log the expected exception, I always find it confusing when tests log exceptions that don't actually cause the test to fail. I would remove the second catch block; if you get any other exception than the one you expect, best to just let it escape and let the test framework get it. 5) In a similar fashion, I would remove the second try/catch block entirely, just keeping the body of the try. You expect that none of it will throw an exception, so just leave it unadorned and the test framework will handle if there is.
        Hide
        Shalin Shekhar Mangar added a comment -

        1) So this doesn't actually fix anything yet, because there are no changes to Overseer itself? Presumably you'd need to catch BVE in overseer and force-refresh reader clusterState?

        Actually Overseer already has code that catches all exceptions and refreshes the cluster state. The bug is in ZkStateWriter which does not use the refreshed cluster state if maybeRefreshBefore returns true and hence tries to compare-and-set using the outdated cluster state version.

        2) Just noting that this seems the opposite of what we discussed earlier. I interpreted your earlier comments to mean that we should blow away the ZK data in favor of the overseer data, since overseer is authoritative. This patch seems do the opposite, preferring external user changes. To wit "it is guaranteed that overwriting cluster state with prevState will not discard any updates that Overseer had performed unless such an act was done externally by the user".

        Maybe I wasn't clear enough. But I did mean the opposite of what you understood. The user has made some changes either accidentally in which case they totally deserve what's coming or presumably to fix something that went wrong in the cluster state (which could be because of a genuine bug). In both cases, overwriting stuff that a user has himself done seems wrong to me. We should just roll with it. Therefore the overseer refreshes the cluster state and starts using it as the base for future operations.

        3) In ZkStateWriterTest, I note that ZkStateWriter isn't super amenable to testing, it's kind of subtle that enqueuing an update sometimes causes a flush, and sometimes does. Dunno if it's better or worse to have test-visible methods for doing a queue-without-flush and then explicit flush.

        You are right. The testZkStateWriterBatching is a horrible test and I should have written a better one. In particular, maybeFlushAfter also updates the local state (lastStateFormat, lastCollectionName) before the write happens. We should change that. But I am not sure what you mean by "Dunno if it's better or worse to have test-visible methods for doing a queue-without-flush and then explicit flush."?

        Re: #4 and #5 – good point. I'll fix that.

        Show
        Shalin Shekhar Mangar added a comment - 1) So this doesn't actually fix anything yet, because there are no changes to Overseer itself? Presumably you'd need to catch BVE in overseer and force-refresh reader clusterState? Actually Overseer already has code that catches all exceptions and refreshes the cluster state. The bug is in ZkStateWriter which does not use the refreshed cluster state if maybeRefreshBefore returns true and hence tries to compare-and-set using the outdated cluster state version. 2) Just noting that this seems the opposite of what we discussed earlier. I interpreted your earlier comments to mean that we should blow away the ZK data in favor of the overseer data, since overseer is authoritative. This patch seems do the opposite, preferring external user changes. To wit "it is guaranteed that overwriting cluster state with prevState will not discard any updates that Overseer had performed unless such an act was done externally by the user". Maybe I wasn't clear enough. But I did mean the opposite of what you understood. The user has made some changes either accidentally in which case they totally deserve what's coming or presumably to fix something that went wrong in the cluster state (which could be because of a genuine bug). In both cases, overwriting stuff that a user has himself done seems wrong to me. We should just roll with it. Therefore the overseer refreshes the cluster state and starts using it as the base for future operations. 3) In ZkStateWriterTest, I note that ZkStateWriter isn't super amenable to testing, it's kind of subtle that enqueuing an update sometimes causes a flush, and sometimes does. Dunno if it's better or worse to have test-visible methods for doing a queue-without-flush and then explicit flush. You are right. The testZkStateWriterBatching is a horrible test and I should have written a better one. In particular, maybeFlushAfter also updates the local state (lastStateFormat, lastCollectionName) before the write happens. We should change that. But I am not sure what you mean by "Dunno if it's better or worse to have test-visible methods for doing a queue-without-flush and then explicit flush."? Re: #4 and #5 – good point. I'll fix that.
        Hide
        Scott Blum added a comment -

        1) bq "Actually Overseer already has code that catches all exceptions and refreshes the cluster state.

        In that case, I would suggest an explicit catch block there for BadVersionException, and log something milder than error? The current KeeperException catch block always logs an error.

        2) Okay, I totally buy that argument! But see note below.

        3) I'm merely saying that the refactoring that would be necessary to make ZkStateWriter more easily testable might negatively impact the flow of the existing code. I'm not sure, you'd have to try it.

        Having stewed on this a bit, I think there's somewhat of a fundamental disconnect in how ZkStateWriter's batching interacts with the Overseer work queue.

        Work items get removed from the queue before the corresponding change is actually written out to the cluster state. You can see in the original design that there's this neat peek->poll dance in the Overseer loop that attempts to enforce guaranteed state updates by not discarding the work item until after the state is written out. But the batching implementation gets rid of this guarantee, and that's why I perceive we're now in a state where the overseer updates and external updates can even be in conflict with each other.

        Imagine (as a thought experiment) we got rid of batching. If we did that, no external change could "lose" a work item, because we'd be committing one item at a time, so the retry operation on a bad version exception would always re-grab the most recent, unapplied work item. Now obviously, we don't want to get rid of batching, because efficiency. But I really do think we're batching in the wrong place. I think batching actually needs to happen in Overseer, because it has to be tied to discarding work items. Ideally, we'd peek N work items from the head of the queue, setup all the pending updates / cluster state mods in ZkStateWriter, then try to commit everything. If it succeeds, great, remove all the processed items from the queue. If it fails, then reread cluster state and retry all the items again.

        Make any sense?

        (Now, all that said, that work may be more effort than it's worth, and maybe we should just focus on not having to use the queue to make cluster state updates in the format v2 world.)

        Show
        Scott Blum added a comment - 1) bq "Actually Overseer already has code that catches all exceptions and refreshes the cluster state. In that case, I would suggest an explicit catch block there for BadVersionException, and log something milder than error? The current KeeperException catch block always logs an error. 2) Okay, I totally buy that argument! But see note below. 3) I'm merely saying that the refactoring that would be necessary to make ZkStateWriter more easily testable might negatively impact the flow of the existing code. I'm not sure, you'd have to try it. Having stewed on this a bit, I think there's somewhat of a fundamental disconnect in how ZkStateWriter's batching interacts with the Overseer work queue. Work items get removed from the queue before the corresponding change is actually written out to the cluster state. You can see in the original design that there's this neat peek->poll dance in the Overseer loop that attempts to enforce guaranteed state updates by not discarding the work item until after the state is written out. But the batching implementation gets rid of this guarantee, and that's why I perceive we're now in a state where the overseer updates and external updates can even be in conflict with each other. Imagine (as a thought experiment) we got rid of batching. If we did that, no external change could "lose" a work item, because we'd be committing one item at a time, so the retry operation on a bad version exception would always re-grab the most recent, unapplied work item. Now obviously, we don't want to get rid of batching, because efficiency. But I really do think we're batching in the wrong place. I think batching actually needs to happen in Overseer, because it has to be tied to discarding work items. Ideally, we'd peek N work items from the head of the queue, setup all the pending updates / cluster state mods in ZkStateWriter, then try to commit everything. If it succeeds, great, remove all the processed items from the queue. If it fails, then reread cluster state and retry all the items again. Make any sense? (Now, all that said, that work may be more effort than it's worth, and maybe we should just focus on not having to use the queue to make cluster state updates in the format v2 world.)
        Hide
        Shalin Shekhar Mangar added a comment -

        In that case, I would suggest an explicit catch block there for BadVersionException, and log something milder than error? The current KeeperException catch block always logs an error.

        +1

        Work items get removed from the queue before the corresponding change is actually written out to the cluster state. You can see in the original design that there's this neat peek->poll dance in the Overseer loop that attempts to enforce guaranteed state updates by not discarding the work item until after the state is written out. But the batching implementation gets rid of this guarantee, and that's why I perceive we're now in a state where the overseer updates and external updates can even be in conflict with each other.

        No, items are never removed until they're flushed to ZK. This invariant is maintained everywhere. If you find that I have missed a case then please do let me know. When operating on the work queue, batching is completely disabled because we have no fallback in case we fail. When operating on the main queue – we add operation to work queue when onEnqueue event is called and we remove item from work queue once onWrite event is called. This ensures that we never lose an operation.

        But I really do think we're batching in the wrong place. I think batching actually needs to happen in Overseer, because it has to be tied to discarding work items.

        That used to be the case. I refactored Overseer extensively in 5.0 in SOLR-6554

        (Now, all that said, that work may be more effort than it's worth, and maybe we should just focus on not having to use the queue to make cluster state updates in the format v2 world.)

        Yeah, I think a more worthwhile improvement would be SOLR-6760

        Show
        Shalin Shekhar Mangar added a comment - In that case, I would suggest an explicit catch block there for BadVersionException, and log something milder than error? The current KeeperException catch block always logs an error. +1 Work items get removed from the queue before the corresponding change is actually written out to the cluster state. You can see in the original design that there's this neat peek->poll dance in the Overseer loop that attempts to enforce guaranteed state updates by not discarding the work item until after the state is written out. But the batching implementation gets rid of this guarantee, and that's why I perceive we're now in a state where the overseer updates and external updates can even be in conflict with each other. No, items are never removed until they're flushed to ZK. This invariant is maintained everywhere. If you find that I have missed a case then please do let me know. When operating on the work queue, batching is completely disabled because we have no fallback in case we fail. When operating on the main queue – we add operation to work queue when onEnqueue event is called and we remove item from work queue once onWrite event is called. This ensures that we never lose an operation. But I really do think we're batching in the wrong place. I think batching actually needs to happen in Overseer, because it has to be tied to discarding work items. That used to be the case. I refactored Overseer extensively in 5.0 in SOLR-6554 (Now, all that said, that work may be more effort than it's worth, and maybe we should just focus on not having to use the queue to make cluster state updates in the format v2 world.) Yeah, I think a more worthwhile improvement would be SOLR-6760
        Hide
        Shalin Shekhar Mangar added a comment -

        Here's a better fix which discards ZkStateWriter on a BadVersionException and starts afresh. The previous approach didn't work when an external change was made on state.json with no changes to /clusterstate.json. Although such changes can be detected and resolved inside ZkStateWriter but that would make this class unnecessarily complex.

        ZkStateWriter will put itself into an invalid state upon a BadVersionException and will disallow all future operations. Callers are expected to discard such an instance and create a fresh ZkStateWriter instance for future use.

        I added two tests in ZkStateWriterTest which simulate an external change to /clusterstate.json and a state.json and asserts that an IllegalStateException is thrown on any future invocation of enqueueUpdate or writePendingUpdates.

        I also added a test in Overseer which asserts that the overseer can keep processing events on a BadVersionException (indirectly testing that a fresh ZkStateWriter is created upon said exception).

        I also added copious amounts of javadocs to the ZkStateWriter class for future reference.

        Show
        Shalin Shekhar Mangar added a comment - Here's a better fix which discards ZkStateWriter on a BadVersionException and starts afresh. The previous approach didn't work when an external change was made on state.json with no changes to /clusterstate.json. Although such changes can be detected and resolved inside ZkStateWriter but that would make this class unnecessarily complex. ZkStateWriter will put itself into an invalid state upon a BadVersionException and will disallow all future operations. Callers are expected to discard such an instance and create a fresh ZkStateWriter instance for future use. I added two tests in ZkStateWriterTest which simulate an external change to /clusterstate.json and a state.json and asserts that an IllegalStateException is thrown on any future invocation of enqueueUpdate or writePendingUpdates. I also added a test in Overseer which asserts that the overseer can keep processing events on a BadVersionException (indirectly testing that a fresh ZkStateWriter is created upon said exception). I also added copious amounts of javadocs to the ZkStateWriter class for future reference.
        Hide
        Scott Blum added a comment -

        Great change! Couple of minor nits:

        1) There's a log(SHALIN) to remove in OverseerTest

        2) testExternalModificationToSharedClusterState() has 2 try/catch(expected) with no fail() statement at the end of the try block. testExternalModificationToStateFormat2() has 3 instances of this problem.

        Show
        Scott Blum added a comment - Great change! Couple of minor nits: 1) There's a log(SHALIN) to remove in OverseerTest 2) testExternalModificationToSharedClusterState() has 2 try/catch(expected) with no fail() statement at the end of the try block. testExternalModificationToStateFormat2() has 3 instances of this problem.
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks for the review Scott!

        Both of your comments are now incorporated into this patch. I'll run precommit and tests and commit once they succeed.

        Show
        Shalin Shekhar Mangar added a comment - Thanks for the review Scott! Both of your comments are now incorporated into this patch. I'll run precommit and tests and commit once they succeed.
        Hide
        ASF subversion and git services added a comment -

        Commit 1695746 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1695746 ]

        SOLR-7869: Overseer does not handle BadVersionException correctly and, in some cases, can go into an infinite loop if cluster state in ZooKeeper is modified externally

        Show
        ASF subversion and git services added a comment - Commit 1695746 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1695746 ] SOLR-7869 : Overseer does not handle BadVersionException correctly and, in some cases, can go into an infinite loop if cluster state in ZooKeeper is modified externally
        Hide
        ASF subversion and git services added a comment -

        Commit 1695763 from shalin@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1695763 ]

        SOLR-7869: Overseer does not handle BadVersionException correctly and, in some cases, can go into an infinite loop if cluster state in ZooKeeper is modified externally

        Show
        ASF subversion and git services added a comment - Commit 1695763 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1695763 ] SOLR-7869 : Overseer does not handle BadVersionException correctly and, in some cases, can go into an infinite loop if cluster state in ZooKeeper is modified externally
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks Scott!

        Show
        Shalin Shekhar Mangar added a comment - Thanks Scott!

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Shalin Shekhar Mangar
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development