Solr
  1. Solr
  2. SOLR-5756

A utility API to move collections from stateFormat=1 to stateFormat=2

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      SOLR-5473 allows creation of collection with state stored outside of clusterstate.json. We would need an API to move existing 'internal' collections outside

      1. CollectionStateFormat2Test-failure.log
        557 kB
        Shalin Shekhar Mangar
      2. CollectionStateFormat2Test-failure-r1695176.log
        1.13 MB
        Shalin Shekhar Mangar
      3. CollectionStateFormat2Test-passed-r1695176.log
        624 kB
        Shalin Shekhar Mangar
      4. sarowe-jenkins-Lucene-Solr-tests-trunk-1522-CollectionStateFormat2-failure.txt
        560 kB
        Steve Rowe
      5. SOLR-5756.patch
        58 kB
        Shalin Shekhar Mangar
      6. SOLR-5756.patch
        52 kB
        Shalin Shekhar Mangar
      7. SOLR-5756.patch
        51 kB
        Shalin Shekhar Mangar
      8. SOLR-5756-fix.patch
        4 kB
        Scott Blum
      9. SOLR-5756-fix.patch
        0.9 kB
        Shalin Shekhar Mangar
      10. SOLR-5756-fix.patch-failure.log
        523 kB
        Steve Rowe
      11. SOLR-5756-part2.patch
        6 kB
        Scott Blum
      12. SOLR-5756-trunk.patch
        44 kB
        Scott Blum

        Issue Links

          Activity

          Hide
          Scott Blum added a comment -

          Hi, has any work started on this issue? We have a deployment with a very large clusterstate.json (most of our collections are there). New collections added since our last upgrade have their own split state.json, but we still have an enormous number of collections using the shared file. We are suspicious that the large degree of contention on clusterstate.json is affecting the stability of our cluster, so we'd like to split it apart to see if things improve.

          A few questions:

          1) Do you think it would be safe to do this manually on a running cluster? I've only spent a few hours looking at the overseer code, but I got the impression that I might just be able to populate all the state.json nodes manually, followed by emptying clusterstate.json. That last step should tickle all the running servers, forcing a reload which will get all servers into the right separated state. At least, that's my theory. Does that sound right to you?

          2) Suppose I wanted to try to write a patch for this issue to help solve it for everyone, is that a reasonable thing to attempt for someone with a lot of ZK knowledge but pretty new to Solr? Or are there a lot of subtleties?

          3) Can you opine on the specifics of having an API to move the state out vs. a forced migration? From what I read on SOLR-5473, it sounds like eventually we'd just want to force everyone into split state. Is it too "soon" to do that?

          (Unrelated to this specific issue, I'm actually a committer on Apache Curator, and I have a general interest in understanding and possibly helping improve overseer's ZK interactions. Are there any docs outside of the code itself you might recommend for me to read?)

          Show
          Scott Blum added a comment - Hi, has any work started on this issue? We have a deployment with a very large clusterstate.json (most of our collections are there). New collections added since our last upgrade have their own split state.json, but we still have an enormous number of collections using the shared file. We are suspicious that the large degree of contention on clusterstate.json is affecting the stability of our cluster, so we'd like to split it apart to see if things improve. A few questions: 1) Do you think it would be safe to do this manually on a running cluster? I've only spent a few hours looking at the overseer code, but I got the impression that I might just be able to populate all the state.json nodes manually, followed by emptying clusterstate.json. That last step should tickle all the running servers, forcing a reload which will get all servers into the right separated state. At least, that's my theory. Does that sound right to you? 2) Suppose I wanted to try to write a patch for this issue to help solve it for everyone, is that a reasonable thing to attempt for someone with a lot of ZK knowledge but pretty new to Solr? Or are there a lot of subtleties? 3) Can you opine on the specifics of having an API to move the state out vs. a forced migration? From what I read on SOLR-5473 , it sounds like eventually we'd just want to force everyone into split state. Is it too "soon" to do that? (Unrelated to this specific issue, I'm actually a committer on Apache Curator, and I have a general interest in understanding and possibly helping improve overseer's ZK interactions. Are there any docs outside of the code itself you might recommend for me to read?)
          Hide
          Noble Paul added a comment -

          Do you think it would be safe to do this manually on a running cluster?

          It is never fully safe , There are intermediate state changes which you may lose

          Suppose I wanted to try to write a patch for this issue to help solve it for everyone, is that a reasonable thing to attempt for someone with a lot of ZK knowledge but pretty new to Solr?

          Is it too "soon" to do that?

          It is not too soon. 5.0 is out for a while

          A lot of ZK knowledge is not required to do this. You need to know how overseer updates state and how states are updated by ZkStateReader.createClusterStateWatchersAndUpdate in each node

          Can you opine on the specifics of having an API to move the state out vs. a forced migration?

          I shall do a follow up comment of what are the steps involved in actually doing this

          Show
          Noble Paul added a comment - Do you think it would be safe to do this manually on a running cluster? It is never fully safe , There are intermediate state changes which you may lose Suppose I wanted to try to write a patch for this issue to help solve it for everyone, is that a reasonable thing to attempt for someone with a lot of ZK knowledge but pretty new to Solr? Is it too "soon" to do that? It is not too soon. 5.0 is out for a while A lot of ZK knowledge is not required to do this. You need to know how overseer updates state and how states are updated by ZkStateReader.createClusterStateWatchersAndUpdate in each node Can you opine on the specifics of having an API to move the state out vs. a forced migration? I shall do a follow up comment of what are the steps involved in actually doing this
          Hide
          Noble Paul added a comment -

          The steps to migrate state from main clusterstate.

          1. Add support to ZKStateReader to lookup the /collections/<collection-name>state.json if a collection is suddenly missing from the main clusterstate.json . If the state.json is found add a listener to that if that requires to be watched
          2. Add a collection-admin command to migrate the state outside

          A user should first upgrade all servers to a new version and then execute the command

          Show
          Noble Paul added a comment - The steps to migrate state from main clusterstate. Add support to ZKStateReader to lookup the /collections/<collection-name>state.json if a collection is suddenly missing from the main clusterstate.json . If the state.json is found add a listener to that if that requires to be watched Add a collection-admin command to migrate the state outside A user should first upgrade all servers to a new version and then execute the command
          Hide
          Scott Blum added a comment -

          That sounds great. I will get on this in the next couple days. One clarifying questions:

          Suppose on read/reload, data exists in both the collection's `state.json` and the shared `clusterstate.json` on load, which one should it prefer, and it should any corrective action happen to de-dup? I would presume that it should prefer `state.json` (and eagerly remove the entry from `clusterstate.json`?) in this case, since it indicates someone successfully wrote out the new one but failed to delete the old one.

          Show
          Scott Blum added a comment - That sounds great. I will get on this in the next couple days. One clarifying questions: Suppose on read/reload, data exists in both the collection's `state.json` and the shared `clusterstate.json` on load, which one should it prefer, and it should any corrective action happen to de-dup? I would presume that it should prefer `state.json` (and eagerly remove the entry from `clusterstate.json`?) in this case, since it indicates someone successfully wrote out the new one but failed to delete the old one.
          Hide
          Scott Blum added a comment -

          Noble Paul Shalin Shekhar Mangar do either of you ever hang out on #solr-dev? I have a few questions that I think would greatly benefit from a real time chat.

          Show
          Scott Blum added a comment - Noble Paul Shalin Shekhar Mangar do either of you ever hang out on #solr-dev? I have a few questions that I think would greatly benefit from a real time chat.
          Hide
          Scott Blum added a comment -

          Also, what branch should I be working against? I have something in a basically working state I'd like feedback on.

          Show
          Scott Blum added a comment - Also, what branch should I be working against? I have something in a basically working state I'd like feedback on.
          Hide
          Anshum Gupta added a comment -

          Scott Blum people generally work on trunk and then merge/backport it to branch_5x.

          Show
          Anshum Gupta added a comment - Scott Blum people generally work on trunk and then merge/backport it to branch_5x.
          Hide
          Noble Paul added a comment -

          always prefer 'state.json' .

          Show
          Noble Paul added a comment - always prefer 'state.json' .
          Hide
          Shalin Shekhar Mangar added a comment -

          Scott Blum - I am 'shalin' on irc

          Show
          Shalin Shekhar Mangar added a comment - Scott Blum - I am 'shalin' on irc
          Hide
          Scott Blum added a comment -

          Initial stab at something working, based on 5.2.1. Haven't run through test suite yet.

          Show
          Scott Blum added a comment - Initial stab at something working, based on 5.2.1. Haven't run through test suite yet.
          Hide
          Scott Blum added a comment -

          (couple of comments and fixed one thing)

          Show
          Scott Blum added a comment - (couple of comments and fixed one thing)
          Hide
          Scott Blum added a comment -

          (more bugfixes... will tests pass?)

          Show
          Scott Blum added a comment - (more bugfixes... will tests pass?)
          Hide
          Scott Blum added a comment -

          I think the tests are actually passing now.

          Show
          Scott Blum added a comment - I think the tests are actually passing now.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Scott Blum. This approach looks promising. A few comments on the patch:

          1. The return value of updateFromSharedClusterState(clusterState.getLiveNodes(), null); in ZkStateReader.updateClusterState is discarded. How is the cluster state updated in that case?
          2. Perhaps rename StateWatcher.refresh to refreshAndAddWatch to make it more clear?
          3. All collections (whether in clusterstate.json or in individual state.json) having local cores are added to interestingCollections and therefore a StateWatcher is created even if the collection was in clusterstate.json (as a result one useless zk call is made?)
          4. ZkStateReader.removeZKWatch should not remove the collection from the cluster state just because all local cores belonging to that collection have been removed. A proper updateClusterState should be called in this case.

          This patch has a lot of changes to the way synchronization is done so I will review it in more detail. But before that, can you please rebase the patch to trunk?

          Show
          Shalin Shekhar Mangar added a comment - Thanks Scott Blum . This approach looks promising. A few comments on the patch: The return value of updateFromSharedClusterState(clusterState.getLiveNodes(), null); in ZkStateReader.updateClusterState is discarded. How is the cluster state updated in that case? Perhaps rename StateWatcher.refresh to refreshAndAddWatch to make it more clear? All collections (whether in clusterstate.json or in individual state.json) having local cores are added to interestingCollections and therefore a StateWatcher is created even if the collection was in clusterstate.json (as a result one useless zk call is made?) ZkStateReader.removeZKWatch should not remove the collection from the cluster state just because all local cores belonging to that collection have been removed. A proper updateClusterState should be called in this case. This patch has a lot of changes to the way synchronization is done so I will review it in more detail. But before that, can you please rebase the patch to trunk?
          Hide
          Scott Blum added a comment -

          1) Great catch, totally missed that.

          2) Sure, not a problem. I thought it was kind of implied by the classname (StateWatcher / ClusterStateWatcher) but explicit methods are always good.

          3) Yes, but in the 99% case (no external state exists) the getData call fails, no watcher is left on the node, and everything should just clean itself up. I don't think this adds much additional overhead considering that getStateFormat2CollectionNames() checks existence on every collection for a state.json, even ones that are not "interesting", and this happens on every update, not just on startup which is where the initial set of StateWatchers get created. So I think lower-hanging fruit would be to optimize the updateFromSharedClusterState -> getStateFormat2CollectionNames() code path. I would love to talk more about how/why the rest of the system relies on the lazy-loading bits.

          4) I don't think I quite follow, let's discuss on IRC.

          Show
          Scott Blum added a comment - 1) Great catch, totally missed that. 2) Sure, not a problem. I thought it was kind of implied by the classname (StateWatcher / ClusterStateWatcher) but explicit methods are always good. 3) Yes, but in the 99% case (no external state exists) the getData call fails, no watcher is left on the node, and everything should just clean itself up. I don't think this adds much additional overhead considering that getStateFormat2CollectionNames() checks existence on every collection for a state.json, even ones that are not "interesting", and this happens on every update, not just on startup which is where the initial set of StateWatchers get created. So I think lower-hanging fruit would be to optimize the updateFromSharedClusterState -> getStateFormat2CollectionNames() code path. I would love to talk more about how/why the rest of the system relies on the lazy-loading bits. 4) I don't think I quite follow, let's discuss on IRC.
          Hide
          Scott Blum added a comment -

          Okay, per discussion on IRC let me summarize the big glaring problem with the existing patch:

          I'm trying to use clusterState to do two very different things:
          1) Serve as an in-memory copy of the contents of clusterstate.json
          2) Serve as the "amalgam" result of combining shared clusterstate.json with all of the non-shared state.json.

          I need to explicitly break this apart into two different fields.

          Show
          Scott Blum added a comment - Okay, per discussion on IRC let me summarize the big glaring problem with the existing patch: I'm trying to use clusterState to do two very different things: 1) Serve as an in-memory copy of the contents of clusterstate.json 2) Serve as the "amalgam" result of combining shared clusterstate.json with all of the non-shared state.json. I need to explicitly break this apart into two different fields.
          Hide
          Scott Blum added a comment -

          Try this. I haven't run tests all the way through, but most of the way.

          Show
          Scott Blum added a comment - Try this. I haven't run tests all the way through, but most of the way.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Scott. A few more comments:

          1. Need a null check for "children" in ZkStateReader.refreshLazyCollections()
          2. I don't see where the clusterState object in ZkStateReader is initialized? The createClusterStateWatchersAndUpdate() method just calls constructState(new HashSet<>(liveNodes)); but doesn't set its return value to the clusterState variable.
          3. Nit-pick: It looks like the code for updating live nodes in ZkStateReader.createClusterStateWatchersAndUpdate() can use updateLiveNodes() directly (with a watcher)?
          4. Nit-pick: There was a lot of controversy on naming collections as "external" when this was originally implemented in SOLR-5473. In the interest of avoiding controversy and bike-shedding, can you rename "external" to "stateFormat2" wherever possible? e.g. "externalCollectionExists" to "isStateFormat2()" etc.

          I think now we are in a position to write a test which:

          1. Creates a collection in shared cluster state
          2. Creates a state.json for the same collection
          3. Mutates state in shared cluster state and asserts that those changes are not visible in ZkStateReader
          4. Deletes old cluster state and ensures that everything still works fine.
          Show
          Shalin Shekhar Mangar added a comment - Thanks Scott. A few more comments: Need a null check for "children" in ZkStateReader.refreshLazyCollections() I don't see where the clusterState object in ZkStateReader is initialized? The createClusterStateWatchersAndUpdate() method just calls constructState(new HashSet<>(liveNodes)); but doesn't set its return value to the clusterState variable. Nit-pick: It looks like the code for updating live nodes in ZkStateReader.createClusterStateWatchersAndUpdate() can use updateLiveNodes() directly (with a watcher)? Nit-pick: There was a lot of controversy on naming collections as "external" when this was originally implemented in SOLR-5473 . In the interest of avoiding controversy and bike-shedding, can you rename "external" to "stateFormat2" wherever possible? e.g. "externalCollectionExists" to "isStateFormat2()" etc. I think now we are in a position to write a test which: Creates a collection in shared cluster state Creates a state.json for the same collection Mutates state in shared cluster state and asserts that those changes are not visible in ZkStateReader Deletes old cluster state and ensures that everything still works fine.
          Hide
          Scott Blum added a comment -

          1) I don't understand. `if (children == null || children.isEmpty()) {` isn't that the null check?

          2) constructState() actually updates the field as a side effect, and it's the ONLY place I'm ever writing to the field. I put it inside the method because I was just having to do clusterState = constructState() at every call site. I can outline it if you think it's unclear, or I can rename the method. I was just sticking with the old name, but maybe that doesn't make sense now. "updateView" maybe?

          3) I actually intentionally left the live node and alias code alone, even though I have the strong urge to refactor it to match the new patterns I'm creating, since it wasn't directly related to the change. That said, I would be totally happy to create a LiveNodeWatcher along the lines of StateWatcher (which would enable shared code) if you think I should go ahead with that!

          4) Awesome, I was hoping for some guidance on this. For "shared" clusterstate, should I use "shared" or "legacyFormat"?

          As far as testing goes, I'll hit you up on IRC for some pointers on writing this. Ideally, if you could point me at an existing test I could start from?

          Show
          Scott Blum added a comment - 1) I don't understand. `if (children == null || children.isEmpty()) {` isn't that the null check? 2) constructState() actually updates the field as a side effect, and it's the ONLY place I'm ever writing to the field. I put it inside the method because I was just having to do clusterState = constructState() at every call site. I can outline it if you think it's unclear, or I can rename the method. I was just sticking with the old name, but maybe that doesn't make sense now. "updateView" maybe? 3) I actually intentionally left the live node and alias code alone, even though I have the strong urge to refactor it to match the new patterns I'm creating, since it wasn't directly related to the change. That said, I would be totally happy to create a LiveNodeWatcher along the lines of StateWatcher (which would enable shared code) if you think I should go ahead with that! 4) Awesome, I was hoping for some guidance on this. For "shared" clusterstate, should I use "shared" or "legacyFormat"? As far as testing goes, I'll hit you up on IRC for some pointers on writing this. Ideally, if you could point me at an existing test I could start from?
          Hide
          Scott Blum added a comment -

          With the changes we discussed! Tests passing.

          Show
          Scott Blum added a comment - With the changes we discussed! Tests passing.
          Hide
          Scott Blum added a comment -

          Now with 100% more testing.

          Show
          Scott Blum added a comment - Now with 100% more testing.
          Hide
          Scott Blum added a comment -

          Now includes fix for SOLR-7870

          Show
          Scott Blum added a comment - Now includes fix for SOLR-7870
          Hide
          Scott Blum added a comment -

          Part 2 of the patch, adds a MIGRATESTATEFORMAT collection command. Bikeshedding on name is welcome.

          Show
          Scott Blum added a comment - Part 2 of the patch, adds a MIGRATESTATEFORMAT collection command. Bikeshedding on name is welcome.
          Hide
          Shalin Shekhar Mangar added a comment -

          Combined patch. Tests and precommit pass. I'll commit this shortly.

          Show
          Shalin Shekhar Mangar added a comment - Combined patch. Tests and precommit pass. I'll commit this shortly.
          Hide
          Shalin Shekhar Mangar added a comment -

          Patch updated to trunk. I'm running tests and precommit now.

          Show
          Shalin Shekhar Mangar added a comment - Patch updated to trunk. I'm running tests and precommit now.
          Hide
          Shalin Shekhar Mangar added a comment - - edited
          1. I changed the Collection API from using "name" to "collection" which, to me, seems much more clear.
          2. I added a end-to-end test in TestCollectionAPI to migrate state format using the new collection API.
          3. Added SolrJ support (CollectionAdminRequest.MigrateStateFormat)
          Show
          Shalin Shekhar Mangar added a comment - - edited I changed the Collection API from using "name" to "collection" which, to me, seems much more clear. I added a end-to-end test in TestCollectionAPI to migrate state format using the new collection API. Added SolrJ support (CollectionAdminRequest.MigrateStateFormat)
          Hide
          ASF subversion and git services added a comment -

          Commit 1694692 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1694692 ]

          SOLR-5756: A utility Collection API to move a collection from stateFormat=1 to stateFormat=2
          SOLR-7870: Write a test which asserts that requests to stateFormat=2 collection succeed on a node even after all local replicas of that collection have been removed

          Show
          ASF subversion and git services added a comment - Commit 1694692 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1694692 ] SOLR-5756 : A utility Collection API to move a collection from stateFormat=1 to stateFormat=2 SOLR-7870 : Write a test which asserts that requests to stateFormat=2 collection succeed on a node even after all local replicas of that collection have been removed
          Hide
          Shalin Shekhar Mangar added a comment -

          There are a few conflicts on branch_5x because SOLR-7855 hasn't been backported yet. I'll wait for it to be complete.

          Show
          Shalin Shekhar Mangar added a comment - There are a few conflicts on branch_5x because SOLR-7855 hasn't been backported yet. I'll wait for it to be complete.
          Hide
          Scott Blum added a comment -

          Awesome, thanks! I presume there won't be a backport to 5.3? (I'm planning to backport it myself to 5.2.1, which is what we're running in prod)

          Show
          Scott Blum added a comment - Awesome, thanks! I presume there won't be a backport to 5.3? (I'm planning to backport it myself to 5.2.1, which is what we're running in prod)
          Hide
          Shalin Shekhar Mangar added a comment -

          Thank you! No, this won't make it to 5.3 because the release branch has already been created. Furthermore, this is too risky a change to sneak in at the last minute.

          Show
          Shalin Shekhar Mangar added a comment - Thank you! No, this won't make it to 5.3 because the release branch has already been created. Furthermore, this is too risky a change to sneak in at the last minute.
          Hide
          Scott Blum added a comment -

          nit: CollectionAdminRequest.java has a copy-pasted comment that needs an update

          Show
          Scott Blum added a comment - nit: CollectionAdminRequest.java has a copy-pasted comment that needs an update
          Hide
          ASF subversion and git services added a comment -

          Commit 1694718 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1694718 ]

          SOLR-5756: Fix copy/paste mistake in comment

          Show
          ASF subversion and git services added a comment - Commit 1694718 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1694718 ] SOLR-5756 : Fix copy/paste mistake in comment
          Hide
          Shalin Shekhar Mangar added a comment -

          nit: CollectionAdminRequest.java has a copy-pasted comment that needs an update

          Oops, thanks!

          Show
          Shalin Shekhar Mangar added a comment - nit: CollectionAdminRequest.java has a copy-pasted comment that needs an update Oops, thanks!
          Hide
          ASF subversion and git services added a comment -

          Commit 1695026 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1695026 ]

          SOLR-5756: A utility Collection API to move a collection from stateFormat=1 to stateFormat=2
          SOLR-7870: Write a test which asserts that requests to stateFormat=2 collection succeed on a node even after all local replicas of that collection have been removed

          Show
          ASF subversion and git services added a comment - Commit 1695026 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1695026 ] SOLR-5756 : A utility Collection API to move a collection from stateFormat=1 to stateFormat=2 SOLR-7870 : Write a test which asserts that requests to stateFormat=2 collection succeed on a node even after all local replicas of that collection have been removed
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Scott!

          Show
          Shalin Shekhar Mangar added a comment - Thanks Scott!
          Hide
          Shalin Shekhar Mangar added a comment - - edited

          I am seeing frequent failures in CollectionStateFormat2Test since this was committed.

          http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-MacOSX/2561/
          http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-Linux/13599/

          A typical failure looks like:

            2> NOTE: reproduce with: ant test  -Dtestcase=CollectionStateFormat2Test -Dtests.method=test -Dtests.seed=B21128F51A7A37B8 -Dtests.slow=true -Dtests.locale=fr -Dtests.timezone=GB -Dtests.asserts=true -Dtests.file.encoding=UTF-8
          [00:27:20.744] ERROR   58.6s J1 | CollectionStateFormat2Test.test <<<
             > Throwable #1: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:41444/w_b/w: Could not find collection : myExternColl
             >    at __randomizedtesting.SeedInfo.seed([B21128F51A7A37B8:3A45172FB4865A40]:0)
             >    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:560)
             >    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:234)
             >    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:226)
             >    at org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:376)
             >    at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:328)
             >    at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1085)
             >    at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:856)
             >    at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:799)
             >    at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220)
             >    at org.apache.solr.cloud.CollectionStateFormat2Test.testZkNodeLocation(CollectionStateFormat2Test.java:84)
             >    at org.apache.solr.cloud.CollectionStateFormat2Test.test(CollectionStateFormat2Test.java:40)
             >    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          
          Show
          Shalin Shekhar Mangar added a comment - - edited I am seeing frequent failures in CollectionStateFormat2Test since this was committed. http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-MacOSX/2561/ http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-Linux/13599/ A typical failure looks like: 2> NOTE: reproduce with: ant test -Dtestcase=CollectionStateFormat2Test -Dtests.method=test -Dtests.seed=B21128F51A7A37B8 -Dtests.slow= true -Dtests.locale=fr -Dtests.timezone=GB -Dtests.asserts= true -Dtests.file.encoding=UTF-8 [00:27:20.744] ERROR 58.6s J1 | CollectionStateFormat2Test.test <<< > Throwable #1: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http: //127.0.0.1:41444/w_b/w: Could not find collection : myExternColl > at __randomizedtesting.SeedInfo.seed([B21128F51A7A37B8:3A45172FB4865A40]:0) > at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:560) > at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:234) > at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:226) > at org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:376) > at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:328) > at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1085) > at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:856) > at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:799) > at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220) > at org.apache.solr.cloud.CollectionStateFormat2Test.testZkNodeLocation(CollectionStateFormat2Test.java:84) > at org.apache.solr.cloud.CollectionStateFormat2Test.test(CollectionStateFormat2Test.java:40) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          Hide
          Shalin Shekhar Mangar added a comment -

          Full test log attached.

          Show
          Shalin Shekhar Mangar added a comment - Full test log attached.
          Hide
          Shalin Shekhar Mangar added a comment -

          The problem is that the ZkStateReader automatically deletes a collection from cluster state once it sees that the state node no longer exists but some times the actual core unload command reaches the node after said state.json is deleted and the node replies by saying that it could not find such a core.

          Show
          Shalin Shekhar Mangar added a comment - The problem is that the ZkStateReader automatically deletes a collection from cluster state once it sees that the state node no longer exists but some times the actual core unload command reaches the node after said state.json is deleted and the node replies by saying that it could not find such a core.
          Hide
          Shalin Shekhar Mangar added a comment -

          Here are two more logs of the same test against the same revision with the same seed etc:

          1. CollectionStateFormat2Test-passed-r1695176.log
          2. CollectionStateFormat2Test-failure-r1695176.log
          Show
          Shalin Shekhar Mangar added a comment - Here are two more logs of the same test against the same revision with the same seed etc: CollectionStateFormat2Test-passed-r1695176.log CollectionStateFormat2Test-failure-r1695176.log
          Hide
          Steve Rowe added a comment -

          Attaching the CollectionStateFormat2 log excerpted from tests-failures.txt from the most recent trunk failure on my (Linux) Jenkins: http://jenkins.sarowe.net/job/Lucene-Solr-tests-trunk/1522/.

          Show
          Steve Rowe added a comment - Attaching the CollectionStateFormat2 log excerpted from tests-failures.txt from the most recent trunk failure on my (Linux) Jenkins: http://jenkins.sarowe.net/job/Lucene-Solr-tests-trunk/1522/ .
          Hide
          Shalin Shekhar Mangar added a comment -

          Based on discussion with Scott on irc, this may be a potential fix.

          Show
          Shalin Shekhar Mangar added a comment - Based on discussion with Scott on irc, this may be a potential fix.
          Hide
          Steve Rowe added a comment -

          I beasted with Shalin's SOLR-5756-fix.patch applied to trunk - attaching the log.

          Show
          Steve Rowe added a comment - I beasted with Shalin's SOLR-5756 -fix.patch applied to trunk - attaching the log.
          Hide
          Scott Blum added a comment -

          A real fix, I'm pretty sure. I found the race condition that was causing the flakiness.

          Show
          Scott Blum added a comment - A real fix, I'm pretty sure. I found the race condition that was causing the flakiness.
          Hide
          ASF subversion and git services added a comment -

          Commit 1696335 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1696335 ]

          SOLR-5756: Fix for race condition between unwatch and watcher fire event

          Show
          ASF subversion and git services added a comment - Commit 1696335 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1696335 ] SOLR-5756 : Fix for race condition between unwatch and watcher fire event
          Hide
          ASF subversion and git services added a comment -

          Commit 1696336 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1696336 ]

          SOLR-5756: Fix for race condition between unwatch and watcher fire event

          Show
          ASF subversion and git services added a comment - Commit 1696336 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1696336 ] SOLR-5756 : Fix for race condition between unwatch and watcher fire event
          Hide
          Shalin Shekhar Mangar added a comment -

          That fix seems to have solved the problem. Good sleuthing, Scott!

          Show
          Shalin Shekhar Mangar added a comment - That fix seems to have solved the problem. Good sleuthing, Scott!

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Noble Paul
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development