Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9014

Deprecate and reduce usage of ClusterState methods which may make calls to ZK via the lazy collection reference

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.1, 7.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      ClusterState has a bunch of methods such as getSlice and getReplica which internally call getCollectionOrNull that ends up making a call to ZK via the lazy collection reference. Many classes use these methods even though a DocCollection object is available. In such cases, multiple redundant calls to ZooKeeper can happen if the collection is not watched locally. This is especially true for Overseer classes which operate on all collections.

      We should audit all usages of these methods and replace them with calls to appropriate DocCollection methods.

      1. SOLR-9014.patch
        58 kB
        Shalin Shekhar Mangar
      2. SOLR-9014-deprecate-getCollections.patch
        35 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        shalinmangar Shalin Shekhar Mangar added a comment - - edited

        This is turning out to be interesting. We have to revisit a few assumptions:

        1. The Overseer has a wait loop to see a certain condition to be true in many places. The earlier assumption was that updateClusterState was expensive and therefore it was better to wait until you see the state. But not that we have lazy collections and a collection specific forceUpdateCollection, the wait loop is actually as expensive because it ends up reading the collection state from ZooKeeper – sometime as frequently as 100ms. We should return the resolved reference from ZkStateReader#forceUpdateCollection and use it in such places.
        2. The ClusterState#getCollections was supposed to be lightweight i.e. it just read and returned the names of known collections from local cached state. This was changed in SOLR-6629 to resolve the reference. This means that it ends up going to ZK for each non-watched collection. So API calls like downnode etc are way more expensive than needed. It is better to start returning a List<DocCollection> from this method instead.
        Show
        shalinmangar Shalin Shekhar Mangar added a comment - - edited This is turning out to be interesting. We have to revisit a few assumptions: The Overseer has a wait loop to see a certain condition to be true in many places. The earlier assumption was that updateClusterState was expensive and therefore it was better to wait until you see the state. But not that we have lazy collections and a collection specific forceUpdateCollection, the wait loop is actually as expensive because it ends up reading the collection state from ZooKeeper – sometime as frequently as 100ms. We should return the resolved reference from ZkStateReader#forceUpdateCollection and use it in such places. The ClusterState#getCollections was supposed to be lightweight i.e. it just read and returned the names of known collections from local cached state. This was changed in SOLR-6629 to resolve the reference. This means that it ends up going to ZK for each non-watched collection. So API calls like downnode etc are way more expensive than needed. It is better to start returning a List<DocCollection> from this method instead.
        Hide
        dragonsinth Scott Blum added a comment -

        Hard to get the right context in mind in all these cases, eh?

        1. If the overseer had set a watch on the collection it cared about, it would be more efficient to wait and loop. Maybe we could fix this via SOLR-8323? Overseer could set a temporary watch on things it cared about.

        One thing puzzles me in what you said tho; if the collection is lazy but existent, I think forceUpdateCollection() is basically exactly as efficient as just polling the lazy collection. So going back to forceUpdateCollection() won't make it any better. But in cases where the collection is watching, looping is far more efficient.

        2) Agreed, maybe we should expose getCollectionStates() or getCollectionNames().

        Show
        dragonsinth Scott Blum added a comment - Hard to get the right context in mind in all these cases, eh? 1. If the overseer had set a watch on the collection it cared about, it would be more efficient to wait and loop. Maybe we could fix this via SOLR-8323 ? Overseer could set a temporary watch on things it cared about. One thing puzzles me in what you said tho; if the collection is lazy but existent, I think forceUpdateCollection() is basically exactly as efficient as just polling the lazy collection. So going back to forceUpdateCollection() won't make it any better. But in cases where the collection is watching, looping is far more efficient. 2) Agreed, maybe we should expose getCollectionStates() or getCollectionNames().
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        You are right. I was confused. I haven't dug into SOLR-8323 yet but if you say so

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - You are right. I was confused. I haven't dug into SOLR-8323 yet but if you say so
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        I found SOLR-9030 while working on this issue.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - I found SOLR-9030 while working on this issue.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Here's a first pass at cleaning this up.

        I tried to change getCollections to return DoCollection objects instead of collection names but somehow it tickles SOLR-9030 even more. I'll take another shot at it in a bit.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Here's a first pass at cleaning this up. I tried to change getCollections to return DoCollection objects instead of collection names but somehow it tickles SOLR-9030 even more. I'll take another shot at it in a bit.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 922265b478296992189434040517368cf93d1b09 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=922265b ]

        SOLR-9014: Deprecate and reduce usage of ClusterState methods which may make calls to ZK via the lazy collection reference

        Show
        jira-bot ASF subversion and git services added a comment - Commit 922265b478296992189434040517368cf93d1b09 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=922265b ] SOLR-9014 : Deprecate and reduce usage of ClusterState methods which may make calls to ZK via the lazy collection reference
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Scott Blum – In SOLR-6629, we added a CollectionRef.get() call in ClusterState.getCollections but reading the code again, I no longer understand why it was necessary. Is there ever a chance that the collectionStates inside ClusterState.java would ever have a collection that cannot be resolved?

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Scott Blum – In SOLR-6629 , we added a CollectionRef.get() call in ClusterState.getCollections but reading the code again, I no longer understand why it was necessary. Is there ever a chance that the collectionStates inside ClusterState.java would ever have a collection that cannot be resolved?
        Hide
        dragonsinth Scott Blum added a comment - - edited

        Yes, unfortunately. ClusterState.collectionStates is driven (in part) by /solr/collections/<children>. In particular, if /solr/collections/foo exists, the foo collection is not being watched, and /solr/collections/foo/state.json does NOT exist, then the collection will appear in collectionStates as a LazyCollectionRef, but it won't resolve to a DocCollection since there's no state.json. We don't poll or watch for the existence of the state.json for non-watched collections.

        Watched collections don't have this problem, due to how interestingCollections vs. watchedCollections is handled in ZkStateReader.

        We should figure out if there's a better API for ClusterState to handle this more efficiently. If you naively remove the guard in ClusterState.getCollections(), what will happen is a lot of calling code will break. For example, in Assign.getNodeNameVsShardCount(), the caller loops over the returned set of collection names, calling clusterState.getCollection(collectionName) and expecting the result to be non-null. We would either need to update all those callers to check, or else have LazyCollectionRef.get() return an empty DocCollection if the node doesn't exist.

        Show
        dragonsinth Scott Blum added a comment - - edited Yes, unfortunately. ClusterState.collectionStates is driven (in part) by /solr/collections/<children>. In particular, if /solr/collections/foo exists, the foo collection is not being watched, and /solr/collections/foo/state.json does NOT exist, then the collection will appear in collectionStates as a LazyCollectionRef, but it won't resolve to a DocCollection since there's no state.json. We don't poll or watch for the existence of the state.json for non-watched collections. Watched collections don't have this problem, due to how interestingCollections vs. watchedCollections is handled in ZkStateReader. We should figure out if there's a better API for ClusterState to handle this more efficiently. If you naively remove the guard in ClusterState.getCollections(), what will happen is a lot of calling code will break. For example, in Assign.getNodeNameVsShardCount(), the caller loops over the returned set of collection names, calling clusterState.getCollection(collectionName) and expecting the result to be non-null. We would either need to update all those callers to check, or else have LazyCollectionRef.get() return an empty DocCollection if the node doesn't exist.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Yes, unfortunately. ClusterState.collectionStates is driven (in part) by /solr/collections/<children>. In particular, if /solr/collections/foo exists, the foo collection is not being watched, and /solr/collections/foo/state.json does NOT exist, then the collection will appear in collectionStates as a LazyCollectionRef, but it won't resolve to a DocCollection since there's no state.json. We don't poll or watch for the existence of the state.json for non-watched collections.

        This is what I don't understand. If /solr/collections/foo exists, won't the collection exist either in legacy clusterstate.json or in /solr/collections/foo/state.json? Granted that there is an edge case where a collection might be under construction but otherwise this should hold true? So if the goal is to return a Set of names of collections, why must we try to resolve each entry in ClusterState.collectionStates?

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Yes, unfortunately. ClusterState.collectionStates is driven (in part) by /solr/collections/<children>. In particular, if /solr/collections/foo exists, the foo collection is not being watched, and /solr/collections/foo/state.json does NOT exist, then the collection will appear in collectionStates as a LazyCollectionRef, but it won't resolve to a DocCollection since there's no state.json. We don't poll or watch for the existence of the state.json for non-watched collections. This is what I don't understand. If /solr/collections/foo exists, won't the collection exist either in legacy clusterstate.json or in /solr/collections/foo/state.json? Granted that there is an edge case where a collection might be under construction but otherwise this should hold true? So if the goal is to return a Set of names of collections, why must we try to resolve each entry in ClusterState.collectionStates?
        Hide
        dragonsinth Scott Blum added a comment -

        Correct, the edge case is what we're guarding against here. I would be completely on board with making collectionStates() cheap, but we would need to go update all the call sites to guard against that edge case; ie, using a key returned from collectionStates(), all the call sites need to expect that getCollection(key) might return null.

        Show
        dragonsinth Scott Blum added a comment - Correct, the edge case is what we're guarding against here. I would be completely on board with making collectionStates() cheap, but we would need to go update all the call sites to guard against that edge case; ie, using a key returned from collectionStates(), all the call sites need to expect that getCollection(key) might return null.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        I tried doing this (burned a few days actually) but there are just too many places to fix and thinking more on this, it just feels like a hack and a bad API. I think a better approach may be to ensure that we always create the collection znode and state.json together in one go, maybe as a 'multi' transaction. This would require a lot of cleanup e.g. bootstrapping mode may need to go away, ZkController.createCollectionZkNode must be purged, ZkCli's linkConfig command would no longer operate if the collection doesn't exist (do people even use it that way?) but I think we should do it.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - I tried doing this (burned a few days actually) but there are just too many places to fix and thinking more on this, it just feels like a hack and a bad API. I think a better approach may be to ensure that we always create the collection znode and state.json together in one go, maybe as a 'multi' transaction. This would require a lot of cleanup e.g. bootstrapping mode may need to go away, ZkController.createCollectionZkNode must be purged, ZkCli's linkConfig command would no longer operate if the collection doesn't exist (do people even use it that way?) but I think we should do it.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 451feb0f8ea820e21dbf3a55c7a5664d2f750803 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=451feb0 ]

        SOLR-9014: Deprecate and reduce usage of ClusterState methods which may make calls to ZK via the lazy collection reference
        (cherry picked from commit 922265b)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 451feb0f8ea820e21dbf3a55c7a5664d2f750803 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=451feb0 ] SOLR-9014 : Deprecate and reduce usage of ClusterState methods which may make calls to ZK via the lazy collection reference (cherry picked from commit 922265b)
        Hide
        dragonsinth Scott Blum added a comment -

        I don't know... seems like a dicey game to try to ensure that no collection node can ever exist without a state.json. Even the process of migrating state format would have to be transactional. It seems really tricky to both get all the edge cases right and convince ourselves there's never a legitimate use case (like having a non-collection just for holding shared config).

        I think what I would most be in favor of is deprecating the existing API and return a Map<String, DocCollection> to at least avoid the cases where we're doing double lookups today. Most callers are iterating over the returned set and fetching individual DocCollections in a loop, which causes a second lookup, which is the real problem. You can never avoid doing at least one lookup anyway.

        An alternate, wacky idea, is to do the same thing but return a custom Map implementation that lazily determines the map contents.

        Show
        dragonsinth Scott Blum added a comment - I don't know... seems like a dicey game to try to ensure that no collection node can ever exist without a state.json. Even the process of migrating state format would have to be transactional. It seems really tricky to both get all the edge cases right and convince ourselves there's never a legitimate use case (like having a non-collection just for holding shared config). I think what I would most be in favor of is deprecating the existing API and return a Map<String, DocCollection> to at least avoid the cases where we're doing double lookups today. Most callers are iterating over the returned set and fetching individual DocCollections in a loop, which causes a second lookup, which is the real problem. You can never avoid doing at least one lookup anyway. An alternate, wacky idea, is to do the same thing but return a custom Map implementation that lazily determines the map contents.
        Hide
        dragonsinth Scott Blum added a comment -

        BTW, I'm surprised this was super tricky. There aren't really that many callers. I count:

        16 - solr-core
         9 - solr-core-tests
         2 - solr-test-framework
         5 - solrj
         1 - solrj-tests
        
        Show
        dragonsinth Scott Blum added a comment - BTW, I'm surprised this was super tricky. There aren't really that many callers. I count: 16 - solr-core 9 - solr-core-tests 2 - solr-test-framework 5 - solrj 1 - solrj-tests
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        No, actually, I segued into my idea of creating the state.json together with the collection znode which took so much time. Also, somehow returning List<DocCollection> from ClusterState.getCollections seems to trigger SOLR-9030 much more often. I'll take another crack at it even though I don't like this solution so much.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - No, actually, I segued into my idea of creating the state.json together with the collection znode which took so much time. Also, somehow returning List<DocCollection> from ClusterState.getCollections seems to trigger SOLR-9030 much more often. I'll take another crack at it even though I don't like this solution so much.
        Hide
        elyograg Shawn Heisey added a comment -

        As you read this, remember that I have never looked at a single line of any of zookeeper code in Solr, so I'm a strictly 30,000-foot-view guy with only a little bit of practical experience with SolrCloud. I do have some opinions about the things I do understand.

        bootstrapping mode may need to go away, ZkController.createCollectionZkNode must be purged, ZkCli's linkConfig command would no longer operate if the collection doesn't exist (do people even use it that way?)

        I have never liked bootstrapping mode. I know that we need a way for somebody to convert a non-cloud install to a cloud install, but bootstrapping has always felt like an extreme hack, producing strange collections that I would want to remove from production as quickly as possible. Perhaps bootstrapping can be moved to the scripting layer... but I'm sure a lot of thought will need to go into it.

        I can't say anything about ZkController.createCollectionZkNode, because I don't know the code.

        Quite some time ago, I was really surprised to learn that linkconfig would work on a collection that didn't even exist yet. Although I can see the appeal, I believe that we should not have multiple ways to achieve the same result. Unless a config with the same name as the collection happens to exist, collection.configName should be a required parameter on the CREATE action of the Collections API.

        Show
        elyograg Shawn Heisey added a comment - As you read this, remember that I have never looked at a single line of any of zookeeper code in Solr, so I'm a strictly 30,000-foot-view guy with only a little bit of practical experience with SolrCloud. I do have some opinions about the things I do understand. bootstrapping mode may need to go away, ZkController.createCollectionZkNode must be purged, ZkCli's linkConfig command would no longer operate if the collection doesn't exist (do people even use it that way?) I have never liked bootstrapping mode. I know that we need a way for somebody to convert a non-cloud install to a cloud install, but bootstrapping has always felt like an extreme hack, producing strange collections that I would want to remove from production as quickly as possible. Perhaps bootstrapping can be moved to the scripting layer... but I'm sure a lot of thought will need to go into it. I can't say anything about ZkController.createCollectionZkNode, because I don't know the code. Quite some time ago, I was really surprised to learn that linkconfig would work on a collection that didn't even exist yet. Although I can see the appeal, I believe that we should not have multiple ways to achieve the same result. Unless a config with the same name as the collection happens to exist, collection.configName should be a required parameter on the CREATE action of the Collections API.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        You have bootstrapping because SolrCloud did not even have a Collections API in 4.0.

        Show
        markrmiller@gmail.com Mark Miller added a comment - You have bootstrapping because SolrCloud did not even have a Collections API in 4.0.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        bootstrapping mode may need to go away,

        Ideally, stateFormat=1, bootstrapping, predefined cores, legacyCloudMode would have all gone away completely in 6x.

        Tests have been what makes it sticky to do though. Hopefully we can do it for 7x.

        Show
        markrmiller@gmail.com Mark Miller added a comment - bootstrapping mode may need to go away, Ideally, stateFormat=1, bootstrapping, predefined cores, legacyCloudMode would have all gone away completely in 6x. Tests have been what makes it sticky to do though. Hopefully we can do it for 7x.
        Hide
        dsmiley David Smiley added a comment -

        Ideally, stateFormat=1, bootstrapping, predefined cores, legacyCloudMode would have all gone away completely in 6x.

        +1 to that! numShards as a System property is evil.

        Show
        dsmiley David Smiley added a comment - Ideally, stateFormat=1, bootstrapping, predefined cores, legacyCloudMode would have all gone away completely in 6x. +1 to that! numShards as a System property is evil.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -
        1. Deprecates the ClusterState.getCollections method and introduces a new getCollectionsMap method.
        2. Removes a redundant call at ZkController.publishAndWaitForDownStates
        3. Fixed SQLHandler.open which called getCollections twice, once to get size and then again to actually get the set of collection names

        As I said earlier this change trips SOLR-9030 a lot more. I'll fix that issue first before I commit this patch.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Deprecates the ClusterState.getCollections method and introduces a new getCollectionsMap method. Removes a redundant call at ZkController.publishAndWaitForDownStates Fixed SQLHandler.open which called getCollections twice, once to get size and then again to actually get the set of collection names As I said earlier this change trips SOLR-9030 a lot more. I'll fix that issue first before I commit this patch.
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Thanks Shawn, we should fix linkConfig – I don't think we should support that use-case. We should also remove bootstrapping, it is long overdue. I'll create an issue though I won't have the time to attack it soon.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Thanks Shawn, we should fix linkConfig – I don't think we should support that use-case. We should also remove bootstrapping, it is long overdue. I'll create an issue though I won't have the time to attack it soon.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit f5497a33e29d087dc0e87ccc697e85f5018d8702 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f5497a3 ]

        SOLR-9014: Deprecate ClusterState.getCollections and introduce a new ClusterState.getCollectionsMap instead

        Show
        jira-bot ASF subversion and git services added a comment - Commit f5497a33e29d087dc0e87ccc697e85f5018d8702 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f5497a3 ] SOLR-9014 : Deprecate ClusterState.getCollections and introduce a new ClusterState.getCollectionsMap instead
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 6ade99947a6e123e3783eb3c3799525e4328e8bc in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6ade999 ]

        SOLR-9014: Fix javadoc

        Show
        jira-bot ASF subversion and git services added a comment - Commit 6ade99947a6e123e3783eb3c3799525e4328e8bc in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6ade999 ] SOLR-9014 : Fix javadoc
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 7bfaa51079aaea13630bcadc8054ba91277b6e04 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7bfaa51 ]

        SOLR-9014: Deprecate ClusterState.getCollections and introduce a new ClusterState.getCollectionsMap instead
        (cherry picked from commit f5497a3)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 7bfaa51079aaea13630bcadc8054ba91277b6e04 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7bfaa51 ] SOLR-9014 : Deprecate ClusterState.getCollections and introduce a new ClusterState.getCollectionsMap instead (cherry picked from commit f5497a3)
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 7ce1c2cb74e3c66a22d7ffc07ef392d651848212 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7ce1c2c ]

        SOLR-9014: Fix javadoc
        (cherry picked from commit 6ade999)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 7ce1c2cb74e3c66a22d7ffc07ef392d651848212 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7ce1c2c ] SOLR-9014 : Fix javadoc (cherry picked from commit 6ade999)
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Let's create separate issues for the other improvements discussed here.

        Thanks to all who have contributed!

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Let's create separate issues for the other improvements discussed here. Thanks to all who have contributed!
        Hide
        hossman Hoss Man added a comment -

        Manually correcting fixVersion per Step #S5 of LUCENE-7271

        Show
        hossman Hoss Man added a comment - Manually correcting fixVersion per Step #S5 of LUCENE-7271

          People

          • Assignee:
            shalinmangar Shalin Shekhar Mangar
            Reporter:
            shalinmangar Shalin Shekhar Mangar
          • Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development