Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9106

Cache cluster properties in ZkStateReader

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: master (7.0)
    • Fix Version/s: 6.1
    • Component/s: None
    • Labels:
      None

      Description

      ZkStateReader currently makes calls into ZK every time getClusterProps() is called. Instead we should keep the data locally and use a Watcher to update it.

      1. SOLR-9106.patch
        44 kB
        Alan Woodward
      2. SOLR-9106.patch
        21 kB
        Alan Woodward

        Activity

        Hide
        romseygeek Alan Woodward added a comment -

        Patch. Also adds a type-safe .getClusterProperty() method, which neatens a few things up a bit.

        Show
        romseygeek Alan Woodward added a comment - Patch. Also adds a type-safe .getClusterProperty() method, which neatens a few things up a bit.
        Hide
        dragonsinth Scott Blum added a comment -

        LGTM if tests pass. One comment:

        -  public static boolean isLegacy(Map clusterProps) {
        -    return !"false".equals(clusterProps.get(ZkStateReader.LEGACY_CLOUD));
        +  public static boolean isLegacy(ZkStateReader stateReader) {
        +    return stateReader.getClusterProperty(ZkStateReader.LEGACY_CLOUD, "false").equals("false") == false;
           }
        

        I think !"false".equals(stateReader.getClusterProperty(ZkStateReader.LEGACY_CLOUD, "false")) would be slightly more readable more like the original. That whole "false").equals("false") == false bit makes my head spin LOL.

        Show
        dragonsinth Scott Blum added a comment - LGTM if tests pass. One comment: - public static boolean isLegacy(Map clusterProps) { - return ! " false " .equals(clusterProps.get(ZkStateReader.LEGACY_CLOUD)); + public static boolean isLegacy(ZkStateReader stateReader) { + return stateReader.getClusterProperty(ZkStateReader.LEGACY_CLOUD, " false " ).equals( " false " ) == false ; } I think !"false".equals(stateReader.getClusterProperty(ZkStateReader.LEGACY_CLOUD, "false")) would be slightly more readable more like the original. That whole "false").equals("false") == false bit makes my head spin LOL.
        Hide
        varunthacker Varun Thacker added a comment -

        Hi Alan,

        There is one common catch block now

        catch (KeeperException | InterruptedException e) {

        Should we handle them separately as we are currently doing so that we can set the interrupt flag?

        Show
        varunthacker Varun Thacker added a comment - Hi Alan, There is one common catch block now catch (KeeperException | InterruptedException e) { Should we handle them separately as we are currently doing so that we can set the interrupt flag?
        Hide
        romseygeek Alan Woodward added a comment -

        Scott Blum yeah, you're right, that's pretty ugly. And it's also wrong, as legacyCloud defaults to 'true' at the moment. Will fix!

        Varun Thacker that's dealt with by the call to SolrZkClient.checkInterrupted()

        Show
        romseygeek Alan Woodward added a comment - Scott Blum yeah, you're right, that's pretty ugly. And it's also wrong, as legacyCloud defaults to 'true' at the moment. Will fix! Varun Thacker that's dealt with by the call to SolrZkClient.checkInterrupted()
        Hide
        varunthacker Varun Thacker added a comment -

        that's dealt with by the call to SolrZkClient.checkInterrupted()

        True! I obviously didn't read the LOG.error line .

        Show
        varunthacker Varun Thacker added a comment - that's dealt with by the call to SolrZkClient.checkInterrupted() True! I obviously didn't read the LOG.error line .
        Hide
        romseygeek Alan Woodward added a comment -

        Hm, so a couple of the tests complicate this a bit by calling ZkStateReader without setting up all the watches, etc (something that I want to make not possible in SOLR-9056). To get around this, I've introduced a ClusterProperties class that will read and write properties in real-time, if need be.

        I've also had to make the ZkController baseURL non-final, because it needs to be set after the state reader watches are created, and that isn't done in the constructor. Which is annoying, but again SOLR-9056 cleans that up so I'm not too concerned.

        All tests pass with this patch.

        Show
        romseygeek Alan Woodward added a comment - Hm, so a couple of the tests complicate this a bit by calling ZkStateReader without setting up all the watches, etc (something that I want to make not possible in SOLR-9056 ). To get around this, I've introduced a ClusterProperties class that will read and write properties in real-time, if need be. I've also had to make the ZkController baseURL non-final, because it needs to be set after the state reader watches are created, and that isn't done in the constructor. Which is annoying, but again SOLR-9056 cleans that up so I'm not too concerned. All tests pass with this patch.
        Hide
        dragonsinth Scott Blum added a comment -

        LGTM

        Show
        dragonsinth Scott Blum added a comment - LGTM
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 77962f4af4e7f376255954f00e8b97f3dff94396 in lucene-solr's branch refs/heads/master from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=77962f4 ]

        SOLR-9106: Cache cluster properties on ZkStateReader

        Show
        jira-bot ASF subversion and git services added a comment - Commit 77962f4af4e7f376255954f00e8b97f3dff94396 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=77962f4 ] SOLR-9106 : Cache cluster properties on ZkStateReader
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit dd23fa40153214095527cdeb77c4255cafbc21f9 in lucene-solr's branch refs/heads/branch_6x from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dd23fa4 ]

        SOLR-9106: Cache cluster properties on ZkStateReader

        Show
        jira-bot ASF subversion and git services added a comment - Commit dd23fa40153214095527cdeb77c4255cafbc21f9 in lucene-solr's branch refs/heads/branch_6x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dd23fa4 ] SOLR-9106 : Cache cluster properties on ZkStateReader
        Hide
        romseygeek Alan Woodward added a comment -

        Thanks all.

        Show
        romseygeek Alan Woodward added a comment - Thanks all.
        Hide
        noble.paul Noble Paul added a comment -

        Hi Alan Woodward I missed this. Not watching clusterprops was done deliberately. it is read rarely and watching is expensive.

        Show
        noble.paul Noble Paul added a comment - Hi Alan Woodward I missed this. Not watching clusterprops was done deliberately. it is read rarely and watching is expensive.
        Hide
        hgadre Hrishikesh Gadre added a comment -

        We found a scenario where the ZK watch was not fired in a timely fashion resulting in the unit test failure (SOLR-9242). The changes made to fix that failure are exactly opposite to the intent of this JIRA (i.e. it brings back the old behavior).

        https://github.com/apache/lucene-solr/commit/eefdc62c997f7936b6db203111d8149dc934b243

        I agree with Nobel here. Alan Woodward is reading cluster properties every-time a serious performance problem?

        CC Varun Thacker

        Show
        hgadre Hrishikesh Gadre added a comment - We found a scenario where the ZK watch was not fired in a timely fashion resulting in the unit test failure ( SOLR-9242 ). The changes made to fix that failure are exactly opposite to the intent of this JIRA (i.e. it brings back the old behavior). https://github.com/apache/lucene-solr/commit/eefdc62c997f7936b6db203111d8149dc934b243 I agree with Nobel here. Alan Woodward is reading cluster properties every-time a serious performance problem? CC Varun Thacker
        Hide
        varunthacker Varun Thacker added a comment -

        I think its okay to cache it .

        We should maybe add some documentation that getClusterProperty might not always have a latest value if an update has just taken place and that in scenarios where we want the latest value we should forceRefresh it before reading.

        So things like backup/restore can forceRefresh it since its a user facing API where the user can set some values in the cluster property file and expect it to be there. Things like the OverseerAutoReplicaFailoverThread where the feature might kick in after a couple of seconds because of this caching is fine.

        Show
        varunthacker Varun Thacker added a comment - I think its okay to cache it . We should maybe add some documentation that getClusterProperty might not always have a latest value if an update has just taken place and that in scenarios where we want the latest value we should forceRefresh it before reading. So things like backup/restore can forceRefresh it since its a user facing API where the user can set some values in the cluster property file and expect it to be there. Things like the OverseerAutoReplicaFailoverThread where the feature might kick in after a couple of seconds because of this caching is fine.
        Hide
        noble.paul Noble Paul added a comment -

        Varun Thacker the problem is, this ticket introduced a new problem where there was none. As Hrishikesh Gadre said reading that value once in a while was not a performance issue. Now we are paying the price for constantly watching it and we still don't have up to date data.

        Show
        noble.paul Noble Paul added a comment - Varun Thacker the problem is, this ticket introduced a new problem where there was none. As Hrishikesh Gadre said reading that value once in a while was not a performance issue. Now we are paying the price for constantly watching it and we still don't have up to date data.
        Hide
        romseygeek Alan Woodward added a comment -

        Not watching clusterprops was done deliberately. it is read rarely and watching is expensive.

        This seems exactly backwards to me? Watches are pretty lightweight. Do you have some numbers here?

        If it's true, then it needs to be explicitly documented on the .getClusterProperties() method that it will always read data from ZK - maybe it should be called .readClusterProperties() instead. And it ends up being used in some expected places, for example during admin requests from a CloudSolrClient.

        Show
        romseygeek Alan Woodward added a comment - Not watching clusterprops was done deliberately. it is read rarely and watching is expensive. This seems exactly backwards to me? Watches are pretty lightweight. Do you have some numbers here? If it's true, then it needs to be explicitly documented on the .getClusterProperties() method that it will always read data from ZK - maybe it should be called .readClusterProperties() instead. And it ends up being used in some expected places, for example during admin requests from a CloudSolrClient.
        Hide
        noble.paul Noble Paul added a comment -

        bqThis seems exactly backwards to me? Watches are pretty lightweight. Do you have some numbers here
        The problem with watches is that we really don't know when the changes are propagated to each node. If we read the data just in time we always get the latest data. Watching is light weight, I haven't done any benchmarking yet.

        Show
        noble.paul Noble Paul added a comment - bqThis seems exactly backwards to me? Watches are pretty lightweight. Do you have some numbers here The problem with watches is that we really don't know when the changes are propagated to each node. If we read the data just in time we always get the latest data. Watching is light weight, I haven't done any benchmarking yet.
        Hide
        romseygeek Alan Woodward added a comment -

        If we read the data just in time we always get the latest data

        If you need this behaviour, you can use an explicit ClusterProperties object. ZkCLI and some tests already do this. I'll add some javadocs to ZkStateReader.getClusterProperty() to make this more explicit though.

        Show
        romseygeek Alan Woodward added a comment - If we read the data just in time we always get the latest data If you need this behaviour, you can use an explicit ClusterProperties object. ZkCLI and some tests already do this. I'll add some javadocs to ZkStateReader.getClusterProperty() to make this more explicit though.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 8247f9f0cd8fccf7ecfb6d76cb8f180079ece3bb in lucene-solr's branch refs/heads/branch_6x from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8247f9f ]

        SOLR-9106: Add javadocs to ZkStateReader cluster properties methods

        Show
        jira-bot ASF subversion and git services added a comment - Commit 8247f9f0cd8fccf7ecfb6d76cb8f180079ece3bb in lucene-solr's branch refs/heads/branch_6x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8247f9f ] SOLR-9106 : Add javadocs to ZkStateReader cluster properties methods
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit c2db9fae2cc312a13a66e6dab9989ed65738fe02 in lucene-solr's branch refs/heads/master from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2db9fa ]

        SOLR-9106: Add javadocs to ZkStateReader cluster properties methods

        Show
        jira-bot ASF subversion and git services added a comment - Commit c2db9fae2cc312a13a66e6dab9989ed65738fe02 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2db9fa ] SOLR-9106 : Add javadocs to ZkStateReader cluster properties methods

          People

          • Assignee:
            romseygeek Alan Woodward
            Reporter:
            romseygeek Alan Woodward
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development