Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9106

Cache cluster properties in ZkStateReader

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 7.0
    • Fix Version/s: 6.1
    • Component/s: None
    • Labels:
      None

      Description

      ZkStateReader currently makes calls into ZK every time getClusterProps() is called. Instead we should keep the data locally and use a Watcher to update it.

      1. SOLR-9106.patch
        44 kB
        Alan Woodward
      2. SOLR-9106.patch
        21 kB
        Alan Woodward

        Issue Links

          Activity

          Hide
          romseygeek Alan Woodward added a comment -

          Patch. Also adds a type-safe .getClusterProperty() method, which neatens a few things up a bit.

          Show
          romseygeek Alan Woodward added a comment - Patch. Also adds a type-safe .getClusterProperty() method, which neatens a few things up a bit.
          Hide
          dragonsinth Scott Blum added a comment -

          LGTM if tests pass. One comment:

          -  public static boolean isLegacy(Map clusterProps) {
          -    return !"false".equals(clusterProps.get(ZkStateReader.LEGACY_CLOUD));
          +  public static boolean isLegacy(ZkStateReader stateReader) {
          +    return stateReader.getClusterProperty(ZkStateReader.LEGACY_CLOUD, "false").equals("false") == false;
             }
          

          I think !"false".equals(stateReader.getClusterProperty(ZkStateReader.LEGACY_CLOUD, "false")) would be slightly more readable more like the original. That whole "false").equals("false") == false bit makes my head spin LOL.

          Show
          dragonsinth Scott Blum added a comment - LGTM if tests pass. One comment: - public static boolean isLegacy(Map clusterProps) { - return ! " false " .equals(clusterProps.get(ZkStateReader.LEGACY_CLOUD)); + public static boolean isLegacy(ZkStateReader stateReader) { + return stateReader.getClusterProperty(ZkStateReader.LEGACY_CLOUD, " false " ).equals( " false " ) == false ; } I think !"false".equals(stateReader.getClusterProperty(ZkStateReader.LEGACY_CLOUD, "false")) would be slightly more readable more like the original. That whole "false").equals("false") == false bit makes my head spin LOL.
          Hide
          varunthacker Varun Thacker added a comment -

          Hi Alan,

          There is one common catch block now

          catch (KeeperException | InterruptedException e) {

          Should we handle them separately as we are currently doing so that we can set the interrupt flag?

          Show
          varunthacker Varun Thacker added a comment - Hi Alan, There is one common catch block now catch (KeeperException | InterruptedException e) { Should we handle them separately as we are currently doing so that we can set the interrupt flag?
          Hide
          romseygeek Alan Woodward added a comment -

          Scott Blum yeah, you're right, that's pretty ugly. And it's also wrong, as legacyCloud defaults to 'true' at the moment. Will fix!

          Varun Thacker that's dealt with by the call to SolrZkClient.checkInterrupted()

          Show
          romseygeek Alan Woodward added a comment - Scott Blum yeah, you're right, that's pretty ugly. And it's also wrong, as legacyCloud defaults to 'true' at the moment. Will fix! Varun Thacker that's dealt with by the call to SolrZkClient.checkInterrupted()
          Hide
          varunthacker Varun Thacker added a comment -

          that's dealt with by the call to SolrZkClient.checkInterrupted()

          True! I obviously didn't read the LOG.error line .

          Show
          varunthacker Varun Thacker added a comment - that's dealt with by the call to SolrZkClient.checkInterrupted() True! I obviously didn't read the LOG.error line .
          Hide
          romseygeek Alan Woodward added a comment -

          Hm, so a couple of the tests complicate this a bit by calling ZkStateReader without setting up all the watches, etc (something that I want to make not possible in SOLR-9056). To get around this, I've introduced a ClusterProperties class that will read and write properties in real-time, if need be.

          I've also had to make the ZkController baseURL non-final, because it needs to be set after the state reader watches are created, and that isn't done in the constructor. Which is annoying, but again SOLR-9056 cleans that up so I'm not too concerned.

          All tests pass with this patch.

          Show
          romseygeek Alan Woodward added a comment - Hm, so a couple of the tests complicate this a bit by calling ZkStateReader without setting up all the watches, etc (something that I want to make not possible in SOLR-9056 ). To get around this, I've introduced a ClusterProperties class that will read and write properties in real-time, if need be. I've also had to make the ZkController baseURL non-final, because it needs to be set after the state reader watches are created, and that isn't done in the constructor. Which is annoying, but again SOLR-9056 cleans that up so I'm not too concerned. All tests pass with this patch.
          Hide
          dragonsinth Scott Blum added a comment -

          LGTM

          Show
          dragonsinth Scott Blum added a comment - LGTM
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 77962f4af4e7f376255954f00e8b97f3dff94396 in lucene-solr's branch refs/heads/master from Alan Woodward
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=77962f4 ]

          SOLR-9106: Cache cluster properties on ZkStateReader

          Show
          jira-bot ASF subversion and git services added a comment - Commit 77962f4af4e7f376255954f00e8b97f3dff94396 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=77962f4 ] SOLR-9106 : Cache cluster properties on ZkStateReader
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit dd23fa40153214095527cdeb77c4255cafbc21f9 in lucene-solr's branch refs/heads/branch_6x from Alan Woodward
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dd23fa4 ]

          SOLR-9106: Cache cluster properties on ZkStateReader

          Show
          jira-bot ASF subversion and git services added a comment - Commit dd23fa40153214095527cdeb77c4255cafbc21f9 in lucene-solr's branch refs/heads/branch_6x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dd23fa4 ] SOLR-9106 : Cache cluster properties on ZkStateReader
          Hide
          romseygeek Alan Woodward added a comment -

          Thanks all.

          Show
          romseygeek Alan Woodward added a comment - Thanks all.
          Hide
          noble.paul Noble Paul added a comment -

          Hi Alan Woodward I missed this. Not watching clusterprops was done deliberately. it is read rarely and watching is expensive.

          Show
          noble.paul Noble Paul added a comment - Hi Alan Woodward I missed this. Not watching clusterprops was done deliberately. it is read rarely and watching is expensive.
          Hide
          hgadre Hrishikesh Gadre added a comment -

          We found a scenario where the ZK watch was not fired in a timely fashion resulting in the unit test failure (SOLR-9242). The changes made to fix that failure are exactly opposite to the intent of this JIRA (i.e. it brings back the old behavior).

          https://github.com/apache/lucene-solr/commit/eefdc62c997f7936b6db203111d8149dc934b243

          I agree with Nobel here. Alan Woodward is reading cluster properties every-time a serious performance problem?

          CC Varun Thacker

          Show
          hgadre Hrishikesh Gadre added a comment - We found a scenario where the ZK watch was not fired in a timely fashion resulting in the unit test failure ( SOLR-9242 ). The changes made to fix that failure are exactly opposite to the intent of this JIRA (i.e. it brings back the old behavior). https://github.com/apache/lucene-solr/commit/eefdc62c997f7936b6db203111d8149dc934b243 I agree with Nobel here. Alan Woodward is reading cluster properties every-time a serious performance problem? CC Varun Thacker
          Hide
          varunthacker Varun Thacker added a comment -

          I think its okay to cache it .

          We should maybe add some documentation that getClusterProperty might not always have a latest value if an update has just taken place and that in scenarios where we want the latest value we should forceRefresh it before reading.

          So things like backup/restore can forceRefresh it since its a user facing API where the user can set some values in the cluster property file and expect it to be there. Things like the OverseerAutoReplicaFailoverThread where the feature might kick in after a couple of seconds because of this caching is fine.

          Show
          varunthacker Varun Thacker added a comment - I think its okay to cache it . We should maybe add some documentation that getClusterProperty might not always have a latest value if an update has just taken place and that in scenarios where we want the latest value we should forceRefresh it before reading. So things like backup/restore can forceRefresh it since its a user facing API where the user can set some values in the cluster property file and expect it to be there. Things like the OverseerAutoReplicaFailoverThread where the feature might kick in after a couple of seconds because of this caching is fine.
          Hide
          noble.paul Noble Paul added a comment -

          Varun Thacker the problem is, this ticket introduced a new problem where there was none. As Hrishikesh Gadre said reading that value once in a while was not a performance issue. Now we are paying the price for constantly watching it and we still don't have up to date data.

          Show
          noble.paul Noble Paul added a comment - Varun Thacker the problem is, this ticket introduced a new problem where there was none. As Hrishikesh Gadre said reading that value once in a while was not a performance issue. Now we are paying the price for constantly watching it and we still don't have up to date data.
          Hide
          romseygeek Alan Woodward added a comment -

          Not watching clusterprops was done deliberately. it is read rarely and watching is expensive.

          This seems exactly backwards to me? Watches are pretty lightweight. Do you have some numbers here?

          If it's true, then it needs to be explicitly documented on the .getClusterProperties() method that it will always read data from ZK - maybe it should be called .readClusterProperties() instead. And it ends up being used in some expected places, for example during admin requests from a CloudSolrClient.

          Show
          romseygeek Alan Woodward added a comment - Not watching clusterprops was done deliberately. it is read rarely and watching is expensive. This seems exactly backwards to me? Watches are pretty lightweight. Do you have some numbers here? If it's true, then it needs to be explicitly documented on the .getClusterProperties() method that it will always read data from ZK - maybe it should be called .readClusterProperties() instead. And it ends up being used in some expected places, for example during admin requests from a CloudSolrClient.
          Hide
          noble.paul Noble Paul added a comment -

          bqThis seems exactly backwards to me? Watches are pretty lightweight. Do you have some numbers here
          The problem with watches is that we really don't know when the changes are propagated to each node. If we read the data just in time we always get the latest data. Watching is light weight, I haven't done any benchmarking yet.

          Show
          noble.paul Noble Paul added a comment - bqThis seems exactly backwards to me? Watches are pretty lightweight. Do you have some numbers here The problem with watches is that we really don't know when the changes are propagated to each node. If we read the data just in time we always get the latest data. Watching is light weight, I haven't done any benchmarking yet.
          Hide
          romseygeek Alan Woodward added a comment -

          If we read the data just in time we always get the latest data

          If you need this behaviour, you can use an explicit ClusterProperties object. ZkCLI and some tests already do this. I'll add some javadocs to ZkStateReader.getClusterProperty() to make this more explicit though.

          Show
          romseygeek Alan Woodward added a comment - If we read the data just in time we always get the latest data If you need this behaviour, you can use an explicit ClusterProperties object. ZkCLI and some tests already do this. I'll add some javadocs to ZkStateReader.getClusterProperty() to make this more explicit though.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8247f9f0cd8fccf7ecfb6d76cb8f180079ece3bb in lucene-solr's branch refs/heads/branch_6x from Alan Woodward
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8247f9f ]

          SOLR-9106: Add javadocs to ZkStateReader cluster properties methods

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8247f9f0cd8fccf7ecfb6d76cb8f180079ece3bb in lucene-solr's branch refs/heads/branch_6x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8247f9f ] SOLR-9106 : Add javadocs to ZkStateReader cluster properties methods
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit c2db9fae2cc312a13a66e6dab9989ed65738fe02 in lucene-solr's branch refs/heads/master from Alan Woodward
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2db9fa ]

          SOLR-9106: Add javadocs to ZkStateReader cluster properties methods

          Show
          jira-bot ASF subversion and git services added a comment - Commit c2db9fae2cc312a13a66e6dab9989ed65738fe02 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2db9fa ] SOLR-9106 : Add javadocs to ZkStateReader cluster properties methods

            People

            • Assignee:
              romseygeek Alan Woodward
              Reporter:
              romseygeek Alan Woodward
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development