Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8662

SchemaManager doesn't wait correctly for replicas to update

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5.1, 6.0.1, 6.1, 7.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently in SchemaManager, waitForOtherReplicasToUpdate doesn't execute since it gets passed the timeout value instead of the end time

      1. SOLR-8662.patch
        24 kB
        Varun Thacker
      2. SOLR-8662.patch
        24 kB
        Varun Thacker
      3. SOLR-8662.patch
        21 kB
        Noble Paul
      4. SOLR-8662.patch
        19 kB
        Varun Thacker

        Issue Links

          Activity

          Hide
          varunthacker Varun Thacker added a comment -
          • ZkSolrResourceLoader#openResource retires fetching the resource in case of session expiry. It also sets back the interrupt flag.
          • Schema API should only be used if it's a managed resource AND mutable. The mutable check was missing the mutable part
          • In waitForOtherReplicasToUpdate there was a bug in the time conversion. Fixed that
          • There was no usage of test-files/solr/configsets/configset-1 . Renamed it to cloud-managed and made necessary changes to support cloud in managed schema mode. This configset is used in the test

          Without the changes to SchemaManager the test would fail.

          Show
          varunthacker Varun Thacker added a comment - ZkSolrResourceLoader#openResource retires fetching the resource in case of session expiry. It also sets back the interrupt flag. Schema API should only be used if it's a managed resource AND mutable. The mutable check was missing the mutable part In waitForOtherReplicasToUpdate there was a bug in the time conversion. Fixed that There was no usage of test-files/solr/configsets/configset-1 . Renamed it to cloud-managed and made necessary changes to support cloud in managed schema mode. This configset is used in the test Without the changes to SchemaManager the test would fail.
          Hide
          steve_rowe Steve Rowe added a comment -

          Varun, I couldn't get your patch to apply using git apply (git v2.7.4) - something about the file moves made it so that git couldn't find change targets - the patch appears to have been produced by IntelliJ. Since I use IntelliJ, I was able to apply the patch using it. But I think it's better to use cmdline git when producing patches so that you don't restrict the audience.

          Show
          steve_rowe Steve Rowe added a comment - Varun, I couldn't get your patch to apply using git apply (git v2.7.4) - something about the file moves made it so that git couldn't find change targets - the patch appears to have been produced by IntelliJ. Since I use IntelliJ, I was able to apply the patch using it. But I think it's better to use cmdline git when producing patches so that you don't restrict the audience.
          Hide
          steve_rowe Steve Rowe added a comment - - edited

          Lots of good changes in your patch, Varun! I ran all Solr+Solrj+contrib tests and everything passes.

          I found a few issues though.

          This change is good, but there remains a pre-existing problem:

          -    int timeout = req.getParams().getInt(BaseSolrResource.UPDATE_TIMEOUT_SECS, -1);
          -    long startTime = System.nanoTime();
          -    long endTime = timeout > 0 ? System.nanoTime() + (timeout * 1000 * 1000) : Long.MAX_VALUE;
          +    //The default timeout is 60s when no BaseSolrResource.UPDATE_TIMEOUT_SECS is specified
          +    int timeout = req.getParams().getInt(BaseSolrResource.UPDATE_TIMEOUT_SECS, 60);
          +
          +    //If BaseSolrResource.UPDATE_TIMEOUT_SECS=0 or -1 then end time is unlimited.
          +    long endTime = timeout > 0 ? TimeUnit.NANOSECONDS.convert(timeout, TimeUnit.SECONDS) + System.nanoTime() : Long.MAX_VALUE;
          

          The problem is that System.nanoTime() can return anything in the range [Long.MIN_VALUE, Long.MAX_VALUE] and it intentionally overflows, so making endTime be Long.MAX_VALUE is effectively choosing a random time in the future; this is not the same as "unlimited". This same situation is dealt with in the QueryTimeout implementations - used by ExitableDirectoryReader - by adding a large value to the current value of nanoTime() to arrive at an effectively unlimited end time.


          Your changes in ZkSolrResourceLoader.openResource() introduce the possibility of retrying 10 times to fetch a resource from ZK when it's not there (according to zkController.pathExists(file)) - I think in this case the retry loop should be immediately exited, so that the classpath search can start immediately. (I'm assuming that the intent here is not to wait around for another thread/process to finish uploading a resource - if that's the case then the branch for when the path doesn't exist should have a timeout, which it doesn't in your patch.)


          Without the changes to SchemaManager the test would fail.

          Indeed, when I revert the SchemaManager changes, TestManagedSchemaAPI.testAddFieldAndDocument() fails. But testReloadAndAddSimple() always succeeds, regardless of the SchemaManager changes, so I'm not sure why it's there, since it doesn't have anything to do with new field visibility - can it be removed?

          Show
          steve_rowe Steve Rowe added a comment - - edited Lots of good changes in your patch, Varun! I ran all Solr+Solrj+contrib tests and everything passes. I found a few issues though. This change is good, but there remains a pre-existing problem: - int timeout = req.getParams().getInt(BaseSolrResource.UPDATE_TIMEOUT_SECS, -1); - long startTime = System .nanoTime(); - long endTime = timeout > 0 ? System .nanoTime() + (timeout * 1000 * 1000) : Long .MAX_VALUE; + //The default timeout is 60s when no BaseSolrResource.UPDATE_TIMEOUT_SECS is specified + int timeout = req.getParams().getInt(BaseSolrResource.UPDATE_TIMEOUT_SECS, 60); + + //If BaseSolrResource.UPDATE_TIMEOUT_SECS=0 or -1 then end time is unlimited. + long endTime = timeout > 0 ? TimeUnit.NANOSECONDS.convert(timeout, TimeUnit.SECONDS) + System .nanoTime() : Long .MAX_VALUE; The problem is that System.nanoTime() can return anything in the range [ Long.MIN_VALUE , Long.MAX_VALUE ] and it intentionally overflows, so making endTime be Long.MAX_VALUE is effectively choosing a random time in the future; this is not the same as "unlimited". This same situation is dealt with in the QueryTimeout implementations - used by ExitableDirectoryReader - by adding a large value to the current value of nanoTime() to arrive at an effectively unlimited end time. Your changes in ZkSolrResourceLoader.openResource() introduce the possibility of retrying 10 times to fetch a resource from ZK when it's not there (according to zkController.pathExists(file) ) - I think in this case the retry loop should be immediately exited, so that the classpath search can start immediately. (I'm assuming that the intent here is not to wait around for another thread/process to finish uploading a resource - if that's the case then the branch for when the path doesn't exist should have a timeout, which it doesn't in your patch.) Without the changes to SchemaManager the test would fail. Indeed, when I revert the SchemaManager changes, TestManagedSchemaAPI.testAddFieldAndDocument() fails. But testReloadAndAddSimple() always succeeds, regardless of the SchemaManager changes, so I'm not sure why it's there, since it doesn't have anything to do with new field visibility - can it be removed?
          Hide
          noble.paul Noble Paul added a comment -

          We should use the TimeOut class for timeout checks. Much simpler

          Show
          noble.paul Noble Paul added a comment - We should use the TimeOut class for timeout checks. Much simpler
          Hide
          noble.paul Noble Paul added a comment -

          I think we should fix this and include this in 5.5.1 as well

          Show
          noble.paul Noble Paul added a comment - I think we should fix this and include this in 5.5.1 as well
          Hide
          varunthacker Varun Thacker added a comment -

          Thanks Steve and Noble for the review!
          Created the new patch from GIT instead of IntelliJ

          In SchemaManager if "updateTimeOutSecs" in 0 or -1 then we retry for 10 minutes . Earlier it was unlimited. I think capping the unlimited to 10 minutes is fair enough. Thoughts? It also uses the TimeOut class for the math. Also upped default timeout to 10 minutes instead of 1 minute earlier

          Your changes in ZkSolrResourceLoader.openResource() introduce the possibility of retrying 10 times to fetch a resource from ZK when it's not there

          Good catch! I've fixed that. We won't retry if the path doesn't exist in ZK.

          I've kept testReloadAndAddSimple() for now. In one test run when I was using AbstractDistribZkTestBase, I'd seen after the reload the session had expired while reading the managed-schema file. So I had added the test and retry logic. This was a couple of months back when I had initially worked on the patch so I don't remember everything.

          Show
          varunthacker Varun Thacker added a comment - Thanks Steve and Noble for the review! Created the new patch from GIT instead of IntelliJ In SchemaManager if "updateTimeOutSecs" in 0 or -1 then we retry for 10 minutes . Earlier it was unlimited. I think capping the unlimited to 10 minutes is fair enough. Thoughts? It also uses the TimeOut class for the math. Also upped default timeout to 10 minutes instead of 1 minute earlier Your changes in ZkSolrResourceLoader.openResource() introduce the possibility of retrying 10 times to fetch a resource from ZK when it's not there Good catch! I've fixed that. We won't retry if the path doesn't exist in ZK. I've kept testReloadAndAddSimple() for now. In one test run when I was using AbstractDistribZkTestBase, I'd seen after the reload the session had expired while reading the managed-schema file. So I had added the test and retry logic. This was a couple of months back when I had initially worked on the patch so I don't remember everything.
          Hide
          steve_rowe Steve Rowe added a comment -

          Varun, I think at least some of the latest patch was created with IntelliJ (I see Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP in there) - maybe you posted the wrong one?

          Show
          steve_rowe Steve Rowe added a comment - Varun, I think at least some of the latest patch was created with IntelliJ (I see Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP in there) - maybe you posted the wrong one?
          Hide
          steve_rowe Steve Rowe added a comment -

          and also I get errors when I attempt to apply it with git apply

          Show
          steve_rowe Steve Rowe added a comment - and also I get errors when I attempt to apply it with git apply
          Hide
          varunthacker Varun Thacker added a comment -

          Same patch as before. This one was created using git diff > SOLR-8662.patch .

          The previous patch was created using git format-patch --stdout -p --no-prefix origin > SOLR-8662.patch which didn't seem to apply correctly with git apply.

          I tried this patch and git apply worked

          Show
          varunthacker Varun Thacker added a comment - Same patch as before. This one was created using git diff > SOLR-8662 .patch . The previous patch was created using git format-patch --stdout -p --no-prefix origin > SOLR-8662 .patch which didn't seem to apply correctly with git apply . I tried this patch and git apply worked
          Hide
          steve_rowe Steve Rowe added a comment -

          crap, download failure on my part (wget renaming to *.1). sorry for the noise...

          Show
          steve_rowe Steve Rowe added a comment - crap, download failure on my part (wget renaming to *.1 ). sorry for the noise...
          Hide
          steve_rowe Steve Rowe added a comment -

          Thanks Varun for the latest patch, I'm looking now.

          Show
          steve_rowe Steve Rowe added a comment - Thanks Varun for the latest patch, I'm looking now.
          Hide
          steve_rowe Steve Rowe added a comment -

          +1 to the latest patch, LGTM, thanks Varun.

          Show
          steve_rowe Steve Rowe added a comment - +1 to the latest patch, LGTM, thanks Varun.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 352e9a705b1c63a21136def424b023126319d870 in lucene-solr's branch refs/heads/branch_5x from Varun Thacker
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=352e9a7 ]

          SOLR-8662: SchemaManager waits correctly for replicas to be notified of a new change

          Show
          jira-bot ASF subversion and git services added a comment - Commit 352e9a705b1c63a21136def424b023126319d870 in lucene-solr's branch refs/heads/branch_5x from Varun Thacker [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=352e9a7 ] SOLR-8662 : SchemaManager waits correctly for replicas to be notified of a new change
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit b3fe7f70030024b2a87286dbb81c25d4701856bb in lucene-solr's branch refs/heads/branch_5_5 from Varun Thacker
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b3fe7f7 ]

          SOLR-8662: SchemaManager waits correctly for replicas to be notified of a new change

          Show
          jira-bot ASF subversion and git services added a comment - Commit b3fe7f70030024b2a87286dbb81c25d4701856bb in lucene-solr's branch refs/heads/branch_5_5 from Varun Thacker [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b3fe7f7 ] SOLR-8662 : SchemaManager waits correctly for replicas to be notified of a new change
          Hide
          varunthacker Varun Thacker added a comment -
          • Committed to master and branch_6x with commit = 44c9cd2 . I missed out adding the SOLR-8662 prefix.
          • branch_5x TestManagedSchemaAPI uses MiniSolrCloudCluster.
          • Committed to branch_5x and branch_5_5 with commit = 352e9a7 .

          Thanks Steve and Noble for the review!

          Show
          varunthacker Varun Thacker added a comment - Committed to master and branch_6x with commit = 44c9cd2 . I missed out adding the SOLR-8662 prefix. branch_5x TestManagedSchemaAPI uses MiniSolrCloudCluster. Committed to branch_5x and branch_5_5 with commit = 352e9a7 . Thanks Steve and Noble for the review!
          Hide
          hossman Hoss Man added a comment -

          Manually correcting fixVersion per Step #S5 of LUCENE-7271

          Show
          hossman Hoss Man added a comment - Manually correcting fixVersion per Step #S5 of LUCENE-7271
          Hide
          steve_rowe Steve Rowe added a comment -

          Reopening to backport to 6.0.1.

          Show
          steve_rowe Steve Rowe added a comment - Reopening to backport to 6.0.1.
          Hide
          steve_rowe Steve Rowe added a comment -

          Bulk close issues included in the 6.0.1 release.

          Show
          steve_rowe Steve Rowe added a comment - Bulk close issues included in the 6.0.1 release.

            People

            • Assignee:
              varunthacker Varun Thacker
              Reporter:
              varunthacker Varun Thacker
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development