Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9832

Schema modifications are not immediately visible on the coordinating node

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      As noted on SOLR-9751, PreAnalyzedFieldManagedSchemaCloudTest.testAdd2Fields() has been failing on Jenkins. When I beast this test on my Jenkins box, it fails about 1% of the time. E.g. from https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/2247/:

        [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=PreAnalyzedFieldManagedSchemaCloudTest -Dtests.method=testAdd2Fields -Dtests.seed=CD72F125201C0C76 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=is -Dtests.timezone=Antarctica/McMurdo -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
        [junit4] ERROR   0.09s J0 | PreAnalyzedFieldManagedSchemaCloudTest.testAdd2Fields <<<
        [junit4]    > Throwable #1: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[https://127.0.0.1:39011/solr/managed-preanalyzed, https://127.0.0.1:33343/solr/managed-preanalyzed]
        [junit4]    > 	at __randomizedtesting.SeedInfo.seed([CD72F125201C0C76:656743CEFC1A9F80]:0)
        [junit4]    > 	at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:414)
        [junit4]    > 	at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1292)
        [junit4]    > 	at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:1062)
        [junit4]    > 	at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:1004)
        [junit4]    > 	at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
        [junit4]    > 	at org.apache.solr.schema.PreAnalyzedFieldManagedSchemaCloudTest.addField(PreAnalyzedFieldManagedSchemaCloudTest.java:61)
        [junit4]    > 	at org.apache.solr.schema.PreAnalyzedFieldManagedSchemaCloudTest.testAdd2Fields(PreAnalyzedFieldManagedSchemaCloudTest.java:52)
        [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
        [junit4]    > Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://127.0.0.1:39011/solr/managed-preanalyzed: No such path /schema/fields/field2
        [junit4]    > 	at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:593)
        [junit4]    > 	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262)
        [junit4]    > 	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:251)
        [junit4]    > 	at org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:435)
        [junit4]    > 	at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:387)
      

        Issue Links

          Activity

          Hide
          steve_rowe Steve Rowe added a comment -

          Patch with a new test that fails about 4% of the time on my Jenkins box, and a fix. The new test doesn't involve PreAnalyzedField(Type).

          The issue appears to be that a schema modification request will persist the new schema to ZooKeeper, then wait up to a configurable amount of time (10 minutes by default) for all replicas to get the new schema, and then return success. A ZooKeeper watch on the coordinating core's config directory eventually triggers a core reload for the persisted schema changes, but sometimes a request sent to a different node will trigger a reload for an older schema. As a result, after returning success for a schema modification, the coordinating node will briefly host an older schema.

          The attached patch triggers an immediate core reload after schema changes are successfully persisted to ZooKeeper, and once that's finished, checks that other replicas have the new schema. The patch also: forces a schema staleness check & update when the ZK watch is created for the new core's schema in ManagedIndexSchemaFactory.inform(); and removes an unnecessary schema download just prior to reloading the core in the listener created in SolrCore.getConfListener().

          I beasted the new test for 500 iterations with the patch, and it did not fail. I also beasted PreAnalyzedFieldManagedSchemaCloudTest.testAdd2Fields() for 300 iterations with the patch, and it did not fail.

          Precommit and all Solr tests pass with the patch; I think it's ready.

          I'll commit to master and let it soak for a couple days.

          Show
          steve_rowe Steve Rowe added a comment - Patch with a new test that fails about 4% of the time on my Jenkins box, and a fix. The new test doesn't involve PreAnalyzedField(Type). The issue appears to be that a schema modification request will persist the new schema to ZooKeeper, then wait up to a configurable amount of time (10 minutes by default) for all replicas to get the new schema, and then return success. A ZooKeeper watch on the coordinating core's config directory eventually triggers a core reload for the persisted schema changes, but sometimes a request sent to a different node will trigger a reload for an older schema. As a result, after returning success for a schema modification, the coordinating node will briefly host an older schema. The attached patch triggers an immediate core reload after schema changes are successfully persisted to ZooKeeper, and once that's finished, checks that other replicas have the new schema. The patch also: forces a schema staleness check & update when the ZK watch is created for the new core's schema in ManagedIndexSchemaFactory.inform() ; and removes an unnecessary schema download just prior to reloading the core in the listener created in SolrCore.getConfListener() . I beasted the new test for 500 iterations with the patch, and it did not fail. I also beasted PreAnalyzedFieldManagedSchemaCloudTest.testAdd2Fields() for 300 iterations with the patch, and it did not fail. Precommit and all Solr tests pass with the patch; I think it's ready. I'll commit to master and let it soak for a couple days.
          Hide
          dsmiley David Smiley added a comment -

          Nice investigation; I bet this was a doozy to figure out.

          Show
          dsmiley David Smiley added a comment - Nice investigation; I bet this was a doozy to figure out.
          Hide
          steve_rowe Steve Rowe added a comment -

          Staring at logs, coordinating ports/timings/zkversions, imagining transient failure modes, and a frosty beverage: heaven!!!

          Show
          steve_rowe Steve Rowe added a comment - Staring at logs, coordinating ports/timings/zkversions, imagining transient failure modes, and a frosty beverage: heaven!!!
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit bf3a3137be8a70ceed884e87c3ada276e82b187b in lucene-solr's branch refs/heads/master from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bf3a313 ]

          SOLR-9832: Schema modifications are not immediately visible on the coordinating node

          Show
          jira-bot ASF subversion and git services added a comment - Commit bf3a3137be8a70ceed884e87c3ada276e82b187b in lucene-solr's branch refs/heads/master from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bf3a313 ] SOLR-9832 : Schema modifications are not immediately visible on the coordinating node
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit bf3a3137be8a70ceed884e87c3ada276e82b187b in lucene-solr's branch refs/heads/apiv2 from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bf3a313 ]

          SOLR-9832: Schema modifications are not immediately visible on the coordinating node

          Show
          jira-bot ASF subversion and git services added a comment - Commit bf3a3137be8a70ceed884e87c3ada276e82b187b in lucene-solr's branch refs/heads/apiv2 from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bf3a313 ] SOLR-9832 : Schema modifications are not immediately visible on the coordinating node
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 765ea6f5a8dea0377420d975053e8583ae74d98e in lucene-solr's branch refs/heads/branch_6x from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=765ea6f ]

          SOLR-9832: Schema modifications are not immediately visible on the coordinating node

          Show
          jira-bot ASF subversion and git services added a comment - Commit 765ea6f5a8dea0377420d975053e8583ae74d98e in lucene-solr's branch refs/heads/branch_6x from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=765ea6f ] SOLR-9832 : Schema modifications are not immediately visible on the coordinating node
          Hide
          steve_rowe Steve Rowe added a comment -

          After the push to master, no test failures there, but since there were a few failures on branch_6x in that time, the change seems to be positive. I cherry-picked to branch_6x.

          Show
          steve_rowe Steve Rowe added a comment - After the push to master, no test failures there, but since there were a few failures on branch_6x in that time, the change seems to be positive. I cherry-picked to branch_6x.

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              steve_rowe Steve Rowe
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development