Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9554

Multiple ManagedIndexSchemaFactory upgrades running simultaneously can clash, causing cores not to load

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.3, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      If a collection is created using a configset with a ManagedSchemaFactory but no managed-schema file, then multiple cores may try and convert the schema file simultaneously.

      1. SOLR-9554-test.patch
        13 kB
        Mikhail Khludnev
      2. SOLR-9554-test.patch
        11 kB
        Mikhail Khludnev
      3. SOLR-9554-just-fix.patch
        1 kB
        Mikhail Khludnev
      4. SOLR-9554.patch
        10 kB
        Alan Woodward
      5. SOLR-9554.patch
        11 kB
        Mikhail Khludnev
      6. SOLR-9554.patch
        1.0 kB
        Andrey Kudryavtsev
      7. SOLR-9554.patch
        13 kB
        Mikhail Khludnev
      8. SOLR-9554.patch
        12 kB
        Mikhail Khludnev

        Activity

        Hide
        romseygeek Alan Woodward added a comment -

        See https://jenkins.thetaphi.de/job/Lucene-Solr-master-Solaris/862/consoleFull for an example. The '.system' collection isn't created properly, because one of the replicas crashes on startup due to this race.

        I guess the straightforward fix is to detect if 'schema.xml' isn't there any more, and retry reading 'managed-schema' instead.

        Show
        romseygeek Alan Woodward added a comment - See https://jenkins.thetaphi.de/job/Lucene-Solr-master-Solaris/862/consoleFull for an example. The '.system' collection isn't created properly, because one of the replicas crashes on startup due to this race. I guess the straightforward fix is to detect if 'schema.xml' isn't there any more, and retry reading 'managed-schema' instead.
        Hide
        romseygeek Alan Woodward added a comment -

        Another failure caused by this: https://jenkins.thetaphi.de/job/Lucene-Solr-master-Solaris/869/consoleFull

        In this case, the race is as follows:

        • core 1 finds that there's no managed-schema
        • core 2 finds that there's no managed-schema
        • core 1 loads schema.xml, upgrades to managed-schema, and moves schema.xml to schema.xml.bak
        • core 2 tries to load schema.xml and crashes with a NullPointerException

        I've tried writing a test case to hammer on this, but it refuses to fail - possibly it needs to be running on a slow enough machine? In any case, I think we need some kind of distributed lock when doing the fallback-and-upgrade manoeuvre.

        Show
        romseygeek Alan Woodward added a comment - Another failure caused by this: https://jenkins.thetaphi.de/job/Lucene-Solr-master-Solaris/869/consoleFull In this case, the race is as follows: core 1 finds that there's no managed-schema core 2 finds that there's no managed-schema core 1 loads schema.xml, upgrades to managed-schema, and moves schema.xml to schema.xml.bak core 2 tries to load schema.xml and crashes with a NullPointerException I've tried writing a test case to hammer on this, but it refuses to fail - possibly it needs to be running on a slow enough machine? In any case, I think we need some kind of distributed lock when doing the fallback-and-upgrade manoeuvre.
        Hide
        mkhludnev Mikhail Khludnev added a comment -

        Can you attach this test?

        Show
        mkhludnev Mikhail Khludnev added a comment - Can you attach this test?
        Hide
        romseygeek Alan Woodward added a comment -

        Sure, here it is.

        Show
        romseygeek Alan Woodward added a comment - Sure, here it is.
        Hide
        mkhludnev Mikhail Khludnev added a comment -

        when I tried to catch this race with debugger, I found that mock responds false by default, that bypasses condition in ManagedIndexSchemaFactory.zkUgradeToManagedSchema()

          if (zkController.pathExists(nonManagedSchemaPath)) {
        

        here I modified mock. but it still can't catch this race..

        Show
        mkhludnev Mikhail Khludnev added a comment - when I tried to catch this race with debugger, I found that mock responds false by default, that bypasses condition in ManagedIndexSchemaFactory.zkUgradeToManagedSchema() if (zkController.pathExists(nonManagedSchemaPath)) { here I modified mock. but it still can't catch this race..
        Hide
        werder Andrey Kudryavtsev added a comment -

        ManagedIndexSchemaFactory#create contains a pretty cool code in terms of multithreading execution.

        But it seems like even if there would be a race between several threads execute this code, it won't lead to that NPE that was logged in Jenkins job. I mean - loadedResource won't be null in 169 line due to race. It is just my understanding and it easily could be wrong.

        But still - NPE is here in Jenkins.

        One of possible options - there was an IOException in 146 line. That exception was lost, but as a result we could get loadedResource=null in 169 line.

        So why don't we start diving into this issue with this small change - log possible exception in catch block. I think it should be logged in any case.

        Show
        werder Andrey Kudryavtsev added a comment - ManagedIndexSchemaFactory#create contains a pretty cool code in terms of multithreading execution. But it seems like even if there would be a race between several threads execute this code, it won't lead to that NPE that was logged in Jenkins job. I mean - loadedResource won't be null in 169 line due to race. It is just my understanding and it easily could be wrong. But still - NPE is here in Jenkins. One of possible options - there was an IOException in 146 line . That exception was lost, but as a result we could get loadedResource=null in 169 line. So why don't we start diving into this issue with this small change - log possible exception in catch block. I think it should be logged in any case.
        Hide
        mkhludnev Mikhail Khludnev added a comment - - edited

        I make test really really mad. It synchronizes race to reproduce it. Note: to reproduce the NPE failure you need to run it a few times -Dtests.iters=10, just because now shuffle(winner,looser) adds one more race.
        Now we can think how to fix it, and it's worth to do.
        But might be more important question how to avoid such races in zoo once and for all or detect them automatically. Can we prohibit writes to Zookeeper at all?

        Show
        mkhludnev Mikhail Khludnev added a comment - - edited I make test really really mad. It synchronizes race to reproduce it. Note: to reproduce the NPE failure you need to run it a few times -Dtests.iters=10, just because now shuffle(winner,looser) adds one more race. Now we can think how to fix it, and it's worth to do. But might be more important question how to avoid such races in zoo once and for all or detect them automatically. Can we prohibit writes to Zookeeper at all?
        Hide
        mkhludnev Mikhail Khludnev added a comment - - edited

        Attaching a simplified test without that latch. Now it's just random sleep. I think it's crucial to provide a zkcliet per thread, Zookeper has stale reads without sync. Patch also contains a FIX but it's commented, at least we need to refactor ManagedIndexSchemaFactory.loadedResource - this is an unfortunate example of non-final-lity. It's interesting that ManagedIndexSchemaFactory.create(String, SolrConfig) already defend from the discussed race, it just missed one assignment.

        Show
        mkhludnev Mikhail Khludnev added a comment - - edited Attaching a simplified test without that latch. Now it's just random sleep. I think it's crucial to provide a zkcliet per thread, Zookeper has stale reads without sync. Patch also contains a FIX but it's commented, at least we need to refactor ManagedIndexSchemaFactory.loadedResource - this is an unfortunate example of non-final-lity. It's interesting that ManagedIndexSchemaFactory.create(String, SolrConfig) already defend from the discussed race, it just missed one assignment.
        Hide
        romseygeek Alan Woodward added a comment -

        Aha, yes, I see the fix. And I agree that this needs to be refactored, but maybe just commit the fix here for now and open another JIRA to clean the whole thing up?

        Show
        romseygeek Alan Woodward added a comment - Aha, yes, I see the fix. And I agree that this needs to be refactored, but maybe just commit the fix here for now and open another JIRA to clean the whole thing up?
        Hide
        mkhludnev Mikhail Khludnev added a comment -

        oh. Please go ahead. I'll tweak the test later.

        Show
        mkhludnev Mikhail Khludnev added a comment - oh. Please go ahead. I'll tweak the test later.
        Hide
        mkhludnev Mikhail Khludnev added a comment -

        I'm going to commit SOLR-9554-just-fix.patch. It has no test, just assignment fix is there. It should fix existing tests.

        Show
        mkhludnev Mikhail Khludnev added a comment - I'm going to commit SOLR-9554-just-fix.patch . It has no test, just assignment fix is there. It should fix existing tests.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 64b0c91df16b09d430957092f71b4991c2a66db2 in lucene-solr's branch refs/heads/master from Mikhail Khludnev
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=64b0c91 ]

        SOLR-9554: fix NullPointerException when cores move schema.xml to
        managed-schema concurrently. No new test added yet.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 64b0c91df16b09d430957092f71b4991c2a66db2 in lucene-solr's branch refs/heads/master from Mikhail Khludnev [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=64b0c91 ] SOLR-9554 : fix NullPointerException when cores move schema.xml to managed-schema concurrently. No new test added yet.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1c614e1d4df4915dd19eb96f478a0379deecef69 in lucene-solr's branch refs/heads/branch_6x from Mikhail Khludnev
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1c614e1 ]

        SOLR-9554: fix NullPointerException when cores move schema.xml to
        managed-schema concurrently. No new test is added yet.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1c614e1d4df4915dd19eb96f478a0379deecef69 in lucene-solr's branch refs/heads/branch_6x from Mikhail Khludnev [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1c614e1 ] SOLR-9554 : fix NullPointerException when cores move schema.xml to managed-schema concurrently. No new test is added yet.
        Hide
        mkhludnev Mikhail Khludnev added a comment -

        SOLR-9554-test.patch makes test simple as possible. There is no refactoring for ManagedIndexSchemaFactory yet. TBC.

        The question is: SuspendingZkClient reminds the best testing practices. Can it be generalized and used somewhere else?

        Show
        mkhludnev Mikhail Khludnev added a comment - SOLR-9554-test.patch makes test simple as possible. There is no refactoring for ManagedIndexSchemaFactory yet. TBC. The question is: SuspendingZkClient reminds the best testing practices. Can it be generalized and used somewhere else?
        Hide
        mkhludnev Mikhail Khludnev added a comment -

        Turns out refactoring ManagedIndexSchemaFactory is not so easy. I.ll just commit SOLR-9554-test.patch. Objections?

        Show
        mkhludnev Mikhail Khludnev added a comment - Turns out refactoring ManagedIndexSchemaFactory is not so easy. I.ll just commit SOLR-9554-test.patch . Objections?
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 67a9d3a44f5709f31172b58c0747981afd20f468 in lucene-solr's branch refs/heads/master from Mikhail Khludnev
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=67a9d3a ]

        SOLR-9554: adding a test for concurrent schema upgrade in cloud.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 67a9d3a44f5709f31172b58c0747981afd20f468 in lucene-solr's branch refs/heads/master from Mikhail Khludnev [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=67a9d3a ] SOLR-9554 : adding a test for concurrent schema upgrade in cloud.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit cc1841cbb73f1c5c54208549cb20cf8244fa5104 in lucene-solr's branch refs/heads/branch_6x from Mikhail Khludnev
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cc1841c ]

        SOLR-9554: adding a test for concurrent schema upgrade in cloud.

        Show
        jira-bot ASF subversion and git services added a comment - Commit cc1841cbb73f1c5c54208549cb20cf8244fa5104 in lucene-solr's branch refs/heads/branch_6x from Mikhail Khludnev [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cc1841c ] SOLR-9554 : adding a test for concurrent schema upgrade in cloud.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit cc4c780227e999339e083cababff96912c4fbb53 in lucene-solr's branch refs/heads/master from Mikhail Khludnev
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cc4c780 ]

        SOLR-9554: clear statics to fix the test failure

        Show
        jira-bot ASF subversion and git services added a comment - Commit cc4c780227e999339e083cababff96912c4fbb53 in lucene-solr's branch refs/heads/master from Mikhail Khludnev [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cc4c780 ] SOLR-9554 : clear statics to fix the test failure
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 81a859b962214732f142653c90164f08625bf213 in lucene-solr's branch refs/heads/branch_6x from Mikhail Khludnev
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=81a859b ]

        SOLR-9554: clear statics to fix the test failure

        Show
        jira-bot ASF subversion and git services added a comment - Commit 81a859b962214732f142653c90164f08625bf213 in lucene-solr's branch refs/heads/branch_6x from Mikhail Khludnev [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=81a859b ] SOLR-9554 : clear statics to fix the test failure
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Closing after 6.3.0 release.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

          People

          • Assignee:
            Unassigned
            Reporter:
            romseygeek Alan Woodward
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development