Solr
  1. Solr
  2. SOLR-4858

UpdateLog not informed of core reload (leads to incorrect cores/update handlers being used)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.2.1
    • Fix Version/s: 4.3.1, 4.4, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      I havene't been able to make sense of this yet, but trying to track down another bug lead me to discover that the following combination leads to problems...

      • updateLog enabled
      • do a core reload
      • do a delete by query *:*

      ...leave out any one of the three, and everything works fine.

      1. find_leak.sh
        0.4 kB
        Alexey Serba
      2. SOLR-4858_fix.patch
        6 kB
        Yonik Seeley
      3. SOLR-4858_fix.patch
        5 kB
        Yonik Seeley
      4. SOLR-4858_fix.patch
        5 kB
        Yonik Seeley
      5. SOLR-4858_fix.patch
        2 kB
        Yonik Seeley
      6. SOLR-4858.patch
        25 kB
        Hoss Man
      7. SOLR-4858.patch
        2 kB
        Hoss Man
      8. SOLR-4858.patch
        5 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          Patch with a test case demonstrating the problem – this randomizes the use of hte updateLog, and 50% of the time it fails.

          (haven't even begun to try and make sense of why this fails)

          NOTE: test started as a patch to TestCoreContainer, hence the slightly odd layer to be testing at and the hoops needed to do the delete by query ... now that i better understand the magic 3 criteria for causing the problem, a simpler test is probably possible.

          Show
          Hoss Man added a comment - Patch with a test case demonstrating the problem – this randomizes the use of hte updateLog, and 50% of the time it fails. (haven't even begun to try and make sense of why this fails) NOTE: test started as a patch to TestCoreContainer, hence the slightly odd layer to be testing at and the hoops needed to do the delete by query ... now that i better understand the magic 3 criteria for causing the problem, a simpler test is probably possible.
          Hide
          Anshum Gupta added a comment -

          Flipping the core reload and delete i.e. delete followed by a core reload also makes it pass.

          Show
          Anshum Gupta added a comment - Flipping the core reload and delete i.e. delete followed by a core reload also makes it pass.
          Hide
          Hoss Man added a comment -

          simplified test case.

          i removed the randomness and replaced it with two distinct methods testing the simple sequence of events with and without updated log enabled...

          # this will pass
          
          ant test -Dtestcase=TestReloadAndDeleteDocs -Dtests.method=testReloadAndDeleteDocsNoUpdateLog
          
          # this will cause a searcher leak because the directory failed to close
          
          ant test -Dtestcase=TestReloadAndDeleteDocs -Dtests.method=testReloadAndDeleteDocsWithUpdateLog
          
          Show
          Hoss Man added a comment - simplified test case. i removed the randomness and replaced it with two distinct methods testing the simple sequence of events with and without updated log enabled... # this will pass ant test -Dtestcase=TestReloadAndDeleteDocs -Dtests.method=testReloadAndDeleteDocsNoUpdateLog # this will cause a searcher leak because the directory failed to close ant test -Dtestcase=TestReloadAndDeleteDocs -Dtests.method=testReloadAndDeleteDocsWithUpdateLog
          Hide
          Yonik Seeley added a comment -

          It appears any deleteByQuery will cause this (the MatchAllDocuments deleteByQuery actually has special handling in Solr, so it's an important distinction).

          Show
          Yonik Seeley added a comment - It appears any deleteByQuery will cause this (the MatchAllDocuments deleteByQuery actually has special handling in Solr, so it's an important distinction).
          Hide
          Alexey Serba added a comment -
          # this will cause a searcher leak because the directory failed to close
          ant test -Dtestcase=TestReloadAndDeleteDocs -Dtests.method=testReloadAndDeleteDocsWithUpdateLog
          

          It seems that this test started failing after the following commit:

          0226c616297c84196753f0989b45471b59c7c09a is the first bad commit
          commit 0226c616297c84196753f0989b45471b59c7c09a
          Author: Mark Robert Miller <markrmiller@apache.org>
          Date:   Mon Mar 18 04:51:18 2013 +0000
          
              SOLR-4604: SolrCore is not using the UpdateHandler that is passed to it in SolrCore#reload.
              
              git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x@1457641 13f79535-47bb-0310-9956-ffa450edef68
          

          https://github.com/apache/lucene-solr/commit/0226c616297c84196753f0989b45471b59c7c09a

          Show
          Alexey Serba added a comment - # this will cause a searcher leak because the directory failed to close ant test -Dtestcase=TestReloadAndDeleteDocs -Dtests.method=testReloadAndDeleteDocsWithUpdateLog It seems that this test started failing after the following commit: 0226c616297c84196753f0989b45471b59c7c09a is the first bad commit commit 0226c616297c84196753f0989b45471b59c7c09a Author: Mark Robert Miller <markrmiller@apache.org> Date: Mon Mar 18 04:51:18 2013 +0000 SOLR-4604: SolrCore is not using the UpdateHandler that is passed to it in SolrCore#reload. git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x@1457641 13f79535-47bb-0310-9956-ffa450edef68 https://github.com/apache/lucene-solr/commit/0226c616297c84196753f0989b45471b59c7c09a
          Hide
          Hoss Man added a comment -

          A much larger patch...

          I initially found this bug because of a weird failure in a test i have on dependent project, and it took me longer then i would have liked to reproduce in a solr test because i didn't realize it was caused by using the updateLog, and i didn't realize how few solr tests take advantage of the updateLog.

          so with that in mind, it seemed to me like we should probably increase the test coverage of hte updatLog to see if there are any more situations that tickle bugs besides this odd edge case of reload+deleteByQuery.

          so in this updated patch...

          • same TestReloadAndDeleteDocs as before
          • the test solrconfig.xml now defaults to using the updateLog
          • SolrTestCaseJ4 uses randomization to occasionally disable the update log with a sys property
          • there is currently a nocommit in SolrTestCaseJ4 forcing the sys prop to always be true
          • any tests using solrconfig.xml that have an explicit need to use/not-use updateLog override the sysprop explicitly
          • a few schema files that did not have version fields are updated to include them

          ...this still only scratches the surface of increasing the test coverage for the UpdateLog, but it already exposes a reproducible failure in AutoCommitTest with the same symptoms as my TestReloadAndDeleteDocs...

          • ERROR Timeout waiting for all directory ref counts...
          • searcher leak.

          (i have not yet narrowed down which method in AutoCommitTest the dir factory ref count is lost in)

          Show
          Hoss Man added a comment - A much larger patch... I initially found this bug because of a weird failure in a test i have on dependent project, and it took me longer then i would have liked to reproduce in a solr test because i didn't realize it was caused by using the updateLog, and i didn't realize how few solr tests take advantage of the updateLog. so with that in mind, it seemed to me like we should probably increase the test coverage of hte updatLog to see if there are any more situations that tickle bugs besides this odd edge case of reload+deleteByQuery. so in this updated patch... same TestReloadAndDeleteDocs as before the test solrconfig.xml now defaults to using the updateLog SolrTestCaseJ4 uses randomization to occasionally disable the update log with a sys property there is currently a nocommit in SolrTestCaseJ4 forcing the sys prop to always be true any tests using solrconfig.xml that have an explicit need to use/not-use updateLog override the sysprop explicitly a few schema files that did not have version fields are updated to include them ...this still only scratches the surface of increasing the test coverage for the UpdateLog, but it already exposes a reproducible failure in AutoCommitTest with the same symptoms as my TestReloadAndDeleteDocs... ERROR Timeout waiting for all directory ref counts... searcher leak. (i have not yet narrowed down which method in AutoCommitTest the dir factory ref count is lost in)
          Hide
          Hoss Man added a comment -

          It seems that this test started failing after the following commit:

          Hmmm... git bisect?

          What confuses me is that r1457641 seems to have been undone by r1457647 ? .. but i guess maybe r1457641 broke it, and then subsequent commits kept it broken even when r1457647 reverted that specific change? (totally possible that the other changes in r1457647 are problematic here since SOLR-4604 in general is about updateLog and core reload.)

          Show
          Hoss Man added a comment - It seems that this test started failing after the following commit: Hmmm... git bisect? What confuses me is that r1457641 seems to have been undone by r1457647 ? .. but i guess maybe r1457641 broke it, and then subsequent commits kept it broken even when r1457647 reverted that specific change? (totally possible that the other changes in r1457647 are problematic here since SOLR-4604 in general is about updateLog and core reload.)
          Hide
          Anshum Gupta added a comment -

          I'm trying to get an intersection of the special handling for the MatchAllDocuments deleteByQuery and changes from SOLR-4604 to get a better grip on what's happening.
          At the same time, it might have just been broken from even before r1457641.

          Show
          Anshum Gupta added a comment - I'm trying to get an intersection of the special handling for the MatchAllDocuments deleteByQuery and changes from SOLR-4604 to get a better grip on what's happening. At the same time, it might have just been broken from even before r1457641.
          Show
          Anshum Gupta added a comment - Potential cause? https://issues.apache.org/jira/browse/SOLR-4617 SolrCore#reload needs to pass the deletion policy to the next SolrCore through it's constructor rather than setting a field after. . http://svn.apache.org/viewvc?view=revision&revision=1458880 Adding it here so that it gets easier to track.
          Hide
          Yonik Seeley added a comment -

          I'm trying to get an intersection of the special handling for the MatchAllDocuments

          A normal deleteByQuery also triggers the same bug though.

          deletionPolicy sharing could have something to do with this, given that it doesn't seem like Solr's original deletion policy was meant to be shared by multiple writers?

          Show
          Yonik Seeley added a comment - I'm trying to get an intersection of the special handling for the MatchAllDocuments A normal deleteByQuery also triggers the same bug though. deletionPolicy sharing could have something to do with this, given that it doesn't seem like Solr's original deletion policy was meant to be shared by multiple writers?
          Hide
          Mark Miller added a comment - - edited

          deletionPolicy sharing could have something to do with this, given that it doesn't seem like Solr's original deletion policy was meant to be shared by multiple writers?

          It actually wans't intended to be shared across new writers - in the above mentioned issues, it's still the same writer.

          We did recently add a call to create a new index writer on reload though – and on a closer look, that ends up passing the old deletion policy to the new writer. Perhaps we should look at that - it looks like it uses a couple data structures that perhaps should be created fresh and not passed on (DefaultSolrCoreState#createMainIndexWriter).

          Show
          Mark Miller added a comment - - edited deletionPolicy sharing could have something to do with this, given that it doesn't seem like Solr's original deletion policy was meant to be shared by multiple writers? It actually wans't intended to be shared across new writers - in the above mentioned issues, it's still the same writer. We did recently add a call to create a new index writer on reload though – and on a closer look, that ends up passing the old deletion policy to the new writer. Perhaps we should look at that - it looks like it uses a couple data structures that perhaps should be created fresh and not passed on (DefaultSolrCoreState#createMainIndexWriter).
          Hide
          Alexey Serba added a comment -

          Hmmm... git bisect?

          Yes

          sh> git status
          # On branch branch_4x
          nothing to commit (working directory clean)
          
          sh> git clean -d -f -x; git checkout -- .; git bisect reset; git bisect start; git bisect good 539d7dba81a59f8644ef268ad238b5a9fc547645; git bisect bad d19f9bf904197d4e9594340e21a961fd617036cb;  git bisect run sh -c "bash ../find_leak.sh 1>> ../bisect.log"
          
          Show
          Alexey Serba added a comment - Hmmm... git bisect? Yes sh> git status # On branch branch_4x nothing to commit (working directory clean) sh> git clean -d -f -x; git checkout -- .; git bisect reset; git bisect start; git bisect good 539d7dba81a59f8644ef268ad238b5a9fc547645; git bisect bad d19f9bf904197d4e9594340e21a961fd617036cb; git bisect run sh -c "bash ../find_leak.sh 1>> ../bisect.log"
          Hide
          Yonik Seeley added a comment -

          OK, I think I found the issue: I don't believe ulog.init(uhandler,core) is being called on a reload now.
          Not sure when this was lost (or where it was previously called from)... need to check out an older version (maybe 4.0) to figure that out.

          Show
          Yonik Seeley added a comment - OK, I think I found the issue: I don't believe ulog.init(uhandler,core) is being called on a reload now. Not sure when this was lost (or where it was previously called from)... need to check out an older version (maybe 4.0) to figure that out.
          Hide
          Yonik Seeley added a comment -

          OK, it was 4604...

          There used to be this code in the DirectUpdateHandler2 constructor:

              this.ulog = updateHandler.getUpdateLog();
              if (this.ulog != null) {
                this.ulog.init(this, core);
              }
          

          It was removed and not replaced with anything.

          Show
          Yonik Seeley added a comment - OK, it was 4604... There used to be this code in the DirectUpdateHandler2 constructor: this .ulog = updateHandler.getUpdateLog(); if ( this .ulog != null ) { this .ulog.init( this , core); } It was removed and not replaced with anything.
          Hide
          Anshum Gupta added a comment -

          Thanks for that Yonik. I did try adding that back and the test still failed for me.
          After your comment, I just did an ant clean and re-ran, works just right. Need to run the entire suite yet though.

          Also, if you think that Deletion policy shouldn't be shared with new writers, can we have another JIRA for that (if you intend to close this one)?

          Show
          Anshum Gupta added a comment - Thanks for that Yonik. I did try adding that back and the test still failed for me. After your comment, I just did an ant clean and re-ran, works just right. Need to run the entire suite yet though. Also, if you think that Deletion policy shouldn't be shared with new writers, can we have another JIRA for that (if you intend to close this one)?
          Hide
          Mark Miller added a comment -

          It was removed and not replaced with anything.

          Right - that was the entire change of 4604 - at the time, that second call to init was causing file handle leaks if I remember right. And removing it had no affect on tests.

          Show
          Mark Miller added a comment - It was removed and not replaced with anything. Right - that was the entire change of 4604 - at the time, that second call to init was causing file handle leaks if I remember right. And removing it had no affect on tests.
          Hide
          Mark Miller added a comment -

          If we add it back, we should be sure it's no longer causing a double init call.

          Show
          Mark Miller added a comment - If we add it back, we should be sure it's no longer causing a double init call.
          Hide
          Yonik Seeley added a comment -

          Here's my fix (untested).
          I'll work on a test that hits this more directly.

          Show
          Yonik Seeley added a comment - Here's my fix (untested). I'll work on a test that hits this more directly.
          Hide
          Yonik Seeley added a comment -

          Here's an updated fix that goes back to the way it worked before (init called at the end of the DUH2 constructor). Slightly messier, but slightly safer.

          Show
          Yonik Seeley added a comment - Here's an updated fix that goes back to the way it worked before (init called at the end of the DUH2 constructor). Slightly messier, but slightly safer.
          Hide
          Yonik Seeley added a comment -

          It's hard for me to get a clean "ant test" these days, but I don't think any of the failures I'm seeing are related to my latest patch here.

          Show
          Yonik Seeley added a comment - It's hard for me to get a clean "ant test" these days, but I don't think any of the failures I'm seeing are related to my latest patch here.
          Hide
          Mark Miller added a comment -

          It's hard for me to get a clean "ant test" these days

          What are the fails?

          Show
          Mark Miller added a comment - It's hard for me to get a clean "ant test" these days What are the fails?
          Hide
          Yonik Seeley added a comment -

          Whew, I just got a clean build.

          What are the fails?

          Most prevalent seem to be shard splitting (either ShardSplitTest or ChaosMonkeyShardSplitTest)

          Show
          Yonik Seeley added a comment - Whew, I just got a clean build. What are the fails? Most prevalent seem to be shard splitting (either ShardSplitTest or ChaosMonkeyShardSplitTest)
          Hide
          Mark Miller added a comment -

          Yeah, you can see those fails on jenkins emails as well - perhaps a timing issue because they seem to pass consistently on my fast machine.

          Show
          Mark Miller added a comment - Yeah, you can see those fails on jenkins emails as well - perhaps a timing issue because they seem to pass consistently on my fast machine.
          Hide
          Yonik Seeley added a comment -

          OK, hopefully the final patch that makes things fail faster by checking if the core is still open we open a new searcher.

          Show
          Yonik Seeley added a comment - OK, hopefully the final patch that makes things fail faster by checking if the core is still open we open a new searcher.
          Hide
          Hoss Man added a comment -

          FYI: with your second to last patch (https://issues.apache.org/jira/secure/attachment/12585629/SOLR-4858_fix.patch) all tests pass even with my SOLR-4858.patch applied

          Show
          Hoss Man added a comment - FYI: with your second to last patch ( https://issues.apache.org/jira/secure/attachment/12585629/SOLR-4858_fix.patch ) all tests pass even with my SOLR-4858 .patch applied
          Hide
          Yonik Seeley added a comment -

          Committed to trunk, 4x, 43x

          Show
          Yonik Seeley added a comment - Committed to trunk, 4x, 43x
          Hide
          Hoss Man added a comment -

          Looks like yonik forgot to svn add the new test file in my mongo patch, and didn't notice i had a nocommit in there ...

          there is currently a nocommit in SolrTestCaseJ4 forcing the sys prop to always be true

          Fixed...

          Committed revision 1488431.
          Committed revision 1488432.
          Committed revision 1488433.

          Show
          Hoss Man added a comment - Looks like yonik forgot to svn add the new test file in my mongo patch, and didn't notice i had a nocommit in there ... there is currently a nocommit in SolrTestCaseJ4 forcing the sys prop to always be true Fixed... Committed revision 1488431. Committed revision 1488432. Committed revision 1488433.
          Hide
          Yonik Seeley added a comment -

          Looks like yonik forgot to svn add the new test file in my mongo patch

          It was actually redundant with the update log test I added (in TestRecovery), but it doesn't hurt.

          Show
          Yonik Seeley added a comment - Looks like yonik forgot to svn add the new test file in my mongo patch It was actually redundant with the update log test I added (in TestRecovery), but it doesn't hurt.
          Hide
          Shalin Shekhar Mangar added a comment -

          Bulk close after 4.3.1 release

          Show
          Shalin Shekhar Mangar added a comment - Bulk close after 4.3.1 release

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development