Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8135

SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection reproducible failure

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.0
    • Fix Version/s: 6.0, 6.1, 7.0
    • Component/s: None
    • Labels:
      None

      Description

      No idea what's going on here, noticed it while testing out an unrelated patch – seed reproduces against pristine trunk...

         [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=SolrCloudExampleTest -Dtests.method=testLoadDocsIntoGettingStartedCollection -Dtests.seed=59EA523FFF6CB60F -Dtests.slow=true -Dtests.locale=es_MX -Dtests.timezone=Africa/Porto-Novo -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
         [junit4] FAILURE 49.5s | SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection <<<
         [junit4]    > Throwable #1: java.lang.AssertionError: Delete action failed!
         [junit4]    > 	at __randomizedtesting.SeedInfo.seed([59EA523FFF6CB60F:4A896050CE030FA9]:0)
         [junit4]    > 	at org.apache.solr.cloud.SolrCloudExampleTest.doTestDeleteAction(SolrCloudExampleTest.java:169)
         [junit4]    > 	at org.apache.solr.cloud.SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection(SolrCloudExampleTest.java:145)
         [junit4]    > 	at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:963)
         [junit4]    > 	at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:938)
      
      1. SOLR-8135.failure.log
        916 kB
        Hoss Man
      2. SOLR-8135.patch
        5 kB
        Noble Paul
      3. SOLR-8135.patch
        2 kB
        Erick Erickson

        Issue Links

          Activity

          Hide
          hossman Hoss Man added a comment -

          attaching ant output testing against...

          hossman@tray:~/lucene/dev$ svnversion && svn info | grep URL
          1707150
          URL: https://svn.apache.org/repos/asf/lucene/dev/trunk
          Relative URL: ^/lucene/dev/trunk
          

          with command...

          ANT_ARGS="" ant test -Dtestcase=SolrCloudExampleTest -Dtests.method=testLoadDocsIntoGettingStartedCollection -Dtests.seed=59EA523FFF6CB60F -Dtests.slow=true -Dtests.locale=es_MX -Dtests.timezone=Africa/Porto-Novo -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 > SOLR-8135.failure.log
          
          Show
          hossman Hoss Man added a comment - attaching ant output testing against... hossman@tray:~/lucene/dev$ svnversion && svn info | grep URL 1707150 URL: https://svn.apache.org/repos/asf/lucene/dev/trunk Relative URL: ^/lucene/dev/trunk with command... ANT_ARGS="" ant test -Dtestcase=SolrCloudExampleTest -Dtests.method=testLoadDocsIntoGettingStartedCollection -Dtests.seed=59EA523FFF6CB60F -Dtests.slow=true -Dtests.locale=es_MX -Dtests.timezone=Africa/Porto-Novo -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 > SOLR-8135.failure.log
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1709142 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1709142 ]

          SOLR-8135: Disable test until it can be fixed

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1709142 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1709142 ] SOLR-8135 : Disable test until it can be fixed
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1709143 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1709143 ]

          SOLR-8135: Disable test until it can be fixed

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1709143 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1709143 ] SOLR-8135 : Disable test until it can be fixed
          Hide
          thetaphi Uwe Schindler added a comment -

          failed also on almost any build on Jenkins (2 of 3 builds failed). Thanks for disabling. I was afraid to do this...

          Show
          thetaphi Uwe Schindler added a comment - failed also on almost any build on Jenkins (2 of 3 builds failed). Thanks for disabling. I was afraid to do this...
          Hide
          gus_heck Gus Heck added a comment -

          also with

          ant test  -Dtestcase=SolrCloudExampleTest -Dtests.method=testLoadDocsIntoGettingStartedCollection -Dtests.seed=D7797CB78640592B -Dtests.slow=true -Dtests.locale=mt_MT -Dtests.timezone=Jamaica -Dtests.asserts=true -Dtests.file.encoding=US-ASCII 

          on 5.x

          Show
          gus_heck Gus Heck added a comment - also with ant test -Dtestcase=SolrCloudExampleTest -Dtests.method=testLoadDocsIntoGettingStartedCollection -Dtests.seed=D7797CB78640592B -Dtests.slow= true -Dtests.locale=mt_MT -Dtests.timezone=Jamaica -Dtests.asserts= true -Dtests.file.encoding=US-ASCII on 5.x
          Hide
          thelabdude Timothy Potter added a comment -

          Noble Paul Can you take a look at this failure? From the logs, it looks like something is breaking in the config overlay / refresh layer when a collection is being deleted. There's nothing wrong with this test.

          Show
          thelabdude Timothy Potter added a comment - Noble Paul Can you take a look at this failure? From the logs, it looks like something is breaking in the config overlay / refresh layer when a collection is being deleted. There's nothing wrong with this test.
          Hide
          erickerickson Erick Erickson added a comment -

          I did notice that if I added any code that delayed for 15 seconds after the doTestConfigUpdate call in that test that it would succeed. Just hacking around for clues I tried looking at the ZK state to see if we were missing a state change and I didn't see any (looking at ZK state and seeing if any nodes weren't active). If I either looped for 15 seconds looking for not active states or just slept for 15 seconds, the test would consistently pass.

          I may have a bit of time to look at this on the flight home tonight.....

          And if I commented out the doUpdateConfig, it would succeed.

          None of which helps pinpoint what's going on, but helps me get a clue where to start looking...

          Show
          erickerickson Erick Erickson added a comment - I did notice that if I added any code that delayed for 15 seconds after the doTestConfigUpdate call in that test that it would succeed. Just hacking around for clues I tried looking at the ZK state to see if we were missing a state change and I didn't see any (looking at ZK state and seeing if any nodes weren't active). If I either looped for 15 seconds looking for not active states or just slept for 15 seconds, the test would consistently pass. I may have a bit of time to look at this on the flight home tonight..... And if I commented out the doUpdateConfig, it would succeed. None of which helps pinpoint what's going on, but helps me get a clue where to start looking...
          Hide
          erickerickson Erick Erickson added a comment - - edited

          So far, this patch keeps the test from failing. I hacked this in on the hint that buried in the failure case was a message about "could not reload core blah blah blah" and the fact that if I commented out the update config bits the test succeeded.

          Maybe a race condition between multiple API calls modifying the configs and core reloading? This patch forces the collection to reload as part of updating the config...

          Noble Paul Mark Miller Yonik Seeley
          This feels like a band-aid though. I see in the ref guide that listeners are registered on the zknode and reloads happen sometime although I'm unclear on what exactly triggers them. It seems like any modification of the config file should be atomic in the sense that as long as the updates are valid, any core reloads should get a config that loads successfully. How do we guarantee atomic reads/writes of the config files? Especially when the read path is different than the write path?

          I don't have the logs right now, so I don't have a good sense of what in core reload was really failing, I'll see if I can get some of that info.

          NOTE: I'm on a plane so I could only beast this lightly. Nevertheless, I never got 4 successful runs before this patch and got 16 with the patch so it certainly seems in the right neighborhood. I'll be able to give it a more thorough spin when I'm not on battery. Assuming we think this is the right fix.

          Show
          erickerickson Erick Erickson added a comment - - edited So far, this patch keeps the test from failing. I hacked this in on the hint that buried in the failure case was a message about "could not reload core blah blah blah" and the fact that if I commented out the update config bits the test succeeded. Maybe a race condition between multiple API calls modifying the configs and core reloading? This patch forces the collection to reload as part of updating the config... Noble Paul Mark Miller Yonik Seeley This feels like a band-aid though. I see in the ref guide that listeners are registered on the zknode and reloads happen sometime although I'm unclear on what exactly triggers them. It seems like any modification of the config file should be atomic in the sense that as long as the updates are valid, any core reloads should get a config that loads successfully. How do we guarantee atomic reads/writes of the config files? Especially when the read path is different than the write path? I don't have the logs right now, so I don't have a good sense of what in core reload was really failing, I'll see if I can get some of that info. NOTE: I'm on a plane so I could only beast this lightly. Nevertheless, I never got 4 successful runs before this patch and got 16 with the patch so it certainly seems in the right neighborhood. I'll be able to give it a more thorough spin when I'm not on battery. Assuming we think this is the right fix.
          Hide
          erickerickson Erick Erickson added a comment -

          Here's all the stack trace related to reloading the core I could find on a failure run.

          [junit4] 2> 50808 INFO (SolrConfigHandler-refreshconf) [n:127.0.0.1:52555_cib c:gettingstarted s:shard2 r:core_node2 x:gettingstarted_shard2_replica2] o.a.s.c.ConfigSetProperties Did not find ConfigSet properties, assuming default properties: Can't find resource 'configsetprops.json' in classpath or '/configs/gettingstarted', cwd=/Users/Erick/apache/solr/beast-tmp/3/J0
          [junit4] 2> 50808 INFO (SolrConfigHandler-refreshconf) [n:127.0.0.1:52555_cib c:gettingstarted s:shard2 r:core_node2 x:gettingstarted_shard2_replica2] o.a.s.c.CoreContainer Reloading SolrCore 'gettingstarted_shard2_replica2' using configuration from collection gettingstarted
          [junit4] 2> 50808 ERROR (SolrConfigHandler-refreshconf) [n:127.0.0.1:52555_cib c:gettingstarted s:shard2 r:core_node2 x:gettingstarted_shard2_replica2] o.a.s.h.SolrConfigHandler Unable to refresh conf
          [junit4] 2> org.apache.solr.common.SolrException: Unable to reload core [gettingstarted_shard2_replica2]
          [junit4] 2> at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:837)
          [junit4] 2> at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2602)
          [junit4] 2> at org.apache.solr.handler.SolrConfigHandler$Command$1.run(SolrConfigHandler.java:212)
          [junit4] 2> Caused by: java.lang.IllegalStateException: IndexWriter has been closed
          [junit4] 2> at org.apache.solr.update.SolrCoreState.increfSolrCoreState(SolrCoreState.java:53)
          [junit4] 2> at org.apache.solr.core.SolrCore.reload(SolrCore.java:462)
          [junit4] 2> at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:832)

          Show
          erickerickson Erick Erickson added a comment - Here's all the stack trace related to reloading the core I could find on a failure run. [junit4] 2> 50808 INFO (SolrConfigHandler-refreshconf) [n:127.0.0.1:52555_cib c:gettingstarted s:shard2 r:core_node2 x:gettingstarted_shard2_replica2] o.a.s.c.ConfigSetProperties Did not find ConfigSet properties, assuming default properties: Can't find resource 'configsetprops.json' in classpath or '/configs/gettingstarted', cwd=/Users/Erick/apache/solr/beast-tmp/3/J0 [junit4] 2> 50808 INFO (SolrConfigHandler-refreshconf) [n:127.0.0.1:52555_cib c:gettingstarted s:shard2 r:core_node2 x:gettingstarted_shard2_replica2] o.a.s.c.CoreContainer Reloading SolrCore 'gettingstarted_shard2_replica2' using configuration from collection gettingstarted [junit4] 2> 50808 ERROR (SolrConfigHandler-refreshconf) [n:127.0.0.1:52555_cib c:gettingstarted s:shard2 r:core_node2 x:gettingstarted_shard2_replica2] o.a.s.h.SolrConfigHandler Unable to refresh conf [junit4] 2> org.apache.solr.common.SolrException: Unable to reload core [gettingstarted_shard2_replica2] [junit4] 2> at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:837) [junit4] 2> at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2602) [junit4] 2> at org.apache.solr.handler.SolrConfigHandler$Command$1.run(SolrConfigHandler.java:212) [junit4] 2> Caused by: java.lang.IllegalStateException: IndexWriter has been closed [junit4] 2> at org.apache.solr.update.SolrCoreState.increfSolrCoreState(SolrCoreState.java:53) [junit4] 2> at org.apache.solr.core.SolrCore.reload(SolrCore.java:462) [junit4] 2> at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:832)
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Thanks Erick, looks like a reload is somehow triggering after the core is already closed?

          Show
          markrmiller@gmail.com Mark Miller added a comment - Thanks Erick, looks like a reload is somehow triggering after the core is already closed?
          Hide
          erickerickson Erick Erickson added a comment -

          An 80 iteration beast of the test ran fine last night with this patch. And yes, I did remove the badapple notation.

          I'm pretty convinced this is not the right fix though, it just helps identify the problem area.

          Show
          erickerickson Erick Erickson added a comment - An 80 iteration beast of the test ran fine last night with this patch. And yes, I did remove the badapple notation. I'm pretty convinced this is not the right fix though, it just helps identify the problem area.
          Hide
          erickerickson Erick Erickson added a comment -

          Mark:

          That's my assumption, got a full day today so can't do anything on it until at least this evening though.

          Show
          erickerickson Erick Erickson added a comment - Mark: That's my assumption, got a full day today so can't do anything on it until at least this evening though.
          Hide
          thelabdude Timothy Potter added a comment -

          I can fix the test to wait longer after it issues a config change but that will mask a race condition that the SolrConfigHandler refreshconf thread attempts to refresh a core that is being shutdown down.

             [junit4]   2> 42606 INFO  (qtp1036148039-29) [n:127.0.0.1:58528_    ] o.a.s.c.SolrCore [gettingstarted_shard1_replica1]  CLOSING SolrCore org.apache.solr.core.SolrCore@53fda895
          

          and then later in the log, you have:

             [junit4]   2> 42629 INFO  (SolrConfigHandler-refreshconf) [n:127.0.0.1:58528_ c:gettingstarted s:shard1 r:core_node4 x:gettingstarted_shard1_replica1] o.a.s.c.ConfigSetProperties Did not find ConfigSet properties, assuming default properties: Can't find resource 'configsetprops.json' in classpath or '/configs/gettingstarted', cwd=/home/hossman/lucene/dev/solr/build/solr-core/test/J0
             [junit4]   2> 42629 INFO  (SolrConfigHandler-refreshconf) [n:127.0.0.1:58528_ c:gettingstarted s:shard1 r:core_node4 x:gettingstarted_shard1_replica1] o.a.s.c.CoreContainer Reloading SolrCore 'gettingstarted_shard1_replica1' using configuration from collection gettingstarted
             [junit4]   2> 42630 ERROR (SolrConfigHandler-refreshconf) [n:127.0.0.1:58528_ c:gettingstarted s:shard1 r:core_node4 x:gettingstarted_shard1_replica1] o.a.s.h.SolrConfigHandler Unable to refresh conf
             [junit4]   2> org.apache.solr.common.SolrException: Unable to reload core [gettingstarted_shard1_replica1]
             [junit4]   2>        at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:837)
             [junit4]   2>        at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2602)
             [junit4]   2>        at org.apache.solr.handler.SolrConfigHandler$Command$1.run(SolrConfigHandler.java:212)
          

          I see some checks for whether the core is closed, so maybe there's nothing we can do better to detect this race condition, but would appreciate if Noble Paul could take a closer look at the checks? If not, the only fix here is to wait longer for the config changes to take effect before issuing the delete.

          Show
          thelabdude Timothy Potter added a comment - I can fix the test to wait longer after it issues a config change but that will mask a race condition that the SolrConfigHandler refreshconf thread attempts to refresh a core that is being shutdown down. [junit4] 2> 42606 INFO (qtp1036148039-29) [n:127.0.0.1:58528_ ] o.a.s.c.SolrCore [gettingstarted_shard1_replica1] CLOSING SolrCore org.apache.solr.core.SolrCore@53fda895 and then later in the log, you have: [junit4] 2> 42629 INFO (SolrConfigHandler-refreshconf) [n:127.0.0.1:58528_ c:gettingstarted s:shard1 r:core_node4 x:gettingstarted_shard1_replica1] o.a.s.c.ConfigSetProperties Did not find ConfigSet properties, assuming default properties: Can't find resource 'configsetprops.json' in classpath or '/configs/gettingstarted', cwd=/home/hossman/lucene/dev/solr/build/solr-core/test/J0 [junit4] 2> 42629 INFO (SolrConfigHandler-refreshconf) [n:127.0.0.1:58528_ c:gettingstarted s:shard1 r:core_node4 x:gettingstarted_shard1_replica1] o.a.s.c.CoreContainer Reloading SolrCore 'gettingstarted_shard1_replica1' using configuration from collection gettingstarted [junit4] 2> 42630 ERROR (SolrConfigHandler-refreshconf) [n:127.0.0.1:58528_ c:gettingstarted s:shard1 r:core_node4 x:gettingstarted_shard1_replica1] o.a.s.h.SolrConfigHandler Unable to refresh conf [junit4] 2> org.apache.solr.common.SolrException: Unable to reload core [gettingstarted_shard1_replica1] [junit4] 2> at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:837) [junit4] 2> at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2602) [junit4] 2> at org.apache.solr.handler.SolrConfigHandler$Command$1.run(SolrConfigHandler.java:212) I see some checks for whether the core is closed, so maybe there's nothing we can do better to detect this race condition, but would appreciate if Noble Paul could take a closer look at the checks? If not, the only fix here is to wait longer for the config changes to take effect before issuing the delete.
          Hide
          thelabdude Timothy Potter added a comment -

          The other interesting thing to note is that the SolrConfigHandler does try to wait up to 30 seconds for agreement across all cores, but that does not mean the config has been applied on all cores, apparently. For instance, in the log you see:

             [junit4]   2> 41864 INFO  (qtp1735783147-155) [n:127.0.0.1:35628_ c:gettingstarted s:shard2 r:core_node1 x:gettingstarted_shard2_replica2] o.a.s.h.SolrConfigHandler Took 905.0ms to set the property overlay to be of version 0 for collection gettingstarted
          

          But there is still config loading activity going on in the cores after this point, so that wait is somewhat misleading. I still think the fix here is to have the test wait longer before issuing the delete, but we should also revisit whether the SolrConfigHandler can do better at detecting a core is shutting down?

          Show
          thelabdude Timothy Potter added a comment - The other interesting thing to note is that the SolrConfigHandler does try to wait up to 30 seconds for agreement across all cores, but that does not mean the config has been applied on all cores, apparently. For instance, in the log you see: [junit4] 2> 41864 INFO (qtp1735783147-155) [n:127.0.0.1:35628_ c:gettingstarted s:shard2 r:core_node1 x:gettingstarted_shard2_replica2] o.a.s.h.SolrConfigHandler Took 905.0ms to set the property overlay to be of version 0 for collection gettingstarted But there is still config loading activity going on in the cores after this point, so that wait is somewhat misleading. I still think the fix here is to have the test wait longer before issuing the delete, but we should also revisit whether the SolrConfigHandler can do better at detecting a core is shutting down?
          Hide
          thelabdude Timothy Potter added a comment -

          Chatting with Chris Hostetter (Unused), it seems like the SolrConfigHandler should be able to leverage the CloseHook support in SolrCore to be able to detect a core is being closed and stop its attempts at refreshing config.

          Moreover, as I thought about the previous comment some more, I actually think the true bug here is the config handler returns before the config change has been applied (it seems from the log anyway) ... not sure why anyone would want a config update request to return to them before the change has been applied across the cluster? Seems hard for a client app to reason about whether a change is fully applied or not with the way things are w/o adding arbitrary waits, which would just be a bandaid in the SolrCloudExampleTest.

          Show
          thelabdude Timothy Potter added a comment - Chatting with Chris Hostetter (Unused) , it seems like the SolrConfigHandler should be able to leverage the CloseHook support in SolrCore to be able to detect a core is being closed and stop its attempts at refreshing config. Moreover, as I thought about the previous comment some more, I actually think the true bug here is the config handler returns before the config change has been applied (it seems from the log anyway) ... not sure why anyone would want a config update request to return to them before the change has been applied across the cluster? Seems hard for a client app to reason about whether a change is fully applied or not with the way things are w/o adding arbitrary waits, which would just be a bandaid in the SolrCloudExampleTest.
          Hide
          noble.paul Noble Paul added a comment -

          Diagnosis: This happens because of a race condition. We have multiple threads in parallel that invoked an unload/or reload. Race condition caused one of the threads to fail. In our code any failure in reload() would put this core as a permanent failure. It is not a failure , it is just that two threads simultaneously tried to perform the reload/unload.

          Solution: We need to handle this Exception differently . So I just created a new Exception called CoreIsClosedException this wouldn't add the core to permanently failed cores as IndexWriter has been closed.

          Follow Question up for others : If I'm trying to perform a reload and it fails , should we retry?

          Show
          noble.paul Noble Paul added a comment - Diagnosis: This happens because of a race condition. We have multiple threads in parallel that invoked an unload/or reload. Race condition caused one of the threads to fail. In our code any failure in reload() would put this core as a permanent failure. It is not a failure , it is just that two threads simultaneously tried to perform the reload/unload. Solution: We need to handle this Exception differently . So I just created a new Exception called CoreIsClosedException this wouldn't add the core to permanently failed cores as IndexWriter has been closed . Follow Question up for others : If I'm trying to perform a reload and it fails , should we retry?
          Hide
          erickerickson Erick Erickson added a comment -

          I can reasonably reliably get this to fail over night (14 out of 1,000) with the following error, so let me know if there's a patch I can try:

          > Throwable #1: java.lang.AssertionError: Delete action failed!
          [junit4] > at __randomizedtesting.SeedInfo.seed([ECD502B89BFAF991:FFB630D7AA954037]:0)
          [junit4] > at org.apache.solr.cloud.SolrCloudExampleTest.doTestDeleteAction(SolrCloudExampleTest.java:171)
          [junit4] > at org.apache.solr.cloud.SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection(SolrCloudExampleTest.java:147)
          [junit4] > at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:965)
          [junit4] > at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:940)
          [junit4] > at java.lang.Thread.run(Thread.java:745)

          Show
          erickerickson Erick Erickson added a comment - I can reasonably reliably get this to fail over night (14 out of 1,000) with the following error, so let me know if there's a patch I can try: > Throwable #1: java.lang.AssertionError: Delete action failed! [junit4] > at __randomizedtesting.SeedInfo.seed( [ECD502B89BFAF991:FFB630D7AA954037] :0) [junit4] > at org.apache.solr.cloud.SolrCloudExampleTest.doTestDeleteAction(SolrCloudExampleTest.java:171) [junit4] > at org.apache.solr.cloud.SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection(SolrCloudExampleTest.java:147) [junit4] > at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:965) [junit4] > at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:940) [junit4] > at java.lang.Thread.run(Thread.java:745)
          Hide
          erickerickson Erick Erickson added a comment -

          This feels a little like a band-aid. If I'm reading this right, the sequence is this:

          1> the test changes the configs, kicking off reloads for the cores
          2> the test deletes the collection

          <2> happens before <1> is complete, thus the race as the reload doesn't get around to reloading the core after the delete happens and the core is closed. Is this about right?

          Tim's comment about a delay suggests that if we waited for the config changes to complete reloading before continuing the test we'd be OK. It seems like a better option would be to have Solr not start the delete while reloads were pending. Or more generally not start carrying out one command while another potentially conflicting operation was being carried.

          Not sure how hard having Solr be smart enough to delay "potentially conflicting" operations would be though.

          Show
          erickerickson Erick Erickson added a comment - This feels a little like a band-aid. If I'm reading this right, the sequence is this: 1> the test changes the configs, kicking off reloads for the cores 2> the test deletes the collection <2> happens before <1> is complete, thus the race as the reload doesn't get around to reloading the core after the delete happens and the core is closed. Is this about right? Tim's comment about a delay suggests that if we waited for the config changes to complete reloading before continuing the test we'd be OK. It seems like a better option would be to have Solr not start the delete while reloads were pending. Or more generally not start carrying out one command while another potentially conflicting operation was being carried. Not sure how hard having Solr be smart enough to delay "potentially conflicting" operations would be though.
          Hide
          noble.paul Noble Paul added a comment -

          This feels a little like a band-aid.

          Erick, I'm not trying to make this test pass. This is a more serious issue. A race condition should not put a core in an unrecoverable error situation. This would cause SolrCloud to lose replicas randomly. I'm surprised users have not experienced it yet . So, this fix must be done anyway.

          Tim's comment about a delay suggests that if we waited for the config changes to complete reloading before continuing the test we'd be OK

          That is fine if you wish to just avoid hitting the race condition. But in this case we should be glad that this test has uncovered a race condition

          Show
          noble.paul Noble Paul added a comment - This feels a little like a band-aid. Erick, I'm not trying to make this test pass. This is a more serious issue. A race condition should not put a core in an unrecoverable error situation. This would cause SolrCloud to lose replicas randomly. I'm surprised users have not experienced it yet . So, this fix must be done anyway. Tim's comment about a delay suggests that if we waited for the config changes to complete reloading before continuing the test we'd be OK That is fine if you wish to just avoid hitting the race condition. But in this case we should be glad that this test has uncovered a race condition
          Hide
          erickerickson Erick Erickson added a comment -

          Noble:

          Well, perhaps I phrased things poorly, so let's start over and not even consider the test.

          bq: A race condition should not put a core in an unrecoverable error situation

          Totally agree. If I understand the patch though, the reload operation and the delete operation are competing. In this particular case the reload was caused by changing the configs, but that's largely immaterial. The delete got in there before the reload operation and closed the core. What happens with throwing this new exception is that the reload operation still fails in the sense that the core is still unavailable right? I don't quite see how throwing a new exception and catching it but not adding it to the failures list changes the fact that the core failed to reload; it's still unavailable. How does it ever recover?

          Or are you saying that in this case, since the core is deleted it really doesn't need to recover? That still doesn't seem to cover the general case of the core being closed during a reload operation. There's a comment somewhere in the code that perhaps the reload should be retried. It'll still fail in this case, but are there others where reloading will succeed and thus we should retry?

          All that said, it'll be at least tomorrow night before I can beast this patch....

          Show
          erickerickson Erick Erickson added a comment - Noble: Well, perhaps I phrased things poorly, so let's start over and not even consider the test. bq: A race condition should not put a core in an unrecoverable error situation Totally agree. If I understand the patch though, the reload operation and the delete operation are competing. In this particular case the reload was caused by changing the configs, but that's largely immaterial. The delete got in there before the reload operation and closed the core. What happens with throwing this new exception is that the reload operation still fails in the sense that the core is still unavailable right? I don't quite see how throwing a new exception and catching it but not adding it to the failures list changes the fact that the core failed to reload; it's still unavailable. How does it ever recover? Or are you saying that in this case, since the core is deleted it really doesn't need to recover? That still doesn't seem to cover the general case of the core being closed during a reload operation. There's a comment somewhere in the code that perhaps the reload should be retried. It'll still fail in this case, but are there others where reloading will succeed and thus we should retry? All that said, it'll be at least tomorrow night before I can beast this patch....
          Hide
          noble.paul Noble Paul added a comment -

          the CoreContainer.coreInitFailures is a map which is intended to track a core that has failed to load, most likely due to unrecoverable errors (such as index corruption). This prevents other operations as well (such as delete) because any operation involving that core would consult this map .A race condition is not unrecoverable. Imagine the race condition didn't occur, the core would not have been entered into this unrecoverable list and that is the expected behavior

          Show
          noble.paul Noble Paul added a comment - the CoreContainer.coreInitFailures is a map which is intended to track a core that has failed to load, most likely due to unrecoverable errors (such as index corruption). This prevents other operations as well (such as delete) because any operation involving that core would consult this map .A race condition is not unrecoverable. Imagine the race condition didn't occur, the core would not have been entered into this unrecoverable list and that is the expected behavior
          Hide
          erickerickson Erick Erickson added a comment -

          Noble Paul Should we commit this? I just beasted this for 1,00 runs last night. I saw some failures, but they were either "zombie threads" or "suite timeout exceeded" which I don't think are relevant to this code.

          Show
          erickerickson Erick Erickson added a comment - Noble Paul Should we commit this? I just beasted this for 1,00 runs last night. I saw some failures, but they were either "zombie threads" or "suite timeout exceeded" which I don't think are relevant to this code.
          Hide
          noble.paul Noble Paul added a comment -

          it surely would have avoided the failures. I was asking for a second opinion from others to review the code and confirm there are no unintended consequences.

          I guess we should commit this and get this into 6.0 anyway

          Show
          noble.paul Noble Paul added a comment - it surely would have avoided the failures. I was asking for a second opinion from others to review the code and confirm there are no unintended consequences. I guess we should commit this and get this into 6.0 anyway
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8cc978b53b1299a27de492d7114cd2d4e353b6cb in lucene-solr's branch refs/heads/master from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8cc978b ]

          SOLR-8135: SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection reproducible failure

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8cc978b53b1299a27de492d7114cd2d4e353b6cb in lucene-solr's branch refs/heads/master from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8cc978b ] SOLR-8135 : SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection reproducible failure
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 6689e1c55a7efffadb18d25179d9d87c121af5d0 in lucene-solr's branch refs/heads/branch_6x from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6689e1c ]

          SOLR-8135: SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection reproducible failure

          Show
          jira-bot ASF subversion and git services added a comment - Commit 6689e1c55a7efffadb18d25179d9d87c121af5d0 in lucene-solr's branch refs/heads/branch_6x from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6689e1c ] SOLR-8135 : SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection reproducible failure
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 96b059c9dd67eef3a49da63d388fac7f4d3809f2 in lucene-solr's branch refs/heads/branch_6x from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=96b059c ]

          SOLR-8135: SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection reproducible failure

          Show
          jira-bot ASF subversion and git services added a comment - Commit 96b059c9dd67eef3a49da63d388fac7f4d3809f2 in lucene-solr's branch refs/heads/branch_6x from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=96b059c ] SOLR-8135 : SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection reproducible failure
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4a795ee371c1bb19285cca0fb7336807f6e2840f in lucene-solr's branch refs/heads/branch_6x from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4a795ee ]

          SOLR-8135: typo in comment

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4a795ee371c1bb19285cca0fb7336807f6e2840f in lucene-solr's branch refs/heads/branch_6x from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4a795ee ] SOLR-8135 : typo in comment
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 72f6261f55b6d851d071c8358a05b52e019416c9 in lucene-solr's branch refs/heads/branch_6_0 from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=72f6261 ]

          SOLR-8135: SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection reproducible failure

          Show
          jira-bot ASF subversion and git services added a comment - Commit 72f6261f55b6d851d071c8358a05b52e019416c9 in lucene-solr's branch refs/heads/branch_6_0 from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=72f6261 ] SOLR-8135 : SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection reproducible failure
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d3f310a18d3c7787219515e82b76d18ef6a8e050 in lucene-solr's branch refs/heads/branch_6_0 from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d3f310a ]

          SOLR-8135: SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection reproducible failure

          Show
          jira-bot ASF subversion and git services added a comment - Commit d3f310a18d3c7787219515e82b76d18ef6a8e050 in lucene-solr's branch refs/heads/branch_6_0 from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d3f310a ] SOLR-8135 : SolrCloudExampleTest.testLoadDocsIntoGettingStartedCollection reproducible failure
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a774273c641b7f95b09da9440ef43f24f5207584 in lucene-solr's branch refs/heads/branch_6_0 from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a774273 ]

          SOLR-8135: typo in comment

          Show
          jira-bot ASF subversion and git services added a comment - Commit a774273c641b7f95b09da9440ef43f24f5207584 in lucene-solr's branch refs/heads/branch_6_0 from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a774273 ] SOLR-8135 : typo in comment
          Hide
          hossman Hoss Man added a comment -

          Manually correcting fixVersion per Step #S6 of LUCENE-7271

          Show
          hossman Hoss Man added a comment - Manually correcting fixVersion per Step #S6 of LUCENE-7271

            People

            • Assignee:
              noble.paul Noble Paul
              Reporter:
              hossman Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development