Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9029

regular fails since ZkStateReaderTest.testStateFormatUpdateWithExplicitRefreshLazy

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.1, master (7.0)
    • Component/s: None
    • Labels:
      None

      Description

      jenkins started to semi-regularly complain about ZkStateReaderTest.testStateFormatUpdateWithExplicitRefreshLazy on march 7 (53 failures in 45 days at current count)

      March 7th is not-coincidently when commit 093a8ce57c06f1bf2f71ddde52dcc7b40cbd6197 for SOLR-8745 was made, modifying both the test & a bunch of ClusterState code.


      Sample failure...

      https://builds.apache.org/job/Lucene-Solr-Tests-master/1096

         [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=ZkStateReaderTest -Dtests.method=testStateFormatUpdateWithExplicitRefreshLazy -Dtests.seed=78F99EDE682EC04B -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=tr-TR -Dtests.timezone=Europe/Tallinn -Dtests.asserts=true -Dtests.file.encoding=UTF-8
         [junit4] ERROR   0.45s J0 | ZkStateReaderTest.testStateFormatUpdateWithExplicitRefreshLazy <<<
         [junit4]    > Throwable #1: org.apache.solr.common.SolrException: Could not find collection : c1
         [junit4]    > 	at __randomizedtesting.SeedInfo.seed([78F99EDE682EC04B:13B63EA311211D71]:0)
         [junit4]    > 	at org.apache.solr.common.cloud.ClusterState.getCollection(ClusterState.java:170)
         [junit4]    > 	at org.apache.solr.cloud.overseer.ZkStateReaderTest.testStateFormatUpdate(ZkStateReaderTest.java:135)
         [junit4]    > 	at org.apache.solr.cloud.overseer.ZkStateReaderTest.testStateFormatUpdateWithExplicitRefreshLazy(ZkStateReaderTest.java:46)
         [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
      

      ...i've also seen this fail locally, but i've never been able to reproduce it with the same seed.

        Issue Links

          Activity

          Hide
          hossman Hoss Man added a comment -

          Shalin Shekhar Mangar & Scott Blum - anything jump out at you?

          Show
          hossman Hoss Man added a comment - Shalin Shekhar Mangar & Scott Blum - anything jump out at you?
          Hide
          dragonsinth Scott Blum added a comment -

          Scanned through the code, nothing jumps out at me. I'll dig deeper at some point.

          Show
          dragonsinth Scott Blum added a comment - Scanned through the code, nothing jumps out at me. I'll dig deeper at some point.
          Hide
          dragonsinth Scott Blum added a comment -

          Super puzzling. We've tested that the ZK node exists, and the fact that reader.forceUpdateCollection() is called on the same thread that subsequently checks collection exists practically eliminates data visibility problems.

          Show
          dragonsinth Scott Blum added a comment - Super puzzling. We've tested that the ZK node exists, and the fact that reader.forceUpdateCollection() is called on the same thread that subsequently checks collection exists practically eliminates data visibility problems.
          Hide
          dragonsinth Scott Blum added a comment -

          Finally found it... there's an very rare edge case in forceUpdateCollection() that only occurs when a collection moves from being the legacy collection state straight to being a lazy collection, without ever being observed missing. Basically, it requires you to not see any ZK events during the execution of the test method. I can repro this by putting early exits in LegacyClusterStateWatcher and CollectionsChildWatcher to prevent any watch events from taking effect.

          Show
          dragonsinth Scott Blum added a comment - Finally found it... there's an very rare edge case in forceUpdateCollection() that only occurs when a collection moves from being the legacy collection state straight to being a lazy collection, without ever being observed missing. Basically, it requires you to not see any ZK events during the execution of the test method. I can repro this by putting early exits in LegacyClusterStateWatcher and CollectionsChildWatcher to prevent any watch events from taking effect.
          Hide
          dragonsinth Scott Blum added a comment -

          Testing a fix now: https://github.com/fullstorydev/lucene-solr/tree/SOLR-9029
          Hoss Man Shalin Shekhar Mangar if you'd like to look at the change.

          Show
          dragonsinth Scott Blum added a comment - Testing a fix now: https://github.com/fullstorydev/lucene-solr/tree/SOLR-9029 Hoss Man Shalin Shekhar Mangar if you'd like to look at the change.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 89c65af2a6e5f1c8216c1202f65e8d670ef14385 in lucene-solr's branch refs/heads/master from Scott Blum
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=89c65af ]

          SOLR-9029: fix rare ZkStateReader visibility race during collection state format update

          Show
          jira-bot ASF subversion and git services added a comment - Commit 89c65af2a6e5f1c8216c1202f65e8d670ef14385 in lucene-solr's branch refs/heads/master from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=89c65af ] SOLR-9029 : fix rare ZkStateReader visibility race during collection state format update
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 89857653cafdafe5396abe946cc3d7f4fec1377d in lucene-solr's branch refs/heads/branch_6x from Scott Blum
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8985765 ]

          SOLR-9029: fix rare ZkStateReader visibility race during collection state format update

          Show
          jira-bot ASF subversion and git services added a comment - Commit 89857653cafdafe5396abe946cc3d7f4fec1377d in lucene-solr's branch refs/heads/branch_6x from Scott Blum [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8985765 ] SOLR-9029 : fix rare ZkStateReader visibility race during collection state format update
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Great, I was seeing this a lot locally.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Great, I was seeing this a lot locally.
          Hide
          hossman Hoss Man added a comment -

          Manually correcting fixVersion per Step #S5 of LUCENE-7271

          Show
          hossman Hoss Man added a comment - Manually correcting fixVersion per Step #S5 of LUCENE-7271
          Hide
          steve_rowe Steve Rowe added a comment -

          Not backporting to 6.0.1, since the modifications are to ZkStateReader.forceUpdateCollection(), introduced by SOLR-8745, which won't be backported to branch_6_0.

          Show
          steve_rowe Steve Rowe added a comment - Not backporting to 6.0.1, since the modifications are to ZkStateReader.forceUpdateCollection() , introduced by SOLR-8745 , which won't be backported to branch_6_0.

            People

            • Assignee:
              dragonsinth Scott Blum
              Reporter:
              hossman Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development