Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8875

ZkStateWriter.writePendingUpdates() can return null clusterState; NPE in Overseer

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5.2, 5.6, 6.0.1, 6.1
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      While trying to get the test in Varun's patch in SOLR-5750 (SolrCloud collection backup & restore) to succeed, I kept hitting an NPE in Overseer in which clusterState was null. I added a bunch of asserts and found where it was happening which I worked around temporarily. See https://github.com/apache/lucene-solr/commit/fd9c4d59e8dbe0e9fbd99a40ed2ff746c1ae7a0c#diff-9ed614eee66b9e685d73446b775dc043R247 which is ZkStateWriter.writePendingUpdates() returning null, overwriting the current non-null clusterState. This happens nearly every time for me when I run that test.

      1. SOLR-8875.patch
        0.5 kB
        Scott Blum

        Issue Links

          Activity

          Hide
          dsmiley David Smiley added a comment -

          As an aside, I think this Overseer code would be clearer to me if the refreshClusterState boolean went away and we simply checked for clusterState == null (less variables and more clarity of purpose of what the if (refreshClusterState) is to resolve – it's to get a new cluster state because it has none. In IRC Shalin Shekhar Mangar agreed with me.

          Show
          dsmiley David Smiley added a comment - As an aside, I think this Overseer code would be clearer to me if the refreshClusterState boolean went away and we simply checked for clusterState == null (less variables and more clarity of purpose of what the if (refreshClusterState) is to resolve – it's to get a new cluster state because it has none. In IRC Shalin Shekhar Mangar agreed with me.
          Hide
          dragonsinth Scott Blum added a comment -

          Hi David,

          Can you give me a little more detail on how to repro?

          I checked out fd9c4d59e8dbe0e9fbd99a40ed2ff746c1ae7a0c then ran TestCloudBackupRestore with -ea, and it passed for me. Do I need to run it a bunch of times to see the problem?

          Show
          dragonsinth Scott Blum added a comment - Hi David, Can you give me a little more detail on how to repro? I checked out fd9c4d59e8dbe0e9fbd99a40ed2ff746c1ae7a0c then ran TestCloudBackupRestore with -ea, and it passed for me. Do I need to run it a bunch of times to see the problem?
          Hide
          dsmiley David Smiley added a comment -

          Thanks for investigating Scott. You need to comment out the modifications I did in Overseer that avoid overwriting the cluster state with null. The rest of Overseer changes outside this method should remain.

          Show
          dsmiley David Smiley added a comment - Thanks for investigating Scott. You need to comment out the modifications I did in Overseer that avoid overwriting the cluster state with null. The rest of Overseer changes outside this method should remain.
          Hide
          dragonsinth Scott Blum added a comment -

          Can you send me a SHA that captures the state to repro?

          Show
          dragonsinth Scott Blum added a comment - Can you send me a SHA that captures the state to repro?
          Hide
          dsmiley David Smiley added a comment -

          I just pushed a branch by the name of this issue. It has one edit from SOLR-5750's branch, which is to cause an assertion failure instead of just logging an error. Simply run TestCloudBackupRestore.

          Show
          dsmiley David Smiley added a comment - I just pushed a branch by the name of this issue. It has one edit from SOLR-5750 's branch, which is to cause an assertion failure instead of just logging an error. Simply run TestCloudBackupRestore.
          Hide
          dragonsinth Scott Blum added a comment -

          Perfect! Thank you

          Show
          dragonsinth Scott Blum added a comment - Perfect! Thank you
          Hide
          dragonsinth Scott Blum added a comment -

          I think this may be a one-liner; in the calling loop, I think there are cases where you can get all the way to the bottom without having actually written any updates, which fails to ever initialize ZkStateWriter.clusterState().

          To be honest, I'm not a fan of both the Overseer loop and ZkStateWriter both trying to keep a source-of-truth clusterState variable, it's a recipe for getting out of sync.

          Show
          dragonsinth Scott Blum added a comment - I think this may be a one-liner; in the calling loop, I think there are cases where you can get all the way to the bottom without having actually written any updates, which fails to ever initialize ZkStateWriter.clusterState(). To be honest, I'm not a fan of both the Overseer loop and ZkStateWriter both trying to keep a source-of-truth clusterState variable, it's a recipe for getting out of sync.
          Hide
          dsmiley David Smiley added a comment -

          This code is new to me but what you say resonates with me. I also got the sense these two things were trying to keep track of the clusterState, and that seems redundant. When I get back to SOLR-5750 in the next few days I'll apply this and see if it goes away. If so I'll commit it. It'll be good to have Jenkins beating on it for awhile.

          Show
          dsmiley David Smiley added a comment - This code is new to me but what you say resonates with me. I also got the sense these two things were trying to keep track of the clusterState, and that seems redundant. When I get back to SOLR-5750 in the next few days I'll apply this and see if it goes away. If so I'll commit it. It'll be good to have Jenkins beating on it for awhile.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4cfd33f7434509d535fa8d5faaf3d20ced1935ce in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4cfd33f ]

          SOLR-8875: Fix null clusterState from ZkStateWriter. Revert my sanity check asserts in Overseer.
          (cherry picked from commit 3bbf8aa)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4cfd33f7434509d535fa8d5faaf3d20ced1935ce in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4cfd33f ] SOLR-8875 : Fix null clusterState from ZkStateWriter. Revert my sanity check asserts in Overseer. (cherry picked from commit 3bbf8aa)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 3bbf8aaa8a9f57c43d64e9c361184c379d90b9c9 in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3bbf8aa ]

          SOLR-8875: Fix null clusterState from ZkStateWriter. Revert my sanity check asserts in Overseer.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 3bbf8aaa8a9f57c43d64e9c361184c379d90b9c9 in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3bbf8aa ] SOLR-8875 : Fix null clusterState from ZkStateWriter. Revert my sanity check asserts in Overseer.
          Hide
          dsmiley David Smiley added a comment -

          Thanks Scott!

          Show
          dsmiley David Smiley added a comment - Thanks Scott!
          Hide
          steve_rowe Steve Rowe added a comment -

          Reopening to backport to 6.0.1.

          Show
          steve_rowe Steve Rowe added a comment - Reopening to backport to 6.0.1.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a2732fd6f0899097defced1b8ee86dbc0099ac8d in lucene-solr's branch refs/heads/branch_6_0 from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a2732fd ]

          SOLR-8875: Fix null clusterState from ZkStateWriter. Revert my sanity check asserts in Overseer.
          (cherry picked from commit 3bbf8aa)

          Show
          jira-bot ASF subversion and git services added a comment - Commit a2732fd6f0899097defced1b8ee86dbc0099ac8d in lucene-solr's branch refs/heads/branch_6_0 from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a2732fd ] SOLR-8875 : Fix null clusterState from ZkStateWriter. Revert my sanity check asserts in Overseer. (cherry picked from commit 3bbf8aa)
          Hide
          steve_rowe Steve Rowe added a comment -

          Bulk close issues included in the 6.0.1 release.

          Show
          steve_rowe Steve Rowe added a comment - Bulk close issues included in the 6.0.1 release.
          Hide
          steve_rowe Steve Rowe added a comment -

          Reopening to backport to 5.6 and 5.5.2.

          Show
          steve_rowe Steve Rowe added a comment - Reopening to backport to 5.6 and 5.5.2.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 83bd4fc00272b5171613e44adaf523587e752f1a in lucene-solr's branch refs/heads/branch_5_5 from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=83bd4fc ]

          SOLR-8875: Fix null clusterState from ZkStateWriter. Revert my sanity check asserts in Overseer.
          (cherry picked from commit 3bbf8aa)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 83bd4fc00272b5171613e44adaf523587e752f1a in lucene-solr's branch refs/heads/branch_5_5 from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=83bd4fc ] SOLR-8875 : Fix null clusterState from ZkStateWriter. Revert my sanity check asserts in Overseer. (cherry picked from commit 3bbf8aa)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5d5719877683934b7699369fa5468d56910840f4 in lucene-solr's branch refs/heads/branch_5x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5d57198 ]

          SOLR-8875: Fix null clusterState from ZkStateWriter. Revert my sanity check asserts in Overseer.
          (cherry picked from commit 3bbf8aa)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5d5719877683934b7699369fa5468d56910840f4 in lucene-solr's branch refs/heads/branch_5x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5d57198 ] SOLR-8875 : Fix null clusterState from ZkStateWriter. Revert my sanity check asserts in Overseer. (cherry picked from commit 3bbf8aa)
          Hide
          steve_rowe Steve Rowe added a comment -

          Bulk close issues released with 5.5.2.

          Show
          steve_rowe Steve Rowe added a comment - Bulk close issues released with 5.5.2.

            People

            • Assignee:
              dsmiley David Smiley
              Reporter:
              dsmiley David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development