Solr
  1. Solr
  2. SOLR-5596

OverseerTest.testOverseerFailure - leader node already exists.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, Trunk
    • Component/s: None
    • Labels:
      None

      Description

      Seeing this a bunch on jenkins - previous leader ephemeral node is still around for some reason.

        Issue Links

          Activity

          Hide
          ASF subversion and git services added a comment -

          Commit 1572370 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1572370 ]

          SOLR-5596: Improve this test.

          Show
          ASF subversion and git services added a comment - Commit 1572370 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1572370 ] SOLR-5596 : Improve this test.
          Hide
          ASF subversion and git services added a comment -

          Commit 1572371 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1572371 ]

          SOLR-5596: Improve this test.

          Show
          ASF subversion and git services added a comment - Commit 1572371 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1572371 ] SOLR-5596 : Improve this test.
          Hide
          Mark Miller added a comment -

          That last attempt did not work - I just saw this again locally.

          Show
          Mark Miller added a comment - That last attempt did not work - I just saw this again locally.
          Hide
          Mark Miller added a comment -

          SOLR-5799 may solve this. My best guess is that the previous leader is just taking a little longer than we would expect to have it's ephemeral leader registration node removed.

          Show
          Mark Miller added a comment - SOLR-5799 may solve this. My best guess is that the previous leader is just taking a little longer than we would expect to have it's ephemeral leader registration node removed.
          Hide
          Mark Miller added a comment -

          SOLR-5799 was just committed - we now wait a short time if an ephemeral leader registration node exists - if we are simply catching it briefly before it goes away, we wait and when it is gone, create our own ephemeral registration node.

          Show
          Mark Miller added a comment - SOLR-5799 was just committed - we now wait a short time if an ephemeral leader registration node exists - if we are simply catching it briefly before it goes away, we wait and when it is gone, create our own ephemeral registration node.
          Hide
          Mark Miller added a comment -

          So we still hit this - pretty surprising. I've gone over the test a couple times and have not spotted the problem yet, but I think it must be an issue with the test.

          Show
          Mark Miller added a comment - So we still hit this - pretty surprising. I've gone over the test a couple times and have not spotted the problem yet, but I think it must be an issue with the test.
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.
          Hide
          Shalin Shekhar Mangar added a comment -

          I was looking into the logs of this fail today:
          http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Linux/10616/

             [junit4]   2> 472241 T2893 oazsp.FileTxnLog.commit WARN fsync-ing the write ahead log in SyncThread:0 took 11588ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
          

          This error can be due to a slow machine but it also happens on fast machines if you try to do a lot of writes very fast on ZooKeeper which is what the testShardLeaderChange does. Perhaps we should add a small wait between operations?

          Would it make sense to set forcefscync to no for ZooKeeper in our tests? At the very least, it would reduce the spurious failures and let us concentrate on fixing real bugs.

          See http://mail-archives.apache.org/mod_mbox/zookeeper-user/201401.mbox/%3CCABtFeVwoXh1d8D+tO0wyLMBap_CRbY6L9i9wh2Le7s1ZkPN+uA@mail.gmail.com%3E
          and http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/zookeeper_psuedo_scalability_and_absolute

          Show
          Shalin Shekhar Mangar added a comment - I was looking into the logs of this fail today: http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Linux/10616/ [junit4] 2> 472241 T2893 oazsp.FileTxnLog.commit WARN fsync-ing the write ahead log in SyncThread:0 took 11588ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide This error can be due to a slow machine but it also happens on fast machines if you try to do a lot of writes very fast on ZooKeeper which is what the testShardLeaderChange does. Perhaps we should add a small wait between operations? Would it make sense to set forcefscync to no for ZooKeeper in our tests? At the very least, it would reduce the spurious failures and let us concentrate on fixing real bugs. See http://mail-archives.apache.org/mod_mbox/zookeeper-user/201401.mbox/%3CCABtFeVwoXh1d8D+tO0wyLMBap_CRbY6L9i9wh2Le7s1ZkPN+uA@mail.gmail.com%3E and http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/zookeeper_psuedo_scalability_and_absolute
          Hide
          Mark Miller added a comment -

          Would it make sense to set forcefscync to no for ZooKeeper in our tests?

          I think I tried it many months ago and still saw the problem. I can't remember exactly what settings I tried though, so feel free to see if you can get it to work. We don't need to worry about this type of thing with zookeeper for 99.9% of our tests.

          Show
          Mark Miller added a comment - Would it make sense to set forcefscync to no for ZooKeeper in our tests? I think I tried it many months ago and still saw the problem. I can't remember exactly what settings I tried though, so feel free to see if you can get it to work. We don't need to worry about this type of thing with zookeeper for 99.9% of our tests.
          Hide
          Shalin Shekhar Mangar added a comment -

          I'll take a crack at it.

          Show
          Shalin Shekhar Mangar added a comment - I'll take a crack at it.
          Hide
          ASF subversion and git services added a comment -

          Commit 1608555 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1608555 ]

          SOLR-5596: Set system property zookeeper.forceSync=no for Solr test cases

          Show
          ASF subversion and git services added a comment - Commit 1608555 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1608555 ] SOLR-5596 : Set system property zookeeper.forceSync=no for Solr test cases
          Hide
          ASF subversion and git services added a comment -

          Commit 1608559 from shalin@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1608559 ]

          SOLR-5596: Set system property zookeeper.forceSync=no for Solr test cases

          Show
          ASF subversion and git services added a comment - Commit 1608559 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1608559 ] SOLR-5596 : Set system property zookeeper.forceSync=no for Solr test cases
          Hide
          ASF subversion and git services added a comment -

          Commit 1608562 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1608562 ]

          SOLR-5596: Remove initCore call from afterClass

          Show
          ASF subversion and git services added a comment - Commit 1608562 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1608562 ] SOLR-5596 : Remove initCore call from afterClass
          Hide
          ASF subversion and git services added a comment -

          Commit 1608565 from shalin@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1608565 ]

          SOLR-5596: Remove initCore call from afterClass

          Show
          ASF subversion and git services added a comment - Commit 1608565 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1608565 ] SOLR-5596 : Remove initCore call from afterClass
          Hide
          Mark Miller added a comment -

          Yeah, I think this is the same result as when I tried to remove the forceSync - still happens: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/4201/

          Show
          Mark Miller added a comment - Yeah, I think this is the same result as when I tried to remove the forceSync - still happens: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/4201/
          Hide
          Mark Miller added a comment -

          I think this may actually be due to SOLR-6426 SolrZkClient clean can fail due to a race with children nodes.

          Show
          Mark Miller added a comment - I think this may actually be due to SOLR-6426 SolrZkClient clean can fail due to a race with children nodes.
          Hide
          ASF subversion and git services added a comment -

          Commit 1620247 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1620247 ]

          SOLR-5596: Raise zk client timeout for mock objects.

          Show
          ASF subversion and git services added a comment - Commit 1620247 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1620247 ] SOLR-5596 : Raise zk client timeout for mock objects.
          Hide
          ASF subversion and git services added a comment -

          Commit 1620248 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1620248 ]

          SOLR-5596: Raise zk client timeout for mock objects.

          Show
          ASF subversion and git services added a comment - Commit 1620248 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1620248 ] SOLR-5596 : Raise zk client timeout for mock objects.
          Hide
          Mark Miller added a comment -

          No, it can still happen.

          Show
          Mark Miller added a comment - No, it can still happen.
          Hide
          ASF subversion and git services added a comment -

          Commit 1620319 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1620319 ]

          SOLR-6428: Occasional OverseerTest#testOverseerFailure fail due to missing election node.
          SOLR-5596: OverseerTest.testOverseerFailure - leader node already exists.

          Show
          ASF subversion and git services added a comment - Commit 1620319 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1620319 ] SOLR-6428 : Occasional OverseerTest#testOverseerFailure fail due to missing election node. SOLR-5596 : OverseerTest.testOverseerFailure - leader node already exists.
          Hide
          ASF subversion and git services added a comment -

          Commit 1620320 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1620320 ]

          SOLR-6428: Occasional OverseerTest#testOverseerFailure fail due to missing election node.
          SOLR-5596: OverseerTest.testOverseerFailure - leader node already exists.

          Show
          ASF subversion and git services added a comment - Commit 1620320 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1620320 ] SOLR-6428 : Occasional OverseerTest#testOverseerFailure fail due to missing election node. SOLR-5596 : OverseerTest.testOverseerFailure - leader node already exists.
          Hide
          Mark Miller added a comment -

          Okay, now I think this will stop. We will see.

          Show
          Mark Miller added a comment - Okay, now I think this will stop. We will see.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development