Solr
  1. Solr
  2. SOLR-5596

OverseerTest.testOverseerFailure - leader node already exists.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, Trunk
    • Component/s: None
    • Labels:
      None

      Description

      Seeing this a bunch on jenkins - previous leader ephemeral node is still around for some reason.

        Issue Links

          Activity

          Mark Miller created issue -
          Hide
          ASF subversion and git services added a comment -

          Commit 1572370 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1572370 ]

          SOLR-5596: Improve this test.

          Show
          ASF subversion and git services added a comment - Commit 1572370 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1572370 ] SOLR-5596 : Improve this test.
          Hide
          ASF subversion and git services added a comment -

          Commit 1572371 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1572371 ]

          SOLR-5596: Improve this test.

          Show
          ASF subversion and git services added a comment - Commit 1572371 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1572371 ] SOLR-5596 : Improve this test.
          Hide
          Mark Miller added a comment -

          That last attempt did not work - I just saw this again locally.

          Show
          Mark Miller added a comment - That last attempt did not work - I just saw this again locally.
          Mark Miller made changes -
          Field Original Value New Value
          Link This issue relates to SOLR-5799 [ SOLR-5799 ]
          Hide
          Mark Miller added a comment -

          SOLR-5799 may solve this. My best guess is that the previous leader is just taking a little longer than we would expect to have it's ephemeral leader registration node removed.

          Show
          Mark Miller added a comment - SOLR-5799 may solve this. My best guess is that the previous leader is just taking a little longer than we would expect to have it's ephemeral leader registration node removed.
          Hide
          Mark Miller added a comment -

          SOLR-5799 was just committed - we now wait a short time if an ephemeral leader registration node exists - if we are simply catching it briefly before it goes away, we wait and when it is gone, create our own ephemeral registration node.

          Show
          Mark Miller added a comment - SOLR-5799 was just committed - we now wait a short time if an ephemeral leader registration node exists - if we are simply catching it briefly before it goes away, we wait and when it is gone, create our own ephemeral registration node.
          Mark Miller made changes -
          Assignee Mark Miller [ markrmiller@gmail.com ]
          Mark Miller made changes -
          Fix Version/s 4.8 [ 12326254 ]
          Fix Version/s 5.0 [ 12321664 ]
          Hide
          Mark Miller added a comment -

          So we still hit this - pretty surprising. I've gone over the test a couple times and have not spotted the problem yet, but I think it must be an issue with the test.

          Show
          Mark Miller added a comment - So we still hit this - pretty surprising. I've gone over the test a couple times and have not spotted the problem yet, but I think it must be an issue with the test.
          Mark Miller made changes -
          Link This issue is related to SOLR-5834 [ SOLR-5834 ]
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.
          Uwe Schindler made changes -
          Fix Version/s 4.9 [ 12326731 ]
          Fix Version/s 4.8 [ 12326254 ]
          Hide
          Shalin Shekhar Mangar added a comment -

          I was looking into the logs of this fail today:
          http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Linux/10616/

             [junit4]   2> 472241 T2893 oazsp.FileTxnLog.commit WARN fsync-ing the write ahead log in SyncThread:0 took 11588ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
          

          This error can be due to a slow machine but it also happens on fast machines if you try to do a lot of writes very fast on ZooKeeper which is what the testShardLeaderChange does. Perhaps we should add a small wait between operations?

          Would it make sense to set forcefscync to no for ZooKeeper in our tests? At the very least, it would reduce the spurious failures and let us concentrate on fixing real bugs.

          See http://mail-archives.apache.org/mod_mbox/zookeeper-user/201401.mbox/%3CCABtFeVwoXh1d8D+tO0wyLMBap_CRbY6L9i9wh2Le7s1ZkPN+uA@mail.gmail.com%3E
          and http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/zookeeper_psuedo_scalability_and_absolute

          Show
          Shalin Shekhar Mangar added a comment - I was looking into the logs of this fail today: http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Linux/10616/ [junit4] 2> 472241 T2893 oazsp.FileTxnLog.commit WARN fsync-ing the write ahead log in SyncThread:0 took 11588ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide This error can be due to a slow machine but it also happens on fast machines if you try to do a lot of writes very fast on ZooKeeper which is what the testShardLeaderChange does. Perhaps we should add a small wait between operations? Would it make sense to set forcefscync to no for ZooKeeper in our tests? At the very least, it would reduce the spurious failures and let us concentrate on fixing real bugs. See http://mail-archives.apache.org/mod_mbox/zookeeper-user/201401.mbox/%3CCABtFeVwoXh1d8D+tO0wyLMBap_CRbY6L9i9wh2Le7s1ZkPN+uA@mail.gmail.com%3E and http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/zookeeper_psuedo_scalability_and_absolute
          Hide
          Mark Miller added a comment -

          Would it make sense to set forcefscync to no for ZooKeeper in our tests?

          I think I tried it many months ago and still saw the problem. I can't remember exactly what settings I tried though, so feel free to see if you can get it to work. We don't need to worry about this type of thing with zookeeper for 99.9% of our tests.

          Show
          Mark Miller added a comment - Would it make sense to set forcefscync to no for ZooKeeper in our tests? I think I tried it many months ago and still saw the problem. I can't remember exactly what settings I tried though, so feel free to see if you can get it to work. We don't need to worry about this type of thing with zookeeper for 99.9% of our tests.
          Hide
          Shalin Shekhar Mangar added a comment -

          I'll take a crack at it.

          Show
          Shalin Shekhar Mangar added a comment - I'll take a crack at it.
          Hide
          ASF subversion and git services added a comment -

          Commit 1608555 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1608555 ]

          SOLR-5596: Set system property zookeeper.forceSync=no for Solr test cases

          Show
          ASF subversion and git services added a comment - Commit 1608555 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1608555 ] SOLR-5596 : Set system property zookeeper.forceSync=no for Solr test cases
          Hide
          ASF subversion and git services added a comment -

          Commit 1608559 from shalin@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1608559 ]

          SOLR-5596: Set system property zookeeper.forceSync=no for Solr test cases

          Show
          ASF subversion and git services added a comment - Commit 1608559 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1608559 ] SOLR-5596 : Set system property zookeeper.forceSync=no for Solr test cases
          Hide
          ASF subversion and git services added a comment -

          Commit 1608562 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1608562 ]

          SOLR-5596: Remove initCore call from afterClass

          Show
          ASF subversion and git services added a comment - Commit 1608562 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1608562 ] SOLR-5596 : Remove initCore call from afterClass
          Hide
          ASF subversion and git services added a comment -

          Commit 1608565 from shalin@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1608565 ]

          SOLR-5596: Remove initCore call from afterClass

          Show
          ASF subversion and git services added a comment - Commit 1608565 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1608565 ] SOLR-5596 : Remove initCore call from afterClass
          Shalin Shekhar Mangar made changes -
          Assignee Mark Miller [ markrmiller@gmail.com ] Shalin Shekhar Mangar [ shalinmangar ]
          Hide
          Mark Miller added a comment -

          Yeah, I think this is the same result as when I tried to remove the forceSync - still happens: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/4201/

          Show
          Mark Miller added a comment - Yeah, I think this is the same result as when I tried to remove the forceSync - still happens: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/4201/
          Hide
          Mark Miller added a comment -

          I think this may actually be due to SOLR-6426 SolrZkClient clean can fail due to a race with children nodes.

          Show
          Mark Miller added a comment - I think this may actually be due to SOLR-6426 SolrZkClient clean can fail due to a race with children nodes.
          Hide
          ASF subversion and git services added a comment -

          Commit 1620247 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1620247 ]

          SOLR-5596: Raise zk client timeout for mock objects.

          Show
          ASF subversion and git services added a comment - Commit 1620247 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1620247 ] SOLR-5596 : Raise zk client timeout for mock objects.
          Hide
          ASF subversion and git services added a comment -

          Commit 1620248 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1620248 ]

          SOLR-5596: Raise zk client timeout for mock objects.

          Show
          ASF subversion and git services added a comment - Commit 1620248 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1620248 ] SOLR-5596 : Raise zk client timeout for mock objects.
          Hide
          Mark Miller added a comment -

          No, it can still happen.

          Show
          Mark Miller added a comment - No, it can still happen.
          Mark Miller made changes -
          Link This issue is related to SOLR-6428 [ SOLR-6428 ]
          Hide
          ASF subversion and git services added a comment -

          Commit 1620319 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1620319 ]

          SOLR-6428: Occasional OverseerTest#testOverseerFailure fail due to missing election node.
          SOLR-5596: OverseerTest.testOverseerFailure - leader node already exists.

          Show
          ASF subversion and git services added a comment - Commit 1620319 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1620319 ] SOLR-6428 : Occasional OverseerTest#testOverseerFailure fail due to missing election node. SOLR-5596 : OverseerTest.testOverseerFailure - leader node already exists.
          Hide
          ASF subversion and git services added a comment -

          Commit 1620320 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1620320 ]

          SOLR-6428: Occasional OverseerTest#testOverseerFailure fail due to missing election node.
          SOLR-5596: OverseerTest.testOverseerFailure - leader node already exists.

          Show
          ASF subversion and git services added a comment - Commit 1620320 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1620320 ] SOLR-6428 : Occasional OverseerTest#testOverseerFailure fail due to missing election node. SOLR-5596 : OverseerTest.testOverseerFailure - leader node already exists.
          Hide
          Mark Miller added a comment -

          Okay, now I think this will stop. We will see.

          Show
          Mark Miller added a comment - Okay, now I think this will stop. We will see.
          Mark Miller made changes -
          Assignee Shalin Shekhar Mangar [ shalinmangar ] Mark Miller [ markrmiller@gmail.com ]
          Mark Miller made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 4.11 [ 12327845 ]
          Fix Version/s 4.9 [ 12326731 ]
          Resolution Fixed [ 1 ]
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.
          Anshum Gupta made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          239d 20h 12m 1 Mark Miller 29/Aug/14 22:28
          Resolved Resolved Closed Closed
          177d 7h 32m 1 Anshum Gupta 23/Feb/15 05:01

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development