Solr
  1. Solr
  2. SOLR-6231

RollingRestartTest failures on jenkins

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: SolrCloud, Tests
    • Labels:
      None

      Description

      A somewhat rare fail on jenkins. An overseer was available to service requests but even after waiting for 60 seconds, none of the designates were the overseer.

      Build: http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Windows/4081/
      Java: 32bit/jdk1.8.0_20-ea-b21 -client -XX:+UseSerialGC
      
      1 tests failed.
      REGRESSION:  org.apache.solr.cloud.RollingRestartTest.testDistribSearch
      
      Error Message:
      No overseer designate as leader found after restart #3: 127.0.0.1:60996_
      
      Stack Trace:
      java.lang.AssertionError: No overseer designate as leader found after restart #3: 127.0.0.1:60996_
              at __randomizedtesting.SeedInfo.seed([5263BF570390CF79:D385314F74CFAF45]:0)
              at org.junit.Assert.fail(Assert.java:93)
              at org.apache.solr.cloud.RollingRestartTest.restartWithRolesTest(RollingRestartTest.java:100)
              at org.apache.solr.cloud.RollingRestartTest.doTest(RollingRestartTest.java:61)
              at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:865)
      
      1. SOLR-6231.patch
        5 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          Shalin Shekhar Mangar added a comment -

          The good thing about this failure is that in all instances I've seen, we always have an overseer. It's just that the overseer is not one of the designates. I looked at the logs of a few failures and it seemed like the re-prioritization was in process and we timed out early.

          Here's a patch to harden the process. We have a max timeout of 300 seconds and a smaller 60 second timeout for finding designates which is adjusted further and further ahead as we find new overseers being elected. The idea is that if within 60 seconds, the overseer hasn't changed, then we're likely not going to find a new overseer and we should stop. But if the overseer changed then re-prioritization is in progress and we should wait more.

          Show
          Shalin Shekhar Mangar added a comment - The good thing about this failure is that in all instances I've seen, we always have an overseer. It's just that the overseer is not one of the designates. I looked at the logs of a few failures and it seemed like the re-prioritization was in process and we timed out early. Here's a patch to harden the process. We have a max timeout of 300 seconds and a smaller 60 second timeout for finding designates which is adjusted further and further ahead as we find new overseers being elected. The idea is that if within 60 seconds, the overseer hasn't changed, then we're likely not going to find a new overseer and we should stop. But if the overseer changed then re-prioritization is in progress and we should wait more.
          Hide
          Noble Paul added a comment -

          Yeah,
          the timeout has to take into account if the leader has changed in between .

          Show
          Noble Paul added a comment - Yeah, the timeout has to take into account if the leader has changed in between .
          Hide
          ASF subversion and git services added a comment -

          Commit 1612499 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1612499 ]

          SOLR-6231: Harden the RollingRestartTest

          Show
          ASF subversion and git services added a comment - Commit 1612499 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1612499 ] SOLR-6231 : Harden the RollingRestartTest
          Hide
          ASF subversion and git services added a comment -

          Commit 1612500 from shalin@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1612500 ]

          SOLR-6231: Harden the RollingRestartTest

          Show
          ASF subversion and git services added a comment - Commit 1612500 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1612500 ] SOLR-6231 : Harden the RollingRestartTest
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks for reviewing, Noble!

          Show
          Shalin Shekhar Mangar added a comment - Thanks for reviewing, Noble!
          Hide
          ASF subversion and git services added a comment -

          Commit 1613834 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1613834 ]

          SOLR-6231: Increased timeouts and hardened the RollingRestartTest

          Show
          ASF subversion and git services added a comment - Commit 1613834 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1613834 ] SOLR-6231 : Increased timeouts and hardened the RollingRestartTest
          Hide
          Shalin Shekhar Mangar added a comment -

          This has helped a lot. I no longer see failures after this fix went in.

          Show
          Shalin Shekhar Mangar added a comment - This has helped a lot. I no longer see failures after this fix went in.
          Hide
          ASF subversion and git services added a comment -

          Commit 1613835 from shalin@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1613835 ]

          SOLR-6231: Increased timeouts and hardened the RollingRestartTest

          Show
          ASF subversion and git services added a comment - Commit 1613835 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1613835 ] SOLR-6231 : Increased timeouts and hardened the RollingRestartTest
          Hide
          Shalin Shekhar Mangar added a comment -

          Haha, what do you know. The moment I mark it as resolved, I see a failure on Policeman Jenkins. Still more to do here so I am re-opening it.

          http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Windows/4122/

          Show
          Shalin Shekhar Mangar added a comment - Haha, what do you know. The moment I mark it as resolved, I see a failure on Policeman Jenkins. Still more to do here so I am re-opening it. http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Windows/4122/
          Hide
          Mark Miller added a comment -

          I think the test is just flawed atm. Depending on what random jettys are chosen to be killed, after restart, there just might not be an overseer designate alive and so you have a different overseer and this check will fail.

          Show
          Mark Miller added a comment - I think the test is just flawed atm. Depending on what random jettys are chosen to be killed, after restart, there just might not be an overseer designate alive and so you have a different overseer and this check will fail.
          Hide
          Shalin Shekhar Mangar added a comment -

          Yeah, you're right. The test doesn't make sure that the jetty being killed doesn't host all the designates. I'll fix, thanks!

          Show
          Shalin Shekhar Mangar added a comment - Yeah, you're right. The test doesn't make sure that the jetty being killed doesn't host all the designates. I'll fix, thanks!
          Hide
          Mark Miller added a comment -

          I've got one possible change in place for it for SOLR-6291. It has tested out very well so far on both regular and nightly runs - I've been running them all weekend.

          Show
          Mark Miller added a comment - I've got one possible change in place for it for SOLR-6291 . It has tested out very well so far on both regular and nightly runs - I've been running them all weekend.
          Hide
          Mark Miller added a comment -

          I think SOLR-6291 has addressed this.

          Show
          Mark Miller added a comment - I think SOLR-6291 has addressed this.
          Hide
          Chris Kulinski added a comment -

          It appears that a System.out.println() debug statement was accidentally committed to DataInputHandler.java while attempting to fix this defect. We're seeing it in our Solr 4.10 logs. Could you please remove?

          http://svn.apache.org/viewvc?view=revision&revision=r1612500

          Show
          Chris Kulinski added a comment - It appears that a System.out.println() debug statement was accidentally committed to DataInputHandler.java while attempting to fix this defect. We're seeing it in our Solr 4.10 logs. Could you please remove? http://svn.apache.org/viewvc?view=revision&revision=r1612500
          Hide
          ASF subversion and git services added a comment -

          Commit 1624352 from shalin@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1624352 ]

          SOLR-6231: Remove debug code that shouldn't have been committed at all

          Show
          ASF subversion and git services added a comment - Commit 1624352 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1624352 ] SOLR-6231 : Remove debug code that shouldn't have been committed at all
          Hide
          ASF subversion and git services added a comment -

          Commit 1624353 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1624353 ]

          SOLR-6231: Remove debug code that shouldn't have been committed at all

          Show
          ASF subversion and git services added a comment - Commit 1624353 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1624353 ] SOLR-6231 : Remove debug code that shouldn't have been committed at all
          Hide
          Shalin Shekhar Mangar added a comment -

          Indeed, I had committed some code that was unrelated and for debugging purposes only. I've removed it. Thanks Chris!

          Show
          Shalin Shekhar Mangar added a comment - Indeed, I had committed some code that was unrelated and for debugging purposes only. I've removed it. Thanks Chris!
          Hide
          Chris Kulinski added a comment -

          Thanks!

          Show
          Chris Kulinski added a comment - Thanks!

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development