Solr
  1. Solr
  2. SOLR-6592

Re-try loop in the ZkController.waitForLeaderToSeeDownState method hangs unit test when leader is gone

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0
    • Component/s: None
    • Labels:
      None

      Description

      HttpPartitionTest is failing due to a ThreadLeakError, which I believe is because the re-try loop in ZkController.waitForLeaderToSeeDownState is coded to take upwards of 12 minutes to fail (2 minutes socket timeout, 6 max retries). The code should be improved to stop trying if the leader is gone, which seems to be the case here (maybe). At the very least, need to figure out how to avoid this ThreadLeakError.

      Build: http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-Linux/11234/
      Java: 64bit/jdk1.8.0_40-ea-b04 -XX:+UseCompressedOops -XX:+UseG1GC

      2 tests failed.
      FAILED: junit.framework.TestSuite.org.apache.solr.cloud.HttpPartitionTest

      Error Message:
      1 thread leaked from SUITE scope at org.apache.solr.cloud.HttpPartitionTest: 1) Thread[id=8655, name=Thread-2764, state=RUNNABLE, group=TGRP-HttpPartitionTest] at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:466) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:215) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211) at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1623) at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:422) at org.apache.solr.cloud.ZkController.access$100(ZkController.java:93) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:261) at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)

      Stack Trace:
      com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.apache.solr.cloud.HttpPartitionTest:
      1) Thread[id=8655, name=Thread-2764, state=RUNNABLE, group=TGRP-HttpPartitionTest]
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
      at java.net.SocketInputStream.read(SocketInputStream.java:170)
      at java.net.SocketInputStream.read(SocketInputStream.java:141)
      at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
      at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
      at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
      at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
      at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
      at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
      at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
      at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
      at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
      at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
      at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
      at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
      at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
      at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
      at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
      at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
      at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
      at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:466)
      at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:215)
      at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
      at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1623)
      at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:422)
      at org.apache.solr.cloud.ZkController.access$100(ZkController.java:93)
      at org.apache.solr.cloud.ZkController$1.command(ZkController.java:261)
      at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
      at __randomizedtesting.SeedInfo.seed([BE8A2D1EED13DDED]:0)

      FAILED: junit.framework.TestSuite.org.apache.solr.cloud.HttpPartitionTest

      1. SOLR-6592.patch
        4 kB
        Timothy Potter

        Activity

        Hide
        Timothy Potter added a comment -

        Here's a patch that checks to see if the leader node is live after receiving an IO error and if the leader is not live, it throws an exception instead of re-trying another pass through the loop. This may be too aggressive but my thinking is there's no need to wait for the leader to see the down state if it's not live right?

        Show
        Timothy Potter added a comment - Here's a patch that checks to see if the leader node is live after receiving an IO error and if the leader is not live, it throws an exception instead of re-trying another pass through the loop. This may be too aggressive but my thinking is there's no need to wait for the leader to see the down state if it's not live right?
        Hide
        Mark Miller added a comment -

        It should be fine. At worst, the cluster state is a bit stale and it throws that exception when the node is live (in some crazy scenario), but even then the replica recovery will just attempt again.

        Show
        Mark Miller added a comment - It should be fine. At worst, the cluster state is a bit stale and it throws that exception when the node is live (in some crazy scenario), but even then the replica recovery will just attempt again.
        Hide
        Timothy Potter added a comment -

        Thanks Mark - hopefully this will resolve the weird failures of the HttpPartitionTest on Jenkins!

        Show
        Timothy Potter added a comment - Thanks Mark - hopefully this will resolve the weird failures of the HttpPartitionTest on Jenkins!
        Hide
        ASF subversion and git services added a comment -

        Commit 1629719 from Timothy Potter in branch 'dev/trunk'
        [ https://svn.apache.org/r1629719 ]

        SOLR-6592: Avoid waiting for the leader to see the down state if that leader is not live.

        Show
        ASF subversion and git services added a comment - Commit 1629719 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1629719 ] SOLR-6592 : Avoid waiting for the leader to see the down state if that leader is not live.
        Hide
        ASF subversion and git services added a comment -

        Commit 1631442 from Timothy Potter in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1631442 ]

        SOLR-6592: Avoid waiting for the leader to see the down state if that leader is not live.

        Show
        ASF subversion and git services added a comment - Commit 1631442 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1631442 ] SOLR-6592 : Avoid waiting for the leader to see the down state if that leader is not live.
        Hide
        ASF subversion and git services added a comment -

        Commit 1631462 from Timothy Potter in branch 'dev/trunk'
        [ https://svn.apache.org/r1631462 ]

        SOLR-6592: add mention in solr/CHANGES.txt

        Show
        ASF subversion and git services added a comment - Commit 1631462 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1631462 ] SOLR-6592 : add mention in solr/CHANGES.txt
        Hide
        ASF subversion and git services added a comment -

        Commit 1631464 from Timothy Potter in branch 'dev/trunk'
        [ https://svn.apache.org/r1631464 ]

        SOLR-6592: add mention in solr/CHANGES.txt

        Show
        ASF subversion and git services added a comment - Commit 1631464 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1631464 ] SOLR-6592 : add mention in solr/CHANGES.txt
        Hide
        ASF subversion and git services added a comment -

        Commit 1631467 from Timothy Potter in branch 'dev/trunk'
        [ https://svn.apache.org/r1631467 ]

        SOLR-6592: add mention in solr/CHANGES.txt

        Show
        ASF subversion and git services added a comment - Commit 1631467 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1631467 ] SOLR-6592 : add mention in solr/CHANGES.txt
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Timothy Potter
            Reporter:
            Timothy Potter
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development