Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-13896

Paused a non-leader node can cause recovery on other nodes

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      All stacktraces below based on 7.5 branch. This problem still exist at 8.x branches. Here is the scenario, we have 3 replicas

      • L: the leader replica
      • R: the normal replica
      • P: the poor one which was paused then resumed

      L is trying to send data to R, P during that P get paused, here is what happen at L's threads.

      • Thread 1 is stucking at this line of StreamingSolrClients
        public synchronized void blockUntilFinished() {
          for (ConcurrentUpdateSolrClient client : solrClients.values()) {
            client.blockUntilFinished();
          }
        } 

        basically this thread is trying to wait for other sender threads to finish. Let's assume that this is the content of solrClients.values : [clientToP, clientToR]

      • Thread 2 coressponds to clientToP since P is paused, it doesn't close the connection. it just keep the connection and never return any data backs to L. So this thread stuck with this stack trace, waiting for response data from P (with timeout=600000ms). Therefore it cause the thread1 stuck at clientToP.blockUntilFinished()
            java.lang.Thread.State: RUNNABLE   java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) at org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:347)
      • Since clientToR is the second element of the array,   is never get called (or at least after the timeout). This problem cause Thread 3, to stuck at this line
        upd = queue.poll(pollQueueTime, TimeUnit.MILLISECONDS); 

        note that pollQueueTime == Integer.MAX_VALUE (this set by StreamingSolrClients). Therefore unless clientToR.blockUntilFinished() is called (which interrupt Thread 3) this Thread 3 will stuck at above line forever

      • because clientToR is sending data to R but never close the outputstream, so basically R just waiting forever (until timeout at 120000ms later). Which then lead to this exception
        o.a.s.h.RequestHandlerBase java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 120003/120000 mso.a.s.h.RequestHandlerBase java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 120003/120000 ms at org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1080) at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:313) at org.apache.solr.servlet.ServletInputStreamWrapper.read(ServletInputStreamWrapper.java:74) at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:100) at org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79) at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:88) at org.apache.solr.common.util.FastInputStream.peek(FastInputStream.java:60) at org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:107) at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55) 
      • After that the leader put all replicas including none-paused one to recovery
         

      It is a very bad outcome and, this is not just theoretical problem since some cloud platforms can freeze a node when doing maintenance.

      Thanks Andrzej Bialecki  and Shalin Shekhar Mangar on helping me debugging this problem.
       

        Attachments

        1. SOLR-13896.patch
          0.7 kB
          Cao Manh Dat

          Activity

            People

            • Assignee:
              ab Andrzej Bialecki
              Reporter:
              caomanhdat Cao Manh Dat
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: