Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4, 1.5
    • Fix Version/s: 1.11
    • Component/s: protocol
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:

      2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
      java.net.SocketTimeoutException: Read timed out
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.read(SocketInputStream.java:129)
              at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
              at java.io.FilterInputStream.read(FilterInputStream.java:116)
              at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
              at java.io.FilterInputStream.read(FilterInputStream.java:90)
              at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
              at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
              at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
              at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
      

      Some example URL's:

        Issue Links

          Activity

          Markus Jelsma created issue -
          Hide
          Markus Jelsma added a comment -

          Patch for 1.6. This patch changes the behavior when a read time out occurs. Currently the SocketTimeoutException is propagated to higher level code without checking for edge-cases. This patch assumes that if bytes where received and no Content-Length header was specified, the read data is alright.

          This change definately fixes read time out problems caused by badly configured servers but still relies on the connection to time out.

          Please comment!

          Show
          Markus Jelsma added a comment - Patch for 1.6. This patch changes the behavior when a read time out occurs. Currently the SocketTimeoutException is propagated to higher level code without checking for edge-cases. This patch assumes that if bytes where received and no Content-Length header was specified, the read data is alright. This change definately fixes read time out problems caused by badly configured servers but still relies on the connection to time out. Please comment!
          Markus Jelsma made changes -
          Field Original Value New Value
          Attachment NUTCH-1342-1.6-1.patch [ 12525982 ]
          Markus Jelsma made changes -
          Assignee Markus Jelsma [ markus17 ]
          Hide
          Markus Jelsma added a comment -

          Unless there are objections or improvements, i'll commit this one in the next few days.

          Show
          Markus Jelsma added a comment - Unless there are objections or improvements, i'll commit this one in the next few days.
          Hide
          Ferdy Galema added a comment -

          Do you have any clue as to why protocol-httpclient has a different behaviour?

          Also, two suggestions for your patch:

          Perhaps you could finegrain the mechanism by allowing a configurable amount of timeouts before definitely failing. Something like:
          if (++timeoutRetries>this.allowedNumberOfTimeoutRetries) throw e; //rethrow

          Secondly, could you specifically catch SocketTimeoutException? (I'm not sure if there are other IOExceptions that shouldn't be catched in any case.)

          Show
          Ferdy Galema added a comment - Do you have any clue as to why protocol-httpclient has a different behaviour? Also, two suggestions for your patch: Perhaps you could finegrain the mechanism by allowing a configurable amount of timeouts before definitely failing. Something like: if (++timeoutRetries>this.allowedNumberOfTimeoutRetries) throw e; //rethrow Secondly, could you specifically catch SocketTimeoutException? (I'm not sure if there are other IOExceptions that shouldn't be catched in any case.)
          Hide
          Markus Jelsma added a comment -

          Hi Ferdy,

          No, i have no clue as to why httpclient is doing the correct thing. I'll check the patch again and catch a SocketTOE instead of the IOE it's doing now. The only problem is that right now the example URL's do not throw a SocketTimeoutException so i cannot test it!

          I'll see if i can find another slow website

          Show
          Markus Jelsma added a comment - Hi Ferdy, No, i have no clue as to why httpclient is doing the correct thing. I'll check the patch again and catch a SocketTOE instead of the IOE it's doing now. The only problem is that right now the example URL's do not throw a SocketTimeoutException so i cannot test it! I'll see if i can find another slow website
          Markus Jelsma made changes -
          Patch Info Patch Available [ 10042 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.7 [ 12323281 ]
          Fix Version/s 1.6 [ 12319941 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.8 [ 12324326 ]
          Fix Version/s 1.7 [ 12323281 ]
          Markus Jelsma made changes -
          Priority Critical [ 2 ] Major [ 3 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.9 [ 12324611 ]
          Fix Version/s 1.8 [ 12324326 ]
          Julien Nioche made changes -
          Component/s protocol [ 12318529 ]
          Component/s fetcher [ 11591 ]
          Julien Nioche made changes -
          Fix Version/s 1.10 [ 12327187 ]
          Fix Version/s 1.9 [ 12324611 ]
          Hide
          Sebastian Nagel added a comment -

          Could this be the same problem as in NUTCH-1825? This would explain why only protocol-http is affected: it's a bug! Better to fix it than to catch it

          Show
          Sebastian Nagel added a comment - Could this be the same problem as in NUTCH-1825 ? This would explain why only protocol-http is affected: it's a bug! Better to fix it than to catch it
          Sebastian Nagel made changes -
          Link This issue is related to NUTCH-1825 [ NUTCH-1825 ]
          Hide
          Markus Jelsma added a comment -

          Sebastian - i cannot reproduce this problem anymore for those URL's.

          Show
          Markus Jelsma added a comment - Sebastian - i cannot reproduce this problem anymore for those URL's.
          Hide
          Mengying Wang added a comment -

          Markus Jelsma Hey Markus, please use https://www.aoncadis.org/scienceKeywordTopic/Cryosphere.html as the seed url, and then you would get the SocketTimeoutException. Thank you!

          Show
          Mengying Wang added a comment - Markus Jelsma Hey Markus, please use https://www.aoncadis.org/scienceKeywordTopic/Cryosphere.html as the seed url, and then you would get the SocketTimeoutException. Thank you!
          Hide
          Markus Jelsma added a comment -

          Mengying, i can safely download that URL using the indexchecker. Which protocol plugin are you using? I have tried both protocol-http and protocol-httpclient, both work nicely for that URL.

          Show
          Markus Jelsma added a comment - Mengying, i can safely download that URL using the indexchecker. Which protocol plugin are you using? I have tried both protocol-http and protocol-httpclient, both work nicely for that URL.
          Hide
          Sebastian Nagel added a comment -

          Hi Mengying Wang, I also get a timeout for the mentioned URL (both protocol-http and httpclient on recent trunk) but that's not related:

          • the stack is different (it's when reading the response header):
              at java.net.SocketInputStream.socketRead0(Native Method)
              ...
              at org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:475)
              at org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:389)
              at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:211)
              ...
            
          • disappears if http.timeout is increased to 30 sec (30000) => looks like the server responds slowly
          Show
          Sebastian Nagel added a comment - Hi Mengying Wang , I also get a timeout for the mentioned URL (both protocol-http and httpclient on recent trunk) but that's not related: the stack is different (it's when reading the response header): at java.net.SocketInputStream.socketRead0(Native Method) ... at org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:475) at org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:389) at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:211) ... disappears if http.timeout is increased to 30 sec (30000) => looks like the server responds slowly
          Hide
          Mengying Wang added a comment -

          Sebastian Nagel Hi Sebastian, you are so great. After setting the http.timeout property to be 30000, this error disappears. Thank you!
          Markus Jelsma Hey Markus, I am using protocol-httpclient plugin. Thank you!

          Show
          Mengying Wang added a comment - Sebastian Nagel Hi Sebastian, you are so great. After setting the http.timeout property to be 30000, this error disappears. Thank you! Markus Jelsma Hey Markus, I am using protocol-httpclient plugin. Thank you!
          Hide
          Lewis John McGibbney added a comment -

          recently on http://www.mrs.org I was using the parsechecker tool fine one minute, then I got the same stack trace as originally poted by Markus Jelsma. I increased protocol timeout and voila... all is good. Does anyone know if webservers can increase the response times dynamically based upon incoming requests from certain clients?

          Show
          Lewis John McGibbney added a comment - recently on http://www.mrs.org I was using the parsechecker tool fine one minute, then I got the same stack trace as originally poted by Markus Jelsma . I increased protocol timeout and voila... all is good. Does anyone know if webservers can increase the response times dynamically based upon incoming requests from certain clients?
          Lewis John McGibbney made changes -
          Fix Version/s 1.11 [ 12329358 ]
          Fix Version/s 1.10 [ 12327187 ]

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development