Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4, 1.5
    • Fix Version/s: 1.9
    • Component/s: protocol
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      For some reason some URL's always time out with protocol-http but not protocol-httpclient. The stack trace is always the same:

      2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
      java.net.SocketTimeoutException: Read timed out
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.read(SocketInputStream.java:129)
              at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
              at java.io.FilterInputStream.read(FilterInputStream.java:116)
              at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
              at java.io.FilterInputStream.read(FilterInputStream.java:90)
              at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
              at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
              at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
              at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
      

      Some example URL's:

        Activity

        Hide
        Markus Jelsma added a comment -

        Patch for 1.6. This patch changes the behavior when a read time out occurs. Currently the SocketTimeoutException is propagated to higher level code without checking for edge-cases. This patch assumes that if bytes where received and no Content-Length header was specified, the read data is alright.

        This change definately fixes read time out problems caused by badly configured servers but still relies on the connection to time out.

        Please comment!

        Show
        Markus Jelsma added a comment - Patch for 1.6. This patch changes the behavior when a read time out occurs. Currently the SocketTimeoutException is propagated to higher level code without checking for edge-cases. This patch assumes that if bytes where received and no Content-Length header was specified, the read data is alright. This change definately fixes read time out problems caused by badly configured servers but still relies on the connection to time out. Please comment!
        Hide
        Markus Jelsma added a comment -

        Unless there are objections or improvements, i'll commit this one in the next few days.

        Show
        Markus Jelsma added a comment - Unless there are objections or improvements, i'll commit this one in the next few days.
        Hide
        Ferdy Galema added a comment -

        Do you have any clue as to why protocol-httpclient has a different behaviour?

        Also, two suggestions for your patch:

        Perhaps you could finegrain the mechanism by allowing a configurable amount of timeouts before definitely failing. Something like:
        if (++timeoutRetries>this.allowedNumberOfTimeoutRetries) throw e; //rethrow

        Secondly, could you specifically catch SocketTimeoutException? (I'm not sure if there are other IOExceptions that shouldn't be catched in any case.)

        Show
        Ferdy Galema added a comment - Do you have any clue as to why protocol-httpclient has a different behaviour? Also, two suggestions for your patch: Perhaps you could finegrain the mechanism by allowing a configurable amount of timeouts before definitely failing. Something like: if (++timeoutRetries>this.allowedNumberOfTimeoutRetries) throw e; //rethrow Secondly, could you specifically catch SocketTimeoutException? (I'm not sure if there are other IOExceptions that shouldn't be catched in any case.)
        Hide
        Markus Jelsma added a comment -

        Hi Ferdy,

        No, i have no clue as to why httpclient is doing the correct thing. I'll check the patch again and catch a SocketTOE instead of the IOE it's doing now. The only problem is that right now the example URL's do not throw a SocketTimeoutException so i cannot test it!

        I'll see if i can find another slow website

        Show
        Markus Jelsma added a comment - Hi Ferdy, No, i have no clue as to why httpclient is doing the correct thing. I'll check the patch again and catch a SocketTOE instead of the IOE it's doing now. The only problem is that right now the example URL's do not throw a SocketTimeoutException so i cannot test it! I'll see if i can find another slow website

          People

          • Assignee:
            Markus Jelsma
            Reporter:
            Markus Jelsma
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development