Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-481

http.content.limit is broken in the protocol-httpclient plugin

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 1.0.0
    • Component/s: fetcher
    • Labels:
      None

      Description

      When using the protocol-httpclient plugin, the entire contents of the request URL is retrieved, regardless of the http.content.limit configuration setting. (The issue does not affect the protocol-http plugin.)

      For very large documents, this leads the Fetcher to believe that the FetcherThread is hung, and the Fetcher aborts its run, logging a warning about hung threads (Fetcher.java:433).

      org.apache.nutch.protocol.httpclient.HttpResponse is properly counting the content length, and is breaking its read loop at the proper point.

      However, when HttpResponse closes the InputStream from which it is reading, the InputStream object (an org.apache.commons.httpclient.AutoCloseInputStream) continues to read all of the content of the document from the webserver.

      Though I'm not certain this is the correct solution, a quick test shows that if HttpResponse is changed to abort the GET, the InputStream correctly aborts the read from the webserver, and the FetcherThread can continue.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                dogacan Doğacan Güney
                Reporter:
                mrbalky charlie wanek
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: